All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch V181 00/54] x86/pti: Final XMAS release
@ 2017-12-20 21:35 Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 01/54] x86/Kconfig: Limit NR_CPUS on 32bit to a sane amount Thomas Gleixner
                   ` (56 more replies)
  0 siblings, 57 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

Changes since V163:

  - Moved the cpu entry area out of the fixmap because that caused failures
    due to fixmap size and cleanup_highmap() zapping fixmap PTEs.

  - Moved all cpu entry area related code into separate files. The
    hodgepodge in cpu/common.c was really not appropriate.

  - Folded Juergens XEN PV fix for vsyscall

  - Folded Peters ACCESS bit simplification

  - Cleaned up and fixed dump_pagetables

  - Added Vlastimils PTI/NOPTI marker for dumpstack

  - Addressed various minor review comments

  Diffstat against V163 appended.

Thanks to everyone who looked and cared!

It's perfect now because I'm going to have quiet holidays no matter what.

Nevertheless, please review and test the hell out of it.

The lot is based on:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti.entry

The series is also available from git:

  git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti

The patch tarball is at:

  https://tglx.de/~tglx/patches-pti-181.tar.bz2

  sha1sum of decompressed tarball: 3b9c1729efe58e793a40031f54e82cac8aba3884

Thanks,

	tglx

---

 Documentation/x86/x86_64/mm.txt         |    4 
 arch/x86/Kconfig                        |    3 
 arch/x86/entry/vsyscall/vsyscall_64.c   |    7 -
 arch/x86/events/intel/ds.c              |   53 ++++++----
 arch/x86/include/asm/cpu_entry_area.h   |   87 +++++++++++++++++
 arch/x86/include/asm/desc.h             |    1 
 arch/x86/include/asm/fixmap.h           |   96 -------------------
 arch/x86/include/asm/pgtable.h          |    4 
 arch/x86/include/asm/pgtable_32_types.h |   15 ++-
 arch/x86/include/asm/pgtable_64_types.h |   55 ++++++-----
 arch/x86/kernel/cpu/common.c            |  126 -------------------------
 arch/x86/kernel/dumpstack.c             |    7 +
 arch/x86/kernel/tls.c                   |   11 --
 arch/x86/kernel/traps.c                 |    6 -
 arch/x86/mm/Makefile                    |    8 -
 arch/x86/mm/cpu_entry_area.c            |  159 ++++++++++++++++++++++++++++++++
 arch/x86/mm/dump_pagetables.c           |  105 ++++++++++++---------
 arch/x86/mm/init_32.c                   |    6 +
 arch/x86/mm/kasan_init_64.c             |    6 -
 arch/x86/mm/pgtable_32.c                |    1 
 arch/x86/mm/pti.c                       |   52 +++++-----
 arch/x86/xen/mmu_pv.c                   |    2 
 22 files changed, 449 insertions(+), 365 deletions(-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 01/54] x86/Kconfig: Limit NR_CPUS on 32bit to a sane amount
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                   ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-Kconfig--Limit-NR_CPUS-on-32bit-to-a-sane-amount.patch --]
[-- Type: text/plain, Size: 850 bytes --]

The recent cpu_entry_area changes fail to compile on 32bit when BIGSMP=y
and NR_CPUS=512 because the fixmap area becomes too big.

Limit the number of CPUs with BIGSMP to 64, which is already way to big for
32bit, but it's at least a working limitation.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/Kconfig |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -925,7 +925,8 @@ config MAXSMP
 config NR_CPUS
 	int "Maximum number of CPUs" if SMP && !MAXSMP
 	range 2 8 if SMP && X86_32 && !X86_BIGSMP
-	range 2 512 if SMP && !MAXSMP && !CPUMASK_OFFSTACK
+	range 2 64 if SMP && X86_32 && X86_BIGSMP
+	range 2 512 if SMP && !MAXSMP && !CPUMASK_OFFSTACK && X86_64
 	range 2 8192 if SMP && !MAXSMP && CPUMASK_OFFSTACK && X86_64
 	default "1" if !SMP
 	default "8192" if MAXSMP

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 01/54] x86/Kconfig: Limit NR_CPUS on 32bit to a sane amount Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 03/54] x86/mm/dump_pagetables: Make the address hints correct and readable Thomas Gleixner
                   ` (54 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-mm-dump_pagetables--Check-PAGE_PRESENT-for-real.patch --]
[-- Type: text/plain, Size: 925 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The check for a present page in printk_prot():

       if (!pgprot_val(prot)) {
                /* Not present */

is bogus. If a PTE is set to PAGE_NONE then the pgprot_val is not zero and
the entry is decoded in bogus ways, e.g. as RX GLB. That is confusing when
analyzing mapping correctness. Check for the present bit to make an
informed decision.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/dump_pagetables.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -140,7 +140,7 @@ static void printk_prot(struct seq_file
 	static const char * const level_name[] =
 		{ "cr3", "pgd", "p4d", "pud", "pmd", "pte" };
 
-	if (!pgprot_val(prot)) {
+	if (!(pr & _PAGE_PRESENT)) {
 		/* Not present */
 		pt_dump_cont_printf(m, dmsg, "                              ");
 	} else {

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 03/54] x86/mm/dump_pagetables: Make the address hints correct and readable
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 01/54] x86/Kconfig: Limit NR_CPUS on 32bit to a sane amount Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 04/54] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Thomas Gleixner
                   ` (53 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-mm-dump_pagetables--Make-the-address-hints-correct-and-readable.patch --]
[-- Type: text/plain, Size: 4135 bytes --]

The address hints are a trainwreck. The array entry numbers have to kept
magically in sync with the actual hints, which is doomed as some of the
array members are initialized at runtime via the entry numbers.

Designated initializers have been around before this code was
implemented....

Use the entry numbers to populate the address hints array and add the
missing bits and pieces. Split 32 and 64 bit for readability sake.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/dump_pagetables.c |   90 ++++++++++++++++++++++++------------------
 1 file changed, 53 insertions(+), 37 deletions(-)

--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -44,10 +44,12 @@ struct addr_marker {
 	unsigned long max_lines;
 };
 
-/* indices for address_markers; keep sync'd w/ address_markers below */
+/* Address space markers hints */
+
+#ifdef CONFIG_X86_64
+
 enum address_markers_idx {
 	USER_SPACE_NR = 0,
-#ifdef CONFIG_X86_64
 	KERNEL_SPACE_NR,
 	LOW_KERNEL_NR,
 	VMALLOC_START_NR,
@@ -56,56 +58,70 @@ enum address_markers_idx {
 	KASAN_SHADOW_START_NR,
 	KASAN_SHADOW_END_NR,
 #endif
-# ifdef CONFIG_X86_ESPFIX64
+#ifdef CONFIG_X86_ESPFIX64
 	ESPFIX_START_NR,
-# endif
+#endif
+#ifdef CONFIG_EFI
+	EFI_END_NR,
+#endif
 	HIGH_KERNEL_NR,
 	MODULES_VADDR_NR,
 	MODULES_END_NR,
-#else
+	FIXADDR_START_NR,
+	END_OF_SPACE_NR,
+};
+
+static struct addr_marker address_markers[] = {
+	[USER_SPACE_NR]		= { 0,			"User Space" },
+	[KERNEL_SPACE_NR]	= { (1UL << 63),	"Kernel Space" },
+	[LOW_KERNEL_NR]		= { 0UL,		"Low Kernel Mapping" },
+	[VMALLOC_START_NR]	= { 0UL,		"vmalloc() Area" },
+	[VMEMMAP_START_NR]	= { 0UL,		"Vmemmap" },
+#ifdef CONFIG_KASAN
+	[KASAN_SHADOW_START_NR]	= { KASAN_SHADOW_START,	"KASAN shadow" },
+	[KASAN_SHADOW_END_NR]	= { KASAN_SHADOW_END,	"KASAN shadow end" },
+#endif
+#ifdef CONFIG_X86_ESPFIX64
+	[ESPFIX_START_NR]	= { ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },
+#endif
+#ifdef CONFIG_EFI
+	[EFI_END_NR]		= { EFI_VA_END,		"EFI Runtime Services" },
+#endif
+	[HIGH_KERNEL_NR]	= { __START_KERNEL_map,	"High Kernel Mapping" },
+	[MODULES_VADDR_NR]	= { MODULES_VADDR,	"Modules" },
+	[MODULES_END_NR]	= { MODULES_END,	"End Modules" },
+	[FIXADDR_START_NR]	= { FIXADDR_START,	"Fixmap Area" },
+	[END_OF_SPACE_NR]	= { -1,			NULL }
+};
+
+#else /* CONFIG_X86_64 */
+
+enum address_markers_idx {
+	USER_SPACE_NR = 0,
 	KERNEL_SPACE_NR,
 	VMALLOC_START_NR,
 	VMALLOC_END_NR,
-# ifdef CONFIG_HIGHMEM
+#ifdef CONFIG_HIGHMEM
 	PKMAP_BASE_NR,
-# endif
-	FIXADDR_START_NR,
 #endif
+	FIXADDR_START_NR,
+	END_OF_SPACE_NR,
 };
 
-/* Address space markers hints */
 static struct addr_marker address_markers[] = {
-	{ 0, "User Space" },
-#ifdef CONFIG_X86_64
-	{ 0x8000000000000000UL, "Kernel Space" },
-	{ 0/* PAGE_OFFSET */,   "Low Kernel Mapping" },
-	{ 0/* VMALLOC_START */, "vmalloc() Area" },
-	{ 0/* VMEMMAP_START */, "Vmemmap" },
-#ifdef CONFIG_KASAN
-	{ KASAN_SHADOW_START,	"KASAN shadow" },
-	{ KASAN_SHADOW_END,	"KASAN shadow end" },
-#endif
-# ifdef CONFIG_X86_ESPFIX64
-	{ ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },
-# endif
-# ifdef CONFIG_EFI
-	{ EFI_VA_END,		"EFI Runtime Services" },
-# endif
-	{ __START_KERNEL_map,   "High Kernel Mapping" },
-	{ MODULES_VADDR,        "Modules" },
-	{ MODULES_END,          "End Modules" },
-#else
-	{ PAGE_OFFSET,          "Kernel Mapping" },
-	{ 0/* VMALLOC_START */, "vmalloc() Area" },
-	{ 0/*VMALLOC_END*/,     "vmalloc() End" },
-# ifdef CONFIG_HIGHMEM
-	{ 0/*PKMAP_BASE*/,      "Persistent kmap() Area" },
-# endif
-	{ 0/*FIXADDR_START*/,   "Fixmap Area" },
+	[USER_SPACE_NR]		= { 0,			"User Space" },
+	[KERNEL_SPACE_NR]	= { PAGE_OFFSET,	"Kernel Mapping" },
+	[VMALLOC_START_NR]	= { 0UL,		"vmalloc() Area" },
+	[VMALLOC_END_NR]	= { 0UL,		"vmalloc() End" },
+#ifdef CONFIG_HIGHMEM
+	[PKMAP_BASE_NR]		= { 0UL,		"Persistent kmap() Area" },
 #endif
-	{ -1, NULL }		/* End of list */
+	[FIXADDR_START_NR]	= { 0UL,		"Fixmap area" },
+	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
 
+#endif /* !CONFIG_X86_64 */
+
 /* Multipliers for offsets within the PTEs */
 #define PTE_LEVEL_MULT (PAGE_SIZE)
 #define PMD_LEVEL_MULT (PTRS_PER_PTE * PTE_LEVEL_MULT)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 04/54] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (2 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 03/54] x86/mm/dump_pagetables: Make the address hints correct and readable Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 05/54] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode Thomas Gleixner
                   ` (52 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Kees Cook

[-- Attachment #1: 0075-x86-vsyscall-64-Explicitly-set-_PAGE_USER-in-the-pag.patch --]
[-- Type: text/plain, Size: 2905 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The kernel is very erratic as to which pagetables have _PAGE_USER set.  The
vsyscall page gets lucky: it seems that all of the relevant pagetables are
among the apparently arbitrary ones that set _PAGE_USER.  Rather than
relying on chance, just explicitly set _PAGE_USER.

This will let us clean up pagetable setup to stop setting _PAGE_USER.  The
added code can also be reused by pagetable isolation to manage the
_PAGE_USER bit in the usermode tables.

[ tglx: Folded paravirt fix from Juergen Gross ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc; Juergen Gross <jgross@suse.com>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |   34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -37,6 +37,7 @@
 #include <asm/unistd.h>
 #include <asm/fixmap.h>
 #include <asm/traps.h>
+#include <asm/paravirt.h>
 
 #define CREATE_TRACE_POINTS
 #include "vsyscall_trace.h"
@@ -329,16 +330,47 @@ int in_gate_area_no_mm(unsigned long add
 	return vsyscall_mode != NONE && (addr & PAGE_MASK) == VSYSCALL_ADDR;
 }
 
+/*
+ * The VSYSCALL page is the only user-accessible page in the kernel address
+ * range.  Normally, the kernel page tables can have _PAGE_USER clear, but
+ * the tables covering VSYSCALL_ADDR need _PAGE_USER set if vsyscalls
+ * are enabled.
+ *
+ * Some day we may create a "minimal" vsyscall mode in which we emulate
+ * vsyscalls but leave the page not present.  If so, we skip calling
+ * this.
+ */
+static void __init set_vsyscall_pgtable_user_bits(void)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+
+	pgd = pgd_offset_k(VSYSCALL_ADDR);
+	set_pgd(pgd, __pgd(pgd_val(*pgd) | _PAGE_USER));
+	p4d = p4d_offset(pgd, VSYSCALL_ADDR);
+#if CONFIG_PGTABLE_LEVELS >= 5
+	p4d->p4d |= _PAGE_USER;
+#endif
+	pud = pud_offset(p4d, VSYSCALL_ADDR);
+	set_pud(pud, __pud(pud_val(*pud) | _PAGE_USER));
+	pmd = pmd_offset(pud, VSYSCALL_ADDR);
+	set_pmd(pmd, __pmd(pmd_val(*pmd) | _PAGE_USER));
+}
+
 void __init map_vsyscall(void)
 {
 	extern char __vsyscall_page;
 	unsigned long physaddr_vsyscall = __pa_symbol(&__vsyscall_page);
 
-	if (vsyscall_mode != NONE)
+	if (vsyscall_mode != NONE) {
 		__set_fixmap(VSYSCALL_PAGE, physaddr_vsyscall,
 			     vsyscall_mode == NATIVE
 			     ? PAGE_KERNEL_VSYSCALL
 			     : PAGE_KERNEL_VVAR);
+		set_vsyscall_pgtable_user_bits();
+	}
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
 		     (unsigned long)VSYSCALL_ADDR);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 05/54] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (3 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 04/54] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (51 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Kees Cook

[-- Attachment #1: 0076-x86-vsyscall-64-Warn-and-fail-vsyscall-emulation-in-.patch --]
[-- Type: text/plain, Size: 1088 bytes --]

From: Andy Lutomirski <luto@kernel.org>

If something goes wrong with pagetable setup, vsyscall=native will
accidentally fall back to emulation.  Make it warn and fail so that we
notice.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Kees Cook <keescook@chromium.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/entry/vsyscall/vsyscall_64.c |    4 ++++
 1 file changed, 4 insertions(+)

--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -139,6 +139,10 @@ bool emulate_vsyscall(struct pt_regs *re
 
 	WARN_ON_ONCE(address != regs->ip);
 
+	/* This should be unreachable in NATIVE mode. */
+	if (WARN_ON(vsyscall_mode == NATIVE))
+		return false;
+
 	if (vsyscall_mode == NONE) {
 		warn_bad_vsyscall(KERN_INFO, regs,
 				  "vsyscall attempted with vsyscall=none");

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 06/54] arch: Allow arch_dup_mmap() to fail
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: arch--Allow_arch_dup_mmap--_to_fail.patch --]
[-- Type: text/plain, Size: 3780 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/powerpc/include/asm/mmu_context.h   |    5 +++--
 arch/um/include/asm/mmu_context.h        |    3 ++-
 arch/unicore32/include/asm/mmu_context.h |    5 +++--
 arch/x86/include/asm/mmu_context.h       |    4 ++--
 include/asm-generic/mm_hooks.h           |    5 +++--
 kernel/fork.c                            |    3 +--
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -114,9 +114,10 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -15,9 +15,10 @@ extern void uml_setup_stubs(struct mm_st
 /*
  * Needed since we do not use the asm-generic/mm_hooks.h:
  */
-static inline void arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	uml_setup_stubs(mm);
+	return 0;
 }
 extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -81,9 +81,10 @@ do { \
 	} \
 } while (0)
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -176,10 +176,10 @@ do {						\
 } while (0)
 #endif
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -7,9 +7,10 @@
 #ifndef _ASM_GENERIC_MM_HOOKS_H
 #define _ASM_GENERIC_MM_HOOKS_H
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,8 +721,7 @@ static __latent_entropy int dup_mmap(str
 			goto out;
 	}
 	/* a new mm has just been created */
-	arch_dup_mmap(oldmm, mm);
-	retval = 0;
+	retval = arch_dup_mmap(oldmm, mm);
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 06/54] arch: Allow arch_dup_mmap() to fail
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: arch--Allow_arch_dup_mmap--_to_fail.patch --]
[-- Type: text/plain, Size: 4007 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

In order to sanitize the LDT initialization on x86 arch_dup_mmap() must be
allowed to fail. Fix up all instances.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/powerpc/include/asm/mmu_context.h   |    5 +++--
 arch/um/include/asm/mmu_context.h        |    3 ++-
 arch/unicore32/include/asm/mmu_context.h |    5 +++--
 arch/x86/include/asm/mmu_context.h       |    4 ++--
 include/asm-generic/mm_hooks.h           |    5 +++--
 kernel/fork.c                            |    3 +--
 6 files changed, 14 insertions(+), 11 deletions(-)

--- a/arch/powerpc/include/asm/mmu_context.h
+++ b/arch/powerpc/include/asm/mmu_context.h
@@ -114,9 +114,10 @@ static inline void enter_lazy_tlb(struct
 #endif
 }
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/um/include/asm/mmu_context.h
+++ b/arch/um/include/asm/mmu_context.h
@@ -15,9 +15,10 @@ extern void uml_setup_stubs(struct mm_st
 /*
  * Needed since we do not use the asm-generic/mm_hooks.h:
  */
-static inline void arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	uml_setup_stubs(mm);
+	return 0;
 }
 extern void arch_exit_mmap(struct mm_struct *mm);
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/unicore32/include/asm/mmu_context.h
+++ b/arch/unicore32/include/asm/mmu_context.h
@@ -81,9 +81,10 @@ do { \
 	} \
 } while (0)
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_unmap(struct mm_struct *mm,
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -176,10 +176,10 @@ do {						\
 } while (0)
 #endif
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -7,9 +7,10 @@
 #ifndef _ASM_GENERIC_MM_HOOKS_H
 #define _ASM_GENERIC_MM_HOOKS_H
 
-static inline void arch_dup_mmap(struct mm_struct *oldmm,
-				 struct mm_struct *mm)
+static inline int arch_dup_mmap(struct mm_struct *oldmm,
+				struct mm_struct *mm)
 {
+	return 0;
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -721,8 +721,7 @@ static __latent_entropy int dup_mmap(str
 			goto out;
 	}
 	/* a new mm has just been created */
-	arch_dup_mmap(oldmm, mm);
-	retval = 0;
+	retval = arch_dup_mmap(oldmm, mm);
 out:
 	up_write(&mm->mmap_sem);
 	flush_tlb_mm(oldmm);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 07/54] x86/ldt: Rework locking
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: x86-ldt--Rework_locking.patch --]
[-- Type: text/plain, Size: 5180 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().

This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.

Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.

So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/x86/include/asm/mmu.h         |    4 +++-
 arch/x86/include/asm/mmu_context.h |    2 ++
 arch/x86/kernel/ldt.c              |   33 +++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMU_H
 
 #include <linux/spinlock.h>
+#include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
@@ -27,7 +28,8 @@ typedef struct {
 	atomic64_t tlb_gen;
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
+	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_struct	*ldt;
 #endif
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,6 +132,8 @@ void enter_lazy_tlb(struct mm_struct *mm
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mutex_init(&mm->context.lock);
+
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -5,6 +5,11 @@
  * Copyright (C) 2002 Andi Kleen
  *
  * This handles calls from both 32bit and 64bit mode.
+ *
+ * Lock order:
+ *	contex.ldt_usr_sem
+ *	  mmap_sem
+ *	    context.lock
  */
 
 #include <linux/errno.h>
@@ -42,7 +47,7 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
-/* context.lock is held for us, so we don't need any locking. */
+/* context.lock is held by the task which issued the smp function call */
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
@@ -99,15 +104,17 @@ static void finalize_ldt_struct(struct l
 	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-/* context.lock is held */
-static void install_ldt(struct mm_struct *current_mm,
-			struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
 {
+	mutex_lock(&mm->context.lock);
+
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&current_mm->context.ldt, ldt);
+	smp_store_release(&mm->context.ldt, ldt);
 
-	/* Activate the LDT for all CPUs using current_mm. */
-	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 static void free_ldt_struct(struct ldt_struct *ldt)
@@ -133,7 +140,8 @@ int init_new_context_ldt(struct task_str
 	struct mm_struct *old_mm;
 	int retval = 0;
 
-	mutex_init(&mm->context.lock);
+	init_rwsem(&mm->context.ldt_usr_sem);
+
 	old_mm = current->mm;
 	if (!old_mm) {
 		mm->context.ldt = NULL;
@@ -180,7 +188,7 @@ static int read_ldt(void __user *ptr, un
 	unsigned long entries_size;
 	int retval;
 
-	mutex_lock(&mm->context.lock);
+	down_read(&mm->context.ldt_usr_sem);
 
 	if (!mm->context.ldt) {
 		retval = 0;
@@ -209,7 +217,7 @@ static int read_ldt(void __user *ptr, un
 	retval = bytecount;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_read(&mm->context.ldt_usr_sem);
 	return retval;
 }
 
@@ -269,7 +277,8 @@ static int write_ldt(void __user *ptr, u
 			ldt.avl = 0;
 	}
 
-	mutex_lock(&mm->context.lock);
+	if (down_write_killable(&mm->context.ldt_usr_sem))
+		return -EINTR;
 
 	old_ldt       = mm->context.ldt;
 	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
@@ -291,7 +300,7 @@ static int write_ldt(void __user *ptr, u
 	error = 0;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_write(&mm->context.ldt_usr_sem);
 out:
 	return error;
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 07/54] x86/ldt: Rework locking
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: x86-ldt--Rework_locking.patch --]
[-- Type: text/plain, Size: 5407 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

The LDT is duplicated on fork() and on exec(), which is wrong as exec()
should start from a clean state, i.e. without LDT. To fix this the LDT
duplication code will be moved into arch_dup_mmap() which is only called
for fork().

This introduces a locking problem. arch_dup_mmap() holds mmap_sem of the
parent process, but the LDT duplication code needs to acquire
mm->context.lock to access the LDT data safely, which is the reverse lock
order of write_ldt() where mmap_sem nests into context.lock.

Solve this by introducing a new rw semaphore which serializes the
read/write_ldt() syscall operations and use context.lock to protect the
actual installment of the LDT descriptor.

So context.lock stabilizes mm->context.ldt and can nest inside of the new
semaphore or mmap_sem.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/x86/include/asm/mmu.h         |    4 +++-
 arch/x86/include/asm/mmu_context.h |    2 ++
 arch/x86/kernel/ldt.c              |   33 +++++++++++++++++++++------------
 3 files changed, 26 insertions(+), 13 deletions(-)

--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_MMU_H
 
 #include <linux/spinlock.h>
+#include <linux/rwsem.h>
 #include <linux/mutex.h>
 #include <linux/atomic.h>
 
@@ -27,7 +28,8 @@ typedef struct {
 	atomic64_t tlb_gen;
 
 #ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
+	struct rw_semaphore	ldt_usr_sem;
+	struct ldt_struct	*ldt;
 #endif
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -132,6 +132,8 @@ void enter_lazy_tlb(struct mm_struct *mm
 static inline int init_new_context(struct task_struct *tsk,
 				   struct mm_struct *mm)
 {
+	mutex_init(&mm->context.lock);
+
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -5,6 +5,11 @@
  * Copyright (C) 2002 Andi Kleen
  *
  * This handles calls from both 32bit and 64bit mode.
+ *
+ * Lock order:
+ *	contex.ldt_usr_sem
+ *	  mmap_sem
+ *	    context.lock
  */
 
 #include <linux/errno.h>
@@ -42,7 +47,7 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
-/* context.lock is held for us, so we don't need any locking. */
+/* context.lock is held by the task which issued the smp function call */
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
@@ -99,15 +104,17 @@ static void finalize_ldt_struct(struct l
 	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
 }
 
-/* context.lock is held */
-static void install_ldt(struct mm_struct *current_mm,
-			struct ldt_struct *ldt)
+static void install_ldt(struct mm_struct *mm, struct ldt_struct *ldt)
 {
+	mutex_lock(&mm->context.lock);
+
 	/* Synchronizes with READ_ONCE in load_mm_ldt. */
-	smp_store_release(&current_mm->context.ldt, ldt);
+	smp_store_release(&mm->context.ldt, ldt);
 
-	/* Activate the LDT for all CPUs using current_mm. */
-	on_each_cpu_mask(mm_cpumask(current_mm), flush_ldt, current_mm, true);
+	/* Activate the LDT for all CPUs using currents mm. */
+	on_each_cpu_mask(mm_cpumask(mm), flush_ldt, mm, true);
+
+	mutex_unlock(&mm->context.lock);
 }
 
 static void free_ldt_struct(struct ldt_struct *ldt)
@@ -133,7 +140,8 @@ int init_new_context_ldt(struct task_str
 	struct mm_struct *old_mm;
 	int retval = 0;
 
-	mutex_init(&mm->context.lock);
+	init_rwsem(&mm->context.ldt_usr_sem);
+
 	old_mm = current->mm;
 	if (!old_mm) {
 		mm->context.ldt = NULL;
@@ -180,7 +188,7 @@ static int read_ldt(void __user *ptr, un
 	unsigned long entries_size;
 	int retval;
 
-	mutex_lock(&mm->context.lock);
+	down_read(&mm->context.ldt_usr_sem);
 
 	if (!mm->context.ldt) {
 		retval = 0;
@@ -209,7 +217,7 @@ static int read_ldt(void __user *ptr, un
 	retval = bytecount;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_read(&mm->context.ldt_usr_sem);
 	return retval;
 }
 
@@ -269,7 +277,8 @@ static int write_ldt(void __user *ptr, u
 			ldt.avl = 0;
 	}
 
-	mutex_lock(&mm->context.lock);
+	if (down_write_killable(&mm->context.ldt_usr_sem))
+		return -EINTR;
 
 	old_ldt       = mm->context.ldt;
 	old_nr_entries = old_ldt ? old_ldt->nr_entries : 0;
@@ -291,7 +300,7 @@ static int write_ldt(void __user *ptr, u
 	error = 0;
 
 out_unlock:
-	mutex_unlock(&mm->context.lock);
+	up_write(&mm->context.ldt_usr_sem);
 out:
 	return error;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 08/54] x86/ldt: Prevent ldt inheritance on exec
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: x86-ldt--Prevent_ldt_inheritance_on_exec.patch --]
[-- Type: text/plain, Size: 5057 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is inheritet independent of fork or exec, but that makes no sense
at all because exec is supposed to start the process clean.

The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().

It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.

Split the context initialization into two parts. Clear the ldt pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/x86/include/asm/mmu_context.h    |   21 ++++++++++++++-------
 arch/x86/kernel/ldt.c                 |   18 +++++-------------
 tools/testing/selftests/x86/ldt_gdt.c |    9 +++------
 3 files changed, 22 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,11 +57,17 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+static inline void init_new_context_ldt(struct mm_struct *mm)
+{
+	mm->context.ldt = NULL;
+	init_rwsem(&mm->context.ldt_usr_sem);
+}
+int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context_ldt(struct task_struct *tsk,
-				       struct mm_struct *mm)
+static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline int ldt_dup_context(struct mm_struct *oldmm,
+				  struct mm_struct *mm)
 {
 	return 0;
 }
@@ -137,15 +143,16 @@ static inline int init_new_context(struc
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
-	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
 		mm->context.pkey_allocation_map = 0x1;
 		/* -1 means unallocated or invalid */
 		mm->context.execute_only_pkey = -1;
 	}
-	#endif
-	return init_new_context_ldt(tsk, mm);
+#endif
+	init_new_context_ldt(mm);
+	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
 {
@@ -181,7 +188,7 @@ do {						\
 static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
-	return 0;
+	return ldt_dup_context(oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -131,28 +131,20 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
- * we do not have to muck with descriptors here, that is
- * done in switch_mm() as needed.
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
-	struct mm_struct *old_mm;
 	int retval = 0;
 
-	init_rwsem(&mm->context.ldt_usr_sem);
-
-	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -627,13 +627,10 @@ static void do_multicpu_tests(void)
 static int finish_exec_test(void)
 {
 	/*
-	 * In a sensible world, this would be check_invalid_segment(0, 1);
-	 * For better or for worse, though, the LDT is inherited across exec.
-	 * We can probably change this safely, but for now we test it.
+	 * Older kernel versions did inherit the LDT on exec() which is
+	 * wrong because exec() starts from a clean state.
 	 */
-	check_valid_segment(0, 1,
-			    AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB,
-			    42, true);
+	check_invalid_segment(0, 1);
 
 	return nerrs ? 1 : 0;
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 08/54] x86/ldt: Prevent ldt inheritance on exec
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, linux-mm,
	dan.j.williams, kirill.shutemov

[-- Attachment #1: x86-ldt--Prevent_ldt_inheritance_on_exec.patch --]
[-- Type: text/plain, Size: 5284 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The LDT is inheritet independent of fork or exec, but that makes no sense
at all because exec is supposed to start the process clean.

The reason why this happens is that init_new_context_ldt() is called from
init_new_context() which obviously needs to be called for both fork() and
exec().

It would be surprising if anything relies on that behaviour, so it seems to
be safe to remove that misfeature.

Split the context initialization into two parts. Clear the ldt pointer and
initialize the mutex from the general context init and move the LDT
duplication to arch_dup_mmap() which is only called on fork().

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra <peterz@infradead.org>
Cc: Juergen Gross <jgross@suse.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: linux-mm@kvack.org
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: dan.j.williams@intel.com
Cc: kirill.shutemov@linux.intel.com

---
 arch/x86/include/asm/mmu_context.h    |   21 ++++++++++++++-------
 arch/x86/kernel/ldt.c                 |   18 +++++-------------
 tools/testing/selftests/x86/ldt_gdt.c |    9 +++------
 3 files changed, 22 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -57,11 +57,17 @@ struct ldt_struct {
 /*
  * Used for LDT copy/destruction.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
+static inline void init_new_context_ldt(struct mm_struct *mm)
+{
+	mm->context.ldt = NULL;
+	init_rwsem(&mm->context.ldt_usr_sem);
+}
+int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
-static inline int init_new_context_ldt(struct task_struct *tsk,
-				       struct mm_struct *mm)
+static inline void init_new_context_ldt(struct mm_struct *mm) { }
+static inline int ldt_dup_context(struct mm_struct *oldmm,
+				  struct mm_struct *mm)
 {
 	return 0;
 }
@@ -137,15 +143,16 @@ static inline int init_new_context(struc
 	mm->context.ctx_id = atomic64_inc_return(&last_mm_ctx_id);
 	atomic64_set(&mm->context.tlb_gen, 0);
 
-	#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
+#ifdef CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS
 	if (cpu_feature_enabled(X86_FEATURE_OSPKE)) {
 		/* pkey 0 is the default and always allocated */
 		mm->context.pkey_allocation_map = 0x1;
 		/* -1 means unallocated or invalid */
 		mm->context.execute_only_pkey = -1;
 	}
-	#endif
-	return init_new_context_ldt(tsk, mm);
+#endif
+	init_new_context_ldt(mm);
+	return 0;
 }
 static inline void destroy_context(struct mm_struct *mm)
 {
@@ -181,7 +188,7 @@ do {						\
 static inline int arch_dup_mmap(struct mm_struct *oldmm, struct mm_struct *mm)
 {
 	paravirt_arch_dup_mmap(oldmm, mm);
-	return 0;
+	return ldt_dup_context(oldmm, mm);
 }
 
 static inline void arch_exit_mmap(struct mm_struct *mm)
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -131,28 +131,20 @@ static void free_ldt_struct(struct ldt_s
 }
 
 /*
- * we do not have to muck with descriptors here, that is
- * done in switch_mm() as needed.
+ * Called on fork from arch_dup_mmap(). Just copy the current LDT state,
+ * the new task is not running, so nothing can be installed.
  */
-int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm)
+int ldt_dup_context(struct mm_struct *old_mm, struct mm_struct *mm)
 {
 	struct ldt_struct *new_ldt;
-	struct mm_struct *old_mm;
 	int retval = 0;
 
-	init_rwsem(&mm->context.ldt_usr_sem);
-
-	old_mm = current->mm;
-	if (!old_mm) {
-		mm->context.ldt = NULL;
+	if (!old_mm)
 		return 0;
-	}
 
 	mutex_lock(&old_mm->context.lock);
-	if (!old_mm->context.ldt) {
-		mm->context.ldt = NULL;
+	if (!old_mm->context.ldt)
 		goto out_unlock;
-	}
 
 	new_ldt = alloc_ldt_struct(old_mm->context.ldt->nr_entries);
 	if (!new_ldt) {
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -627,13 +627,10 @@ static void do_multicpu_tests(void)
 static int finish_exec_test(void)
 {
 	/*
-	 * In a sensible world, this would be check_invalid_segment(0, 1);
-	 * For better or for worse, though, the LDT is inherited across exec.
-	 * We can probably change this safely, but for now we test it.
+	 * Older kernel versions did inherit the LDT on exec() which is
+	 * wrong because exec() starts from a clean state.
 	 */
-	check_valid_segment(0, 1,
-			    AR_DPL3 | AR_TYPE_XRCODE | AR_S | AR_P | AR_DB,
-			    42, true);
+	check_invalid_segment(0, 1);
 
 	return nerrs ? 1 : 0;
 }


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 09/54] x86/mm/64: Improve the memory map documentation
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (7 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (47 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, Kees Cook,
	Borislav Petkov, Kirill A. Shutemov

[-- Attachment #1: x86-mm-64--Improve_the_memory_map_documentation.patch --]
[-- Type: text/plain, Size: 2139 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The old docs had the vsyscall range wrong* and were missing the fixmap.
Fix both.

There used to be 8 MB reserved for future vsyscalls, but that's long gone.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>

---
 Documentation/x86/x86_64/mm.txt |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -19,8 +19,9 @@ ffffff0000000000 - ffffff7fffffffff (=39
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space (variable)
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space (variable)
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Virtual memory map with 5 level page tables:
@@ -41,8 +42,9 @@ ffffff0000000000 - ffffff7fffffffff (=39
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
 ... unused hole ...
 ffffffff80000000 - ffffffff9fffffff (=512 MB)  kernel text mapping, from phys 0
-ffffffffa0000000 - ffffffffff5fffff (=1526 MB) module mapping space
-ffffffffff600000 - ffffffffffdfffff (=8 MB) vsyscalls
+ffffffffa0000000 - [fixmap start]   (~1526 MB) module mapping space
+[fixmap start]   - ffffffffff5fffff kernel-internal fixmap range
+ffffffffff600000 - ffffffffff600fff (=4 kB) legacy vsyscall ABI
 ffffffffffe00000 - ffffffffffffffff (=2 MB) unused hole
 
 Architecture defines a 64-bit virtual address. Implementations can support

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 10/54] x86/doc: Remove obvious weirdness
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0069-x86-doc-Remove-obvious-weirdness.patch --]
[-- Type: text/plain, Size: 2754 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 Documentation/x86/x86_64/mm.txt |   12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -1,6 +1,4 @@
 
-<previous description obsolete, deleted>
-
 Virtual memory map with 4 level page tables:
 
 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
@@ -49,8 +47,9 @@ ffffffffffe00000 - ffffffffffffffff (=2
 
 Architecture defines a 64-bit virtual address. Implementations can support
 less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
-through to the most-significant implemented bit are set to either all ones
-or all zero. This causes hole between user space and kernel addresses.
+through to the most-significant implemented bit are sign extended.
+This causes hole between user space and kernel addresses if you interpret them
+as unsigned.
 
 The direct mapping covers all memory in the system up to the highest
 memory address (this means in some cases it can also include PCI memory
@@ -60,9 +59,6 @@ vmalloc space is lazily synchronized int
 the processes using the page fault handler, with init_top_pgt as
 reference.
 
-Current X86-64 implementations support up to 46 bits of address space (64 TB),
-which is our current limit. This expands into MBZ space in the page tables.
-
 We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
 memory window (this size is arbitrary, it can be raised later if needed).
 The mappings are not part of any other kernel PGD and are only available
@@ -74,5 +70,3 @@ following fixmap section.
 Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
 physical memory, vmalloc/ioremap space and virtual memory map are randomized.
 Their order is preserved but their base will be offset early at boot time.
-
--Andi Kleen, Jul 2004

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 10/54] x86/doc: Remove obvious weirdness
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0069-x86-doc-Remove-obvious-weirdness.patch --]
[-- Type: text/plain, Size: 2981 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 Documentation/x86/x86_64/mm.txt |   12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -1,6 +1,4 @@
 
-<previous description obsolete, deleted>
-
 Virtual memory map with 4 level page tables:
 
 0000000000000000 - 00007fffffffffff (=47 bits) user space, different per mm
@@ -49,8 +47,9 @@ ffffffffffe00000 - ffffffffffffffff (=2
 
 Architecture defines a 64-bit virtual address. Implementations can support
 less. Currently supported are 48- and 57-bit virtual addresses. Bits 63
-through to the most-significant implemented bit are set to either all ones
-or all zero. This causes hole between user space and kernel addresses.
+through to the most-significant implemented bit are sign extended.
+This causes hole between user space and kernel addresses if you interpret them
+as unsigned.
 
 The direct mapping covers all memory in the system up to the highest
 memory address (this means in some cases it can also include PCI memory
@@ -60,9 +59,6 @@ vmalloc space is lazily synchronized int
 the processes using the page fault handler, with init_top_pgt as
 reference.
 
-Current X86-64 implementations support up to 46 bits of address space (64 TB),
-which is our current limit. This expands into MBZ space in the page tables.
-
 We map EFI runtime services in the 'efi_pgd' PGD in a 64Gb large virtual
 memory window (this size is arbitrary, it can be raised later if needed).
 The mappings are not part of any other kernel PGD and are only available
@@ -74,5 +70,3 @@ following fixmap section.
 Note that if CONFIG_RANDOMIZE_MEMORY is enabled, the direct mapping of all
 physical memory, vmalloc/ioremap space and virtual memory map are randomized.
 Their order is preserved but their base will be offset early at boot time.
-
--Andi Kleen, Jul 2004


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 11/54] x86/entry: Remove SYSENTER_stack naming
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (9 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (45 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, Borislav Petkov,
	H. Peter Anvin

[-- Attachment #1: 0070-x86-entry-Remove-SYSENTER_stack-naming.patch --]
[-- Type: text/plain, Size: 10926 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

If the kernel oopses while on the trampoline stack, it will print
"<SYSENTER>" even if SYSENTER is not involved.  That is rather confusing.

The "SYSENTER" stack is used for a lot more than SYSENTER now.  Give it a
better string to display in stack dumps, and rename the kernel code to
match.

Also move the 32-bit code over to the new naming even though it still uses
the entry stack only for SYSENTER.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
---
 arch/x86/entry/entry_32.S         |   12 ++++++------
 arch/x86/entry/entry_64.S         |    4 ++--
 arch/x86/include/asm/fixmap.h     |    8 ++++----
 arch/x86/include/asm/processor.h  |    6 +++---
 arch/x86/include/asm/stacktrace.h |    4 ++--
 arch/x86/kernel/asm-offsets.c     |    4 ++--
 arch/x86/kernel/asm-offsets_32.c  |    2 +-
 arch/x86/kernel/cpu/common.c      |   14 +++++++-------
 arch/x86/kernel/dumpstack.c       |   10 +++++-----
 arch/x86/kernel/dumpstack_32.c    |    6 +++---
 arch/x86/kernel/dumpstack_64.c    |   12 +++++++++---
 11 files changed, 44 insertions(+), 38 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,9 +942,9 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
-	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
-	cmpl	$SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
+	subl	%eax, %ecx	/* ecx = (end of entry_stack) - esp */
+	cmpl	$SIZEOF_entry_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
 
 	TRACE_IRQS_OFF
@@ -986,9 +986,9 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
-	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
-	cmpl	$SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_entry_stack + SIZEOF_entry_stack, %ecx
+	subl	%eax, %ecx	/* ecx = (end of entry_stack) - esp */
+	cmpl	$SIZEOF_entry_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
 
 	/* Not on SYSENTER stack. */
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -154,8 +154,8 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_SYSENTER_stack + \
-			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+#define RSP_SCRATCH	CPU_ENTRY_AREA_entry_stack + \
+			SIZEOF_entry_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
 	UNWIND_HINT_EMPTY
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -56,10 +56,10 @@ struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
 
 	/*
-	 * The GDT is just below SYSENTER_stack and thus serves (on x86_64) as
+	 * The GDT is just below entry_stack and thus serves (on x86_64) as
 	 * a a read-only guard page.
 	 */
-	struct SYSENTER_stack_page SYSENTER_stack_page;
+	struct entry_stack_page entry_stack_page;
 
 	/*
 	 * On x86_64, the TSS is mapped RO.  On x86_32, it's mapped RW because
@@ -250,9 +250,9 @@ static inline struct cpu_entry_area *get
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
-static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+static inline struct entry_stack *cpu_entry_stack(int cpu)
 {
-	return &get_cpu_entry_area(cpu)->SYSENTER_stack_page.stack;
+	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
 }
 
 #endif /* !__ASSEMBLY__ */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -336,12 +336,12 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
-struct SYSENTER_stack {
+struct entry_stack {
 	unsigned long		words[64];
 };
 
-struct SYSENTER_stack_page {
-	struct SYSENTER_stack stack;
+struct entry_stack_page {
+	struct entry_stack stack;
 } __aligned(PAGE_SIZE);
 
 struct tss_struct {
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,7 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
-	STACK_TYPE_SYSENTER,
+	STACK_TYPE_ENTRY,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -29,7 +29,7 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
-bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+bool in_entry_stack(unsigned long *stack, struct stack_info *info);
 
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -97,6 +97,6 @@ void common(void) {
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
-	OFFSET(CPU_ENTRY_AREA_SYSENTER_stack, cpu_entry_area, SYSENTER_stack_page);
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
+	OFFSET(CPU_ENTRY_AREA_entry_stack, cpu_entry_area, entry_stack_page);
+	DEFINE(SIZEOF_entry_stack, sizeof(struct entry_stack));
 }
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -48,7 +48,7 @@ void foo(void)
 
 	/* Offset from the sysenter stack to tss.sp0 */
 	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
-	       offsetofend(struct cpu_entry_area, SYSENTER_stack_page.stack));
+	       offsetofend(struct cpu_entry_area, entry_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -487,8 +487,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(struct SYSENTER_stack_page,
-				   SYSENTER_stack_storage);
+static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page,
+				   entry_stack_storage);
 
 static void __init
 set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
@@ -523,8 +523,8 @@ static void __init setup_cpu_entry_area(
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, SYSENTER_stack_page),
-				per_cpu_ptr(&SYSENTER_stack_storage, cpu), 1,
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, entry_stack_page),
+				per_cpu_ptr(&entry_stack_storage, cpu), 1,
 				PAGE_KERNEL);
 
 	/*
@@ -1323,7 +1323,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1440,7 +1440,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_entry_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1655,7 +1655,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
+	load_sp0((unsigned long)(cpu_entry_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,9 +43,9 @@ bool in_task_stack(unsigned long *stack,
 	return true;
 }
 
-bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+bool in_entry_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
+	struct entry_stack *ss = cpu_entry_stack(smp_processor_id());
 
 	void *begin = ss;
 	void *end = ss + 1;
@@ -53,7 +53,7 @@ bool in_sysenter_stack(unsigned long *st
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;
 
-	info->type	= STACK_TYPE_SYSENTER;
+	info->type	= STACK_TYPE_ENTRY;
 	info->begin	= begin;
 	info->end	= end;
 	info->next_sp	= NULL;
@@ -111,13 +111,13 @@ void show_trace_log_lvl(struct task_stru
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
-	 * - SYSENTER stack
+	 * - entry stack
 	 *
 	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
-	 * - SYSENTER stack
+	 * - entry stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,8 +26,8 @@ const char *stack_type_name(enum stack_t
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
-	if (type == STACK_TYPE_SYSENTER)
-		return "SYSENTER";
+	if (type == STACK_TYPE_ENTRY)
+		return "ENTRY_TRAMPOLINE";
 
 	return NULL;
 }
@@ -96,7 +96,7 @@ int get_stack_info(unsigned long *stack,
 	if (task != current)
 		goto unknown;
 
-	if (in_sysenter_stack(stack, info))
+	if (in_entry_stack(stack, info))
 		goto recursion_check;
 
 	if (in_hardirq_stack(stack, info))
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,8 +37,14 @@ const char *stack_type_name(enum stack_t
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
-	if (type == STACK_TYPE_SYSENTER)
-		return "SYSENTER";
+	if (type == STACK_TYPE_ENTRY) {
+		/*
+		 * On 64-bit, we have a generic entry stack that we
+		 * use for all the kernel entry points, including
+		 * SYSENTER.
+		 */
+		return "ENTRY_TRAMPOLINE";
+	}
 
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
@@ -118,7 +124,7 @@ int get_stack_info(unsigned long *stack,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
-	if (in_sysenter_stack(stack, info))
+	if (in_entry_stack(stack, info))
 		goto recursion_check;
 
 	goto unknown;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 12/54] x86/uv: Use the right tlbflush API
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Andrew Banman, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mike Travis, linux-mm

[-- Attachment #1: 0065-x86-uv-Use-the-right-tlbflush-API.patch --]
[-- Type: text/plain, Size: 1555 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Since uv_flush_tlb_others() implements flush_tlb_others() which is
about flushing user mappings, we should use __flush_tlb_single(),
which too is about flushing user mappings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrew Banman <abanman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/platform/uv/tlb_uv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -299,7 +299,7 @@ static void bau_process_message(struct m
 		local_flush_tlb();
 		stat->d_alltlb++;
 	} else {
-		__flush_tlb_one(msg->address);
+		__flush_tlb_single(msg->address);
 		stat->d_onetlb++;
 	}
 	stat->d_requestee++;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 12/54] x86/uv: Use the right tlbflush API
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Andrew Banman, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Mike Travis, linux-mm

[-- Attachment #1: 0065-x86-uv-Use-the-right-tlbflush-API.patch --]
[-- Type: text/plain, Size: 1782 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Since uv_flush_tlb_others() implements flush_tlb_others() which is
about flushing user mappings, we should use __flush_tlb_single(),
which too is about flushing user mappings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Andrew Banman <abanman@hpe.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Travis <mike.travis@hpe.com>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/platform/uv/tlb_uv.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/platform/uv/tlb_uv.c
+++ b/arch/x86/platform/uv/tlb_uv.c
@@ -299,7 +299,7 @@ static void bau_process_message(struct m
 		local_flush_tlb();
 		stat->d_alltlb++;
 	} else {
-		__flush_tlb_one(msg->address);
+		__flush_tlb_single(msg->address);
 		stat->d_onetlb++;
 	}
 	stat->d_requestee++;


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 13/54] x86/microcode: Dont abuse the tlbflush interface
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	fenghua.yu, linux-mm

[-- Attachment #1: 0066-x86-microcode-Dont-abuse-the-tlbflush-interface.patch --]
[-- Type: text/plain, Size: 3051 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Commit: ec400ddeff20 ("x86/microcode_intel_early.c: Early update ucode on
Intel's CPU") grubbed into tlbflush internals without coherent explanation.

Since it says its precaution and the SDM doesn't mention anything like
this, take it out back.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: fenghua.yu@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h       |   19 ++++++-------------
 arch/x86/kernel/cpu/microcode/intel.c |   13 -------------
 2 files changed, 6 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -246,20 +246,9 @@ static inline void __native_flush_tlb(vo
 	preempt_enable();
 }
 
-static inline void __native_flush_tlb_global_irq_disabled(void)
-{
-	unsigned long cr4;
-
-	cr4 = this_cpu_read(cpu_tlbstate.cr4);
-	/* clear PGE */
-	native_write_cr4(cr4 & ~X86_CR4_PGE);
-	/* write old PGE again and flush TLBs */
-	native_write_cr4(cr4);
-}
-
 static inline void __native_flush_tlb_global(void)
 {
-	unsigned long flags;
+	unsigned long cr4, flags;
 
 	if (static_cpu_has(X86_FEATURE_INVPCID)) {
 		/*
@@ -277,7 +266,11 @@ static inline void __native_flush_tlb_gl
 	 */
 	raw_local_irq_save(flags);
 
-	__native_flush_tlb_global_irq_disabled();
+	cr4 = this_cpu_read(cpu_tlbstate.cr4);
+	/* toggle PGE */
+	native_write_cr4(cr4 ^ X86_CR4_PGE);
+	/* write old PGE again and flush TLBs */
+	native_write_cr4(cr4);
 
 	raw_local_irq_restore(flags);
 }
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -565,15 +565,6 @@ static void print_ucode(struct ucode_cpu
 }
 #else
 
-/*
- * Flush global tlb. We only do this in x86_64 where paging has been enabled
- * already and PGE should be enabled as well.
- */
-static inline void flush_tlb_early(void)
-{
-	__native_flush_tlb_global_irq_disabled();
-}
-
 static inline void print_ucode(struct ucode_cpu_info *uci)
 {
 	struct microcode_intel *mc;
@@ -602,10 +593,6 @@ static int apply_microcode_early(struct
 	if (rev != mc->hdr.rev)
 		return -1;
 
-#ifdef CONFIG_X86_64
-	/* Flush global tlb. This is precaution. */
-	flush_tlb_early();
-#endif
 	uci->cpu_sig.rev = rev;
 
 	if (early)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 13/54] x86/microcode: Dont abuse the tlbflush interface
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	fenghua.yu, linux-mm

[-- Attachment #1: 0066-x86-microcode-Dont-abuse-the-tlbflush-interface.patch --]
[-- Type: text/plain, Size: 3278 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Commit: ec400ddeff20 ("x86/microcode_intel_early.c: Early update ucode on
Intel's CPU") grubbed into tlbflush internals without coherent explanation.

Since it says its precaution and the SDM doesn't mention anything like
this, take it out back.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: fenghua.yu@intel.com
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h       |   19 ++++++-------------
 arch/x86/kernel/cpu/microcode/intel.c |   13 -------------
 2 files changed, 6 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -246,20 +246,9 @@ static inline void __native_flush_tlb(vo
 	preempt_enable();
 }
 
-static inline void __native_flush_tlb_global_irq_disabled(void)
-{
-	unsigned long cr4;
-
-	cr4 = this_cpu_read(cpu_tlbstate.cr4);
-	/* clear PGE */
-	native_write_cr4(cr4 & ~X86_CR4_PGE);
-	/* write old PGE again and flush TLBs */
-	native_write_cr4(cr4);
-}
-
 static inline void __native_flush_tlb_global(void)
 {
-	unsigned long flags;
+	unsigned long cr4, flags;
 
 	if (static_cpu_has(X86_FEATURE_INVPCID)) {
 		/*
@@ -277,7 +266,11 @@ static inline void __native_flush_tlb_gl
 	 */
 	raw_local_irq_save(flags);
 
-	__native_flush_tlb_global_irq_disabled();
+	cr4 = this_cpu_read(cpu_tlbstate.cr4);
+	/* toggle PGE */
+	native_write_cr4(cr4 ^ X86_CR4_PGE);
+	/* write old PGE again and flush TLBs */
+	native_write_cr4(cr4);
 
 	raw_local_irq_restore(flags);
 }
--- a/arch/x86/kernel/cpu/microcode/intel.c
+++ b/arch/x86/kernel/cpu/microcode/intel.c
@@ -565,15 +565,6 @@ static void print_ucode(struct ucode_cpu
 }
 #else
 
-/*
- * Flush global tlb. We only do this in x86_64 where paging has been enabled
- * already and PGE should be enabled as well.
- */
-static inline void flush_tlb_early(void)
-{
-	__native_flush_tlb_global_irq_disabled();
-}
-
 static inline void print_ucode(struct ucode_cpu_info *uci)
 {
 	struct microcode_intel *mc;
@@ -602,10 +593,6 @@ static int apply_microcode_early(struct
 	if (rev != mc->hdr.rev)
 		return -1;
 
-#ifdef CONFIG_X86_64
-	/* Flush global tlb. This is precaution. */
-	flush_tlb_early();
-#endif
 	uci->cpu_sig.rev = rev;
 
 	if (early)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 14/54] x86/mm: Use __flush_tlb_one() for kernel memory
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0064-x86-mm-Use-__flush_tlb_one-for-kernel-memory.patch --]
[-- Type: text/plain, Size: 1404 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

__flush_tlb_single() is for user mappings, __flush_tlb_one() for
kernel mappings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void do_kernel_range_flush(void *
 
 	/* flush range by one by one 'invlpg' */
 	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
-		__flush_tlb_single(addr);
+		__flush_tlb_one(addr);
 }
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 14/54] x86/mm: Use __flush_tlb_one() for kernel memory
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0064-x86-mm-Use-__flush_tlb_one-for-kernel-memory.patch --]
[-- Type: text/plain, Size: 1631 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

__flush_tlb_single() is for user mappings, __flush_tlb_one() for
kernel mappings.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/mm/tlb.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -551,7 +551,7 @@ static void do_kernel_range_flush(void *
 
 	/* flush range by one by one 'invlpg' */
 	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
-		__flush_tlb_single(addr);
+		__flush_tlb_one(addr);
 }
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 15/54] x86/mm: Remove superfluous barriers
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0061-x86-mm-Remove-superfluous-barriers.patch --]
[-- Type: text/plain, Size: 1733 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

atomic64_inc_return() already implies smp_mb() before and after.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -60,19 +60,13 @@ static inline void invpcid_flush_all_non
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
-	u64 new_tlb_gen;
-
 	/*
 	 * Bump the generation count.  This also serves as a full barrier
 	 * that synchronizes with switch_mm(): callers are required to order
 	 * their read of mm_cpumask after their writes to the paging
 	 * structures.
 	 */
-	smp_mb__before_atomic();
-	new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
-	smp_mb__after_atomic();
-
-	return new_tlb_gen;
+	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
 #ifdef CONFIG_PARAVIRT

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 15/54] x86/mm: Remove superfluous barriers
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0061-x86-mm-Remove-superfluous-barriers.patch --]
[-- Type: text/plain, Size: 1960 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

atomic64_inc_return() already implies smp_mb() before and after.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |    8 +-------
 1 file changed, 1 insertion(+), 7 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -60,19 +60,13 @@ static inline void invpcid_flush_all_non
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
-	u64 new_tlb_gen;
-
 	/*
 	 * Bump the generation count.  This also serves as a full barrier
 	 * that synchronizes with switch_mm(): callers are required to order
 	 * their read of mm_cpumask after their writes to the paging
 	 * structures.
 	 */
-	smp_mb__before_atomic();
-	new_tlb_gen = atomic64_inc_return(&mm->context.tlb_gen);
-	smp_mb__after_atomic();
-
-	return new_tlb_gen;
+	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
 #ifdef CONFIG_PARAVIRT


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 16/54] x86/mm: Clarify which functions are supposed to flush what
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0067-x86-mm-Clarify-which-functions-are-supposed-to-flush.patch --]
[-- Type: text/plain, Size: 2412 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Per popular request..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -228,6 +228,9 @@ static inline void cr4_set_bits_and_upda
 
 extern void initialize_tlbstate_and_flush(void);
 
+/*
+ * flush the entire current user mapping
+ */
 static inline void __native_flush_tlb(void)
 {
 	/*
@@ -240,6 +243,9 @@ static inline void __native_flush_tlb(vo
 	preempt_enable();
 }
 
+/*
+ * flush everything
+ */
 static inline void __native_flush_tlb_global(void)
 {
 	unsigned long cr4, flags;
@@ -269,17 +275,27 @@ static inline void __native_flush_tlb_gl
 	raw_local_irq_restore(flags);
 }
 
+/*
+ * flush one page in the user mapping
+ */
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+/*
+ * flush everything
+ */
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
+		/*
+		 * !PGE -> !PCID (setup_pcid()), thus every flush is total.
+		 */
 		__flush_tlb();
+	}
 
 	/*
 	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
@@ -290,6 +306,9 @@ static inline void __flush_tlb_all(void)
 	 */
 }
 
+/*
+ * flush one page in the kernel mapping
+ */
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 16/54] x86/mm: Clarify which functions are supposed to flush what
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0067-x86-mm-Clarify-which-functions-are-supposed-to-flush.patch --]
[-- Type: text/plain, Size: 2639 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Per popular request..

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   23 +++++++++++++++++++++--
 1 file changed, 21 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -228,6 +228,9 @@ static inline void cr4_set_bits_and_upda
 
 extern void initialize_tlbstate_and_flush(void);
 
+/*
+ * flush the entire current user mapping
+ */
 static inline void __native_flush_tlb(void)
 {
 	/*
@@ -240,6 +243,9 @@ static inline void __native_flush_tlb(vo
 	preempt_enable();
 }
 
+/*
+ * flush everything
+ */
 static inline void __native_flush_tlb_global(void)
 {
 	unsigned long cr4, flags;
@@ -269,17 +275,27 @@ static inline void __native_flush_tlb_gl
 	raw_local_irq_restore(flags);
 }
 
+/*
+ * flush one page in the user mapping
+ */
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 }
 
+/*
+ * flush everything
+ */
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
+		/*
+		 * !PGE -> !PCID (setup_pcid()), thus every flush is total.
+		 */
 		__flush_tlb();
+	}
 
 	/*
 	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
@@ -290,6 +306,9 @@ static inline void __flush_tlb_all(void)
 	 */
 }
 
+/*
+ * flush one page in the kernel mapping
+ */
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 17/54] x86/mm: Move the CR3 construction functions to tlbflush.h
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0047-x86-mm-Move-the-CR3-construction-functions-to-tlbflu.patch --]
[-- Type: text/plain, Size: 5654 bytes --]

For flushing the TLB, the ASID which has been programmed into the hardware
must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move the
CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org

---
 arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 arch/x86/include/asm/tlbflush.h    |   26 ++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 31 insertions(+), 32 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -291,33 +291,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -326,7 +299,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -69,6 +69,32 @@ static inline u64 inc_mm_tlb_gen(struct
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
+ * This serves two purposes.  It prevents a nasty situation in which
+ * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
+ * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
+ * ASID was nonzero.  It also means that any bugs involving loading a
+ * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -208,7 +208,7 @@ void switch_mm_irqs_off(struct mm_struct
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
@@ -288,7 +288,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 17/54] x86/mm: Move the CR3 construction functions to tlbflush.h
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0047-x86-mm-Move-the-CR3-construction-functions-to-tlbflu.patch --]
[-- Type: text/plain, Size: 5881 bytes --]

For flushing the TLB, the ASID which has been programmed into the hardware
must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move the
CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org

---
 arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 arch/x86/include/asm/tlbflush.h    |   26 ++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 31 insertions(+), 32 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -291,33 +291,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -326,7 +299,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -69,6 +69,32 @@ static inline u64 inc_mm_tlb_gen(struct
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
+ * This serves two purposes.  It prevents a nasty situation in which
+ * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
+ * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
+ * ASID was nonzero.  It also means that any bugs involving loading a
+ * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -208,7 +208,7 @@ void switch_mm_irqs_off(struct mm_struct
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
@@ -288,7 +288,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 18/54] x86/mm: Remove hard-coded ASID limit checks
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0048-x86-mm-Remove-hard-coded-ASID-limit-checks.patch --]
[-- Type: text/plain, Size: 2723 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, PAGE_TABLE_ISOLATION is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell out this
new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -69,6 +69,22 @@ static inline u64 inc_mm_tlb_gen(struct
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS		12
+/*
+ * When enabled, PAGE_TABLE_ISOLATION consumes a single bit for
+ * user/kernel switches
+ */
+#define PTI_CONSUMED_ASID_BITS		0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - PTI_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
+ * for them being zero-based.  Another -1 is because ASID 0 is reserved for
+ * use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
  * This serves two purposes.  It prevents a nasty situation in which
@@ -81,7 +97,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -91,7 +107,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 18/54] x86/mm: Remove hard-coded ASID limit checks
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0048-x86-mm-Remove-hard-coded-ASID-limit-checks.patch --]
[-- Type: text/plain, Size: 2950 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, PAGE_TABLE_ISOLATION is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell out this
new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -69,6 +69,22 @@ static inline u64 inc_mm_tlb_gen(struct
 	return atomic64_inc_return(&mm->context.tlb_gen);
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS		12
+/*
+ * When enabled, PAGE_TABLE_ISOLATION consumes a single bit for
+ * user/kernel switches
+ */
+#define PTI_CONSUMED_ASID_BITS		0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - PTI_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
+ * for them being zero-based.  Another -1 is because ASID 0 is reserved for
+ * use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
  * This serves two purposes.  It prevents a nasty situation in which
@@ -81,7 +97,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -91,7 +107,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 19/54] x86/mm: Put MMU to hardware ASID translation in one place
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0049-x86-mm-Put-MMU-to-hardware-ASID-translation-in-one-p.patch --]
[-- Type: text/plain, Size: 3383 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:

 1. The one stored in the mmu_context that goes from 0..5
 2. The one programmed into the hardware that goes from 1..6

This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
PAGE_TABLE_ISOLATION will also need to, given an ASID, know which hardware
ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -85,20 +85,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
- * This serves two purposes.  It prevents a nasty situation in which
- * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
- * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
- * ASID was nonzero.  It also means that any bugs involving loading a
- * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
- */
+static inline u16 kern_pcid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the
+	 * PCID bits.  This serves two purposes.  It prevents a nasty
+	 * situation in which PCID-unaware code saves CR3, loads some other
+	 * value (with PCID == 0), and then restores CR3, thus corrupting
+	 * the TLB for ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3 with
+	 * CR4.PCIDE off will trigger deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_pcid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -108,7 +114,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 19/54] x86/mm: Put MMU to hardware ASID translation in one place
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0049-x86-mm-Put-MMU-to-hardware-ASID-translation-in-one-p.patch --]
[-- Type: text/plain, Size: 3610 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:

 1. The one stored in the mmu_context that goes from 0..5
 2. The one programmed into the hardware that goes from 1..6

This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
PAGE_TABLE_ISOLATION will also need to, given an ASID, know which hardware
ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -85,20 +85,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
- * This serves two purposes.  It prevents a nasty situation in which
- * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
- * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
- * ASID was nonzero.  It also means that any bugs involving loading a
- * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
- */
+static inline u16 kern_pcid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the
+	 * PCID bits.  This serves two purposes.  It prevents a nasty
+	 * situation in which PCID-unaware code saves CR3, loads some other
+	 * value (with PCID == 0), and then restores CR3, thus corrupting
+	 * the TLB for ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3 with
+	 * CR4.PCIDE off will trigger deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_pcid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -108,7 +114,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 20/54] x86/mm: Create asm/invpcid.h
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0062-x86-mm-Create-asm-invpcid.h.patch --]
[-- Type: text/plain, Size: 4614 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Unclutter tlbflush.h a little.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/invpcid.h  |   53 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |   49 ------------------------------------
 2 files changed, 54 insertions(+), 48 deletions(-)

--- /dev/null
+++ b/arch/x86/include/asm/invpcid.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVPCID
+#define _ASM_X86_INVPCID
+
+static inline void __invpcid(unsigned long pcid, unsigned long addr,
+			     unsigned long type)
+{
+	struct { u64 d[2]; } desc = { { pcid, addr } };
+
+	/*
+	 * The memory clobber is because the whole point is to invalidate
+	 * stale TLB entries and, especially if we're flushing global
+	 * mappings, we don't want the compiler to reorder any subsequent
+	 * memory accesses before the TLB flush.
+	 *
+	 * The hex opcode is invpcid (%ecx), %eax in 32-bit mode and
+	 * invpcid (%rcx), %rax in long mode.
+	 */
+	asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01"
+		      : : "m" (desc), "a" (type), "c" (&desc) : "memory");
+}
+
+#define INVPCID_TYPE_INDIV_ADDR		0
+#define INVPCID_TYPE_SINGLE_CTXT	1
+#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
+#define INVPCID_TYPE_ALL_NON_GLOBAL	3
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invpcid_flush_one(unsigned long pcid,
+				     unsigned long addr)
+{
+	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invpcid_flush_single_context(unsigned long pcid)
+{
+	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invpcid_flush_all(void)
+{
+	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invpcid_flush_all_nonglobals(void)
+{
+	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
+}
+
+#endif /* _ASM_X86_INVPCID */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -9,54 +9,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
-
-static inline void __invpcid(unsigned long pcid, unsigned long addr,
-			     unsigned long type)
-{
-	struct { u64 d[2]; } desc = { { pcid, addr } };
-
-	/*
-	 * The memory clobber is because the whole point is to invalidate
-	 * stale TLB entries and, especially if we're flushing global
-	 * mappings, we don't want the compiler to reorder any subsequent
-	 * memory accesses before the TLB flush.
-	 *
-	 * The hex opcode is invpcid (%ecx), %eax in 32-bit mode and
-	 * invpcid (%rcx), %rax in long mode.
-	 */
-	asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01"
-		      : : "m" (desc), "a" (type), "c" (&desc) : "memory");
-}
-
-#define INVPCID_TYPE_INDIV_ADDR		0
-#define INVPCID_TYPE_SINGLE_CTXT	1
-#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
-#define INVPCID_TYPE_ALL_NON_GLOBAL	3
-
-/* Flush all mappings for a given pcid and addr, not including globals. */
-static inline void invpcid_flush_one(unsigned long pcid,
-				     unsigned long addr)
-{
-	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
-}
-
-/* Flush all mappings for a given PCID, not including globals. */
-static inline void invpcid_flush_single_context(unsigned long pcid)
-{
-	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
-}
-
-/* Flush all mappings, including globals, for all PCIDs. */
-static inline void invpcid_flush_all(void)
-{
-	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
-}
-
-/* Flush all mappings for all PCIDs except globals. */
-static inline void invpcid_flush_all_nonglobals(void)
-{
-	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
-}
+#include <asm/invpcid.h>
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 20/54] x86/mm: Create asm/invpcid.h
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0062-x86-mm-Create-asm-invpcid.h.patch --]
[-- Type: text/plain, Size: 4841 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Unclutter tlbflush.h a little.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/invpcid.h  |   53 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/tlbflush.h |   49 ------------------------------------
 2 files changed, 54 insertions(+), 48 deletions(-)

--- /dev/null
+++ b/arch/x86/include/asm/invpcid.h
@@ -0,0 +1,53 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_X86_INVPCID
+#define _ASM_X86_INVPCID
+
+static inline void __invpcid(unsigned long pcid, unsigned long addr,
+			     unsigned long type)
+{
+	struct { u64 d[2]; } desc = { { pcid, addr } };
+
+	/*
+	 * The memory clobber is because the whole point is to invalidate
+	 * stale TLB entries and, especially if we're flushing global
+	 * mappings, we don't want the compiler to reorder any subsequent
+	 * memory accesses before the TLB flush.
+	 *
+	 * The hex opcode is invpcid (%ecx), %eax in 32-bit mode and
+	 * invpcid (%rcx), %rax in long mode.
+	 */
+	asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01"
+		      : : "m" (desc), "a" (type), "c" (&desc) : "memory");
+}
+
+#define INVPCID_TYPE_INDIV_ADDR		0
+#define INVPCID_TYPE_SINGLE_CTXT	1
+#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
+#define INVPCID_TYPE_ALL_NON_GLOBAL	3
+
+/* Flush all mappings for a given pcid and addr, not including globals. */
+static inline void invpcid_flush_one(unsigned long pcid,
+				     unsigned long addr)
+{
+	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
+}
+
+/* Flush all mappings for a given PCID, not including globals. */
+static inline void invpcid_flush_single_context(unsigned long pcid)
+{
+	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
+}
+
+/* Flush all mappings, including globals, for all PCIDs. */
+static inline void invpcid_flush_all(void)
+{
+	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
+}
+
+/* Flush all mappings for all PCIDs except globals. */
+static inline void invpcid_flush_all_nonglobals(void)
+{
+	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
+}
+
+#endif /* _ASM_X86_INVPCID */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -9,54 +9,7 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
-
-static inline void __invpcid(unsigned long pcid, unsigned long addr,
-			     unsigned long type)
-{
-	struct { u64 d[2]; } desc = { { pcid, addr } };
-
-	/*
-	 * The memory clobber is because the whole point is to invalidate
-	 * stale TLB entries and, especially if we're flushing global
-	 * mappings, we don't want the compiler to reorder any subsequent
-	 * memory accesses before the TLB flush.
-	 *
-	 * The hex opcode is invpcid (%ecx), %eax in 32-bit mode and
-	 * invpcid (%rcx), %rax in long mode.
-	 */
-	asm volatile (".byte 0x66, 0x0f, 0x38, 0x82, 0x01"
-		      : : "m" (desc), "a" (type), "c" (&desc) : "memory");
-}
-
-#define INVPCID_TYPE_INDIV_ADDR		0
-#define INVPCID_TYPE_SINGLE_CTXT	1
-#define INVPCID_TYPE_ALL_INCL_GLOBAL	2
-#define INVPCID_TYPE_ALL_NON_GLOBAL	3
-
-/* Flush all mappings for a given pcid and addr, not including globals. */
-static inline void invpcid_flush_one(unsigned long pcid,
-				     unsigned long addr)
-{
-	__invpcid(pcid, addr, INVPCID_TYPE_INDIV_ADDR);
-}
-
-/* Flush all mappings for a given PCID, not including globals. */
-static inline void invpcid_flush_single_context(unsigned long pcid)
-{
-	__invpcid(pcid, 0, INVPCID_TYPE_SINGLE_CTXT);
-}
-
-/* Flush all mappings, including globals, for all PCIDs. */
-static inline void invpcid_flush_all(void)
-{
-	__invpcid(0, 0, INVPCID_TYPE_ALL_INCL_GLOBAL);
-}
-
-/* Flush all mappings for all PCIDs except globals. */
-static inline void invpcid_flush_all_nonglobals(void)
-{
-	__invpcid(0, 0, INVPCID_TYPE_ALL_NON_GLOBAL);
-}
+#include <asm/invpcid.h>
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 21/54] x86/cpu_entry_area: Move it to a separate unit
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (19 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 22:29   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 22/54] x86/cpu_entry_area: Move it out of fixmap Thomas Gleixner
                   ` (35 subsequent siblings)
  56 siblings, 1 reply; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-cpu_entry_area--Move-it-to-a-separate-unit.patch --]
[-- Type: text/plain, Size: 11860 bytes --]

Separate the cpu_entry_area code out of cpu/common.c and the fixmap.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/cpu_entry_area.h |   52 +++++++++++++++++
 arch/x86/include/asm/fixmap.h         |   41 -------------
 arch/x86/kernel/cpu/common.c          |   94 -------------------------------
 arch/x86/kernel/traps.c               |    1 
 arch/x86/mm/Makefile                  |    2 
 arch/x86/mm/cpu_entry_area.c          |  102 ++++++++++++++++++++++++++++++++++
 6 files changed, 157 insertions(+), 135 deletions(-)

--- /dev/null
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -0,0 +1,52 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#ifndef _ASM_X86_CPU_ENTRY_AREA_H
+#define _ASM_X86_CPU_ENTRY_AREA_H
+
+#include <linux/percpu-defs.h>
+#include <asm/processor.h>
+
+/*
+ * cpu_entry_area is a percpu region that contains things needed by the CPU
+ * and early entry/exit code.  Real types aren't used for all fields here
+ * to avoid circular header dependencies.
+ *
+ * Every field is a virtual alias of some other allocated backing store.
+ * There is no direct allocation of a struct cpu_entry_area.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below entry_stack and thus serves (on x86_64) as
+	 * a a read-only guard page.
+	 */
+	struct entry_stack_page entry_stack_page;
+
+	/*
+	 * On x86_64, the TSS is mapped RO.  On x86_32, it's mapped RW because
+	 * we need task switches to work, and task switches write to the TSS.
+	 */
+	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
+};
+
+#define CPU_ENTRY_AREA_SIZE	(sizeof(struct cpu_entry_area))
+#define CPU_ENTRY_AREA_PAGES	(CPU_ENTRY_AREA_SIZE / PAGE_SIZE)
+
+DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+
+extern void setup_cpu_entry_areas(void);
+
+#endif
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -25,6 +25,7 @@
 #else
 #include <uapi/asm/vsyscall.h>
 #endif
+#include <asm/cpu_entry_area.h>
 
 /*
  * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
@@ -45,46 +46,6 @@ extern unsigned long __FIXADDR_TOP;
 #endif
 
 /*
- * cpu_entry_area is a percpu region in the fixmap that contains things
- * needed by the CPU and early entry/exit code.  Real types aren't used
- * for all fields here to avoid circular header dependencies.
- *
- * Every field is a virtual alias of some other allocated backing store.
- * There is no direct allocation of a struct cpu_entry_area.
- */
-struct cpu_entry_area {
-	char gdt[PAGE_SIZE];
-
-	/*
-	 * The GDT is just below entry_stack and thus serves (on x86_64) as
-	 * a a read-only guard page.
-	 */
-	struct entry_stack_page entry_stack_page;
-
-	/*
-	 * On x86_64, the TSS is mapped RO.  On x86_32, it's mapped RW because
-	 * we need task switches to work, and task switches write to the TSS.
-	 */
-	struct tss_struct tss;
-
-	char entry_trampoline[PAGE_SIZE];
-
-#ifdef CONFIG_X86_64
-	/*
-	 * Exception stacks used for IST entries.
-	 *
-	 * In the future, this should have a separate slot for each stack
-	 * with guard pages between them.
-	 */
-	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
-#endif
-};
-
-#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
-
-extern void setup_cpu_entry_areas(void);
-
-/*
  * Here we define all the compile-time 'special' virtual
  * addresses. The point is to have a constant address at
  * compile time, but to set the physical address only
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -482,102 +482,8 @@ static const unsigned int exception_stac
 	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-#endif
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page,
-				   entry_stack_storage);
-
-static void __init
-set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
-{
-	for ( ; pages; pages--, idx--, ptr += PAGE_SIZE)
-		__set_fixmap(idx, per_cpu_ptr_to_phys(ptr), prot);
-}
-
-/* Setup the fixmap mappings only once per-processor */
-static void __init setup_cpu_entry_area(int cpu)
-{
-#ifdef CONFIG_X86_64
-	extern char _entry_trampoline[];
-
-	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
-	pgprot_t gdt_prot = PAGE_KERNEL_RO;
-	pgprot_t tss_prot = PAGE_KERNEL_RO;
-#else
-	/*
-	 * On native 32-bit systems, the GDT cannot be read-only because
-	 * our double fault handler uses a task gate, and entering through
-	 * a task gate needs to change an available TSS to busy.  If the
-	 * GDT is read-only, that will triple fault.  The TSS cannot be
-	 * read-only because the CPU writes to it on task switches.
-	 *
-	 * On Xen PV, the GDT must be read-only because the hypervisor
-	 * requires it.
-	 */
-	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
-		PAGE_KERNEL_RO : PAGE_KERNEL;
-	pgprot_t tss_prot = PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, entry_stack_page),
-				per_cpu_ptr(&entry_stack_storage, cpu), 1,
-				PAGE_KERNEL);
-
-	/*
-	 * The Intel SDM says (Volume 3, 7.2.1):
-	 *
-	 *  Avoid placing a page boundary in the part of the TSS that the
-	 *  processor reads during a task switch (the first 104 bytes). The
-	 *  processor may not correctly perform address translations if a
-	 *  boundary occurs in this area. During a task switch, the processor
-	 *  reads and writes into the first 104 bytes of each TSS (using
-	 *  contiguous physical addresses beginning with the physical address
-	 *  of the first byte of the TSS). So, after TSS access begins, if
-	 *  part of the 104 bytes is not physically contiguous, the processor
-	 *  will access incorrect information without generating a page-fault
-	 *  exception.
-	 *
-	 * There are also a lot of errata involving the TSS spanning a page
-	 * boundary.  Assert that we're not doing that.
-	 */
-	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
-		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
-	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
-				&per_cpu(cpu_tss_rw, cpu),
-				sizeof(struct tss_struct) / PAGE_SIZE,
-				tss_prot);
-
-#ifdef CONFIG_X86_32
-	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
-#endif
-
-#ifdef CONFIG_X86_64
-	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
-	BUILD_BUG_ON(sizeof(exception_stacks) !=
-		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
-				&per_cpu(exception_stacks, cpu),
-				sizeof(exception_stacks) / PAGE_SIZE,
-				PAGE_KERNEL);
-
-	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
-		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
-#endif
-}
-
-void __init setup_cpu_entry_areas(void)
-{
-	unsigned int cpu;
-
-	for_each_possible_cpu(cpu)
-		setup_cpu_entry_area(cpu);
-}
-
 /* Load the original GDT from the per-cpu structure */
 void load_direct_gdt(int cpu)
 {
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -60,6 +60,7 @@
 #include <asm/trace/mpx.h>
 #include <asm/mpx.h>
 #include <asm/vm86.h>
+#include <asm/cpu_entry_area.h>
 
 #ifdef CONFIG_X86_64
 #include <asm/x86_init.h>
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -10,7 +10,7 @@ CFLAGS_REMOVE_mem_encrypt.o	= -pg
 endif
 
 obj-y	:=  init.o init_$(BITS).o fault.o ioremap.o extable.o pageattr.o mmap.o \
-	    pat.o pgtable.o physaddr.o setup_nx.o tlb.o
+	    pat.o pgtable.o physaddr.o setup_nx.o tlb.o cpu_entry_area.o
 
 # Make sure __phys_addr has no stackprotector
 nostackp := $(call cc-option, -fno-stack-protector)
--- /dev/null
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -0,0 +1,102 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#include <linux/percpu.h>
+#include <asm/cpu_entry_area.h>
+#include <asm/pgtable.h>
+#include <asm/fixmap.h>
+#include <asm/desc.h>
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(struct entry_stack_page, entry_stack_storage);
+
+#ifdef CONFIG_X86_64
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
+static void __init
+set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
+{
+	for ( ; pages; pages--, idx--, ptr += PAGE_SIZE)
+		__set_fixmap(idx, per_cpu_ptr_to_phys(ptr), prot);
+}
+
+/* Setup the fixmap mappings only once per-processor */
+static void __init setup_cpu_entry_area(int cpu)
+{
+#ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
+	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
+	pgprot_t tss_prot = PAGE_KERNEL_RO;
+#else
+	/*
+	 * On native 32-bit systems, the GDT cannot be read-only because
+	 * our double fault handler uses a task gate, and entering through
+	 * a task gate needs to change an available TSS to busy.  If the
+	 * GDT is read-only, that will triple fault.  The TSS cannot be
+	 * read-only because the CPU writes to it on task switches.
+	 *
+	 * On Xen PV, the GDT must be read-only because the hypervisor
+	 * requires it.
+	 */
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+		PAGE_KERNEL_RO : PAGE_KERNEL;
+	pgprot_t tss_prot = PAGE_KERNEL;
+#endif
+
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, entry_stack_page),
+				per_cpu_ptr(&entry_stack_storage, cpu), 1,
+				PAGE_KERNEL);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss_rw, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				tss_prot);
+
+#ifdef CONFIG_X86_32
+	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
+#endif
+
+#ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
+}
+
+void __init setup_cpu_entry_areas(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		setup_cpu_entry_area(cpu);
+}

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 22/54] x86/cpu_entry_area: Move it out of fixmap
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (20 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 21/54] x86/cpu_entry_area: Move it to a separate unit Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-22  2:46   ` [V181,22/54] " Andrei Vagin
  2017-12-20 21:35 ` [patch V181 23/54] init: Invoke init_espfix_bsp() from mm_init() Thomas Gleixner
                   ` (34 subsequent siblings)
  56 siblings, 1 reply; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-cpu_entry_area--Move-it-out-of-fixmap.patch --]
[-- Type: text/plain, Size: 17468 bytes --]

Put the cpu_entry_area into a separate p4d entry. The fixmap gets too bug
and 0-day already hit a case where the fixmap ptes were cleared by
cleanup_highmap().

Aside of that the fixmap API is a pain as it's all backwards.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/x86/x86_64/mm.txt         |    2 +
 arch/x86/include/asm/cpu_entry_area.h   |   24 ++++++++++++-
 arch/x86/include/asm/desc.h             |    1 
 arch/x86/include/asm/fixmap.h           |   32 -----------------
 arch/x86/include/asm/pgtable_32_types.h |   15 ++++++--
 arch/x86/include/asm/pgtable_64_types.h |   47 +++++++++++++++-----------
 arch/x86/kernel/dumpstack.c             |    1 
 arch/x86/kernel/traps.c                 |    5 +-
 arch/x86/mm/cpu_entry_area.c            |   57 +++++++++++++++++++++++---------
 arch/x86/mm/dump_pagetables.c           |    6 ++-
 arch/x86/mm/init_32.c                   |    6 +++
 arch/x86/mm/kasan_init_64.c             |    6 ++-
 arch/x86/mm/pgtable_32.c                |    1 
 arch/x86/xen/mmu_pv.c                   |    2 -
 14 files changed, 128 insertions(+), 77 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,6 +12,7 @@ ffffea0000000000 - ffffeaffffffffff (=40
 ... unused hole ...
 ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
+fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
@@ -35,6 +36,7 @@ ffd4000000000000 - ffd5ffffffffffff (=49
 ... unused hole ...
 ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
 ... unused hole ...
+fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
 ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -43,10 +43,32 @@ struct cpu_entry_area {
 };
 
 #define CPU_ENTRY_AREA_SIZE	(sizeof(struct cpu_entry_area))
-#define CPU_ENTRY_AREA_PAGES	(CPU_ENTRY_AREA_SIZE / PAGE_SIZE)
+#define CPU_ENTRY_AREA_TOT_SIZE	(CPU_ENTRY_AREA_SIZE * NR_CPUS)
 
 DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 
 extern void setup_cpu_entry_areas(void);
+extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
+
+#define	CPU_ENTRY_AREA_RO_IDT		CPU_ENTRY_AREA_BASE
+#define CPU_ENTRY_AREA_PER_CPU		(CPU_ENTRY_AREA_RO_IDT + PAGE_SIZE)
+
+#define CPU_ENTRY_AREA_RO_IDT_VADDR	((void *)CPU_ENTRY_AREA_RO_IDT)
+
+#define CPU_ENTRY_AREA_MAP_SIZE			\
+	(CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOT_SIZE - CPU_ENTRY_AREA_BASE)
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	unsigned long va = CPU_ENTRY_AREA_PER_CPU + cpu * CPU_ENTRY_AREA_SIZE;
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return (struct cpu_entry_area *) va;
+}
+
+static inline struct entry_stack *cpu_entry_stack(int cpu)
+{
+	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
+}
 
 #endif
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -7,6 +7,7 @@
 #include <asm/mmu.h>
 #include <asm/fixmap.h>
 #include <asm/irq_vectors.h>
+#include <asm/cpu_entry_area.h>
 
 #include <linux/smp.h>
 #include <linux/percpu.h>
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -25,7 +25,6 @@
 #else
 #include <uapi/asm/vsyscall.h>
 #endif
-#include <asm/cpu_entry_area.h>
 
 /*
  * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
@@ -84,7 +83,6 @@ enum fixed_addresses {
 	FIX_IO_APIC_BASE_0,
 	FIX_IO_APIC_BASE_END = FIX_IO_APIC_BASE_0 + MAX_IO_APICS - 1,
 #endif
-	FIX_RO_IDT,	/* Virtual mapping for read-only IDT */
 #ifdef CONFIG_X86_32
 	FIX_KMAP_BEGIN,	/* reserved pte's for temporary kernel mappings */
 	FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
@@ -100,9 +98,6 @@ enum fixed_addresses {
 #ifdef	CONFIG_X86_INTEL_MID
 	FIX_LNW_VRTC,
 #endif
-	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_CPU_ENTRY_AREA_TOP,
-	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 #ifdef CONFIG_ACPI_APEI_GHES
 	/* Used for GHES mapping from assorted contexts */
@@ -143,7 +138,7 @@ enum fixed_addresses {
 extern void reserve_top_address(unsigned long reserve);
 
 #define FIXADDR_SIZE	(__end_of_permanent_fixed_addresses << PAGE_SHIFT)
-#define FIXADDR_START		(FIXADDR_TOP - FIXADDR_SIZE)
+#define FIXADDR_START	(FIXADDR_TOP - FIXADDR_SIZE)
 
 extern int fixmaps_set;
 
@@ -191,30 +186,5 @@ void __init *early_memremap_decrypted_wp
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
-static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
-{
-	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
-
-	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
-}
-
-#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
-	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
-	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
-	})
-
-#define get_cpu_entry_area_index(cpu, field)				\
-	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
-
-static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
-{
-	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
-}
-
-static inline struct entry_stack *cpu_entry_stack(int cpu)
-{
-	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
-}
-
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
--- a/arch/x86/include/asm/pgtable_32_types.h
+++ b/arch/x86/include/asm/pgtable_32_types.h
@@ -38,13 +38,22 @@ extern bool __vmalloc_start_set; /* set
 #define LAST_PKMAP 1024
 #endif
 
-#define PKMAP_BASE ((FIXADDR_START - PAGE_SIZE * (LAST_PKMAP + 1))	\
-		    & PMD_MASK)
+/*
+ * Define this here and validate with BUILD_BUG_ON() in pgtable_32.c
+ * to avoid include recursion hell
+ */
+#define CPU_ENTRY_AREA_PAGES	(NR_CPUS * 40)
+
+#define CPU_ENTRY_AREA_BASE				\
+	((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)
+
+#define PKMAP_BASE		\
+	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
 
 #ifdef CONFIG_HIGHMEM
 # define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
 #else
-# define VMALLOC_END	(FIXADDR_START - 2 * PAGE_SIZE)
+# define VMALLOC_END	(CPU_ENTRY_AREA_BASE - 2 * PAGE_SIZE)
 #endif
 
 #define MODULES_VADDR	VMALLOC_START
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -76,32 +76,41 @@ typedef struct { pteval_t pte; } pte_t;
 #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
 
 /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
-#define MAXMEM		_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
+#define MAXMEM			_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
+
 #ifdef CONFIG_X86_5LEVEL
-#define VMALLOC_SIZE_TB _AC(16384, UL)
-#define __VMALLOC_BASE	_AC(0xff92000000000000, UL)
-#define __VMEMMAP_BASE	_AC(0xffd4000000000000, UL)
+# define VMALLOC_SIZE_TB	_AC(16384, UL)
+# define __VMALLOC_BASE		_AC(0xff92000000000000, UL)
+# define __VMEMMAP_BASE		_AC(0xffd4000000000000, UL)
 #else
-#define VMALLOC_SIZE_TB	_AC(32, UL)
-#define __VMALLOC_BASE	_AC(0xffffc90000000000, UL)
-#define __VMEMMAP_BASE	_AC(0xffffea0000000000, UL)
+# define VMALLOC_SIZE_TB	_AC(32, UL)
+# define __VMALLOC_BASE		_AC(0xffffc90000000000, UL)
+# define __VMEMMAP_BASE		_AC(0xffffea0000000000, UL)
 #endif
+
 #ifdef CONFIG_RANDOMIZE_MEMORY
-#define VMALLOC_START	vmalloc_base
-#define VMEMMAP_START	vmemmap_base
+# define VMALLOC_START		vmalloc_base
+# define VMEMMAP_START		vmemmap_base
 #else
-#define VMALLOC_START	__VMALLOC_BASE
-#define VMEMMAP_START	__VMEMMAP_BASE
+# define VMALLOC_START		__VMALLOC_BASE
+# define VMEMMAP_START		__VMEMMAP_BASE
 #endif /* CONFIG_RANDOMIZE_MEMORY */
-#define VMALLOC_END	(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
-#define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
+
+#define VMALLOC_END		(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
+
+#define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
 /* The module sections ends with the start of the fixmap */
-#define MODULES_END   __fix_to_virt(__end_of_fixed_addresses + 1)
-#define MODULES_LEN   (MODULES_END - MODULES_VADDR)
-#define ESPFIX_PGD_ENTRY _AC(-2, UL)
-#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)
-#define EFI_VA_START	 ( -4 * (_AC(1, UL) << 30))
-#define EFI_VA_END	 (-68 * (_AC(1, UL) << 30))
+#define MODULES_END		__fix_to_virt(__end_of_fixed_addresses + 1)
+#define MODULES_LEN		(MODULES_END - MODULES_VADDR)
+
+#define ESPFIX_PGD_ENTRY	_AC(-2, UL)
+#define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
+
+#define CPU_ENTRY_AREA_PGD	_AC(-3, UL)
+#define CPU_ENTRY_AREA_BASE	(CPU_ENTRY_AREA_PGD << P4D_SHIFT)
+
+#define EFI_VA_START		( -4 * (_AC(1, UL) << 30))
+#define EFI_VA_END		(-68 * (_AC(1, UL) << 30))
 
 #define EARLY_DYNAMIC_PAGE_TABLES	64
 
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -18,6 +18,7 @@
 #include <linux/nmi.h>
 #include <linux/sysfs.h>
 
+#include <asm/cpu_entry_area.h>
 #include <asm/stacktrace.h>
 #include <asm/unwind.h>
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -951,8 +951,9 @@ void __init trap_init(void)
 	 * "sidt" instruction will not leak the location of the kernel, and
 	 * to defend the IDT against arbitrary memory write vulnerabilities.
 	 * It will be reloaded in cpu_init() */
-	__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
-	idt_descr.address = fix_to_virt(FIX_RO_IDT);
+	cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table),
+		    PAGE_KERNEL_RO);
+	idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
 
 	/*
 	 * Should be a barrier for any external CPU state:
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -13,11 +13,18 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
+void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
+{
+	unsigned long va = (unsigned long) cea_vaddr;
+
+	set_pte_vaddr(va, pfn_pte(pa >> PAGE_SHIFT, flags));
+}
+
 static void __init
-set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
+cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot)
 {
-	for ( ; pages; pages--, idx--, ptr += PAGE_SIZE)
-		__set_fixmap(idx, per_cpu_ptr_to_phys(ptr), prot);
+	for ( ; pages; pages--, cea_vaddr+= PAGE_SIZE, ptr += PAGE_SIZE)
+		cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot);
 }
 
 /* Setup the fixmap mappings only once per-processor */
@@ -45,10 +52,12 @@ static void __init setup_cpu_entry_area(
 	pgprot_t tss_prot = PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, entry_stack_page),
-				per_cpu_ptr(&entry_stack_storage, cpu), 1,
-				PAGE_KERNEL);
+	cea_set_pte(&get_cpu_entry_area(cpu)->gdt, get_cpu_gdt_paddr(cpu),
+		    gdt_prot);
+
+	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->entry_stack_page,
+			     per_cpu_ptr(&entry_stack_storage, cpu), 1,
+			     PAGE_KERNEL);
 
 	/*
 	 * The Intel SDM says (Volume 3, 7.2.1):
@@ -70,10 +79,9 @@ static void __init setup_cpu_entry_area(
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
 	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
-				&per_cpu(cpu_tss_rw, cpu),
-				sizeof(struct tss_struct) / PAGE_SIZE,
-				tss_prot);
+	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->tss,
+			     &per_cpu(cpu_tss_rw, cpu),
+			     sizeof(struct tss_struct) / PAGE_SIZE, tss_prot);
 
 #ifdef CONFIG_X86_32
 	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
@@ -83,20 +91,37 @@ static void __init setup_cpu_entry_area(
 	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
 	BUILD_BUG_ON(sizeof(exception_stacks) !=
 		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
-				&per_cpu(exception_stacks, cpu),
-				sizeof(exception_stacks) / PAGE_SIZE,
-				PAGE_KERNEL);
+	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
+			     &per_cpu(exception_stacks, cpu),
+			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
 
-	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
 }
 
+static __init void setup_cpu_entry_area_ptes(void)
+{
+#ifdef CONFIG_X86_32
+	unsigned long start, end;
+
+	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
+	BUG_ON(CPU_ENTRY_AREA_BASE & ~PMD_MASK);
+
+	start = CPU_ENTRY_AREA_BASE;
+	end = start + CPU_ENTRY_AREA_MAP_SIZE;
+
+	for (; start < end; start += PMD_SIZE)
+		populate_extra_pte(start);
+#endif
+}
+
 void __init setup_cpu_entry_areas(void)
 {
 	unsigned int cpu;
 
+	setup_cpu_entry_area_ptes();
+
 	for_each_possible_cpu(cpu)
 		setup_cpu_entry_area(cpu);
 }
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -58,6 +58,7 @@ enum address_markers_idx {
 	KASAN_SHADOW_START_NR,
 	KASAN_SHADOW_END_NR,
 #endif
+	CPU_ENTRY_AREA_NR,
 #ifdef CONFIG_X86_ESPFIX64
 	ESPFIX_START_NR,
 #endif
@@ -81,6 +82,7 @@ static struct addr_marker address_marker
 	[KASAN_SHADOW_START_NR]	= { KASAN_SHADOW_START,	"KASAN shadow" },
 	[KASAN_SHADOW_END_NR]	= { KASAN_SHADOW_END,	"KASAN shadow end" },
 #endif
+	[CPU_ENTRY_AREA_NR]	= { CPU_ENTRY_AREA_BASE,"CPU entry Area" },
 #ifdef CONFIG_X86_ESPFIX64
 	[ESPFIX_START_NR]	= { ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },
 #endif
@@ -104,6 +106,7 @@ enum address_markers_idx {
 #ifdef CONFIG_HIGHMEM
 	PKMAP_BASE_NR,
 #endif
+	CPU_ENTRY_AREA_NR,
 	FIXADDR_START_NR,
 	END_OF_SPACE_NR,
 };
@@ -116,6 +119,7 @@ static struct addr_marker address_marker
 #ifdef CONFIG_HIGHMEM
 	[PKMAP_BASE_NR]		= { 0UL,		"Persistent kmap() Area" },
 #endif
+	[CPU_ENTRY_AREA_NR]	= { 0UL,		"CPU entry area" },
 	[FIXADDR_START_NR]	= { 0UL,		"Fixmap area" },
 	[END_OF_SPACE_NR]	= { -1,			NULL }
 };
@@ -541,8 +545,8 @@ static int __init pt_dump_init(void)
 	address_markers[PKMAP_BASE_NR].start_address = PKMAP_BASE;
 # endif
 	address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
+	address_markers[CPU_ENTRY_AREA_NR].start_address = CPU_ENTRY_AREA_BASE;
 #endif
-
 	return 0;
 }
 __initcall(pt_dump_init);
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -50,6 +50,7 @@
 #include <asm/setup.h>
 #include <asm/set_memory.h>
 #include <asm/page_types.h>
+#include <asm/cpu_entry_area.h>
 #include <asm/init.h>
 
 #include "mm_internal.h"
@@ -766,6 +767,7 @@ void __init mem_init(void)
 	mem_init_print_info(NULL);
 	printk(KERN_INFO "virtual kernel memory layout:\n"
 		"    fixmap  : 0x%08lx - 0x%08lx   (%4ld kB)\n"
+		"  cpu_entry : 0x%08lx - 0x%08lx   (%4ld kB)\n"
 #ifdef CONFIG_HIGHMEM
 		"    pkmap   : 0x%08lx - 0x%08lx   (%4ld kB)\n"
 #endif
@@ -777,6 +779,10 @@ void __init mem_init(void)
 		FIXADDR_START, FIXADDR_TOP,
 		(FIXADDR_TOP - FIXADDR_START) >> 10,
 
+		CPU_ENTRY_AREA_BASE,
+		CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE,
+		CPU_ENTRY_AREA_MAP_SIZE >> 10,
+
 #ifdef CONFIG_HIGHMEM
 		PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE,
 		(LAST_PKMAP*PAGE_SIZE) >> 10,
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -15,6 +15,7 @@
 #include <asm/tlbflush.h>
 #include <asm/sections.h>
 #include <asm/pgtable.h>
+#include <asm/cpu_entry_area.h>
 
 extern struct range pfn_mapped[E820_MAX_ENTRIES];
 
@@ -330,12 +331,13 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
-	shadow_cpu_entry_begin = (void *)__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM);
+	shadow_cpu_entry_begin = (void *)CPU_ENTRY_AREA_BASE;
 	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
 	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
 						PAGE_SIZE);
 
-	shadow_cpu_entry_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+	shadow_cpu_entry_end = (void *)(CPU_ENTRY_AREA_BASE +
+					CPU_ENTRY_AREA_TOT_SIZE);
 	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
 	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
 					PAGE_SIZE);
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -10,6 +10,7 @@
 #include <linux/pagemap.h>
 #include <linux/spinlock.h>
 
+#include <asm/cpu_entry_area.h>
 #include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/fixmap.h>
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2261,7 +2261,6 @@ static void xen_set_fixmap(unsigned idx,
 
 	switch (idx) {
 	case FIX_BTMAP_END ... FIX_BTMAP_BEGIN:
-	case FIX_RO_IDT:
 #ifdef CONFIG_X86_32
 	case FIX_WP_TEST:
 # ifdef CONFIG_HIGHMEM
@@ -2272,7 +2271,6 @@ static void xen_set_fixmap(unsigned idx,
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 23/54] init: Invoke init_espfix_bsp() from mm_init()
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (21 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 22/54] x86/cpu_entry_area: Move it out of fixmap Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 24/54] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
                   ` (33 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: init--Invoke-init_espfix_bsp---from-mm_init--.patch --]
[-- Type: text/plain, Size: 2506 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

init_espfix_bsp() needs to be invoked before the page table isolation
initialization. Move it into mm_init() which is the place where pti_init()
will be added.

While at it get rid of the #ifdeffery and provide proper stub functions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/espfix.h |    7 ++++---
 arch/x86/kernel/smpboot.c     |    6 +-----
 include/asm-generic/pgtable.h |    5 +++++
 init/main.c                   |    6 ++----
 4 files changed, 12 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/espfix.h
+++ b/arch/x86/include/asm/espfix.h
@@ -2,7 +2,7 @@
 #ifndef _ASM_X86_ESPFIX_H
 #define _ASM_X86_ESPFIX_H
 
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_X86_ESPFIX64
 
 #include <asm/percpu.h>
 
@@ -11,7 +11,8 @@ DECLARE_PER_CPU_READ_MOSTLY(unsigned lon
 
 extern void init_espfix_bsp(void);
 extern void init_espfix_ap(int cpu);
-
-#endif /* CONFIG_X86_64 */
+#else
+static inline void init_espfix_ap(int cpu) { }
+#endif
 
 #endif /* _ASM_X86_ESPFIX_H */
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -990,12 +990,8 @@ static int do_boot_cpu(int apicid, int c
 	initial_code = (unsigned long)start_secondary;
 	initial_stack  = idle->thread.sp;
 
-	/*
-	 * Enable the espfix hack for this CPU
-	*/
-#ifdef CONFIG_X86_ESPFIX64
+	/* Enable the espfix hack for this CPU */
 	init_espfix_ap(cpu);
-#endif
 
 	/* So we see what's up */
 	announce_cpu(cpu, apicid);
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -1017,6 +1017,11 @@ static inline int pmd_clear_huge(pmd_t *
 struct file;
 int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
 			unsigned long size, pgprot_t *vma_prot);
+
+#ifndef CONFIG_X86_ESPFIX64
+static inline void init_espfix_bsp(void) { }
+#endif
+
 #endif /* !__ASSEMBLY__ */
 
 #ifndef io_remap_pfn_range
--- a/init/main.c
+++ b/init/main.c
@@ -504,6 +504,8 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	/* Should be run before the first non-init thread is created */
+	init_espfix_bsp();
 }
 
 asmlinkage __visible void __init start_kernel(void)
@@ -674,10 +676,6 @@ asmlinkage __visible void __init start_k
 	if (efi_enabled(EFI_RUNTIME_SERVICES))
 		efi_enter_virtual_mode();
 #endif
-#ifdef CONFIG_X86_ESPFIX64
-	/* Should be run before the first non-init thread is created */
-	init_espfix_bsp();
-#endif
 	thread_stack_cache_init();
 	cred_init();
 	fork_init();

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 24/54] x86/cpufeatures: Add X86_BUG_CPU_INSECURE
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (22 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 23/54] init: Invoke init_espfix_bsp() from mm_init() Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (32 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0027-x86-cpufeatures-Add-X86_BUG_CPU_INSECURE.patch --]
[-- Type: text/plain, Size: 3628 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Many x86 CPUs leak information to user space due to missing isolation of
user space and kernel space page tables. There are many well documented
ways to exploit that.

The upcoming software migitation of isolating the user and kernel space
page tables needs a misfeature flag so code can be made runtime
conditional.

Add the BUG bits which indicates that the CPU is affected and add a feature
bit which indicates that the software migitation is enabled.

Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
made later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/cpufeatures.h       |    3 ++-
 arch/x86/include/asm/disabled-features.h |    8 +++++++-
 arch/x86/kernel/cpu/common.c             |    4 ++++
 3 files changed, 13 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -201,7 +201,7 @@
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_SME			( 7*32+10) /* AMD Secure Memory Encryption */
-
+#define X86_FEATURE_PTI			( 7*32+11) /* Kernel Page Table Isolation enabled */
 #define X86_FEATURE_INTEL_PPIN		( 7*32+14) /* Intel Processor Inventory Number */
 #define X86_FEATURE_INTEL_PT		( 7*32+15) /* Intel Processor Trace */
 #define X86_FEATURE_AVX512_4VNNIW	( 7*32+16) /* AVX-512 Neural Network Instructions */
@@ -340,5 +340,6 @@
 #define X86_BUG_SWAPGS_FENCE		X86_BUG(11) /* SWAPGS without input dep on GS */
 #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
 #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
+#define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
 
 #endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/include/asm/disabled-features.h
+++ b/arch/x86/include/asm/disabled-features.h
@@ -44,6 +44,12 @@
 # define DISABLE_LA57	(1<<(X86_FEATURE_LA57 & 31))
 #endif
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define DISABLE_PTI		0
+#else
+# define DISABLE_PTI		(1 << (X86_FEATURE_PTI & 31))
+#endif
+
 /*
  * Make sure to add features to the correct mask
  */
@@ -54,7 +60,7 @@
 #define DISABLED_MASK4	(DISABLE_PCID)
 #define DISABLED_MASK5	0
 #define DISABLED_MASK6	0
-#define DISABLED_MASK7	0
+#define DISABLED_MASK7	(DISABLE_PTI)
 #define DISABLED_MASK8	0
 #define DISABLED_MASK9	(DISABLE_MPX)
 #define DISABLED_MASK10	0
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -898,6 +898,10 @@ static void __init early_identify_cpu(st
 	}
 
 	setup_force_cpu_cap(X86_FEATURE_ALWAYS);
+
+	/* Assume for now that ALL x86 CPUs are insecure */
+	setup_force_cpu_bug(X86_BUG_CPU_INSECURE);
+
 	fpu__init_system(c);
 
 #ifdef CONFIG_X86_32

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 25/54] x86/mm/pti: Disable global pages if PAGE_TABLE_ISOLATION=y
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0028-x86-mm-pti-Disable-global-pages-if-PAGE_TABLE_ISOLAT.patch --]
[-- Type: text/plain, Size: 2893 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:

   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

That means that even when PAGE_TABLE_ISOLATION switches page tables
on return to user space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Suppress global pages via the __supported_pte_mask. The user space
mappings set PAGE_GLOBAL for the minimal kernel mappings which are
required for entry/exit. These mappings are set up manually so the
filtering does not take place.

[ The __supported_pte_mask simplification was written by Thomas Gleixner. ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/mm/init.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,6 +161,12 @@ struct map_range {
 
 static int page_size_mask;
 
+static void enable_global_pages(void)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		__supported_pte_mask |= _PAGE_GLOBAL;
+}
+
 static void __init probe_page_size_mask(void)
 {
 	/*
@@ -179,11 +185,11 @@ static void __init probe_page_size_mask(
 		cr4_set_bits_and_update_boot(X86_CR4_PSE);
 
 	/* Enable PGE if available */
+	__supported_pte_mask &= ~_PAGE_GLOBAL;
 	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		cr4_set_bits_and_update_boot(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	} else
-		__supported_pte_mask &= ~_PAGE_GLOBAL;
+		enable_global_pages();
+	}
 
 	/* Enable 1 GB linear kernel mappings if available: */
 	if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 25/54] x86/mm/pti: Disable global pages if PAGE_TABLE_ISOLATION=y
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0028-x86-mm-pti-Disable-global-pages-if-PAGE_TABLE_ISOLAT.patch --]
[-- Type: text/plain, Size: 3120 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:

   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

That means that even when PAGE_TABLE_ISOLATION switches page tables
on return to user space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Suppress global pages via the __supported_pte_mask. The user space
mappings set PAGE_GLOBAL for the minimal kernel mappings which are
required for entry/exit. These mappings are set up manually so the
filtering does not take place.

[ The __supported_pte_mask simplification was written by Thomas Gleixner. ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/mm/init.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,6 +161,12 @@ struct map_range {
 
 static int page_size_mask;
 
+static void enable_global_pages(void)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		__supported_pte_mask |= _PAGE_GLOBAL;
+}
+
 static void __init probe_page_size_mask(void)
 {
 	/*
@@ -179,11 +185,11 @@ static void __init probe_page_size_mask(
 		cr4_set_bits_and_update_boot(X86_CR4_PSE);
 
 	/* Enable PGE if available */
+	__supported_pte_mask &= ~_PAGE_GLOBAL;
 	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		cr4_set_bits_and_update_boot(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	} else
-		__supported_pte_mask &= ~_PAGE_GLOBAL;
+		enable_global_pages();
+	}
 
 	/* Enable 1 GB linear kernel mappings if available: */
 	if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 26/54] x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, Borislav Petkov,
	H. Peter Anvin, linux-mm

[-- Attachment #1: 0029-x86-mm-pti-Prepare-the-x86-entry-assembly-code-for-e.patch --]
[-- Type: text/plain, Size: 10611 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
enters the kernel and switch back when it exits.  This essentially needs to
be done before leaving assembly code.

This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary.  It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.

Establish a set of macros that allow changing to the user and kernel CR3
values.

Interactions with SWAPGS:

  Previous versions of the PAGE_TABLE_ISOLATION code relied on having
  per-CPU scratch space to save/restore a register that can be used for the
  CR3 MOV.  The %GS register is used to index into our per-CPU space, so
  SWAPGS *had* to be done before the CR3 switch.  That scratch space is gone
  now, but the semantic that SWAPGS must be done before the CR3 MOV is
  retained.  This is good to keep because it is not that hard to do and it
  allows to do things like add per-CPU debugging information.

What this does in the NMI code is worth pointing out.  NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs.  The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation.  Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI.  It is callee-saved and will not be clobbered by the C NMI
handlers that get called.

[ PeterZ: ESPFIX optimization ]

Based-on-code-from: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/entry/calling.h         |   66 +++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        |   45 +++++++++++++++++++++++---
 arch/x86/entry/entry_64_compat.S |   24 +++++++++++++-
 3 files changed, 128 insertions(+), 7 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
@@ -187,6 +189,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/* PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define PTI_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+	andq	$(~PTI_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(PTI_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, \scratch_reg
+	movq	\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real PAGE_TABLE_ISOLATION patches in a moment.
+	 */
+	testq	$(PTI_SWITCH_MASK), \scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 \scratch_reg
+	movq	\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -164,6 +164,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -203,6 +206,10 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
+	/*
+	 * This path is not taken when PAGE_TABLE_ISOLATION is disabled so it
+	 * is not required to switch CR3.
+	 */
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
@@ -399,6 +406,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -736,6 +744,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -818,7 +828,9 @@ ENTRY(native_iret)
 	 */
 
 	pushq	%rdi				/* Stash user RDI */
-	SWAPGS
+	SWAPGS					/* to kernel GS */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
+
 	movq	PER_CPU_VAR(espfix_waddr), %rdi
 	movq	%rax, (0*8)(%rdi)		/* user RAX */
 	movq	(1*8)(%rsp), %rax		/* user RIP */
@@ -834,7 +846,6 @@ ENTRY(native_iret)
 	/* Now RAX == RSP. */
 
 	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */
-	popq	%rdi				/* Restore user RDI */
 
 	/*
 	 * espfix_stack[31:16] == 0.  The page tables are set up such that
@@ -845,7 +856,11 @@ ENTRY(native_iret)
 	 * still points to an RO alias of the ESPFIX stack.
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
-	SWAPGS
+
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWAPGS					/* to user GS */
+	popq	%rdi				/* Restore user RDI */
+
 	movq	%rax, %rsp
 	UNWIND_HINT_IRET_REGS offset=8
 
@@ -945,6 +960,8 @@ ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
 	pushq	%rdi
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1244,7 +1261,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1266,6 +1287,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1293,6 +1315,8 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1339,6 +1363,7 @@ ENTRY(error_entry)
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	jmp .Lerror_entry_done
 
 .Lbstep_iret:
@@ -1348,10 +1373,11 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1383,6 +1409,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1446,6 +1476,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1698,6 +1729,8 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -216,6 +220,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r15 = 0 */
 
 	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
+	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
@@ -256,10 +266,22 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 26/54] x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, Borislav Petkov,
	H. Peter Anvin, linux-mm

[-- Attachment #1: 0029-x86-mm-pti-Prepare-the-x86-entry-assembly-code-for-e.patch --]
[-- Type: text/plain, Size: 10838 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when it
enters the kernel and switch back when it exits.  This essentially needs to
be done before leaving assembly code.

This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary.  It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.

Establish a set of macros that allow changing to the user and kernel CR3
values.

Interactions with SWAPGS:

  Previous versions of the PAGE_TABLE_ISOLATION code relied on having
  per-CPU scratch space to save/restore a register that can be used for the
  CR3 MOV.  The %GS register is used to index into our per-CPU space, so
  SWAPGS *had* to be done before the CR3 switch.  That scratch space is gone
  now, but the semantic that SWAPGS must be done before the CR3 MOV is
  retained.  This is good to keep because it is not that hard to do and it
  allows to do things like add per-CPU debugging information.

What this does in the NMI code is worth pointing out.  NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs.  The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation.  Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI.  It is callee-saved and will not be clobbered by the C NMI
handlers that get called.

[ PeterZ: ESPFIX optimization ]

Based-on-code-from: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/entry/calling.h         |   66 +++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        |   45 +++++++++++++++++++++++---
 arch/x86/entry/entry_64_compat.S |   24 +++++++++++++-
 3 files changed, 128 insertions(+), 7 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
@@ -187,6 +189,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+
+/* PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define PTI_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+	andq	$(~PTI_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(PTI_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, \scratch_reg
+	movq	\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real PAGE_TABLE_ISOLATION patches in a moment.
+	 */
+	testq	$(PTI_SWITCH_MASK), \scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 \scratch_reg
+	movq	\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -164,6 +164,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -203,6 +206,10 @@ ENTRY(entry_SYSCALL_64)
 	 */
 
 	swapgs
+	/*
+	 * This path is not taken when PAGE_TABLE_ISOLATION is disabled so it
+	 * is not required to switch CR3.
+	 */
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
@@ -399,6 +406,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -736,6 +744,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -818,7 +828,9 @@ ENTRY(native_iret)
 	 */
 
 	pushq	%rdi				/* Stash user RDI */
-	SWAPGS
+	SWAPGS					/* to kernel GS */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
+
 	movq	PER_CPU_VAR(espfix_waddr), %rdi
 	movq	%rax, (0*8)(%rdi)		/* user RAX */
 	movq	(1*8)(%rsp), %rax		/* user RIP */
@@ -834,7 +846,6 @@ ENTRY(native_iret)
 	/* Now RAX == RSP. */
 
 	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */
-	popq	%rdi				/* Restore user RDI */
 
 	/*
 	 * espfix_stack[31:16] == 0.  The page tables are set up such that
@@ -845,7 +856,11 @@ ENTRY(native_iret)
 	 * still points to an RO alias of the ESPFIX stack.
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
-	SWAPGS
+
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWAPGS					/* to user GS */
+	popq	%rdi				/* Restore user RDI */
+
 	movq	%rax, %rsp
 	UNWIND_HINT_IRET_REGS offset=8
 
@@ -945,6 +960,8 @@ ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
 	pushq	%rdi
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1244,7 +1261,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1266,6 +1287,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1293,6 +1315,8 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1339,6 +1363,7 @@ ENTRY(error_entry)
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	jmp .Lerror_entry_done
 
 .Lbstep_iret:
@@ -1348,10 +1373,11 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1383,6 +1409,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1446,6 +1476,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1698,6 +1729,8 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -216,6 +220,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r15 = 0 */
 
 	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
+	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
@@ -256,10 +266,22 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 27/54] x86/mm/pti: Add infrastructure for page table isolation
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (25 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 28/54] x86/mm/pti: Add mapping helper functions Thomas Gleixner
                   ` (29 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0030-x86-mm-pti-Add-infrastructure-for-page-table-isolati.patch --]
[-- Type: text/plain, Size: 7936 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the initial files for kernel page table isolation, with a minimal init
function and the boot time detection for this misfeature.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 Documentation/admin-guide/kernel-parameters.txt |    2 
 arch/x86/boot/compressed/pagetable.c            |    3 
 arch/x86/entry/calling.h                        |    7 ++
 arch/x86/include/asm/pti.h                      |   14 ++++
 arch/x86/mm/Makefile                            |    7 +-
 arch/x86/mm/init.c                              |    2 
 arch/x86/mm/pti.c                               |   76 ++++++++++++++++++++++++
 include/linux/pti.h                             |   11 +++
 init/main.c                                     |    3 
 9 files changed, 122 insertions(+), 3 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2685,6 +2685,8 @@
 			steal time is computed, but won't influence scheduler
 			behaviour
 
+	nopti		[X86-64] Disable kernel page table isolation
+
 	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
 
 	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -23,6 +23,9 @@
  */
 #undef CONFIG_AMD_MEM_ENCRYPT
 
+/* No PAGE_TABLE_ISOLATION support needed either: */
+#undef CONFIG_PAGE_TABLE_ISOLATION
+
 #include "misc.h"
 
 /* These actually do the work of building the kernel identity maps. */
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -205,18 +205,23 @@ For 32-bit we have the following convent
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lend_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lend_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
 	/*
@@ -233,11 +238,13 @@ For 32-bit we have the following convent
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Lend_\@:
 .endm
 
 #else /* CONFIG_PAGE_TABLE_ISOLATION=n: */
--- /dev/null
+++ b/arch/x86/include/asm/pti.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _ASM_X86_PTI_H
+#define _ASM_X86_PTI_H
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+extern void pti_init(void);
+extern void pti_check_boottime_disable(void);
+#else
+static inline void pti_check_boottime_disable(void) { }
+#endif
+
+#endif /* __ASSEMBLY__ */
+#endif /* _ASM_X86_PTI_H */
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -43,9 +43,10 @@ obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
-obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
-obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
-obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_X86_INTEL_MPX)			+= mpx.o
+obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS)	+= pkeys.o
+obj-$(CONFIG_RANDOMIZE_MEMORY)			+= kaslr.o
+obj-$(CONFIG_PAGE_TABLE_ISOLATION)		+= pti.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -20,6 +20,7 @@
 #include <asm/kaslr.h>
 #include <asm/hypervisor.h>
 #include <asm/cpufeature.h>
+#include <asm/pti.h>
 
 /*
  * We need to define the tracepoints somewhere, and tlb.c
@@ -630,6 +631,7 @@ void __init init_mem_mapping(void)
 {
 	unsigned long end;
 
+	pti_check_boottime_disable();
 	probe_page_size_mask();
 	setup_pcid();
 
--- /dev/null
+++ b/arch/x86/mm/pti.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ * Mostly rewritten by Thomas Gleixner <tglx@linutronix.de> and
+ *		       Andy Lutomirsky <luto@amacapital.net>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/cmdline.h>
+#include <asm/pti.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt
+
+void __init pti_check_boottime_disable(void)
+{
+	bool enable = true;
+
+	if (cmdline_find_option_bool(boot_command_line, "nopti")) {
+		pr_info("disabled on command line.\n");
+		enable = false;
+	}
+	if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
+		pr_info("disabled on XEN_PV.\n");
+		enable = false;
+	}
+	if (enable)
+		setup_force_cpu_cap(X86_FEATURE_PTI);
+}
+
+/*
+ * Initialize kernel page table isolation
+ */
+void __init pti_init(void)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	pr_info("enabled\n");
+}
--- /dev/null
+++ b/include/linux/pti.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _INCLUDE_PTI_H
+#define _INCLUDE_PTI_H
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+#include <asm/pti.h>
+#else
+static inline void pti_init(void) { }
+#endif
+
+#endif
--- a/init/main.c
+++ b/init/main.c
@@ -75,6 +75,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/pti.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -506,6 +507,8 @@ static void __init mm_init(void)
 	ioremap_huge_init();
 	/* Should be run before the first non-init thread is created */
 	init_espfix_bsp();
+	/* Should be run after espfix64 is set up. */
+	pti_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 28/54] x86/mm/pti: Add mapping helper functions
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (26 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 27/54] x86/mm/pti: Add infrastructure for page table isolation Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 29/54] x86/mm/pti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
                   ` (28 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0031-x86-mm-pti-Add-mapping-helper-functions.patch --]
[-- Type: text/plain, Size: 6523 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Add the pagetable helper functions do manage the separate user space page
tables.

[ tglx: Split out from the big combo kaiser patch. Folded Andys
  	simplification and made it out of line as Boris suggested ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/pgtable.h    |    6 ++
 arch/x86/include/asm/pgtable_64.h |   92 ++++++++++++++++++++++++++++++++++++++
 arch/x86/mm/pti.c                 |   41 ++++++++++++++++
 3 files changed, 138 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -909,7 +909,11 @@ static inline int pgd_none(pgd_t pgd)
  * pgd_offset() returns a (pgd_t *)
  * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
  */
-#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
+#define pgd_offset_pgd(pgd, address) (pgd + pgd_index((address)))
+/*
+ * a shortcut to get a pgd_t in a given mm
+ */
+#define pgd_offset(mm, address) pgd_offset_pgd((mm)->pgd, (address))
 /*
  * a shortcut which implies the use of the kernel's pgd, instead
  * of a process's
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,9 +131,97 @@ static inline pud_t native_pudp_get_and_
 #endif
 }
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * All top-level PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
+ * the user one is in the last 4k.  To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define PTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= BIT(bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~BIT(bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, PTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_PAGE_TABLE_ISOLATION */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
+}
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd);
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return pgd;
+	return __pti_set_user_pgd(pgdp, pgd);
+}
+#else
+static inline pgd_t pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	return pgd;
+}
+#endif
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
+	p4dp->pgd = pti_set_user_pgd(&p4dp->pgd, p4d.pgd);
+#else
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -147,7 +235,11 @@ static inline void native_p4d_clear(p4d_
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	*pgdp = pti_set_user_pgd(pgdp, pgd);
+#else
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -64,6 +64,47 @@ void __init pti_check_boottime_disable(v
 		setup_force_cpu_cap(X86_FEATURE_PTI);
 }
 
+pgd_t __pti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+	/*
+	 * Changes to the high (kernel) portion of the kernelmode page
+	 * tables are not automatically propagated to the usermode tables.
+	 *
+	 * Users should keep in mind that, unlike the kernelmode tables,
+	 * there is no vmalloc_fault equivalent for the usermode tables.
+	 * Top-level entries added to init_mm's usermode pgd after boot
+	 * will not be automatically propagated to other mms.
+	 */
+	if (!pgdp_maps_userspace(pgdp))
+		return pgd;
+
+	/*
+	 * The user page tables get the full PGD, accessible from
+	 * userspace:
+	 */
+	kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
+
+	/*
+	 * If this is normal user memory, make it NX in the kernel
+	 * pagetables so that, if we somehow screw up and return to
+	 * usermode with the kernel CR3 loaded, we'll get a page fault
+	 * instead of allowing user code to execute with the wrong CR3.
+	 *
+	 * As exceptions, we don't set NX if:
+	 *  - _PAGE_USER is not set.  This could be an executable
+	 *     EFI runtime mapping or something similar, and the kernel
+	 *     may execute from it
+	 *  - we don't have NX support
+	 *  - we're clearing the PGD (i.e. the new pgd is not present).
+	 */
+	if ((pgd.pgd & (_PAGE_USER|_PAGE_PRESENT)) == (_PAGE_USER|_PAGE_PRESENT) &&
+	    (__supported_pte_mask & _PAGE_NX))
+		pgd.pgd |= _PAGE_NX;
+
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
 /*
  * Initialize kernel page table isolation
  */

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 29/54] x86/mm/pti: Allow NX poison to be set in p4d/pgd
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (27 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 28/54] x86/mm/pti: Add mapping helper functions Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 30/54] x86/mm/pti: Allocate a separate user PGD Thomas Gleixner
                   ` (27 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0032-x86-mm-pti-Allow-NX-poison-to-be-set-in-p4d-pgd.patch --]
[-- Type: text/plain, Size: 2121 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

With PAGE_TABLE_ISOLATION the user portion of the kernel page tables is
poisoned with the NX bit so if the entry code exits with the kernel page
tables selected in CR3, userspace crashes.

But doing so trips the p4d/pgd_bad() checks.  Make sure it does not do
that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,12 @@ static inline pud_t *pud_offset(p4d_t *p
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -880,7 +885,12 @@ static inline p4d_t *p4d_offset(pgd_t *p
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 30/54] x86/mm/pti: Allocate a separate user PGD
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (28 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 29/54] x86/mm/pti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 31/54] x86/mm/pti: Populate " Thomas Gleixner
                   ` (26 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0033-x86-mm-pti-Allocate-a-separate-user-PGD.patch --]
[-- Type: text/plain, Size: 4470 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Kernel page table isolation requires to have two PGDs. One for the kernel,
which contains the full kernel mapping plus the user space mapping and one
for user space which contains the user space mappings and the minimal set
of kernel mappings which are required by the architecture to be able to
transition from and to user space.

Add the necessary preliminaries.

[ tglx: Split out from the big kaiser dump ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/kernel/head_64.S |   30 +++++++++++++++++++++++++++---
 arch/x86/mm/pgtable.c     |   16 ++++++++++++++--
 2 files changed, 41 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -341,6 +341,27 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define PTI_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define PTI_USER_PGD_FILL	0
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -350,13 +371,14 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	PTI_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -364,13 +386,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	PTI_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -381,8 +404,9 @@ NEXT_PAGE(level2_ident_pgt)
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	PTI_USER_PGD_FILL,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * Instead of one PGD, we acquire two PGDs.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 31/54] x86/mm/pti: Populate user PGD
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (29 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 30/54] x86/mm/pti: Allocate a separate user PGD Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 32/54] x86/mm/pti: Add functions to clone kernel PMDs Thomas Gleixner
                   ` (25 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0034-x86-mm-pti-Populate-user-PGD.patch --]
[-- Type: text/plain, Size: 1797 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

In clone_pgd_range() copy the init user PGDs which cover the kernel half of
the address space, so a process has all the required kernel mappings
visible.

[ tglx: Split out from the big kaiser dump and folded Andys simplification ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/pgtable.h |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1119,7 +1119,14 @@ static inline void pmdp_set_wrprotect(st
  */
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
-       memcpy(dst, src, count * sizeof(pgd_t));
+	memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+	/* Clone the user space pgd as well */
+	memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 32/54] x86/mm/pti: Add functions to clone kernel PMDs
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (30 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 31/54] x86/mm/pti: Populate " Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 33/54] x86/mm/pti: Force entry through trampoline when PTI active Thomas Gleixner
                   ` (24 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0036-x86-mm-pti-Add-functions-to-clone-kernel-PMDs.patch --]
[-- Type: text/plain, Size: 4992 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Provide infrastructure to:

 - find a kernel PMD for a mapping which must be visible to user space for
   the entry/exit code to work.

 - walk an address range and share the kernel PMD with it.

This reuses a small part of the original KAISER patches to populate the
user space page table.

[ tglx: Made it universally usable so it can be used for any kind of shared
	mapping. Add a mechanism to clear specific bits in the user space
	visible PMD entry. Folded Andys simplifactions ]

Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/mm/pti.c |  127 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 127 insertions(+)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -48,6 +48,11 @@
 #undef pr_fmt
 #define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt
 
+/* Backporting helper */
+#ifndef __GFP_NOTRACK
+#define __GFP_NOTRACK	0
+#endif
+
 void __init pti_check_boottime_disable(void)
 {
 	bool enable = true;
@@ -106,6 +111,128 @@ pgd_t __pti_set_user_pgd(pgd_t *pgdp, pg
 }
 
 /*
+ * Walk the user copy of the page tables (optionally) trying to allocate
+ * page table pages on the way down.
+ *
+ * Returns a pointer to a P4D on success, or NULL on failure.
+ */
+static p4d_t *pti_user_pagetable_walk_p4d(unsigned long address)
+{
+	pgd_t *pgd = kernel_to_user_pgdp(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		unsigned long new_p4d_page = __get_free_page(gfp);
+		if (!new_p4d_page)
+			return NULL;
+
+		if (pgd_none(*pgd)) {
+			set_pgd(pgd, __pgd(_KERNPG_TABLE | __pa(new_p4d_page)));
+			new_p4d_page = 0;
+		}
+		if (new_p4d_page)
+			free_page(new_p4d_page);
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	return p4d_offset(pgd, address);
+}
+
+/*
+ * Walk the user copy of the page tables (optionally) trying to allocate
+ * page table pages on the way down.
+ *
+ * Returns a pointer to a PMD on success, or NULL on failure.
+ */
+static pmd_t *pti_user_pagetable_walk_pmd(unsigned long address)
+{
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+	p4d_t *p4d = pti_user_pagetable_walk_p4d(address);
+	pud_t *pud;
+
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		if (p4d_none(*p4d)) {
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+			new_pud_page = 0;
+		}
+		if (new_pud_page)
+			free_page(new_pud_page);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The user page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		if (pud_none(*pud)) {
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+			new_pmd_page = 0;
+		}
+		if (new_pmd_page)
+			free_page(new_pmd_page);
+	}
+
+	return pmd_offset(pud, address);
+}
+
+static void __init
+pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
+{
+	unsigned long addr;
+
+	/*
+	 * Clone the populated PMDs which cover start to end. These PMD areas
+	 * can have holes.
+	 */
+	for (addr = start; addr < end; addr += PMD_SIZE) {
+		pmd_t *pmd, *target_pmd;
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+
+		pgd = pgd_offset_k(addr);
+		if (WARN_ON(pgd_none(*pgd)))
+			return;
+		p4d = p4d_offset(pgd, addr);
+		if (WARN_ON(p4d_none(*p4d)))
+			return;
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud))
+			continue;
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd))
+			continue;
+
+		target_pmd = pti_user_pagetable_walk_pmd(addr);
+		if (WARN_ON(!target_pmd))
+			return;
+
+		/*
+		 * Copy the PMD.  That is, the kernelmode and usermode
+		 * tables will share the last-level page tables of this
+		 * address range
+		 */
+		*target_pmd = pmd_clear_flags(*pmd, clear);
+	}
+}
+
+/*
  * Initialize kernel page table isolation
  */
 void __init pti_init(void)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 33/54] x86/mm/pti: Force entry through trampoline when PTI active
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (31 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 32/54] x86/mm/pti: Add functions to clone kernel PMDs Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 34/54] x86/mm/pti: Share cpu_entry_area with user space page tables Thomas Gleixner
                   ` (23 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0037-x86-mm-pti-Force-entry-through-trampoline-when-PTI-a.patch --]
[-- Type: text/plain, Size: 1588 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Force the entry through the trampoline only when PTI is active. Otherwise
go through the normal entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/kernel/cpu/common.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1339,7 +1339,10 @@ void syscall_init(void)
 		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
+	if (static_cpu_has(X86_FEATURE_PTI))
+		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
+	else
+		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 34/54] x86/mm/pti: Share cpu_entry_area with user space page tables
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (32 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 33/54] x86/mm/pti: Force entry through trampoline when PTI active Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 35/54] x86/entry: Align entry text section to PMD boundary Thomas Gleixner
                   ` (22 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0039-x86-mm-pti-Share-cpu_entry_area-PMDs.patch --]
[-- Type: text/plain, Size: 1929 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Share the cpu entry area so the user space and kernel space page tables
have the same P4D page.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/mm/pti.c |   25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -233,6 +233,29 @@ pti_clone_pmds(unsigned long start, unsi
 }
 
 /*
+ * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
+ * next-level entry on 5-level systems.
+ */
+static void __init pti_clone_p4d(unsigned long addr)
+{
+	p4d_t *kernel_p4d, *user_p4d;
+	pgd_t *kernel_pgd;
+
+	user_p4d = pti_user_pagetable_walk_p4d(addr);
+	kernel_pgd = pgd_offset_k(addr);
+	kernel_p4d = p4d_offset(kernel_pgd, addr);
+	*user_p4d = *kernel_p4d;
+}
+
+/*
+ * Clone the CPU_ENTRY_AREA into the user space visible page table.
+ */
+static void __init pti_clone_user_shared(void)
+{
+	pti_clone_p4d(CPU_ENTRY_AREA_BASE);
+}
+
+/*
  * Initialize kernel page table isolation
  */
 void __init pti_init(void)
@@ -241,4 +264,6 @@ void __init pti_init(void)
 		return;
 
 	pr_info("enabled\n");
+
+	pti_clone_user_shared();
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 35/54] x86/entry: Align entry text section to PMD boundary
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (33 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 34/54] x86/mm/pti: Share cpu_entry_area with user space page tables Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2018-05-17 15:58   ` Josh Poimboeuf
  2017-12-20 21:35 ` [patch V181 36/54] x86/mm/pti: Share entry text PMD Thomas Gleixner
                   ` (21 subsequent siblings)
  56 siblings, 1 reply; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0040-x86-entry-Align-entry-text-section-to-PMD-boundary.patch --]
[-- Type: text/plain, Size: 1728 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The (irq)entry text must be visible in the user space page tables. To allow
simple PMD based sharing, make the entry text PMD aligned.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/kernel/vmlinux.lds.S |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -61,11 +61,17 @@ jiffies_64 = jiffies;
 		. = ALIGN(HPAGE_SIZE);				\
 		__end_rodata_hpage_align = .;
 
+#define ALIGN_ENTRY_TEXT_BEGIN	. = ALIGN(PMD_SIZE);
+#define ALIGN_ENTRY_TEXT_END	. = ALIGN(PMD_SIZE);
+
 #else
 
 #define X64_ALIGN_RODATA_BEGIN
 #define X64_ALIGN_RODATA_END
 
+#define ALIGN_ENTRY_TEXT_BEGIN
+#define ALIGN_ENTRY_TEXT_END
+
 #endif
 
 PHDRS {
@@ -102,8 +108,10 @@ SECTIONS
 		CPUIDLE_TEXT
 		LOCK_TEXT
 		KPROBES_TEXT
+		ALIGN_ENTRY_TEXT_BEGIN
 		ENTRY_TEXT
 		IRQENTRY_TEXT
+		ALIGN_ENTRY_TEXT_END
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 36/54] x86/mm/pti: Share entry text PMD
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (34 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 35/54] x86/entry: Align entry text section to PMD boundary Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 37/54] x86/mm/pti: Map ESPFIX into user space Thomas Gleixner
                   ` (20 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0041-x86-mm-pti-Share-entry-text-PMD.patch --]
[-- Type: text/plain, Size: 1793 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Share the entry text PMD of the kernel mapping with the user space
mapping. If large pages are enabled this is a single PMD entry and at the
point where it is copied into the user page table the RW bit has not been
cleared yet. Clear it right away so the user space visible map becomes RX.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/mm/pti.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -256,6 +256,15 @@ static void __init pti_clone_user_shared
 }
 
 /*
+ * Clone the populated PMDs of the entry and irqentry text and force it RO.
+ */
+static void __init pti_clone_entry_text(void)
+{
+	pti_clone_pmds((unsigned long) __entry_text_start,
+			(unsigned long) __irqentry_text_end, _PAGE_RW);
+}
+
+/*
  * Initialize kernel page table isolation
  */
 void __init pti_init(void)
@@ -266,4 +275,5 @@ void __init pti_init(void)
 	pr_info("enabled\n");
 
 	pti_clone_user_shared();
+	pti_clone_entry_text();
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 37/54] x86/mm/pti: Map ESPFIX into user space
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (35 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 36/54] x86/mm/pti: Share entry text PMD Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 38/54] x86/cpu_entry_area: Add debugstore entries to cpu_entry_area Thomas Gleixner
                   ` (19 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, Kees Cook,
	Borislav Petkov

[-- Attachment #1: x86-mm-pti--Map-ESPFIX-into-user-space.patch --]
[-- Type: text/plain, Size: 1047 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Map the ESPFIX pages into user space when PTI is enabled.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>
---
 arch/x86/mm/pti.c |   11 +++++++++++
 1 file changed, 11 insertions(+)

--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -256,6 +256,16 @@ static void __init pti_clone_user_shared
 }
 
 /*
+ * Clone the ESPFIX P4D into the user space visinble page table
+ */
+static void __init pti_setup_espfix64(void)
+{
+#ifdef CONFIG_X86_ESPFIX64
+	pti_clone_p4d(ESPFIX_BASE_ADDR);
+#endif
+}
+
+/*
  * Clone the populated PMDs of the entry and irqentry text and force it RO.
  */
 static void __init pti_clone_entry_text(void)
@@ -276,4 +286,5 @@ void __init pti_init(void)
 
 	pti_clone_user_shared();
 	pti_clone_entry_text();
+	pti_setup_espfix64();
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 38/54] x86/cpu_entry_area: Add debugstore entries to cpu_entry_area
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (36 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 37/54] x86/mm/pti: Map ESPFIX into user space Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 39/54] x86/events/intel/ds: Map debug buffers in cpu_entry_area Thomas Gleixner
                   ` (18 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0043-x86-fixmap-Add-debugstore-entries-to-cpu_entry_area.patch --]
[-- Type: text/plain, Size: 6696 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The Intel PEBS/BTS debug store is a design trainwreck as it expects virtual
addresses which must be visible in any execution context.

So it is required to make these mappings visible to user space when kernel
page table isolation is active.

Provide enough room for the buffer mappings in the cpu_entry_area so the
buffers are available in the user space visible page tables.

At the point where the kernel side entry area is populated there is no
buffer available yet, but the kernel PMD must be populated. To achieve this
set the entries for these buffers to non present.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/events/intel/ds.c            |    5 ++--
 arch/x86/events/perf_event.h          |   21 +------------------
 arch/x86/include/asm/cpu_entry_area.h |   13 ++++++++++++
 arch/x86/include/asm/intel_ds.h       |   36 ++++++++++++++++++++++++++++++++++
 arch/x86/mm/cpu_entry_area.c          |   32 ++++++++++++++++++++++++++++++
 5 files changed, 86 insertions(+), 21 deletions(-)

--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -8,11 +8,12 @@
 
 #include "../perf_event.h"
 
+/* Waste a full page so it can be mapped into the cpu_entry_area */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
-#define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
 #define PEBS_FIXUP_SIZE		PAGE_SIZE
 
 /*
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -14,6 +14,8 @@
 
 #include <linux/perf_event.h>
 
+#include <asm/intel_ds.h>
+
 /* To enable MSR tracing please use the generic trace points. */
 
 /*
@@ -77,8 +79,6 @@ struct amd_nb {
 	struct event_constraint event_constraints[X86_PMC_IDX_MAX];
 };
 
-/* The maximal number of PEBS events: */
-#define MAX_PEBS_EVENTS		8
 #define PEBS_COUNTER_MASK	((1ULL << MAX_PEBS_EVENTS) - 1)
 
 /*
@@ -95,23 +95,6 @@ struct amd_nb {
 	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
 	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
 
-/*
- * A debug store configuration.
- *
- * We only support architectures that use 64bit fields.
- */
-struct debug_store {
-	u64	bts_buffer_base;
-	u64	bts_index;
-	u64	bts_absolute_maximum;
-	u64	bts_interrupt_threshold;
-	u64	pebs_buffer_base;
-	u64	pebs_index;
-	u64	pebs_absolute_maximum;
-	u64	pebs_interrupt_threshold;
-	u64	pebs_event_reset[MAX_PEBS_EVENTS];
-};
-
 #define PEBS_REGS \
 	(PERF_REG_X86_AX | \
 	 PERF_REG_X86_BX | \
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -5,6 +5,7 @@
 
 #include <linux/percpu-defs.h>
 #include <asm/processor.h>
+#include <asm/intel_ds.h>
 
 /*
  * cpu_entry_area is a percpu region that contains things needed by the CPU
@@ -40,6 +41,18 @@ struct cpu_entry_area {
 	 */
 	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
 #endif
+#ifdef CONFIG_CPU_SUP_INTEL
+	/*
+	 * Per CPU debug store for Intel performance monitoring. Wastes a
+	 * full page at the moment.
+	 */
+	struct debug_store cpu_debug_store;
+	/*
+	 * The actual PEBS/BTS buffers must be mapped to user space
+	 * Reserve enough fixmap PTEs.
+	 */
+	struct debug_store_buffers cpu_debug_buffers;
+#endif
 };
 
 #define CPU_ENTRY_AREA_SIZE	(sizeof(struct cpu_entry_area))
--- /dev/null
+++ b/arch/x86/include/asm/intel_ds.h
@@ -0,0 +1,36 @@
+#ifndef _ASM_INTEL_DS_H
+#define _ASM_INTEL_DS_H
+
+#include <linux/percpu-defs.h>
+
+#define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
+
+/* The maximal number of PEBS events: */
+#define MAX_PEBS_EVENTS		8
+
+/*
+ * A debug store configuration.
+ *
+ * We only support architectures that use 64bit fields.
+ */
+struct debug_store {
+	u64	bts_buffer_base;
+	u64	bts_index;
+	u64	bts_absolute_maximum;
+	u64	bts_interrupt_threshold;
+	u64	pebs_buffer_base;
+	u64	pebs_index;
+	u64	pebs_absolute_maximum;
+	u64	pebs_interrupt_threshold;
+	u64	pebs_event_reset[MAX_PEBS_EVENTS];
+} __aligned(PAGE_SIZE);
+
+DECLARE_PER_CPU_PAGE_ALIGNED(struct debug_store, cpu_debug_store);
+
+struct debug_store_buffers {
+	char	bts_buffer[BTS_BUFFER_SIZE];
+	char	pebs_buffer[PEBS_BUFFER_SIZE];
+};
+
+#endif
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -20,6 +20,16 @@ void cea_set_pte(void *cea_vaddr, phys_a
 	set_pte_vaddr(va, pfn_pte(pa >> PAGE_SHIFT, flags));
 }
 
+/*
+ * Force the population of PMDs for not yet allocated per cpu
+ * memory like debug store buffers.
+ */
+static void __init cea_allocate_ptes(void *cea_vaddr, int pages)
+{
+	for (; pages; pages--, cea_vaddr += PAGE_SIZE)
+		cea_set_pte(cea_vaddr, 0, PAGE_NONE);
+}
+
 static void __init
 cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot)
 {
@@ -27,6 +37,27 @@ cea_map_percpu_pages(void *cea_vaddr, vo
 		cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot);
 }
 
+static void percpu_setup_debug_store(int cpu)
+{
+#ifdef CONFIG_CPU_SUP_INTEL
+	int npages;
+	void *cea;
+
+	if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)
+		return;
+
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_store;
+	npages = sizeof(struct debug_store) / PAGE_SIZE;
+	BUILD_BUG_ON(sizeof(struct debug_store) % PAGE_SIZE != 0);
+	cea_map_percpu_pages(cea, &per_cpu(cpu_debug_store, cpu), npages,
+			     PAGE_KERNEL);
+
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers;
+	npages = sizeof(struct debug_store_buffers) / PAGE_SIZE;
+	cea_allocate_ptes(cea, npages);
+#endif
+}
+
 /* Setup the fixmap mappings only once per-processor */
 static void __init setup_cpu_entry_area(int cpu)
 {
@@ -98,6 +129,7 @@ static void __init setup_cpu_entry_area(
 	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
+	percpu_setup_debug_store(cpu);
 }
 
 static __init void setup_cpu_entry_area_ptes(void)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 39/54] x86/events/intel/ds: Map debug buffers in cpu_entry_area
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (37 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 38/54] x86/cpu_entry_area: Add debugstore entries to cpu_entry_area Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 40/54] x86/mm/64: Make a full PGD-entry size hole in the memory map Thomas Gleixner
                   ` (17 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0044-x86-events-intel-ds-Map-debug-buffers-in-fixmap.patch --]
[-- Type: text/plain, Size: 7825 bytes --]

From: Hugh Dickins <hughd@google.com>

The BTS and PEBS buffers both have their virtual addresses programmed into
the hardware.  This means that any access to them is performed via the page
tables.  The times that the hardware accesses these are entirely dependent
on how the performance monitoring hardware events are set up.  In other
words, there is no way for the kernel to tell when the hardware might
access these buffers.

To avoid perf crashes, place 'debug_store' allocate pages and map them into
the cpu_entry_area.

The PEBS fixup buffer does not need this treatment.

[ tglx: Got rid of the kaiser_add_mapping() cruft ]

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: keescook@google.com
---
 arch/x86/events/intel/ds.c   |  125 +++++++++++++++++++++++++++----------------
 arch/x86/events/perf_event.h |    2 
 2 files changed, 82 insertions(+), 45 deletions(-)

--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3,6 +3,7 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+#include <asm/cpu_entry_area.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 
@@ -280,17 +281,52 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
-static int alloc_pebs_buffer(int cpu)
+static void ds_update_cea(void *cea, void *addr, size_t size, pgprot_t prot)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	phys_addr_t pa;
+	size_t msz = 0;
+
+	pa = virt_to_phys(addr);
+	for (; msz < size; msz += PAGE_SIZE, pa += PAGE_SIZE, cea += PAGE_SIZE)
+		cea_set_pte(cea, pa, prot);
+}
+
+static void ds_clear_cea(void *cea, size_t size)
+{
+	size_t msz = 0;
+
+	for (; msz < size; msz += PAGE_SIZE, cea += PAGE_SIZE)
+		cea_set_pte(cea, 0, PAGE_NONE);
+}
+
+static void *dsalloc_pages(size_t size, gfp_t flags, int cpu)
+{
+	unsigned int order = get_order(size);
 	int node = cpu_to_node(cpu);
-	int max;
-	void *buffer, *ibuffer;
+	struct page *page;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	return page ? page_address(page) : NULL;
+}
+
+static void dsfree_pages(const void *buffer, size_t size)
+{
+	if (buffer)
+		free_pages((unsigned long)buffer, get_order(size));
+}
+
+static int alloc_pebs_buffer(int cpu)
+{
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	size_t bsiz = x86_pmu.pebs_buffer_size;
+	int max, node = cpu_to_node(cpu);
+	void *buffer, *ibuffer, *cea;
 
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -301,25 +337,27 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree_pages(buffer, bsiz);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
 	}
-
-	max = x86_pmu.pebs_buffer_size / x86_pmu.pebs_record_size;
-
-	ds->pebs_buffer_base = (u64)(unsigned long)buffer;
+	hwev->ds_pebs_vaddr = buffer;
+	/* Update the cpu entry area mapping */
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
+	ds->pebs_buffer_base = (unsigned long) cea;
+	ds_update_cea(cea, buffer, bsiz, PAGE_KERNEL);
 	ds->pebs_index = ds->pebs_buffer_base;
-	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
-		max * x86_pmu.pebs_record_size;
-
+	max = x86_pmu.pebs_record_size * (bsiz / x86_pmu.pebs_record_size);
+	ds->pebs_absolute_maximum = ds->pebs_buffer_base + max;
 	return 0;
 }
 
 static void release_pebs_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	void *cea;
 
 	if (!ds || !x86_pmu.pebs)
 		return;
@@ -327,73 +365,70 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	/* Clear the fixmap */
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.pebs_buffer;
+	ds_clear_cea(cea, x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
+	dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
+	hwev->ds_pebs_vaddr = NULL;
 }
 
 static int alloc_bts_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-	int node = cpu_to_node(cpu);
-	int max, thresh;
-	void *buffer;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	void *buffer, *cea;
+	int max;
 
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc_pages(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, cpu);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
 	}
-
-	max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
-	thresh = max / 16;
-
-	ds->bts_buffer_base = (u64)(unsigned long)buffer;
+	hwev->ds_bts_vaddr = buffer;
+	/* Update the fixmap */
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.bts_buffer;
+	ds->bts_buffer_base = (unsigned long) cea;
+	ds_update_cea(cea, buffer, BTS_BUFFER_SIZE, PAGE_KERNEL);
 	ds->bts_index = ds->bts_buffer_base;
-	ds->bts_absolute_maximum = ds->bts_buffer_base +
-		max * BTS_RECORD_SIZE;
-	ds->bts_interrupt_threshold = ds->bts_absolute_maximum -
-		thresh * BTS_RECORD_SIZE;
-
+	max = BTS_RECORD_SIZE * (BTS_BUFFER_SIZE / BTS_RECORD_SIZE);
+	ds->bts_absolute_maximum = ds->bts_buffer_base + max;
+	ds->bts_interrupt_threshold = ds->bts_absolute_maximum - (max / 16);
 	return 0;
 }
 
 static void release_bts_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	void *cea;
 
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	/* Clear the fixmap */
+	cea = &get_cpu_entry_area(cpu)->cpu_debug_buffers.bts_buffer;
+	ds_clear_cea(cea, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
+	dsfree_pages(hwev->ds_bts_vaddr, BTS_BUFFER_SIZE);
+	hwev->ds_bts_vaddr = NULL;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = &get_cpu_entry_area(cpu)->cpu_debug_store;
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
-
 	return 0;
 }
 
 static void release_ds_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-
-	if (!ds)
-		return;
-
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -199,6 +199,8 @@ struct cpu_hw_events {
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
+	void			*ds_pebs_vaddr;
+	void			*ds_bts_vaddr;
 	u64			pebs_enabled;
 	int			n_pebs;
 	int			n_large_pebs;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 40/54] x86/mm/64: Make a full PGD-entry size hole in the memory map
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (38 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 39/54] x86/events/intel/ds: Map debug buffers in cpu_entry_area Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 41/54] x86/pti: Put the LDT in its own PGD if PTI is on Thomas Gleixner
                   ` (16 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, Kees Cook,
	Borislav Petkov, Kirill A. Shutemov

[-- Attachment #1: x86-mm-64--Make_a_full_PGD-entry_size_hole_in_the_memory_map.patch --]
[-- Type: text/plain, Size: 2059 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Shrink vmalloc space from 16384TiB to 12800TiB to enlarge the hole starting
at 0xff90000000000000 to be a full PGD entry.

A subsequent patch will use this hole for the pagetable isolation LDT
alias.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>

---
 Documentation/x86/x86_64/mm.txt         |    4 ++--
 arch/x86/include/asm/pgtable_64_types.h |    4 ++--
 2 files changed, 4 insertions(+), 4 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -29,8 +29,8 @@ ffffffffffe00000 - ffffffffffffffff (=2
 hole caused by [56:63] sign extension
 ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
 ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
-ff90000000000000 - ff91ffffffffffff (=49 bits) hole
-ff92000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space
+ff90000000000000 - ff9fffffffffffff (=52 bits) hole
+ffa0000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space (12800 TB)
 ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
 ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
 ... unused hole ...
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -79,8 +79,8 @@ typedef struct { pteval_t pte; } pte_t;
 #define MAXMEM			_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
 
 #ifdef CONFIG_X86_5LEVEL
-# define VMALLOC_SIZE_TB	_AC(16384, UL)
-# define __VMALLOC_BASE		_AC(0xff92000000000000, UL)
+# define VMALLOC_SIZE_TB	_AC(12800, UL)
+# define __VMALLOC_BASE		_AC(0xffa0000000000000, UL)
 # define __VMEMMAP_BASE		_AC(0xffd4000000000000, UL)
 #else
 # define VMALLOC_SIZE_TB	_AC(32, UL)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 41/54] x86/pti: Put the LDT in its own PGD if PTI is on
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (39 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 40/54] x86/mm/64: Make a full PGD-entry size hole in the memory map Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 42/54] x86/pti: Map the vsyscall page if needed Thomas Gleixner
                   ` (15 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, Kees Cook,
	Borislav Petkov, Kirill A. Shutemov

[-- Attachment #1: x86-pti--Put_the_LDT_in_its_own_PGD_if_PTI_is_on.patch --]
[-- Type: text/plain, Size: 14128 bytes --]

From: Andy Lutomirski <luto@kernel.org>

With PTI enabled, the LDT must be mapped in the usermode tables somewhere.
The LDT is per process, i.e. per mm.

An earlier approach mapped the LDT on context switch into a fixmap area,
but that's a big overhead and exhausted the fixmap space when NR_CPUS got
big.

Take advantage of the fact that there is an address space hole which
provides a completely unused pgd. Use this pgd to manage per-mm LDT
mappings.

This has a down side: the LDT isn't (currently) randomized, and an attack
that can write the LDT is instant root due to call gates (thanks, AMD, for
leaving call gates in AMD64 but designing them wrong so they're only useful
for exploits).  This can be mitigated by making the LDT read-only or
randomizing the mapping, either of which is strightforward on top of this
patch.

This will significantly slow down LDT users, but that shouldn't matter for
important workloads -- the LDT is only used by DOSEMU(2), Wine, and very
old libc implementations.

[ tglx: Decrapified it ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "Kirill A. Shutemov" <kirill@shutemov.name>

---
 Documentation/x86/x86_64/mm.txt         |    3 
 arch/x86/include/asm/mmu_context.h      |   59 ++++++++++++-
 arch/x86/include/asm/pgtable_64_types.h |    4 
 arch/x86/include/asm/processor.h        |   23 +++--
 arch/x86/kernel/ldt.c                   |  139 +++++++++++++++++++++++++++++++-
 arch/x86/mm/dump_pagetables.c           |    9 ++
 6 files changed, 220 insertions(+), 17 deletions(-)

--- a/Documentation/x86/x86_64/mm.txt
+++ b/Documentation/x86/x86_64/mm.txt
@@ -12,6 +12,7 @@ ffffea0000000000 - ffffeaffffffffff (=40
 ... unused hole ...
 ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
 ... unused hole ...
+fffffe0000000000 - fffffe7fffffffff (=39 bits) LDT remap for PTI
 fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping
 ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
 ... unused hole ...
@@ -29,7 +30,7 @@ ffffffffffe00000 - ffffffffffffffff (=2
 hole caused by [56:63] sign extension
 ff00000000000000 - ff0fffffffffffff (=52 bits) guard hole, reserved for hypervisor
 ff10000000000000 - ff8fffffffffffff (=55 bits) direct mapping of all phys. memory
-ff90000000000000 - ff9fffffffffffff (=52 bits) hole
+ff90000000000000 - ff9fffffffffffff (=52 bits) LDT remap for PTI
 ffa0000000000000 - ffd1ffffffffffff (=54 bits) vmalloc/ioremap space (12800 TB)
 ffd2000000000000 - ffd3ffffffffffff (=49 bits) hole
 ffd4000000000000 - ffd5ffffffffffff (=49 bits) virtual memory map (512TB)
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -50,10 +50,33 @@ struct ldt_struct {
 	 * call gates.  On native, we could merge the ldt_struct and LDT
 	 * allocations, but it's not worth trying to optimize.
 	 */
-	struct desc_struct *entries;
-	unsigned int nr_entries;
+	struct desc_struct	*entries;
+	unsigned int		nr_entries;
+
+	/*
+	 * If PTI is in use, then the entries array is not mapped while we're
+	 * in user mode.  The whole array will be aliased at the addressed
+	 * given by ldt_slot_va(slot).  We use two slots so that we can allocate
+	 * and map, and enable a new LDT without invalidating the mapping
+	 * of an older, still-in-use LDT.
+	 *
+	 * slot will be -1 if this LDT doesn't have an alias mapping.
+	 */
+	int			slot;
 };
 
+/* This is a multiple of PAGE_SIZE. */
+#define LDT_SLOT_STRIDE (LDT_ENTRIES * LDT_ENTRY_SIZE)
+
+static inline void *ldt_slot_va(int slot)
+{
+#ifdef CONFIG_X86_64
+	return (void *)(LDT_BASE_ADDR + LDT_SLOT_STRIDE * slot);
+#else
+	BUG();
+#endif
+}
+
 /*
  * Used for LDT copy/destruction.
  */
@@ -64,6 +87,7 @@ static inline void init_new_context_ldt(
 }
 int ldt_dup_context(struct mm_struct *oldmm, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
+void ldt_arch_exit_mmap(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline void init_new_context_ldt(struct mm_struct *mm) { }
 static inline int ldt_dup_context(struct mm_struct *oldmm,
@@ -71,7 +95,8 @@ static inline int ldt_dup_context(struct
 {
 	return 0;
 }
-static inline void destroy_context_ldt(struct mm_struct *mm) {}
+static inline void destroy_context_ldt(struct mm_struct *mm) { }
+static inline void ldt_arch_exit_mmap(struct mm_struct *mm) { }
 #endif
 
 static inline void load_mm_ldt(struct mm_struct *mm)
@@ -96,10 +121,31 @@ static inline void load_mm_ldt(struct mm
 	 * that we can see.
 	 */
 
-	if (unlikely(ldt))
-		set_ldt(ldt->entries, ldt->nr_entries);
-	else
+	if (unlikely(ldt)) {
+		if (static_cpu_has(X86_FEATURE_PTI)) {
+			if (WARN_ON_ONCE((unsigned long)ldt->slot > 1)) {
+				/*
+				 * Whoops -- either the new LDT isn't mapped
+				 * (if slot == -1) or is mapped into a bogus
+				 * slot (if slot > 1).
+				 */
+				clear_LDT();
+				return;
+			}
+
+			/*
+			 * If page table isolation is enabled, ldt->entries
+			 * will not be mapped in the userspace pagetables.
+			 * Tell the CPU to access the LDT through the alias
+			 * at ldt_slot_va(ldt->slot).
+			 */
+			set_ldt(ldt_slot_va(ldt->slot), ldt->nr_entries);
+		} else {
+			set_ldt(ldt->entries, ldt->nr_entries);
+		}
+	} else {
 		clear_LDT();
+	}
 #else
 	clear_LDT();
 #endif
@@ -194,6 +240,7 @@ static inline int arch_dup_mmap(struct m
 static inline void arch_exit_mmap(struct mm_struct *mm)
 {
 	paravirt_arch_exit_mmap(mm);
+	ldt_arch_exit_mmap(mm);
 }
 
 #ifdef CONFIG_X86_64
--- a/arch/x86/include/asm/pgtable_64_types.h
+++ b/arch/x86/include/asm/pgtable_64_types.h
@@ -82,10 +82,14 @@ typedef struct { pteval_t pte; } pte_t;
 # define VMALLOC_SIZE_TB	_AC(12800, UL)
 # define __VMALLOC_BASE		_AC(0xffa0000000000000, UL)
 # define __VMEMMAP_BASE		_AC(0xffd4000000000000, UL)
+# define LDT_PGD_ENTRY		_AC(-112, UL)
+# define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
 #else
 # define VMALLOC_SIZE_TB	_AC(32, UL)
 # define __VMALLOC_BASE		_AC(0xffffc90000000000, UL)
 # define __VMEMMAP_BASE		_AC(0xffffea0000000000, UL)
+# define LDT_PGD_ENTRY		_AC(-4, UL)
+# define LDT_BASE_ADDR		(LDT_PGD_ENTRY << PGDIR_SHIFT)
 #endif
 
 #ifdef CONFIG_RANDOMIZE_MEMORY
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -851,13 +851,22 @@ static inline void spin_lock_prefetch(co
 
 #else
 /*
- * User space process size. 47bits minus one guard page.  The guard
- * page is necessary on Intel CPUs: if a SYSCALL instruction is at
- * the highest possible canonical userspace address, then that
- * syscall will enter the kernel with a non-canonical return
- * address, and SYSRET will explode dangerously.  We avoid this
- * particular problem by preventing anything from being mapped
- * at the maximum canonical address.
+ * User space process size.  This is the first address outside the user range.
+ * There are a few constraints that determine this:
+ *
+ * On Intel CPUs, if a SYSCALL instruction is at the highest canonical
+ * address, then that syscall will enter the kernel with a
+ * non-canonical return address, and SYSRET will explode dangerously.
+ * We avoid this particular problem by preventing anything executable
+ * from being mapped at the maximum canonical address.
+ *
+ * On AMD CPUs in the Ryzen family, there's a nasty bug in which the
+ * CPUs malfunction if they execute code from the highest canonical page.
+ * They'll speculate right off the end of the canonical space, and
+ * bad things happen.  This is worked around in the same way as the
+ * Intel problem.
+ *
+ * With page table isolation enabled, we map the LDT in ... [stay tuned]
  */
 #define TASK_SIZE_MAX	((1UL << __VIRTUAL_MASK_SHIFT) - PAGE_SIZE)
 
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -24,6 +24,7 @@
 #include <linux/uaccess.h>
 
 #include <asm/ldt.h>
+#include <asm/tlb.h>
 #include <asm/desc.h>
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
@@ -51,13 +52,11 @@ static void refresh_ldt_segments(void)
 static void flush_ldt(void *__mm)
 {
 	struct mm_struct *mm = __mm;
-	mm_context_t *pc;
 
 	if (this_cpu_read(cpu_tlbstate.loaded_mm) != mm)
 		return;
 
-	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+	load_mm_ldt(mm);
 
 	refresh_ldt_segments();
 }
@@ -94,10 +93,121 @@ static struct ldt_struct *alloc_ldt_stru
 		return NULL;
 	}
 
+	/* The new LDT isn't aliased for PTI yet. */
+	new_ldt->slot = -1;
+
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
 
+/*
+ * If PTI is enabled, this maps the LDT into the kernelmode and
+ * usermode tables for the given mm.
+ *
+ * There is no corresponding unmap function.  Even if the LDT is freed, we
+ * leave the PTEs around until the slot is reused or the mm is destroyed.
+ * This is harmless: the LDT is always in ordinary memory, and no one will
+ * access the freed slot.
+ *
+ * If we wanted to unmap freed LDTs, we'd also need to do a flush to make
+ * it useful, and the flush would slow down modify_ldt().
+ */
+static int
+map_ldt_struct(struct mm_struct *mm, struct ldt_struct *ldt, int slot)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	bool is_vmalloc, had_top_level_entry;
+	unsigned long va;
+	spinlock_t *ptl;
+	pgd_t *pgd;
+	int i;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return 0;
+
+	/*
+	 * Any given ldt_struct should have map_ldt_struct() called at most
+	 * once.
+	 */
+	WARN_ON(ldt->slot != -1);
+
+	/*
+	 * Did we already have the top level entry allocated?  We can't
+	 * use pgd_none() for this because it doens't do anything on
+	 * 4-level page table kernels.
+	 */
+	pgd = pgd_offset(mm, LDT_BASE_ADDR);
+	had_top_level_entry = (pgd->pgd != 0);
+
+	is_vmalloc = is_vmalloc_addr(ldt->entries);
+
+	for (i = 0; i * PAGE_SIZE < ldt->nr_entries * LDT_ENTRY_SIZE; i++) {
+		unsigned long offset = i << PAGE_SHIFT;
+		const void *src = (char *)ldt->entries + offset;
+		unsigned long pfn;
+		pte_t pte, *ptep;
+
+		va = (unsigned long)ldt_slot_va(slot) + offset;
+		pfn = is_vmalloc ? vmalloc_to_pfn(src) :
+			page_to_pfn(virt_to_page(src));
+		/*
+		 * Treat the PTI LDT range as a *userspace* range.
+		 * get_locked_pte() will allocate all needed pagetables
+		 * and account for them in this mm.
+		 */
+		ptep = get_locked_pte(mm, va, &ptl);
+		if (!ptep)
+			return -ENOMEM;
+		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));
+		set_pte_at(mm, va, ptep, pte);
+		pte_unmap_unlock(ptep, ptl);
+	}
+
+	if (mm->context.ldt) {
+		/*
+		 * We already had an LDT.  The top-level entry should already
+		 * have been allocated and synchronized with the usermode
+		 * tables.
+		 */
+		WARN_ON(!had_top_level_entry);
+		if (static_cpu_has(X86_FEATURE_PTI))
+			WARN_ON(!kernel_to_user_pgdp(pgd)->pgd);
+	} else {
+		/*
+		 * This is the first time we're mapping an LDT for this process.
+		 * Sync the pgd to the usermode tables.
+		 */
+		WARN_ON(had_top_level_entry);
+		if (static_cpu_has(X86_FEATURE_PTI)) {
+			WARN_ON(kernel_to_user_pgdp(pgd)->pgd);
+			set_pgd(kernel_to_user_pgdp(pgd), *pgd);
+		}
+	}
+
+	va = (unsigned long)ldt_slot_va(slot);
+	flush_tlb_mm_range(mm, va, va + LDT_SLOT_STRIDE, 0);
+
+	ldt->slot = slot;
+#endif
+	return 0;
+}
+
+static void free_ldt_pgtables(struct mm_struct *mm)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	struct mmu_gather tlb;
+	unsigned long start = LDT_BASE_ADDR;
+	unsigned long end = start + (1UL << PGDIR_SHIFT);
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	tlb_gather_mmu(&tlb, mm, start, end);
+	free_pgd_range(&tlb, start, end, start, end);
+	tlb_finish_mmu(&tlb, start, end);
+#endif
+}
+
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
@@ -156,6 +266,12 @@ int ldt_dup_context(struct mm_struct *ol
 	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
 	finalize_ldt_struct(new_ldt);
 
+	retval = map_ldt_struct(mm, new_ldt, 0);
+	if (retval) {
+		free_ldt_pgtables(mm);
+		free_ldt_struct(new_ldt);
+		goto out_unlock;
+	}
 	mm->context.ldt = new_ldt;
 
 out_unlock:
@@ -174,6 +290,11 @@ void destroy_context_ldt(struct mm_struc
 	mm->context.ldt = NULL;
 }
 
+void ldt_arch_exit_mmap(struct mm_struct *mm)
+{
+	free_ldt_pgtables(mm);
+}
+
 static int read_ldt(void __user *ptr, unsigned long bytecount)
 {
 	struct mm_struct *mm = current->mm;
@@ -287,6 +408,18 @@ static int write_ldt(void __user *ptr, u
 	new_ldt->entries[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
 
+	/*
+	 * If we are using PTI, map the new LDT into the userspace pagetables.
+	 * If there is already an LDT, use the other slot so that other CPUs
+	 * will continue to use the old LDT until install_ldt() switches
+	 * them over to the new LDT.
+	 */
+	error = map_ldt_struct(mm, new_ldt, old_ldt ? !old_ldt->slot : 0);
+	if (error) {
+		free_ldt_struct(old_ldt);
+		goto out_unlock;
+	}
+
 	install_ldt(mm, new_ldt);
 	free_ldt_struct(old_ldt);
 	error = 0;
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -52,12 +52,18 @@ enum address_markers_idx {
 	USER_SPACE_NR = 0,
 	KERNEL_SPACE_NR,
 	LOW_KERNEL_NR,
+#if defined(CONFIG_MODIFY_LDT_SYSCALL) && defined(CONFIG_X86_5LEVEL)
+	LDT_NR,
+#endif
 	VMALLOC_START_NR,
 	VMEMMAP_START_NR,
 #ifdef CONFIG_KASAN
 	KASAN_SHADOW_START_NR,
 	KASAN_SHADOW_END_NR,
 #endif
+#if defined(CONFIG_MODIFY_LDT_SYSCALL) && !defined(CONFIG_X86_5LEVEL)
+	LDT_NR,
+#endif
 	CPU_ENTRY_AREA_NR,
 #ifdef CONFIG_X86_ESPFIX64
 	ESPFIX_START_NR,
@@ -82,6 +88,9 @@ static struct addr_marker address_marker
 	[KASAN_SHADOW_START_NR]	= { KASAN_SHADOW_START,	"KASAN shadow" },
 	[KASAN_SHADOW_END_NR]	= { KASAN_SHADOW_END,	"KASAN shadow end" },
 #endif
+#ifdef CONFIG_MODIFY_LDT_SYSCALL
+	[LDT_NR]		= { LDT_BASE_ADDR,	"LDT remap" },
+#endif
 	[CPU_ENTRY_AREA_NR]	= { CPU_ENTRY_AREA_BASE,"CPU entry Area" },
 #ifdef CONFIG_X86_ESPFIX64
 	[ESPFIX_START_NR]	= { ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 42/54] x86/pti: Map the vsyscall page if needed
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (40 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 41/54] x86/pti: Put the LDT in its own PGD if PTI is on Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 43/54] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
                   ` (14 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, Kees Cook,
	Borislav Petkov

[-- Attachment #1: x86-pti--Map_the_vsyscall_page_if_needed.patch --]
[-- Type: text/plain, Size: 4181 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Make VSYSCALLs work fully in PTI mode.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>

---
 arch/x86/entry/vsyscall/vsyscall_64.c |    6 +--
 arch/x86/include/asm/vsyscall.h       |    1 
 arch/x86/mm/pti.c                     |   63 ++++++++++++++++++++++++++++++++++
 3 files changed, 67 insertions(+), 3 deletions(-)

--- a/arch/x86/entry/vsyscall/vsyscall_64.c
+++ b/arch/x86/entry/vsyscall/vsyscall_64.c
@@ -344,14 +344,14 @@ int in_gate_area_no_mm(unsigned long add
  * vsyscalls but leave the page not present.  If so, we skip calling
  * this.
  */
-static void __init set_vsyscall_pgtable_user_bits(void)
+void __init set_vsyscall_pgtable_user_bits(pgd_t *root)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
 	pud_t *pud;
 	pmd_t *pmd;
 
-	pgd = pgd_offset_k(VSYSCALL_ADDR);
+	pgd = pgd_offset_pgd(root, VSYSCALL_ADDR);
 	set_pgd(pgd, __pgd(pgd_val(*pgd) | _PAGE_USER));
 	p4d = p4d_offset(pgd, VSYSCALL_ADDR);
 #if CONFIG_PGTABLE_LEVELS >= 5
@@ -373,7 +373,7 @@ void __init map_vsyscall(void)
 			     vsyscall_mode == NATIVE
 			     ? PAGE_KERNEL_VSYSCALL
 			     : PAGE_KERNEL_VVAR);
-		set_vsyscall_pgtable_user_bits();
+		set_vsyscall_pgtable_user_bits(swapper_pg_dir);
 	}
 
 	BUILD_BUG_ON((unsigned long)__fix_to_virt(VSYSCALL_PAGE) !=
--- a/arch/x86/include/asm/vsyscall.h
+++ b/arch/x86/include/asm/vsyscall.h
@@ -7,6 +7,7 @@
 
 #ifdef CONFIG_X86_VSYSCALL_EMULATION
 extern void map_vsyscall(void);
+extern void set_vsyscall_pgtable_user_bits(pgd_t *root);
 
 /*
  * Called on instruction fetch fault in vsyscall page.
--- a/arch/x86/mm/pti.c
+++ b/arch/x86/mm/pti.c
@@ -38,6 +38,7 @@
 
 #include <asm/cpufeature.h>
 #include <asm/hypervisor.h>
+#include <asm/vsyscall.h>
 #include <asm/cmdline.h>
 #include <asm/pti.h>
 #include <asm/pgtable.h>
@@ -191,6 +192,48 @@ static pmd_t *pti_user_pagetable_walk_pm
 	return pmd_offset(pud, address);
 }
 
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to allocate
+ * page table pages on the way down.  Does not support large pages.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *pti_user_pagetable_walk_pte(unsigned long address)
+{
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+	pmd_t *pmd = pti_user_pagetable_walk_pmd(address);
+	pte_t *pte;
+
+	/* We can't do anything sensible if we hit a large mapping. */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		if (pmd_none(*pmd)) {
+			set_pmd(pmd, __pmd(_KERNPG_TABLE | __pa(new_pte_page)));
+			new_pte_page = 0;
+		}
+		if (new_pte_page)
+			free_page(new_pte_page);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
 static void __init
 pti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
 {
@@ -232,6 +275,25 @@ pti_clone_pmds(unsigned long start, unsi
 	}
 }
 
+static void __init pti_setup_vsyscall(void)
+{
+#ifdef CONFIG_X86_VSYSCALL_EMULATION
+	pte_t *pte, *target_pte;
+	unsigned int level;
+
+	pte = lookup_address(VSYSCALL_ADDR, &level);
+	if (!pte || WARN_ON(level != PG_LEVEL_4K) || pte_none(*pte))
+		return;
+
+	target_pte = pti_user_pagetable_walk_pte(VSYSCALL_ADDR);
+	if (WARN_ON(!target_pte))
+		return;
+
+	*target_pte = *pte;
+	set_vsyscall_pgtable_user_bits(kernel_to_user_pgdp(swapper_pg_dir));
+#endif
+}
+
 /*
  * Clone a single p4d (i.e. a top-level entry on 4-level systems and a
  * next-level entry on 5-level systems.
@@ -287,4 +349,5 @@ void __init pti_init(void)
 	pti_clone_user_shared();
 	pti_clone_entry_text();
 	pti_setup_espfix64();
+	pti_setup_vsyscall();
 }

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 43/54] x86/mm: Allow flushing for future ASID switches
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (41 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 42/54] x86/pti: Map the vsyscall page if needed Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 44/54] x86/mm: Abstract switching CR3 Thomas Gleixner
                   ` (13 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0050-x86-mm-Allow-flushing-for-future-ASID-switches.patch --]
[-- Type: text/plain, Size: 5268 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of all
contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
by:

 1. INVPCID for each PCID (works for single pages too).

 2. Load CR3 with each PCID without the NOFLUSH bit set

 3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
PAGE_TABLE_ISOLATION) at the time that invalidation is required.
Instead of actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
required.

At the next context-switch, look for this indicator
('invalidate_other' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

[ tglx: Folded more fixups from Peter ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/tlbflush.h |   37 +++++++++++++++++++++++++++++--------
 arch/x86/mm/tlb.c               |   35 +++++++++++++++++++++++++++++++++++
 2 files changed, 64 insertions(+), 8 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -135,6 +135,17 @@ struct tlb_state {
 	bool is_lazy;
 
 	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool invalidate_other;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -212,6 +223,14 @@ static inline unsigned long cr4_read_sha
 }
 
 /*
+ * Mark all other ASIDs as invalid, preserves the current.
+ */
+static inline void invalidate_other_asid(void)
+{
+	this_cpu_write(cpu_tlbstate.invalidate_other, true);
+}
+
+/*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
  * up after us can get the correct flags.  This should only be used
@@ -298,14 +317,6 @@ static inline void __flush_tlb_all(void)
 		 */
 		__flush_tlb();
 	}
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
 }
 
 /*
@@ -315,6 +326,16 @@ static inline void __flush_tlb_one(unsig
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	/*
+	 * __flush_tlb_single() will have cleared the TLB entry for this ASID,
+	 * but since kernel space is replicated across all, we must also
+	 * invalidate all others.
+	 */
+	invalidate_other_asid();
 }
 
 #define TLB_FLUSH_ALL	-1UL
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_asid_other(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (!static_cpu_has(X86_FEATURE_PTI)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.invalidate_other, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.invalidate_other))
+		clear_asid_other();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 44/54] x86/mm: Abstract switching CR3
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (42 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 43/54] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 45/54] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
                   ` (12 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0051-x86-mm-Abstract-switching-CR3.patch --]
[-- Type: text/plain, Size: 2628 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

In preparation to adding additional PCID flushing, abstract the
loading of a new ASID into CR3.

[ PeterZ: Split out from big combo patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/mm/tlb.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -100,6 +100,24 @@ static void choose_new_asid(struct mm_st
 	*need_flush = true;
 }
 
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -230,7 +248,7 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -243,7 +261,7 @@ void switch_mm_irqs_off(struct mm_struct
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 45/54] x86/mm: Use/Fix PCID to optimize user/kernel switches
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (43 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 44/54] x86/mm: Abstract switching CR3 Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 46/54] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
                   ` (11 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0052-x86-mm-Use-Fix-PCID-to-optimize-user-kernel-switches.patch --]
[-- Type: text/plain, Size: 14794 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

We can use PCID to retain the TLBs across CR3 switches; including those now
part of the user/kernel switch. This increases performance of kernel
entry/exit at the cost of more expensive/complicated TLB flushing.

Now that we have two address spaces, one for kernel and one for user space,
we need two PCIDs per mm. We use the top PCID bit to indicate a user PCID
(just like we use the PFN LSB for the PGD). Since we do TLB invalidation
from kernel space, the existing code will only invalidate the kernel PCID,
we augment that by marking the corresponding user PCID invalid, and upon
switching back to userspace, use a flushing CR3 write for the switch.

In order to access the user_pcid_flush_mask we use PER_CPU storage, which
means the previously established SWAPGS vs CR3 ordering is now mandatory
and required.

Having to do this memory access does require additional registers, most
sites have a functioning stack and we can spill one (RAX), sites without
functional stack need to otherwise provide the second scratch register.

Note: PCID is generally available on Intel Sandybridge and later CPUs.
Note: Up until this point TLB flushing was broken in this series.

Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/entry/calling.h                    |   72 ++++++++++++++++++----
 arch/x86/entry/entry_64.S                   |    9 +-
 arch/x86/entry/entry_64_compat.S            |    4 -
 arch/x86/include/asm/processor-flags.h      |    5 +
 arch/x86/include/asm/tlbflush.h             |   91 ++++++++++++++++++++++++----
 arch/x86/include/uapi/asm/processor-flags.h |    7 +-
 arch/x86/kernel/asm-offsets.c               |    4 +
 arch/x86/mm/init.c                          |    2 
 arch/x86/mm/tlb.c                           |    1 
 9 files changed, 162 insertions(+), 33 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -3,6 +3,9 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/percpu.h>
+#include <asm/asm-offsets.h>
+#include <asm/processor-flags.h>
 
 /*
 
@@ -191,17 +194,21 @@ For 32-bit we have the following convent
 
 #ifdef CONFIG_PAGE_TABLE_ISOLATION
 
-/* PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two halves: */
-#define PTI_SWITCH_MASK (1<<PAGE_SHIFT)
+/*
+ * PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define PTI_SWITCH_PGTABLES_MASK	(1<<PAGE_SHIFT)
+#define PTI_SWITCH_MASK		(PTI_SWITCH_PGTABLES_MASK|(1<<X86_CR3_PTI_SWITCH_BIT))
 
-.macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
-	andq	$(~PTI_SWITCH_MASK), \reg
+.macro SET_NOFLUSH_BIT	reg:req
+	bts	$X86_CR3_PCID_NOFLUSH_BIT, \reg
 .endm
 
-.macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(PTI_SWITCH_MASK), \reg
+.macro ADJUST_KERNEL_CR3 reg:req
+	ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
+	/* Clear PCID and "PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+	andq    $(~PTI_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -212,21 +219,58 @@ For 32-bit we have the following convent
 .Lend_\@:
 .endm
 
-.macro SWITCH_TO_USER_CR3 scratch_reg:req
+#define THIS_CPU_user_pcid_flush_mask   \
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
+
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
 	mov	%cr3, \scratch_reg
-	ADJUST_USER_CR3 \scratch_reg
+
+	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
+
+	/*
+	 * Test if the ASID needs a flush.
+	 */
+	movq	\scratch_reg, \scratch_reg2
+	andq	$(0x7FF), \scratch_reg		/* mask ASID */
+	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jnc	.Lnoflush_\@
+
+	/* Flush needed, clear the bit */
+	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	movq	\scratch_reg2, \scratch_reg
+	jmp	.Lwrcr3_\@
+
+.Lnoflush_\@:
+	movq	\scratch_reg2, \scratch_reg
+	SET_NOFLUSH_BIT \scratch_reg
+
+.Lwrcr3_\@:
+	/* Flip the PGD and ASID to the user version */
+	orq     $(PTI_SWITCH_MASK), \scratch_reg
 	mov	\scratch_reg, %cr3
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_STACK	scratch_reg:req
+	pushq	%rax
+	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+	popq	%rax
+.endm
+
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_PTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real PAGE_TABLE_ISOLATION patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not a user CR3.
 	 */
 	testq	$(PTI_SWITCH_MASK), \scratch_reg
 	jz	.Ldone_\@
@@ -251,7 +295,9 @@ For 32-bit we have the following convent
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
 .endm
-.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+.endm
+.macro SWITCH_TO_USER_CR3_STACK scratch_reg:req
 .endm
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 .endm
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -23,7 +23,6 @@
 #include <asm/segment.h>
 #include <asm/cache.h>
 #include <asm/errno.h>
-#include "calling.h"
 #include <asm/asm-offsets.h>
 #include <asm/msr.h>
 #include <asm/unistd.h>
@@ -40,6 +39,8 @@
 #include <asm/frame.h>
 #include <linux/err.h>
 
+#include "calling.h"
+
 .code64
 .section .entry.text, "ax"
 
@@ -406,7 +407,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -744,7 +745,7 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	/* Restore RDI. */
 	popq	%rdi
@@ -857,7 +858,7 @@ ENTRY(native_iret)
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
 
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 	SWAPGS					/* to user GS */
 	popq	%rdi				/* Restore user RDI */
 
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -275,9 +275,9 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	 * switch until after after the last reference to the process
 	 * stack.
 	 *
-	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 * %r8/%r9 are zeroed before the sysret, thus safe to clobber.
 	 */
-	SWITCH_TO_USER_CR3 scratch_reg=%r8
+	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=%r8 scratch_reg2=%r9
 
 	xorq	%r8, %r8
 	xorq	%r9, %r9
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -38,6 +38,11 @@
 #define CR3_ADDR_MASK	__sme_clr(0x7FFFFFFFFFFFF000ull)
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define X86_CR3_PTI_SWITCH_BIT	11
+#endif
+
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -10,6 +10,8 @@
 #include <asm/special_insns.h>
 #include <asm/smp.h>
 #include <asm/invpcid.h>
+#include <asm/pti.h>
+#include <asm/processor-flags.h>
 
 static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 {
@@ -24,24 +26,54 @@ static inline u64 inc_mm_tlb_gen(struct
 
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS		12
+
 /*
  * When enabled, PAGE_TABLE_ISOLATION consumes a single bit for
  * user/kernel switches
  */
-#define PTI_CONSUMED_ASID_BITS		0
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+# define PTI_CONSUMED_PCID_BITS	1
+#else
+# define PTI_CONSUMED_PCID_BITS	0
+#endif
+
+#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - PTI_CONSUMED_PCID_BITS)
 
-#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - PTI_CONSUMED_ASID_BITS)
 /*
  * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
  * for them being zero-based.  Another -1 is because ASID 0 is reserved for
  * use by non-PCID-aware users.
  */
-#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
+#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_PCID_BITS) - 2)
+
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in two cache
+ * lines.
+ */
+#define TLB_NR_DYN_ASIDS	6
 
 static inline u16 kern_pcid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
 	/*
+	 * Make sure that the dynamic ASID space does not confict with the
+	 * bit we are using to switch between user and kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1 << X86_CR3_PTI_SWITCH_BIT));
+
+	/*
+	 * The ASID being passed in here should have respected the
+	 * MAX_ASID_AVAILABLE and thus never have the switch bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1 << X86_CR3_PTI_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in are small
+	 * (<TLB_NR_DYN_ASIDS).  They never have the high switch bit set,
+	 * so do not bother to clear it.
+	 *
 	 * If PCID is on, ASID-aware code paths put the ASID+1 into the
 	 * PCID bits.  This serves two purposes.  It prevents a nasty
 	 * situation in which PCID-unaware code saves CR3, loads some other
@@ -95,12 +127,6 @@ static inline bool tlb_defer_switch_to_i
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -146,6 +172,13 @@ struct tlb_state {
 	bool invalidate_other;
 
 	/*
+	 * Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
+	 * the corresponding user PCID needs a flush next time we
+	 * switch to it; see SWITCH_TO_USER_CR3.
+	 */
+	unsigned short user_pcid_flush_mask;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -250,14 +283,41 @@ static inline void cr4_set_bits_and_upda
 extern void initialize_tlbstate_and_flush(void);
 
 /*
+ * Given an ASID, flush the corresponding user ASID.  We can delay this
+ * until the next time we switch to it.
+ *
+ * See SWITCH_TO_USER_CR3.
+ */
+static inline void invalidate_user_asid(u16 asid)
+{
+	/* There is no user ASID if address space separation is off */
+	if (!IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION))
+		return;
+
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	__set_bit(kern_pcid(asid),
+		  (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+}
+
+/*
  * flush the entire current user mapping
  */
 static inline void __native_flush_tlb(void)
 {
+	invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
+	 * If current->mm == NULL then we borrow a mm which may change
+	 * during a task switch and therefore we must not be preempted
+	 * while we write CR3 back:
 	 */
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
@@ -301,7 +361,14 @@ static inline void __native_flush_tlb_gl
  */
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	invalidate_user_asid(loaded_mm_asid);
 }
 
 /*
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -78,7 +78,12 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+
+#define X86_CR3_PCID_BITS	12
+#define X86_CR3_PCID_MASK	(_AC((1UL << X86_CR3_PCID_BITS) - 1, UL))
+
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -17,6 +17,7 @@
 #include <asm/sigframe.h>
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
+#include <asm/tlbflush.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -94,6 +95,9 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
+	/* TLB state for the entry code */
+	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
+
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -855,7 +855,7 @@ void __init zone_sizes_init(void)
 	free_area_init_nodes(max_zone_pfns);
 }
 
-DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
 	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,7 @@ static void load_new_mm_cr3(pgd_t *pgdir
 	unsigned long new_mm_cr3;
 
 	if (need_flush) {
+		invalidate_user_asid(new_asid);
 		new_mm_cr3 = build_cr3(pgdir, new_asid);
 	} else {
 		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 46/54] x86/mm: Optimize RESTORE_CR3
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (44 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 45/54] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 47/54] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
                   ` (10 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin

[-- Attachment #1: 0053-x86-mm-Optimize-RESTORE_CR3.patch --]
[-- Type: text/plain, Size: 3234 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Most NMI/paranoid exceptions will not in fact change pagetables and would
thus not require TLB flushing, however RESTORE_CR3 uses flushing CR3
writes.

Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush the
kernel mappings and now that we track which user PCIDs need flushing we can
avoid those too when possible.

This does mean RESTORE_CR3 needs an additional scratch_reg, luckily both
sites have plenty available.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/entry/calling.h  |   30 ++++++++++++++++++++++++++++--
 arch/x86/entry/entry_64.S |    4 ++--
 2 files changed, 30 insertions(+), 4 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -281,8 +281,34 @@ For 32-bit we have the following convent
 .Ldone_\@:
 .endm
 
-.macro RESTORE_CR3 save_reg:req
+.macro RESTORE_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_PTI
+
+	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
+
+	/*
+	 * KERNEL pages can always resume with NOFLUSH as we do
+	 * explicit flushes.
+	 */
+	bt	$X86_CR3_PTI_SWITCH_BIT, \save_reg
+	jnc	.Lnoflush_\@
+
+	/*
+	 * Check if there's a pending flush for the user ASID we're
+	 * about to set.
+	 */
+	movq	\save_reg, \scratch_reg
+	andq	$(0x7FF), \scratch_reg
+	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jnc	.Lnoflush_\@
+
+	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jmp	.Lwrcr3_\@
+
+.Lnoflush_\@:
+	SET_NOFLUSH_BIT \save_reg
+
+.Lwrcr3_\@:
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
@@ -301,7 +327,7 @@ For 32-bit we have the following convent
 .endm
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 .endm
-.macro RESTORE_CR3 save_reg:req
+.macro RESTORE_CR3 scratch_reg:req save_reg:req
 .endm
 
 #endif
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1288,7 +1288,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
-	RESTORE_CR3	save_reg=%r14
+	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1730,7 +1730,7 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
-	RESTORE_CR3 save_reg=%r14
+	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 47/54] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (45 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 46/54] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (9 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin

[-- Attachment #1: 0054-x86-mm-Use-INVPCID-for-__native_flush_tlb_single.patch --]
[-- Type: text/plain, Size: 5739 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

This uses INVPCID to shoot down individual lines of the user mapping
instead of marking the entire user map as invalid. This
could/might/possibly be faster.

This for sure needs tlb_single_page_flush_ceiling to be redetermined;
esp. since INVPCID is _slow_.

A detailed performance analysis is available here:

  https://lkml.kernel.org/r/3062e486-3539-8a1f-5724-16199420be71@intel.com

[ Peterz: Split out from big combo patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/include/asm/cpufeatures.h |    1 
 arch/x86/include/asm/tlbflush.h    |   23 ++++++++++++-
 arch/x86/mm/init.c                 |   64 +++++++++++++++++++++----------------
 3 files changed, 60 insertions(+), 28 deletions(-)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -197,6 +197,7 @@
 #define X86_FEATURE_CAT_L3		( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2		( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3		( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE	( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -85,6 +85,18 @@ static inline u16 kern_pcid(u16 asid)
 	return asid + 1;
 }
 
+/*
+ * The user PCID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_pcid(u16 asid)
+{
+	u16 ret = kern_pcid(asid);
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	ret |= 1 << X86_CR3_PTI_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -335,6 +347,8 @@ static inline void __native_flush_tlb_gl
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -368,7 +382,14 @@ static inline void __native_flush_tlb_si
 	if (!static_cpu_has(X86_FEATURE_PTI))
 		return;
 
-	invalidate_user_asid(loaded_mm_asid);
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
+	 * Just use invalidate_user_asid() in case we are called early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
+		invalidate_user_asid(loaded_mm_asid);
+	else
+		invpcid_flush_one(user_pcid(loaded_mm_asid), addr);
 }
 
 /*
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -203,34 +203,44 @@ static void __init probe_page_size_mask(
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
-			setup_clear_cpu_cap(X86_FEATURE_PCID);
-		}
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() -- the
+		 * trampoline code can't handle CR4.PCIDE and it wouldn't
+		 * do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually cause
+		 * the bits in question to remain set all the way through
+		 * the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE manually in
+		 * start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work if we set
+		 * X86_CR4_PCIDE, *and* we INVPCID support.  It's unusable
+		 * on systems that have X86_CR4_PCIDE clear, or that have
+		 * no INVPCID support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't work if
+		 * PCID is on but PGE is not.  Since that combination
+		 * doesn't exist on real hardware, there's no reason to try
+		 * to fully support it, but it's polite to avoid corrupting
+		 * data if we're on an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 48/54] x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0068-x86-mm-Clarify-the-whole-ASID-kernel-PCID-user-PCID-.patch --]
[-- Type: text/plain, Size: 4284 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Ideally we'd also use sparse to enforce this separation so it becomes much
more difficult to mess up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   55 +++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -13,16 +13,33 @@
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
 
-static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
-{
-	/*
-	 * Bump the generation count.  This also serves as a full barrier
-	 * that synchronizes with switch_mm(): callers are required to order
-	 * their read of mm_cpumask after their writes to the paging
-	 * structures.
-	 */
-	return atomic64_inc_return(&mm->context.tlb_gen);
-}
+/*
+ * The x86 feature is called PCID (Process Context IDentifier). It is similar
+ * to what is traditionally called ASID on the RISC processors.
+ *
+ * We don't use the traditional ASID implementation, where each process/mm gets
+ * its own ASID and flush/restart when we run out of ASID space.
+ *
+ * Instead we have a small per-cpu array of ASIDs and cache the last few mm's
+ * that came by on this CPU, allowing cheaper switch_mm between processes on
+ * this CPU.
+ *
+ * We end up with different spaces for different things. To avoid confusion we
+ * use different names for each of them:
+ *
+ * ASID  - [0, TLB_NR_DYN_ASIDS-1]
+ *         the canonical identifier for an mm
+ *
+ * kPCID - [1, TLB_NR_DYN_ASIDS]
+ *         the value we write into the PCID part of CR3; corresponds to the
+ *         ASID+1, because PCID 0 is special.
+ *
+ * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ *         for KPTI each mm has two address spaces and thus needs two
+ *         PCID values, but we can still do with a single ASID denomination
+ *         for each mm. Corresponds to kPCID + 2048.
+ *
+ */
 
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS		12
@@ -41,7 +58,7 @@ static inline u64 inc_mm_tlb_gen(struct
 
 /*
  * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
- * for them being zero-based.  Another -1 is because ASID 0 is reserved for
+ * for them being zero-based.  Another -1 is because PCID 0 is reserved for
  * use by non-PCID-aware users.
  */
 #define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_PCID_BITS) - 2)
@@ -52,6 +69,9 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define TLB_NR_DYN_ASIDS	6
 
+/*
+ * Given @asid, compute kPCID
+ */
 static inline u16 kern_pcid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
@@ -86,7 +106,7 @@ static inline u16 kern_pcid(u16 asid)
 }
 
 /*
- * The user PCID is just the kernel one, plus the "switch bit".
+ * Given @asid, compute uPCID
  */
 static inline u16 user_pcid(u16 asid)
 {
@@ -484,6 +504,17 @@ static inline void flush_tlb_page(struct
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
+static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
+{
+	/*
+	 * Bump the generation count.  This also serves as a full barrier
+	 * that synchronizes with switch_mm(): callers are required to order
+	 * their read of mm_cpumask after their writes to the paging
+	 * structures.
+	 */
+	return atomic64_inc_return(&mm->context.tlb_gen);
+}
+
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 48/54] x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0068-x86-mm-Clarify-the-whole-ASID-kernel-PCID-user-PCID-.patch --]
[-- Type: text/plain, Size: 4511 bytes --]

From: Peter Zijlstra <peterz@infradead.org>

Ideally we'd also use sparse to enforce this separation so it becomes much
more difficult to mess up.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/tlbflush.h |   55 +++++++++++++++++++++++++++++++---------
 1 file changed, 43 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -13,16 +13,33 @@
 #include <asm/pti.h>
 #include <asm/processor-flags.h>
 
-static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
-{
-	/*
-	 * Bump the generation count.  This also serves as a full barrier
-	 * that synchronizes with switch_mm(): callers are required to order
-	 * their read of mm_cpumask after their writes to the paging
-	 * structures.
-	 */
-	return atomic64_inc_return(&mm->context.tlb_gen);
-}
+/*
+ * The x86 feature is called PCID (Process Context IDentifier). It is similar
+ * to what is traditionally called ASID on the RISC processors.
+ *
+ * We don't use the traditional ASID implementation, where each process/mm gets
+ * its own ASID and flush/restart when we run out of ASID space.
+ *
+ * Instead we have a small per-cpu array of ASIDs and cache the last few mm's
+ * that came by on this CPU, allowing cheaper switch_mm between processes on
+ * this CPU.
+ *
+ * We end up with different spaces for different things. To avoid confusion we
+ * use different names for each of them:
+ *
+ * ASID  - [0, TLB_NR_DYN_ASIDS-1]
+ *         the canonical identifier for an mm
+ *
+ * kPCID - [1, TLB_NR_DYN_ASIDS]
+ *         the value we write into the PCID part of CR3; corresponds to the
+ *         ASID+1, because PCID 0 is special.
+ *
+ * uPCID - [2048 + 1, 2048 + TLB_NR_DYN_ASIDS]
+ *         for KPTI each mm has two address spaces and thus needs two
+ *         PCID values, but we can still do with a single ASID denomination
+ *         for each mm. Corresponds to kPCID + 2048.
+ *
+ */
 
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS		12
@@ -41,7 +58,7 @@ static inline u64 inc_mm_tlb_gen(struct
 
 /*
  * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
- * for them being zero-based.  Another -1 is because ASID 0 is reserved for
+ * for them being zero-based.  Another -1 is because PCID 0 is reserved for
  * use by non-PCID-aware users.
  */
 #define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_PCID_BITS) - 2)
@@ -52,6 +69,9 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define TLB_NR_DYN_ASIDS	6
 
+/*
+ * Given @asid, compute kPCID
+ */
 static inline u16 kern_pcid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
@@ -86,7 +106,7 @@ static inline u16 kern_pcid(u16 asid)
 }
 
 /*
- * The user PCID is just the kernel one, plus the "switch bit".
+ * Given @asid, compute uPCID
  */
 static inline u16 user_pcid(u16 asid)
 {
@@ -484,6 +504,17 @@ static inline void flush_tlb_page(struct
 void native_flush_tlb_others(const struct cpumask *cpumask,
 			     const struct flush_tlb_info *info);
 
+static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
+{
+	/*
+	 * Bump the generation count.  This also serves as a full barrier
+	 * that synchronizes with switch_mm(): callers are required to order
+	 * their read of mm_cpumask after their writes to the paging
+	 * structures.
+	 */
+	return atomic64_inc_return(&mm->context.tlb_gen);
+}
+
 static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
 					struct mm_struct *mm)
 {


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 49/54] x86/dumpstack: Indicate in Oops whether pti is configured and enabled
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (47 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 22:03   ` Jiri Kosina
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (7 subsequent siblings)
  56 siblings, 1 reply; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss, jkosina

[-- Attachment #1: x86-dumpstack--Indicate_in_Oops_whether_pti_is_configured_and_enabled.patch --]
[-- Type: text/plain, Size: 2385 bytes --]

From: Vlastimil Babka <vbabka@suse.cz>

CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
still have some corner cases which could take some time to manifest and be
fixed. It would be useful to have Oops messages indicate whether it was
enabled for building the kernel, and whether it was disabled during boot.

Example of fully enabled:

	Oops: 0001 [#1] SMP PTI

Example of enabled during build, but disabled during boot:

	Oops: 0001 [#1] SMP NOPTI

We can decide to remove this after the feature has been tested in the field
long enough.

[ tglx: Made it use boot_cpu_has() as requested by Borislav ]

Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Eduardo Valentin <eduval@amazon.com>
Acked-by: Dave Hansen <dave.hansen@intel.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: aliguori@amazon.com
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: jkosina@suse.cz
Cc: keescook@google.com
Cc: hughd@google.com
Cc: Will Deacon <will.deacon@arm.com>
Cc: daniel.gruss@iaik.tugraz.at
Cc: David Laight <David.Laight@aculab.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Andy Lutomirsky <luto@kernel.org>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: bpetkov@suse.de
Cc: Linus Torvalds <torvalds@linux-foundation.org>

---
 arch/x86/kernel/dumpstack.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -297,11 +297,13 @@ int __die(const char *str, struct pt_reg
 	unsigned long sp;
 #endif
 	printk(KERN_DEFAULT
-	       "%s: %04lx [#%d]%s%s%s%s\n", str, err & 0xffff, ++die_counter,
+	       "%s: %04lx [#%d]%s%s%s%s%s\n", str, err & 0xffff, ++die_counter,
 	       IS_ENABLED(CONFIG_PREEMPT) ? " PREEMPT"         : "",
 	       IS_ENABLED(CONFIG_SMP)     ? " SMP"             : "",
 	       debug_pagealloc_enabled()  ? " DEBUG_PAGEALLOC" : "",
-	       IS_ENABLED(CONFIG_KASAN)   ? " KASAN"           : "");
+	       IS_ENABLED(CONFIG_KASAN)   ? " KASAN"           : "",
+	       IS_ENABLED(CONFIG_PAGE_TABLE_ISOLATION) ?
+	       (boot_cpu_has(X86_FEATURE_PTI) ? " PTI" : " NOPTI") : "");
 
 	if (notify_die(DIE_OOPS, str, regs, err,
 			current->thread.trap_nr, SIGSEGV) == NOTIFY_STOP)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 50/54] x86/mm/pti: Add Kconfig
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0056-x86-mm-pti-Add-Kconfig.patch --]
[-- Type: text/plain, Size: 2404 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Finally allow CONFIG_PAGE_TABLE_ISOLATION to be enabled.

PARAVIRT generally requires that the kernel not manage its own page tables.
It also means that the hypervisor and kernel must agree wholeheartedly
about what format the page tables are in and what they contain.
PAGE_TABLE_ISOLATION, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether they
want the Kconfig magic to go first or last in a patch series.  It's going
last here because the partially-applied series leads to kernels that can
not boot in a bunch of cases.  I did a run through the entire series with
CONFIG_PAGE_TABLE_ISOLATION=y to look for build errors, though.

[ tglx: Removed SMP and !PARAVIRT dependencies as they not longer exist ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config PAGE_TABLE_ISOLATION
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && !UML
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/pagetable-isolation.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 50/54] x86/mm/pti: Add Kconfig
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Dave Hansen, Ingo Molnar, Borislav Petkov, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0056-x86-mm-pti-Add-Kconfig.patch --]
[-- Type: text/plain, Size: 2631 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Finally allow CONFIG_PAGE_TABLE_ISOLATION to be enabled.

PARAVIRT generally requires that the kernel not manage its own page tables.
It also means that the hypervisor and kernel must agree wholeheartedly
about what format the page tables are in and what they contain.
PAGE_TABLE_ISOLATION, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether they
want the Kconfig magic to go first or last in a patch series.  It's going
last here because the partially-applied series leads to kernels that can
not boot in a bunch of cases.  I did a run through the entire series with
CONFIG_PAGE_TABLE_ISOLATION=y to look for build errors, though.

[ tglx: Removed SMP and !PARAVIRT dependencies as they not longer exist ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config PAGE_TABLE_ISOLATION
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && !UML
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/pagetable-isolation.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 51/54] x86/mm/dump_pagetables: Add page table directory
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (49 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 21:35   ` Thomas Gleixner
                   ` (5 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Borislav Petkov, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin

[-- Attachment #1: 0057-x86-mm-dump_pagetables-Add-page-table-directory.patch --]
[-- Type: text/plain, Size: 1976 bytes --]

The upcoming support for dumping the kernel and the user space page tables
of the current process would create more random files in the top level
debugfs directory.

Add a page table directory and move the existing file to it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
---
 arch/x86/mm/debug_pagetables.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -22,21 +22,26 @@ static const struct file_operations ptdu
 	.release	= single_release,
 };
 
-static struct dentry *pe;
+static struct dentry *dir, *pe;
 
 static int __init pt_dump_debug_init(void)
 {
-	pe = debugfs_create_file("kernel_page_tables", S_IRUSR, NULL, NULL,
-				 &ptdump_fops);
-	if (!pe)
+	dir = debugfs_create_dir("page_tables", NULL);
+	if (!dir)
 		return -ENOMEM;
 
+	pe = debugfs_create_file("kernel", 0400, dir, NULL, &ptdump_fops);
+	if (!pe)
+		goto err;
 	return 0;
+err:
+	debugfs_remove_recursive(dir);
+	return -ENOMEM;
 }
 
 static void __exit pt_dump_debug_exit(void)
 {
-	debugfs_remove_recursive(pe);
+	debugfs_remove_recursive(dir);
 }
 
 module_init(pt_dump_debug_init);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 52/54] x86/mm/dump_pagetables: Check user space page table for WX pages
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0058-x86-mm-dump_pagetables-Check-user-space-page-table-f.patch --]
[-- Type: text/plain, Size: 3598 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

ptdump_walk_pgd_level_checkwx() checks the kernel page table for WX pages,
but does not check the PAGE_TABLE_ISOLATION user space page table.

Restructure the code so that dmesg output is selected by an explicit
argument and not implicit via checking the pgd argument for !NULL.

Add the check for the user space page table.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/mm/debug_pagetables.c |    2 +-
 arch/x86/mm/dump_pagetables.c  |   30 +++++++++++++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,6 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL);
 	return 0;
 }
 
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -476,7 +476,7 @@ static inline bool is_hypervisor_range(i
 }
 
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
-				       bool checkwx)
+				       bool checkwx, bool dmesg)
 {
 #ifdef CONFIG_X86_64
 	pgd_t *start = (pgd_t *) &init_top_pgt;
@@ -489,7 +489,7 @@ static void ptdump_walk_pgd_level_core(s
 
 	if (pgd) {
 		start = pgd;
-		st.to_dmesg = true;
+		st.to_dmesg = dmesg;
 	}
 
 	st.check_wx = checkwx;
@@ -527,13 +527,33 @@ static void ptdump_walk_pgd_level_core(s
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 {
-	ptdump_walk_pgd_level_core(m, pgd, false);
+	ptdump_walk_pgd_level_core(m, pgd, false, true);
+}
+
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+{
+	ptdump_walk_pgd_level_core(m, pgd, false, false);
+}
+EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
+
+static void ptdump_walk_user_pgd_level_checkwx(void)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pgd_t *pgd = (pgd_t *) &init_top_pgt;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	pr_info("x86/mm: Checking user space page tables\n");
+	pgd = kernel_to_user_pgdp(pgd);
+	ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+#endif
 }
-EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level);
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-	ptdump_walk_pgd_level_core(NULL, NULL, true);
+	ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+	ptdump_walk_user_pgd_level_checkwx();
 }
 
 static int __init pt_dump_init(void)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 52/54] x86/mm/dump_pagetables: Check user space page table for WX pages
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0058-x86-mm-dump_pagetables-Check-user-space-page-table-f.patch --]
[-- Type: text/plain, Size: 3825 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

ptdump_walk_pgd_level_checkwx() checks the kernel page table for WX pages,
but does not check the PAGE_TABLE_ISOLATION user space page table.

Restructure the code so that dmesg output is selected by an explicit
argument and not implicit via checking the pgd argument for !NULL.

Add the check for the user space page table.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/mm/debug_pagetables.c |    2 +-
 arch/x86/mm/dump_pagetables.c  |   30 +++++++++++++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,6 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL);
 	return 0;
 }
 
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -476,7 +476,7 @@ static inline bool is_hypervisor_range(i
 }
 
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
-				       bool checkwx)
+				       bool checkwx, bool dmesg)
 {
 #ifdef CONFIG_X86_64
 	pgd_t *start = (pgd_t *) &init_top_pgt;
@@ -489,7 +489,7 @@ static void ptdump_walk_pgd_level_core(s
 
 	if (pgd) {
 		start = pgd;
-		st.to_dmesg = true;
+		st.to_dmesg = dmesg;
 	}
 
 	st.check_wx = checkwx;
@@ -527,13 +527,33 @@ static void ptdump_walk_pgd_level_core(s
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 {
-	ptdump_walk_pgd_level_core(m, pgd, false);
+	ptdump_walk_pgd_level_core(m, pgd, false, true);
+}
+
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+{
+	ptdump_walk_pgd_level_core(m, pgd, false, false);
+}
+EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
+
+static void ptdump_walk_user_pgd_level_checkwx(void)
+{
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pgd_t *pgd = (pgd_t *) &init_top_pgt;
+
+	if (!static_cpu_has(X86_FEATURE_PTI))
+		return;
+
+	pr_info("x86/mm: Checking user space page tables\n");
+	pgd = kernel_to_user_pgdp(pgd);
+	ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+#endif
 }
-EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level);
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-	ptdump_walk_pgd_level_core(NULL, NULL, true);
+	ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+	ptdump_walk_user_pgd_level_checkwx();
 }
 
 static int __init pt_dump_init(void)


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 53/54] x86/mm/dump_pagetables: Allow dumping current pagetables
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-20 21:35   ` Thomas Gleixner
  2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
                     ` (55 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0059-x86-mm-dump_pagetables-Allow-dumping-current-pagetab.patch --]
[-- Type: text/plain, Size: 5091 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add two debugfs files which allow to dump the pagetable of the current
task.

current_kernel dumps the regular page table. This is the page table which
is normally shared between kernel and user space. If kernel page table
isolation is enabled this is the kernel space mapping.

If kernel page table isolation is enabled the second file, current_user,
dumps the user space page table.

These files allow to verify the resulting page tables for page table
isolation, but even in the normal case its useful to be able to inspect
user space page tables of current for debugging purposes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/pgtable.h |    2 -
 arch/x86/mm/debug_pagetables.c |   71 ++++++++++++++++++++++++++++++++++++++---
 arch/x86/mm/dump_pagetables.c  |    6 ++-
 3 files changed, 73 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level_debugfs(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL, false);
 	return 0;
 }
 
@@ -22,7 +22,57 @@ static const struct file_operations ptdu
 	.release	= single_release,
 };
 
-static struct dentry *dir, *pe;
+static int ptdump_show_curknl(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curknl(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curknl, NULL);
+}
+
+static const struct file_operations ptdump_curknl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curknl,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static struct dentry *pe_curusr;
+
+static int ptdump_show_curusr(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curusr(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curusr, NULL);
+}
+
+static const struct file_operations ptdump_curusr_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curusr,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
+static struct dentry *dir, *pe_knl, *pe_curknl;
 
 static int __init pt_dump_debug_init(void)
 {
@@ -30,9 +80,22 @@ static int __init pt_dump_debug_init(voi
 	if (!dir)
 		return -ENOMEM;
 
-	pe = debugfs_create_file("kernel", 0400, dir, NULL, &ptdump_fops);
-	if (!pe)
+	pe_knl = debugfs_create_file("kernel", 0400, dir, NULL,
+				     &ptdump_fops);
+	if (!pe_knl)
+		goto err;
+
+	pe_curknl = debugfs_create_file("current_kernel", 0400,
+					dir, NULL, &ptdump_curknl_fops);
+	if (!pe_curknl)
+		goto err;
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pe_curusr = debugfs_create_file("current_user", 0400,
+					dir, NULL, &ptdump_curusr_fops);
+	if (!pe_curusr)
 		goto err;
+#endif
 	return 0;
 err:
 	debugfs_remove_recursive(dir);
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -530,8 +530,12 @@ void ptdump_walk_pgd_level(struct seq_fi
 	ptdump_walk_pgd_level_core(m, pgd, false, true);
 }
 
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	if (user && static_cpu_has(X86_FEATURE_PTI))
+		pgd = kernel_to_user_pgdp(pgd);
+#endif
 	ptdump_walk_pgd_level_core(m, pgd, false, false);
 }
 EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 53/54] x86/mm/dump_pagetables: Allow dumping current pagetables
@ 2017-12-20 21:35   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	linux-mm

[-- Attachment #1: 0059-x86-mm-dump_pagetables-Allow-dumping-current-pagetab.patch --]
[-- Type: text/plain, Size: 5318 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add two debugfs files which allow to dump the pagetable of the current
task.

current_kernel dumps the regular page table. This is the page table which
is normally shared between kernel and user space. If kernel page table
isolation is enabled this is the kernel space mapping.

If kernel page table isolation is enabled the second file, current_user,
dumps the user space page table.

These files allow to verify the resulting page tables for page table
isolation, but even in the normal case its useful to be able to inspect
user space page tables of current for debugging purposes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Eduardo Valentin <eduval@amazon.com>
Cc: Greg KH <gregkh@linuxfoundation.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Will Deacon <will.deacon@arm.com>
Cc: aliguori@amazon.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
---
 arch/x86/include/asm/pgtable.h |    2 -
 arch/x86/mm/debug_pagetables.c |   71 ++++++++++++++++++++++++++++++++++++++---
 arch/x86/mm/dump_pagetables.c  |    6 ++-
 3 files changed, 73 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level_debugfs(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL, false);
 	return 0;
 }
 
@@ -22,7 +22,57 @@ static const struct file_operations ptdu
 	.release	= single_release,
 };
 
-static struct dentry *dir, *pe;
+static int ptdump_show_curknl(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curknl(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curknl, NULL);
+}
+
+static const struct file_operations ptdump_curknl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curknl,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+static struct dentry *pe_curusr;
+
+static int ptdump_show_curusr(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curusr(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curusr, NULL);
+}
+
+static const struct file_operations ptdump_curusr_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curusr,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
+static struct dentry *dir, *pe_knl, *pe_curknl;
 
 static int __init pt_dump_debug_init(void)
 {
@@ -30,9 +80,22 @@ static int __init pt_dump_debug_init(voi
 	if (!dir)
 		return -ENOMEM;
 
-	pe = debugfs_create_file("kernel", 0400, dir, NULL, &ptdump_fops);
-	if (!pe)
+	pe_knl = debugfs_create_file("kernel", 0400, dir, NULL,
+				     &ptdump_fops);
+	if (!pe_knl)
+		goto err;
+
+	pe_curknl = debugfs_create_file("current_kernel", 0400,
+					dir, NULL, &ptdump_curknl_fops);
+	if (!pe_curknl)
+		goto err;
+
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	pe_curusr = debugfs_create_file("current_user", 0400,
+					dir, NULL, &ptdump_curusr_fops);
+	if (!pe_curusr)
 		goto err;
+#endif
 	return 0;
 err:
 	debugfs_remove_recursive(dir);
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -530,8 +530,12 @@ void ptdump_walk_pgd_level(struct seq_fi
 	ptdump_walk_pgd_level_core(m, pgd, false, true);
 }
 
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
 {
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+	if (user && static_cpu_has(X86_FEATURE_PTI))
+		pgd = kernel_to_user_pgdp(pgd);
+#endif
 	ptdump_walk_pgd_level_core(m, pgd, false, false);
 }
 EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* [patch V181 54/54] x86/ldt: Make the LDT mapping RO
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (52 preceding siblings ...)
  2017-12-20 21:35   ` Thomas Gleixner
@ 2017-12-20 21:35 ` Thomas Gleixner
  2017-12-20 23:48 ` [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (2 subsequent siblings)
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 21:35 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

[-- Attachment #1: x86-ldt--Make-the-LDT-mapping-RO.patch --]
[-- Type: text/plain, Size: 3220 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Now that the LDT mapping is in a known area when PAGE_TABLE_ISOLATION is
enabled its a primary target for attacks, if a user space interface fails
to validate a write address correctly. That can never happen, right?

The SDM states:

    If the segment descriptors in the GDT or an LDT are placed in ROM, the
    processor can enter an indefinite loop if software or the processor
    attempts to update (write to) the ROM-based segment descriptors. To
    prevent this problem, set the accessed bits for all segment descriptors
    placed in a ROM. Also, remove operating-system or executive code that
    attempts to modify segment descriptors located in ROM.

So its a valid approach to set the ACCESS bit when setting up the LDT entry
and to map the table RO. Fixup the selftest so it can handle that new mode.

Remove the manual ACCESS bit setter in set_tls_desc() as this is now
pointless. Folded the patch from Peter Ziljstra.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/desc.h           |    2 ++
 arch/x86/kernel/ldt.c                 |    7 ++++++-
 arch/x86/kernel/tls.c                 |   11 ++---------
 tools/testing/selftests/x86/ldt_gdt.c |    3 +--
 4 files changed, 11 insertions(+), 12 deletions(-)

--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -21,6 +21,8 @@ static inline void fill_ldt(struct desc_
 
 	desc->type		= (info->read_exec_only ^ 1) << 1;
 	desc->type	       |= info->contents << 2;
+	/* Set the ACCESS bit so it can be mapped RO */
+	desc->type	       |= 1;
 
 	desc->s			= 1;
 	desc->dpl		= 0x3;
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -158,7 +158,12 @@ map_ldt_struct(struct mm_struct *mm, str
 		ptep = get_locked_pte(mm, va, &ptl);
 		if (!ptep)
 			return -ENOMEM;
-		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL & ~_PAGE_GLOBAL));
+		/*
+		 * Map it RO so the easy to find address is not a primary
+		 * target via some kernel interface which misses a
+		 * permission check.
+		 */
+		pte = pfn_pte(pfn, __pgprot(__PAGE_KERNEL_RO & ~_PAGE_GLOBAL));
 		set_pte_at(mm, va, ptep, pte);
 		pte_unmap_unlock(ptep, ptl);
 	}
--- a/arch/x86/kernel/tls.c
+++ b/arch/x86/kernel/tls.c
@@ -93,17 +93,10 @@ static void set_tls_desc(struct task_str
 	cpu = get_cpu();
 
 	while (n-- > 0) {
-		if (LDT_empty(info) || LDT_zero(info)) {
+		if (LDT_empty(info) || LDT_zero(info))
 			memset(desc, 0, sizeof(*desc));
-		} else {
+		else
 			fill_ldt(desc, info);
-
-			/*
-			 * Always set the accessed bit so that the CPU
-			 * doesn't try to write to the (read-only) GDT.
-			 */
-			desc->type |= 1;
-		}
 		++info;
 		++desc;
 	}
--- a/tools/testing/selftests/x86/ldt_gdt.c
+++ b/tools/testing/selftests/x86/ldt_gdt.c
@@ -122,8 +122,7 @@ static void check_valid_segment(uint16_t
 	 * NB: Different Linux versions do different things with the
 	 * accessed bit in set_thread_area().
 	 */
-	if (ar != expected_ar &&
-	    (ldt || ar != (expected_ar | AR_ACCESSED))) {
+	if (ar != expected_ar && ar != (expected_ar | AR_ACCESSED)) {
 		printf("[FAIL]\t%s entry %hu has AR 0x%08X but expected 0x%08X\n",
 		       (ldt ? "LDT" : "GDT"), index, ar, expected_ar);
 		nerrs++;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 49/54] x86/dumpstack: Indicate in Oops whether pti is configured and enabled
  2017-12-20 21:35 ` [patch V181 49/54] x86/dumpstack: Indicate in Oops whether pti is configured and enabled Thomas Gleixner
@ 2017-12-20 22:03   ` Jiri Kosina
  0 siblings, 0 replies; 85+ messages in thread
From: Jiri Kosina @ 2017-12-20 22:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Wed, 20 Dec 2017, Thomas Gleixner wrote:

> From: Vlastimil Babka <vbabka@suse.cz>
> 
> CONFIG_PAGE_TABLE_ISOLATION is relatively new and intrusive feature that may
> still have some corner cases which could take some time to manifest and be
> fixed. It would be useful to have Oops messages indicate whether it was
> enabled for building the kernel, and whether it was disabled during boot.
> 
> Example of fully enabled:
> 
> 	Oops: 0001 [#1] SMP PTI
> 
> Example of enabled during build, but disabled during boot:
> 
> 	Oops: 0001 [#1] SMP NOPTI
> 
> We can decide to remove this after the feature has been tested in the field
> long enough.
> 
> [ tglx: Made it use boot_cpu_has() as requested by Borislav ]
> 
> Signed-off-by: Vlastimil Babka <vbabka@suse.cz>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Eduardo Valentin <eduval@amazon.com>
> Acked-by: Dave Hansen <dave.hansen@intel.com>

FWIW

	Acked-by: Jiri Kosina <jkosina@suse.cz>

I've had a much more stupid variant of this when fuzzing pti/nopti kernels 
(distro backports, of course), and it was really helpful when diving into 
the crashes.

Thanks,

-- 
Jiri Kosina
SUSE Labs

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 21/54] x86/cpu_entry_area: Move it to a separate unit
  2017-12-20 21:35 ` [patch V181 21/54] x86/cpu_entry_area: Move it to a separate unit Thomas Gleixner
@ 2017-12-20 22:29   ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 22:29 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Wed, 20 Dec 2017, Thomas Gleixner wrote:
> +++ b/arch/x86/mm/cpu_entry_area.c
> @@ -0,0 +1,102 @@
> +// SPDX-License-Identifier: GPL-2.0
> +

Lacks an

#include <linux/spinlock.h> as 0-day noticed....

> +#include <linux/percpu.h>
> +#include <asm/cpu_entry_area.h>
> +#include <asm/pgtable.h>
> +#include <asm/fixmap.h>
> +#include <asm/desc.h>

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 00/54] x86/pti: Final XMAS release
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (53 preceding siblings ...)
  2017-12-20 21:35 ` [patch V181 54/54] x86/ldt: Make the LDT mapping RO Thomas Gleixner
@ 2017-12-20 23:48 ` Thomas Gleixner
  2017-12-21 12:57 ` Kirill A. Shutemov
  2017-12-21 15:57 ` Boris Ostrovsky
  56 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-20 23:48 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Wed, 20 Dec 2017, Thomas Gleixner wrote:

> Changes since V163:
> 
>   - Moved the cpu entry area out of the fixmap because that caused failures
>     due to fixmap size and cleanup_highmap() zapping fixmap PTEs.
> 
>   - Moved all cpu entry area related code into separate files. The
>     hodgepodge in cpu/common.c was really not appropriate.
> 
>   - Folded Juergens XEN PV fix for vsyscall
> 
>   - Folded Peters ACCESS bit simplification
> 
>   - Cleaned up and fixed dump_pagetables
> 
>   - Added Vlastimils PTI/NOPTI marker for dumpstack
> 
>   - Addressed various minor review comments
> 
>   Diffstat against V163 appended.
> 
> Thanks to everyone who looked and cared!
> 
> It's perfect now because I'm going to have quiet holidays no matter what.

Almost perfect. 0-day is amazing. It unearthed yet more include hell. I'm
not going to repost the whole thing. Find the delta fix below.

I've updated the git tree at:

   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti

head commit is now: 9b12709513089f4e71190685930e7f9f0d75fee7

The new patch tarball is at:

   https://tglx.de/~tglx/patches-pti-184.tar.bz2

   sha1sum of decompressed tarball: 2dbdfb57cf65a0c1e558ed55747e17d3c8d8adee

Thanks,

	tglx

8<--------------

diff --git a/arch/x86/include/asm/cpu_entry_area.h b/arch/x86/include/asm/cpu_entry_area.h
index e05a39029446..4a7884b8dca5 100644
--- a/arch/x86/include/asm/cpu_entry_area.h
+++ b/arch/x86/include/asm/cpu_entry_area.h
@@ -71,13 +71,7 @@ extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
 #define CPU_ENTRY_AREA_MAP_SIZE			\
 	(CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOT_SIZE - CPU_ENTRY_AREA_BASE)
 
-static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
-{
-	unsigned long va = CPU_ENTRY_AREA_PER_CPU + cpu * CPU_ENTRY_AREA_SIZE;
-	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
-
-	return (struct cpu_entry_area *) va;
-}
+extern struct cpu_entry_area *get_cpu_entry_area(int cpu);
 
 static inline struct entry_stack *cpu_entry_stack(int cpu)
 {
diff --git a/arch/x86/mm/cpu_entry_area.c b/arch/x86/mm/cpu_entry_area.c
index 66c0d1207243..b5dfb762c64c 100644
--- a/arch/x86/mm/cpu_entry_area.c
+++ b/arch/x86/mm/cpu_entry_area.c
@@ -1,6 +1,8 @@
 // SPDX-License-Identifier: GPL-2.0
 
+#include <linux/spinlock.h>
 #include <linux/percpu.h>
+
 #include <asm/cpu_entry_area.h>
 #include <asm/pgtable.h>
 #include <asm/fixmap.h>
@@ -13,6 +15,15 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
+struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	unsigned long va = CPU_ENTRY_AREA_PER_CPU + cpu * CPU_ENTRY_AREA_SIZE;
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return (struct cpu_entry_area *) va;
+}
+EXPORT_SYMBOL(get_cpu_entry_area);
+
 void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
 {
 	unsigned long va = (unsigned long) cea_vaddr;

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [patch V181 00/54] x86/pti: Final XMAS release
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (54 preceding siblings ...)
  2017-12-20 23:48 ` [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
@ 2017-12-21 12:57 ` Kirill A. Shutemov
  2017-12-21 16:26   ` Kirill A. Shutemov
  2017-12-21 15:57 ` Boris Ostrovsky
  56 siblings, 1 reply; 85+ messages in thread
From: Kirill A. Shutemov @ 2017-12-21 12:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Wed, Dec 20, 2017 at 10:35:03PM +0100, Thomas Gleixner wrote:
> The series is also available from git:
> 
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti

The patchset looks sane in 5-level paging configuration as long as commit
c739f930be1d ("x86/espfix/64: Fix espfix double-fault handling on 5-level
systems") from tip/x86/urgent is applied.

Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

BTW, can we push the patch to Linus?

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 00/54] x86/pti: Final XMAS release
  2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
                   ` (55 preceding siblings ...)
  2017-12-21 12:57 ` Kirill A. Shutemov
@ 2017-12-21 15:57 ` Boris Ostrovsky
  56 siblings, 0 replies; 85+ messages in thread
From: Boris Ostrovsky @ 2017-12-21 15:57 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Vlastimil Babka, daniel.gruss

On 12/20/2017 04:35 PM, Thomas Gleixner wrote:
> Changes since V163:
>
>   - Moved the cpu entry area out of the fixmap because that caused failures
>     due to fixmap size and cleanup_highmap() zapping fixmap PTEs.
>
>   - Moved all cpu entry area related code into separate files. The
>     hodgepodge in cpu/common.c was really not appropriate.
>
>   - Folded Juergens XEN PV fix for vsyscall
>
>   - Folded Peters ACCESS bit simplification
>
>   - Cleaned up and fixed dump_pagetables
>
>   - Added Vlastimils PTI/NOPTI marker for dumpstack
>
>   - Addressed various minor review comments
>
>   Diffstat against V163 appended.
>
> Thanks to everyone who looked and cared!
>
> It's perfect now because I'm going to have quiet holidays no matter what.
>
> Nevertheless, please review and test the hell out of it.

Passed my nightly tests, with both baremetal and various Xen guests.

-boris

>
> The lot is based on:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti.entry
>
> The series is also available from git:
>
>   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti
>
> The patch tarball is at:
>
>   https://tglx.de/~tglx/patches-pti-181.tar.bz2
>
>   sha1sum of decompressed tarball: 3b9c1729efe58e793a40031f54e82cac8aba3884
>
> Thanks,
>
> 	tglx
>
> ---
>
>  Documentation/x86/x86_64/mm.txt         |    4 
>  arch/x86/Kconfig                        |    3 
>  arch/x86/entry/vsyscall/vsyscall_64.c   |    7 -
>  arch/x86/events/intel/ds.c              |   53 ++++++----
>  arch/x86/include/asm/cpu_entry_area.h   |   87 +++++++++++++++++
>  arch/x86/include/asm/desc.h             |    1 
>  arch/x86/include/asm/fixmap.h           |   96 -------------------
>  arch/x86/include/asm/pgtable.h          |    4 
>  arch/x86/include/asm/pgtable_32_types.h |   15 ++-
>  arch/x86/include/asm/pgtable_64_types.h |   55 ++++++-----
>  arch/x86/kernel/cpu/common.c            |  126 -------------------------
>  arch/x86/kernel/dumpstack.c             |    7 +
>  arch/x86/kernel/tls.c                   |   11 --
>  arch/x86/kernel/traps.c                 |    6 -
>  arch/x86/mm/Makefile                    |    8 -
>  arch/x86/mm/cpu_entry_area.c            |  159 ++++++++++++++++++++++++++++++++
>  arch/x86/mm/dump_pagetables.c           |  105 ++++++++++++---------
>  arch/x86/mm/init_32.c                   |    6 +
>  arch/x86/mm/kasan_init_64.c             |    6 -
>  arch/x86/mm/pgtable_32.c                |    1 
>  arch/x86/mm/pti.c                       |   52 +++++-----
>  arch/x86/xen/mmu_pv.c                   |    2 
>  22 files changed, 449 insertions(+), 365 deletions(-)
>

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 00/54] x86/pti: Final XMAS release
  2017-12-21 12:57 ` Kirill A. Shutemov
@ 2017-12-21 16:26   ` Kirill A. Shutemov
  2017-12-21 18:39     ` Thomas Gleixner
  0 siblings, 1 reply; 85+ messages in thread
From: Kirill A. Shutemov @ 2017-12-21 16:26 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Thu, Dec 21, 2017 at 03:57:02PM +0300, Kirill A. Shutemov wrote:
> On Wed, Dec 20, 2017 at 10:35:03PM +0100, Thomas Gleixner wrote:
> > The series is also available from git:
> > 
> >   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti
> 
> The patchset looks sane in 5-level paging configuration as long as commit
> c739f930be1d ("x86/espfix/64: Fix espfix double-fault handling on 5-level
> systems") from tip/x86/urgent is applied.
> 
> Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>

Failed to boot with EFI. I don't think it's limited to 5-level paging.

The fix is below.

	BUG: unable to handle kernel paging request at ff1000017d803000
	IP: __pti_set_user_pgd+0x22/0x44
	PGD 4be4067 P4D 4be5067 PUD 27f905067 PMD 27f718067 PTE 800000017d803060
	Oops: 0002 [#1] PREEMPT SMP DEBUG_PAGEALLOC PTI
	CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.14.0-00174-g9b1270951308 #6613
	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
	task: ffffffff822114c0 task.stack: ffffffff82200000
	RIP: 0010:__pti_set_user_pgd+0x22/0x44
	RSP: 0000:ffffffff82203c70 EFLAGS: 00000202
	RAX: 000000007c490063 RBX: ffffffff82203e28 RCX: 0000000000000002
	RDX: 0000000000000001 RSI: 000000007c490063 RDI: ff1000017d803000
	RBP: 000000007ff58000 R08: 0000000000000000 R09: 0000000000000067
	R10: ffffffff82203a78 R11: 0000000000000001 R12: ff1000017d802000
	R13: 0000000000000000 R14: 000000007ff58000 R15: ffffffff82203e28
	FS:  0000000000000000(0000) GS:ff1000007f000000(0000) knlGS:0000000000000000
	CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: ff1000017d803000 CR3: 000000000440a000 CR4: 00000000000016b0
	Call Trace:
	 __cpa_process_fault+0x3f6/0x5d0
	 ? get_page_from_freelist+0x34f/0xbd0
	 __change_page_attr_set_clr+0x820/0xd70
	 ? __alloc_pages_nodemask+0x124/0xfd0
	 ? __alloc_pages_nodemask+0x124/0xfd0
	 ? printk+0x3e/0x46
	 ? kernel_map_pages_in_pgd+0x91/0x180
	 kernel_map_pages_in_pgd+0x91/0x180
	 ? __map_region+0x37/0x53
	 __map_region+0x37/0x53
	 efi_map_region+0x27/0xb3
	 efi_enter_virtual_mode+0x26f/0x490
	 start_kernel+0x368/0x3df
	 secondary_startup_64+0xab/0xb0
	Code: 90 90 90 90 90 90 90 90 90 48 89 fa 48 89 f0 81 e2 ff 0f 00 00 48 81 fa ff 07 00 00 77 16 48 89 f2 48 81 cf 00 10 00 00 83 e2 05 <48> 89 37 48 83 fa 05 74 01 c3 48 83 3d 6c 7c 29 01 00 79 f5 48
	RIP: __pti_set_user_pgd+0x22/0x44 RSP: ffffffff82203c70
	CR2: ff1000017d803000

-----------8<-----------

>From 373d3c99f9a8a70f19363c6f007e78216a5f935e Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Date: Thu, 21 Dec 2017 19:11:54 +0300
Subject: [PATCH] x86/efi: allocate two PGD for EFI page tables if PTI is
 enabled

EFI has its own top-level page table to avoid inserting EFI region
mappings into standard kernel page tables.

For PTI, we need to allocate two PGD page tables instead of one.
The user side is never used, but this allows us to use normal helpers to
deal with EFI page tables.

Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
---
 arch/x86/include/asm/pgalloc.h | 11 +++++++++++
 arch/x86/mm/pgtable.c          | 11 -----------
 arch/x86/platform/efi/efi_64.c |  5 ++++-
 3 files changed, 15 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/pgalloc.h b/arch/x86/include/asm/pgalloc.h
index 4b5e1eafada7..aff42e1da6ee 100644
--- a/arch/x86/include/asm/pgalloc.h
+++ b/arch/x86/include/asm/pgalloc.h
@@ -30,6 +30,17 @@ static inline void paravirt_release_p4d(unsigned long pfn) {}
  */
 extern gfp_t __userpte_alloc_gfp;
 
+#ifdef CONFIG_PAGE_TABLE_ISOLATION
+/*
+ * Instead of one PGD, we acquire two PGDs.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 /*
  * Allocate and free page tables.
  */
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index c05b6dccc72d..9b7bcbd33cc2 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -356,17 +356,6 @@ static inline void _pgd_free(pgd_t *pgd)
 }
 #else
 
-#ifdef CONFIG_PAGE_TABLE_ISOLATION
-/*
- * Instead of one PGD, we acquire two PGDs.  Being order-1, it is
- * both 8k in size and 8k-aligned.  That lets us just flip bit 12
- * in a pointer to swap between the two 4k halves.
- */
-#define PGD_ALLOCATION_ORDER 1
-#else
-#define PGD_ALLOCATION_ORDER 0
-#endif
-
 static inline pgd_t *_pgd_alloc(void)
 {
 	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
diff --git a/arch/x86/platform/efi/efi_64.c b/arch/x86/platform/efi/efi_64.c
index 20fb31579b69..39c4b35ac7a4 100644
--- a/arch/x86/platform/efi/efi_64.c
+++ b/arch/x86/platform/efi/efi_64.c
@@ -195,6 +195,9 @@ static pgd_t *efi_pgd;
  * because we want to avoid inserting EFI region mappings (EFI_VA_END
  * to EFI_VA_START) into the standard kernel page tables. Everything
  * else can be shared, see efi_sync_low_kernel_mappings().
+ *
+ * We don't want the pgd on the pgd_list and cannot use pgd_alloc() for the
+ * allocation.
  */
 int __init efi_alloc_page_tables(void)
 {
@@ -207,7 +210,7 @@ int __init efi_alloc_page_tables(void)
 		return 0;
 
 	gfp_mask = GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO;
-	efi_pgd = (pgd_t *)__get_free_page(gfp_mask);
+	efi_pgd = (pgd_t *)__get_free_pages(gfp_mask, PGD_ALLOCATION_ORDER);
 	if (!efi_pgd)
 		return -ENOMEM;
 
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [patch V181 00/54] x86/pti: Final XMAS release
  2017-12-21 16:26   ` Kirill A. Shutemov
@ 2017-12-21 18:39     ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-21 18:39 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

On Thu, 21 Dec 2017, Kirill A. Shutemov wrote:
> On Thu, Dec 21, 2017 at 03:57:02PM +0300, Kirill A. Shutemov wrote:
> > On Wed, Dec 20, 2017 at 10:35:03PM +0100, Thomas Gleixner wrote:
> > > The series is also available from git:
> > > 
> > >   git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/pti
> > 
> > The patchset looks sane in 5-level paging configuration as long as commit
> > c739f930be1d ("x86/espfix/64: Fix espfix double-fault handling on 5-level
> > systems") from tip/x86/urgent is applied.
> > 
> > Tested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> 
> Failed to boot with EFI. I don't think it's limited to 5-level paging.

Uuurg. That's there forever.

> The fix is below.

Thanks for catching and fixing that.

       tglx

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [V181,22/54] x86/cpu_entry_area: Move it out of fixmap
  2017-12-20 21:35 ` [patch V181 22/54] x86/cpu_entry_area: Move it out of fixmap Thomas Gleixner
@ 2017-12-22  2:46   ` Andrei Vagin
  2017-12-22 13:05     ` Thomas Gleixner
  0 siblings, 1 reply; 85+ messages in thread
From: Andrei Vagin @ 2017-12-22  2:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss

Hi Thomas,

The kernel with this patch doesn't boot, if CONFIG_KASAN is set:
[    0.000000] Linux version 4.14.0-00142-g8604322546c0 (avagin@laptop) (gcc version 7.2.1 20170915 (Red Hat 7.2.1-2) (GCC)) #11 SMP Thu Dec 21 18:38:44 PST 2017
[    0.000000] Command line: root=/dev/vda2 ro debug console=ttyS0,115200 LANG=en_US.UTF-8 slub_debug=FZP raid=noautodetect selinux=0 earlyprintk=serial,ttyS0,115200
[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'
[    0.000000] x86/fpu: xstate_offset[2]:  576, xstate_sizes[2]:  256
[    0.000000] x86/fpu: xstate_offset[3]:  832, xstate_sizes[3]:   64
[    0.000000] x86/fpu: xstate_offset[4]:  896, xstate_sizes[4]:   64
[    0.000000] x86/fpu: Enabled xstate features 0x1f, context size is 960 bytes, using 'compacted' format.
[    0.000000] e820: BIOS-provided physical RAM map:
[    0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[    0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000000000100000-0x000000007ffd8fff] usable
[    0.000000] BIOS-e820: [mem 0x000000007ffd9000-0x000000007fffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[    0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[    0.000000] bootconsole [earlyser0] enabled
[    0.000000] NX (Execute Disable) protection: active
[    0.000000] random: fast init done
[    0.000000] SMBIOS 2.8 present.
[    0.000000] DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1.fc26 04/01/2014
[    0.000000] Hypervisor detected: KVM
[    0.000000] tsc: Fast TSC calibration using PIT
[    0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[    0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[    0.000000] e820: last_pfn = 0x7ffd9 max_arch_pfn = 0x400000000
[    0.000000] MTRR default type: write-back
[    0.000000] MTRR fixed ranges enabled:
[    0.000000]   00000-9FFFF write-back
[    0.000000]   A0000-BFFFF uncachable
[    0.000000]   C0000-FFFFF write-protect
[    0.000000] MTRR variable ranges enabled:
[    0.000000]   0 base 0080000000 mask FF80000000 uncachable
[    0.000000]   1 disabled
[    0.000000]   2 disabled
[    0.000000]   3 disabled
[    0.000000]   4 disabled
[    0.000000]   5 disabled
[    0.000000]   6 disabled
[    0.000000]   7 disabled
[    0.000000] x86/PAT: Configuration [0-7]: WB  WC  UC- UC  WB  WP  UC- WT  
[    0.000000] found SMP MP-table at [mem 0x000f6bd0-0x000f6bdf] mapped at [ffffffffff200bd0]
[    0.000000] Base memory trampoline at [ffff880000099000] 99000 size 24576
[    0.000000] Using GB pages for direct mapping
[    0.000000] BRK [0x5bf4e000, 0x5bf4efff] PGTABLE
[    0.000000] BRK [0x5bf4f000, 0x5bf4ffff] PGTABLE
[    0.000000] BRK [0x5bf50000, 0x5bf50fff] PGTABLE
[    0.000000] BRK [0x5bf51000, 0x5bf51fff] PGTABLE
[    0.000000] BRK [0x5bf52000, 0x5bf52fff] PGTABLE
[    0.000000] ACPI: Early table checksum verification disabled
[    0.000000] ACPI: RSDP 0x00000000000F69C0 000014 (v00 BOCHS )
[    0.000000] ACPI: RSDT 0x000000007FFE12FF 00002C (v01 BOCHS  BXPCRSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACP 0x000000007FFE120B 000074 (v01 BOCHS  BXPCFACP 00000001 BXPC 00000001)
[    0.000000] ACPI: DSDT 0x000000007FFE0040 0011CB (v01 BOCHS  BXPCDSDT 00000001 BXPC 00000001)
[    0.000000] ACPI: FACS 0x000000007FFE0000 000040
[    0.000000] ACPI: APIC 0x000000007FFE127F 000080 (v01 BOCHS  BXPCAPIC 00000001 BXPC 00000001)
[    0.000000] ACPI: Local APIC address 0xfee00000
[    0.000000] No NUMA configuration found
[    0.000000] Faking a node at [mem 0x0000000000000000-0x000000007ffd8fff]
[    0.000000] NODE_DATA(0) allocated [mem 0x7ffc2000-0x7ffd8fff]
[    0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[    0.000000] kvm-clock: cpu 0, msr 0:7ffc1001, primary cpu clock
[    0.000000] kvm-clock: using sched offset of 137192604594 cycles
[    0.000000] clocksource: kvm-clock: mask: 0xffffffffffffffff max_cycles: 0x1cd42e4dffb, max_idle_ns: 881590591483 ns
[    0.000000] Zone ranges:
[    0.000000]   DMA      [mem 0x0000000000001000-0x0000000000ffffff]
[    0.000000]   DMA32    [mem 0x0000000001000000-0x000000007ffd8fff]
[    0.000000]   Normal   empty
[    0.000000]   Device   empty
[    0.000000] Movable zone start for each node
[    0.000000] Early memory node ranges
[    0.000000]   node   0: [mem 0x0000000000001000-0x000000000009efff]
[    0.000000]   node   0: [mem 0x0000000000100000-0x000000007ffd8fff]
[    0.000000] Initmem setup node 0 [mem 0x0000000000001000-0x000000007ffd8fff]
[    0.000000] On node 0 totalpages: 524151
[    0.000000]   DMA zone: 64 pages used for memmap
[    0.000000]   DMA zone: 21 pages reserved
[    0.000000]   DMA zone: 3998 pages, LIFO batch:0
[    0.000000]   DMA32 zone: 8128 pages used for memmap
[    0.000000]   DMA32 zone: 520153 pages, LIFO batch:31

And then it starts booting again...

On Wed, Dec 20, 2017 at 10:35:25PM +0100, Thomas Gleixner wrote:
> Put the cpu_entry_area into a separate p4d entry. The fixmap gets too bug
> and 0-day already hit a case where the fixmap ptes were cleared by
> cleanup_highmap().
> 
> Aside of that the fixmap API is a pain as it's all backwards.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/x86/x86_64/mm.txt         |    2 +
>  arch/x86/include/asm/cpu_entry_area.h   |   24 ++++++++++++-
>  arch/x86/include/asm/desc.h             |    1 
>  arch/x86/include/asm/fixmap.h           |   32 -----------------
>  arch/x86/include/asm/pgtable_32_types.h |   15 ++++++--
>  arch/x86/include/asm/pgtable_64_types.h |   47 +++++++++++++++-----------
>  arch/x86/kernel/dumpstack.c             |    1 
>  arch/x86/kernel/traps.c                 |    5 +-
>  arch/x86/mm/cpu_entry_area.c            |   57 +++++++++++++++++++++++---------
>  arch/x86/mm/dump_pagetables.c           |    6 ++-
>  arch/x86/mm/init_32.c                   |    6 +++
>  arch/x86/mm/kasan_init_64.c             |    6 ++-
>  arch/x86/mm/pgtable_32.c                |    1 
>  arch/x86/xen/mmu_pv.c                   |    2 -
>  14 files changed, 128 insertions(+), 77 deletions(-)
> 
> --- a/Documentation/x86/x86_64/mm.txt
> +++ b/Documentation/x86/x86_64/mm.txt
> @@ -12,6 +12,7 @@ ffffea0000000000 - ffffeaffffffffff (=40
>  ... unused hole ...
>  ffffec0000000000 - fffffbffffffffff (=44 bits) kasan shadow memory (16TB)
>  ... unused hole ...
> +fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping
>  ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
>  ... unused hole ...
>  ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
> @@ -35,6 +36,7 @@ ffd4000000000000 - ffd5ffffffffffff (=49
>  ... unused hole ...
>  ffdf000000000000 - fffffc0000000000 (=53 bits) kasan shadow memory (8PB)
>  ... unused hole ...
> +fffffe8000000000 - fffffeffffffffff (=39 bits) cpu_entry_area mapping
>  ffffff0000000000 - ffffff7fffffffff (=39 bits) %esp fixup stacks
>  ... unused hole ...
>  ffffffef00000000 - fffffffeffffffff (=64 GB) EFI region mapping space
> --- a/arch/x86/include/asm/cpu_entry_area.h
> +++ b/arch/x86/include/asm/cpu_entry_area.h
> @@ -43,10 +43,32 @@ struct cpu_entry_area {
>  };
>  
>  #define CPU_ENTRY_AREA_SIZE	(sizeof(struct cpu_entry_area))
> -#define CPU_ENTRY_AREA_PAGES	(CPU_ENTRY_AREA_SIZE / PAGE_SIZE)
> +#define CPU_ENTRY_AREA_TOT_SIZE	(CPU_ENTRY_AREA_SIZE * NR_CPUS)
>  
>  DECLARE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
>  
>  extern void setup_cpu_entry_areas(void);
> +extern void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags);
> +
> +#define	CPU_ENTRY_AREA_RO_IDT		CPU_ENTRY_AREA_BASE
> +#define CPU_ENTRY_AREA_PER_CPU		(CPU_ENTRY_AREA_RO_IDT + PAGE_SIZE)
> +
> +#define CPU_ENTRY_AREA_RO_IDT_VADDR	((void *)CPU_ENTRY_AREA_RO_IDT)
> +
> +#define CPU_ENTRY_AREA_MAP_SIZE			\
> +	(CPU_ENTRY_AREA_PER_CPU + CPU_ENTRY_AREA_TOT_SIZE - CPU_ENTRY_AREA_BASE)
> +
> +static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
> +{
> +	unsigned long va = CPU_ENTRY_AREA_PER_CPU + cpu * CPU_ENTRY_AREA_SIZE;
> +	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
> +
> +	return (struct cpu_entry_area *) va;
> +}
> +
> +static inline struct entry_stack *cpu_entry_stack(int cpu)
> +{
> +	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
> +}
>  
>  #endif
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -7,6 +7,7 @@
>  #include <asm/mmu.h>
>  #include <asm/fixmap.h>
>  #include <asm/irq_vectors.h>
> +#include <asm/cpu_entry_area.h>
>  
>  #include <linux/smp.h>
>  #include <linux/percpu.h>
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -25,7 +25,6 @@
>  #else
>  #include <uapi/asm/vsyscall.h>
>  #endif
> -#include <asm/cpu_entry_area.h>
>  
>  /*
>   * We can't declare FIXADDR_TOP as variable for x86_64 because vsyscall
> @@ -84,7 +83,6 @@ enum fixed_addresses {
>  	FIX_IO_APIC_BASE_0,
>  	FIX_IO_APIC_BASE_END = FIX_IO_APIC_BASE_0 + MAX_IO_APICS - 1,
>  #endif
> -	FIX_RO_IDT,	/* Virtual mapping for read-only IDT */
>  #ifdef CONFIG_X86_32
>  	FIX_KMAP_BEGIN,	/* reserved pte's for temporary kernel mappings */
>  	FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
> @@ -100,9 +98,6 @@ enum fixed_addresses {
>  #ifdef	CONFIG_X86_INTEL_MID
>  	FIX_LNW_VRTC,
>  #endif
> -	/* Fixmap entries to remap the GDTs, one per processor. */
> -	FIX_CPU_ENTRY_AREA_TOP,
> -	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
>  
>  #ifdef CONFIG_ACPI_APEI_GHES
>  	/* Used for GHES mapping from assorted contexts */
> @@ -143,7 +138,7 @@ enum fixed_addresses {
>  extern void reserve_top_address(unsigned long reserve);
>  
>  #define FIXADDR_SIZE	(__end_of_permanent_fixed_addresses << PAGE_SHIFT)
> -#define FIXADDR_START		(FIXADDR_TOP - FIXADDR_SIZE)
> +#define FIXADDR_START	(FIXADDR_TOP - FIXADDR_SIZE)
>  
>  extern int fixmaps_set;
>  
> @@ -191,30 +186,5 @@ void __init *early_memremap_decrypted_wp
>  void __early_set_fixmap(enum fixed_addresses idx,
>  			phys_addr_t phys, pgprot_t flags);
>  
> -static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
> -{
> -	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
> -
> -	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
> -}
> -
> -#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
> -	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
> -	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
> -	})
> -
> -#define get_cpu_entry_area_index(cpu, field)				\
> -	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
> -
> -static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
> -{
> -	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
> -}
> -
> -static inline struct entry_stack *cpu_entry_stack(int cpu)
> -{
> -	return &get_cpu_entry_area(cpu)->entry_stack_page.stack;
> -}
> -
>  #endif /* !__ASSEMBLY__ */
>  #endif /* _ASM_X86_FIXMAP_H */
> --- a/arch/x86/include/asm/pgtable_32_types.h
> +++ b/arch/x86/include/asm/pgtable_32_types.h
> @@ -38,13 +38,22 @@ extern bool __vmalloc_start_set; /* set
>  #define LAST_PKMAP 1024
>  #endif
>  
> -#define PKMAP_BASE ((FIXADDR_START - PAGE_SIZE * (LAST_PKMAP + 1))	\
> -		    & PMD_MASK)
> +/*
> + * Define this here and validate with BUILD_BUG_ON() in pgtable_32.c
> + * to avoid include recursion hell
> + */
> +#define CPU_ENTRY_AREA_PAGES	(NR_CPUS * 40)
> +
> +#define CPU_ENTRY_AREA_BASE				\
> +	((FIXADDR_START - PAGE_SIZE * (CPU_ENTRY_AREA_PAGES + 1)) & PMD_MASK)
> +
> +#define PKMAP_BASE		\
> +	((CPU_ENTRY_AREA_BASE - PAGE_SIZE) & PMD_MASK)
>  
>  #ifdef CONFIG_HIGHMEM
>  # define VMALLOC_END	(PKMAP_BASE - 2 * PAGE_SIZE)
>  #else
> -# define VMALLOC_END	(FIXADDR_START - 2 * PAGE_SIZE)
> +# define VMALLOC_END	(CPU_ENTRY_AREA_BASE - 2 * PAGE_SIZE)
>  #endif
>  
>  #define MODULES_VADDR	VMALLOC_START
> --- a/arch/x86/include/asm/pgtable_64_types.h
> +++ b/arch/x86/include/asm/pgtable_64_types.h
> @@ -76,32 +76,41 @@ typedef struct { pteval_t pte; } pte_t;
>  #define PGDIR_MASK	(~(PGDIR_SIZE - 1))
>  
>  /* See Documentation/x86/x86_64/mm.txt for a description of the memory map. */
> -#define MAXMEM		_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
> +#define MAXMEM			_AC(__AC(1, UL) << MAX_PHYSMEM_BITS, UL)
> +
>  #ifdef CONFIG_X86_5LEVEL
> -#define VMALLOC_SIZE_TB _AC(16384, UL)
> -#define __VMALLOC_BASE	_AC(0xff92000000000000, UL)
> -#define __VMEMMAP_BASE	_AC(0xffd4000000000000, UL)
> +# define VMALLOC_SIZE_TB	_AC(16384, UL)
> +# define __VMALLOC_BASE		_AC(0xff92000000000000, UL)
> +# define __VMEMMAP_BASE		_AC(0xffd4000000000000, UL)
>  #else
> -#define VMALLOC_SIZE_TB	_AC(32, UL)
> -#define __VMALLOC_BASE	_AC(0xffffc90000000000, UL)
> -#define __VMEMMAP_BASE	_AC(0xffffea0000000000, UL)
> +# define VMALLOC_SIZE_TB	_AC(32, UL)
> +# define __VMALLOC_BASE		_AC(0xffffc90000000000, UL)
> +# define __VMEMMAP_BASE		_AC(0xffffea0000000000, UL)
>  #endif
> +
>  #ifdef CONFIG_RANDOMIZE_MEMORY
> -#define VMALLOC_START	vmalloc_base
> -#define VMEMMAP_START	vmemmap_base
> +# define VMALLOC_START		vmalloc_base
> +# define VMEMMAP_START		vmemmap_base
>  #else
> -#define VMALLOC_START	__VMALLOC_BASE
> -#define VMEMMAP_START	__VMEMMAP_BASE
> +# define VMALLOC_START		__VMALLOC_BASE
> +# define VMEMMAP_START		__VMEMMAP_BASE
>  #endif /* CONFIG_RANDOMIZE_MEMORY */
> -#define VMALLOC_END	(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
> -#define MODULES_VADDR    (__START_KERNEL_map + KERNEL_IMAGE_SIZE)
> +
> +#define VMALLOC_END		(VMALLOC_START + _AC((VMALLOC_SIZE_TB << 40) - 1, UL))
> +
> +#define MODULES_VADDR		(__START_KERNEL_map + KERNEL_IMAGE_SIZE)
>  /* The module sections ends with the start of the fixmap */
> -#define MODULES_END   __fix_to_virt(__end_of_fixed_addresses + 1)
> -#define MODULES_LEN   (MODULES_END - MODULES_VADDR)
> -#define ESPFIX_PGD_ENTRY _AC(-2, UL)
> -#define ESPFIX_BASE_ADDR (ESPFIX_PGD_ENTRY << P4D_SHIFT)
> -#define EFI_VA_START	 ( -4 * (_AC(1, UL) << 30))
> -#define EFI_VA_END	 (-68 * (_AC(1, UL) << 30))
> +#define MODULES_END		__fix_to_virt(__end_of_fixed_addresses + 1)
> +#define MODULES_LEN		(MODULES_END - MODULES_VADDR)
> +
> +#define ESPFIX_PGD_ENTRY	_AC(-2, UL)
> +#define ESPFIX_BASE_ADDR	(ESPFIX_PGD_ENTRY << P4D_SHIFT)
> +
> +#define CPU_ENTRY_AREA_PGD	_AC(-3, UL)
> +#define CPU_ENTRY_AREA_BASE	(CPU_ENTRY_AREA_PGD << P4D_SHIFT)
> +
> +#define EFI_VA_START		( -4 * (_AC(1, UL) << 30))
> +#define EFI_VA_END		(-68 * (_AC(1, UL) << 30))
>  
>  #define EARLY_DYNAMIC_PAGE_TABLES	64
>  
> --- a/arch/x86/kernel/dumpstack.c
> +++ b/arch/x86/kernel/dumpstack.c
> @@ -18,6 +18,7 @@
>  #include <linux/nmi.h>
>  #include <linux/sysfs.h>
>  
> +#include <asm/cpu_entry_area.h>
>  #include <asm/stacktrace.h>
>  #include <asm/unwind.h>
>  
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -951,8 +951,9 @@ void __init trap_init(void)
>  	 * "sidt" instruction will not leak the location of the kernel, and
>  	 * to defend the IDT against arbitrary memory write vulnerabilities.
>  	 * It will be reloaded in cpu_init() */
> -	__set_fixmap(FIX_RO_IDT, __pa_symbol(idt_table), PAGE_KERNEL_RO);
> -	idt_descr.address = fix_to_virt(FIX_RO_IDT);
> +	cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table),
> +		    PAGE_KERNEL_RO);
> +	idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
>  
>  	/*
>  	 * Should be a barrier for any external CPU state:
> --- a/arch/x86/mm/cpu_entry_area.c
> +++ b/arch/x86/mm/cpu_entry_area.c
> @@ -13,11 +13,18 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
>  	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
>  #endif
>  
> +void cea_set_pte(void *cea_vaddr, phys_addr_t pa, pgprot_t flags)
> +{
> +	unsigned long va = (unsigned long) cea_vaddr;
> +
> +	set_pte_vaddr(va, pfn_pte(pa >> PAGE_SHIFT, flags));
> +}
> +
>  static void __init
> -set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
> +cea_map_percpu_pages(void *cea_vaddr, void *ptr, int pages, pgprot_t prot)
>  {
> -	for ( ; pages; pages--, idx--, ptr += PAGE_SIZE)
> -		__set_fixmap(idx, per_cpu_ptr_to_phys(ptr), prot);
> +	for ( ; pages; pages--, cea_vaddr+= PAGE_SIZE, ptr += PAGE_SIZE)
> +		cea_set_pte(cea_vaddr, per_cpu_ptr_to_phys(ptr), prot);
>  }
>  
>  /* Setup the fixmap mappings only once per-processor */
> @@ -45,10 +52,12 @@ static void __init setup_cpu_entry_area(
>  	pgprot_t tss_prot = PAGE_KERNEL;
>  #endif
>  
> -	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
> -	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, entry_stack_page),
> -				per_cpu_ptr(&entry_stack_storage, cpu), 1,
> -				PAGE_KERNEL);
> +	cea_set_pte(&get_cpu_entry_area(cpu)->gdt, get_cpu_gdt_paddr(cpu),
> +		    gdt_prot);
> +
> +	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->entry_stack_page,
> +			     per_cpu_ptr(&entry_stack_storage, cpu), 1,
> +			     PAGE_KERNEL);
>  
>  	/*
>  	 * The Intel SDM says (Volume 3, 7.2.1):
> @@ -70,10 +79,9 @@ static void __init setup_cpu_entry_area(
>  	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
>  		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
>  	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
> -	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
> -				&per_cpu(cpu_tss_rw, cpu),
> -				sizeof(struct tss_struct) / PAGE_SIZE,
> -				tss_prot);
> +	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->tss,
> +			     &per_cpu(cpu_tss_rw, cpu),
> +			     sizeof(struct tss_struct) / PAGE_SIZE, tss_prot);
>  
>  #ifdef CONFIG_X86_32
>  	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
> @@ -83,20 +91,37 @@ static void __init setup_cpu_entry_area(
>  	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
>  	BUILD_BUG_ON(sizeof(exception_stacks) !=
>  		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
> -	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
> -				&per_cpu(exception_stacks, cpu),
> -				sizeof(exception_stacks) / PAGE_SIZE,
> -				PAGE_KERNEL);
> +	cea_map_percpu_pages(&get_cpu_entry_area(cpu)->exception_stacks,
> +			     &per_cpu(exception_stacks, cpu),
> +			     sizeof(exception_stacks) / PAGE_SIZE, PAGE_KERNEL);
>  
> -	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
> +	cea_set_pte(&get_cpu_entry_area(cpu)->entry_trampoline,
>  		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
>  #endif
>  }
>  
> +static __init void setup_cpu_entry_area_ptes(void)
> +{
> +#ifdef CONFIG_X86_32
> +	unsigned long start, end;
> +
> +	BUILD_BUG_ON(CPU_ENTRY_AREA_PAGES * PAGE_SIZE < CPU_ENTRY_AREA_MAP_SIZE);
> +	BUG_ON(CPU_ENTRY_AREA_BASE & ~PMD_MASK);
> +
> +	start = CPU_ENTRY_AREA_BASE;
> +	end = start + CPU_ENTRY_AREA_MAP_SIZE;
> +
> +	for (; start < end; start += PMD_SIZE)
> +		populate_extra_pte(start);
> +#endif
> +}
> +
>  void __init setup_cpu_entry_areas(void)
>  {
>  	unsigned int cpu;
>  
> +	setup_cpu_entry_area_ptes();
> +
>  	for_each_possible_cpu(cpu)
>  		setup_cpu_entry_area(cpu);
>  }
> --- a/arch/x86/mm/dump_pagetables.c
> +++ b/arch/x86/mm/dump_pagetables.c
> @@ -58,6 +58,7 @@ enum address_markers_idx {
>  	KASAN_SHADOW_START_NR,
>  	KASAN_SHADOW_END_NR,
>  #endif
> +	CPU_ENTRY_AREA_NR,
>  #ifdef CONFIG_X86_ESPFIX64
>  	ESPFIX_START_NR,
>  #endif
> @@ -81,6 +82,7 @@ static struct addr_marker address_marker
>  	[KASAN_SHADOW_START_NR]	= { KASAN_SHADOW_START,	"KASAN shadow" },
>  	[KASAN_SHADOW_END_NR]	= { KASAN_SHADOW_END,	"KASAN shadow end" },
>  #endif
> +	[CPU_ENTRY_AREA_NR]	= { CPU_ENTRY_AREA_BASE,"CPU entry Area" },
>  #ifdef CONFIG_X86_ESPFIX64
>  	[ESPFIX_START_NR]	= { ESPFIX_BASE_ADDR,	"ESPfix Area", 16 },
>  #endif
> @@ -104,6 +106,7 @@ enum address_markers_idx {
>  #ifdef CONFIG_HIGHMEM
>  	PKMAP_BASE_NR,
>  #endif
> +	CPU_ENTRY_AREA_NR,
>  	FIXADDR_START_NR,
>  	END_OF_SPACE_NR,
>  };
> @@ -116,6 +119,7 @@ static struct addr_marker address_marker
>  #ifdef CONFIG_HIGHMEM
>  	[PKMAP_BASE_NR]		= { 0UL,		"Persistent kmap() Area" },
>  #endif
> +	[CPU_ENTRY_AREA_NR]	= { 0UL,		"CPU entry area" },
>  	[FIXADDR_START_NR]	= { 0UL,		"Fixmap area" },
>  	[END_OF_SPACE_NR]	= { -1,			NULL }
>  };
> @@ -541,8 +545,8 @@ static int __init pt_dump_init(void)
>  	address_markers[PKMAP_BASE_NR].start_address = PKMAP_BASE;
>  # endif
>  	address_markers[FIXADDR_START_NR].start_address = FIXADDR_START;
> +	address_markers[CPU_ENTRY_AREA_NR].start_address = CPU_ENTRY_AREA_BASE;
>  #endif
> -
>  	return 0;
>  }
>  __initcall(pt_dump_init);
> --- a/arch/x86/mm/init_32.c
> +++ b/arch/x86/mm/init_32.c
> @@ -50,6 +50,7 @@
>  #include <asm/setup.h>
>  #include <asm/set_memory.h>
>  #include <asm/page_types.h>
> +#include <asm/cpu_entry_area.h>
>  #include <asm/init.h>
>  
>  #include "mm_internal.h"
> @@ -766,6 +767,7 @@ void __init mem_init(void)
>  	mem_init_print_info(NULL);
>  	printk(KERN_INFO "virtual kernel memory layout:\n"
>  		"    fixmap  : 0x%08lx - 0x%08lx   (%4ld kB)\n"
> +		"  cpu_entry : 0x%08lx - 0x%08lx   (%4ld kB)\n"
>  #ifdef CONFIG_HIGHMEM
>  		"    pkmap   : 0x%08lx - 0x%08lx   (%4ld kB)\n"
>  #endif
> @@ -777,6 +779,10 @@ void __init mem_init(void)
>  		FIXADDR_START, FIXADDR_TOP,
>  		(FIXADDR_TOP - FIXADDR_START) >> 10,
>  
> +		CPU_ENTRY_AREA_BASE,
> +		CPU_ENTRY_AREA_BASE + CPU_ENTRY_AREA_MAP_SIZE,
> +		CPU_ENTRY_AREA_MAP_SIZE >> 10,
> +
>  #ifdef CONFIG_HIGHMEM
>  		PKMAP_BASE, PKMAP_BASE+LAST_PKMAP*PAGE_SIZE,
>  		(LAST_PKMAP*PAGE_SIZE) >> 10,
> --- a/arch/x86/mm/kasan_init_64.c
> +++ b/arch/x86/mm/kasan_init_64.c
> @@ -15,6 +15,7 @@
>  #include <asm/tlbflush.h>
>  #include <asm/sections.h>
>  #include <asm/pgtable.h>
> +#include <asm/cpu_entry_area.h>
>  
>  extern struct range pfn_mapped[E820_MAX_ENTRIES];
>  
> @@ -330,12 +331,13 @@ void __init kasan_init(void)
>  			      (unsigned long)kasan_mem_to_shadow(_end),
>  			      early_pfn_to_nid(__pa(_stext)));
>  
> -	shadow_cpu_entry_begin = (void *)__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM);
> +	shadow_cpu_entry_begin = (void *)CPU_ENTRY_AREA_BASE;
>  	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
>  	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
>  						PAGE_SIZE);
>  
> -	shadow_cpu_entry_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
> +	shadow_cpu_entry_end = (void *)(CPU_ENTRY_AREA_BASE +
> +					CPU_ENTRY_AREA_TOT_SIZE);
>  	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
>  	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
>  					PAGE_SIZE);
> --- a/arch/x86/mm/pgtable_32.c
> +++ b/arch/x86/mm/pgtable_32.c
> @@ -10,6 +10,7 @@
>  #include <linux/pagemap.h>
>  #include <linux/spinlock.h>
>  
> +#include <asm/cpu_entry_area.h>
>  #include <asm/pgtable.h>
>  #include <asm/pgalloc.h>
>  #include <asm/fixmap.h>
> --- a/arch/x86/xen/mmu_pv.c
> +++ b/arch/x86/xen/mmu_pv.c
> @@ -2261,7 +2261,6 @@ static void xen_set_fixmap(unsigned idx,
>  
>  	switch (idx) {
>  	case FIX_BTMAP_END ... FIX_BTMAP_BEGIN:
> -	case FIX_RO_IDT:
>  #ifdef CONFIG_X86_32
>  	case FIX_WP_TEST:
>  # ifdef CONFIG_HIGHMEM
> @@ -2272,7 +2271,6 @@ static void xen_set_fixmap(unsigned idx,
>  #endif
>  	case FIX_TEXT_POKE0:
>  	case FIX_TEXT_POKE1:
> -	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
>  		/* All local page mappings */
>  		pte = pfn_pte(phys, prot);
>  		break;

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [V181,22/54] x86/cpu_entry_area: Move it out of fixmap
  2017-12-22  2:46   ` [V181,22/54] " Andrei Vagin
@ 2017-12-22 13:05     ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2017-12-22 13:05 UTC (permalink / raw)
  To: Andrei Vagin
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Vlastimil Babka, daniel.gruss,
	Andrey Ryabinin, Dmitry Vyukov, kasan-dev

On Thu, 21 Dec 2017, Andrei Vagin wrote:

> Hi Thomas,
> [    0.000000]   DMA32 zone: 520153 pages, LIFO batch:31
> 
> And then it starts booting again...

Yes. It triple faults. Aside of having made the area one page too small, my
approach of changing this was just naive.

Andrey rescued me and provided the fix below. Thanks again!

Thanks,

	tglx

---
 arch/x86/mm/kasan_init_64.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 6353b8d31e6a..47388f0c0e59 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -323,32 +323,33 @@ void __init kasan_init(void)
 		map_range(&pfn_mapped[i]);
 	}
 
-	kasan_populate_zero_shadow(
-		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
-		kasan_mem_to_shadow((void *)__START_KERNEL_map));
-
-	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext),
-			      (unsigned long)kasan_mem_to_shadow(_end),
-			      early_pfn_to_nid(__pa(_stext)));
-
 	shadow_cpu_entry_begin = (void *)CPU_ENTRY_AREA_BASE;
 	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
 	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
 						PAGE_SIZE);
 
 	shadow_cpu_entry_end = (void *)(CPU_ENTRY_AREA_BASE +
-					CPU_ENTRY_AREA_TOT_SIZE);
+					CPU_ENTRY_AREA_MAP_SIZE);
 	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
 	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
 					PAGE_SIZE);
 
-	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-				   shadow_cpu_entry_begin);
+	kasan_populate_zero_shadow(
+		kasan_mem_to_shadow((void *)PAGE_OFFSET + MAXMEM),
+		shadow_cpu_entry_begin);
 
 	kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
 			      (unsigned long)shadow_cpu_entry_end, 0);
 
-	kasan_populate_zero_shadow(shadow_cpu_entry_end, (void *)KASAN_SHADOW_END);
+	kasan_populate_zero_shadow(shadow_cpu_entry_end,
+				kasan_mem_to_shadow((void *)__START_KERNEL_map));
+
+	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(_stext),
+			      (unsigned long)kasan_mem_to_shadow(_end),
+			      early_pfn_to_nid(__pa(_stext)));
+
+	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
+				(void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [patch V181 35/54] x86/entry: Align entry text section to PMD boundary
  2017-12-20 21:35 ` [patch V181 35/54] x86/entry: Align entry text section to PMD boundary Thomas Gleixner
@ 2018-05-17 15:58   ` Josh Poimboeuf
  2018-05-18 10:38     ` Thomas Gleixner
  0 siblings, 1 reply; 85+ messages in thread
From: Josh Poimboeuf @ 2018-05-17 15:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Vlastimil Babka, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin

On Wed, Dec 20, 2017 at 10:35:38PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The (irq)entry text must be visible in the user space page tables. To allow
> simple PMD based sharing, make the entry text PMD aligned.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

With my config, this patch adds 2.5M (+~25%) to the kernel text size.
Is there a more granular way to achieve this?

-- 
Josh

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [patch V181 35/54] x86/entry: Align entry text section to PMD boundary
  2018-05-17 15:58   ` Josh Poimboeuf
@ 2018-05-18 10:38     ` Thomas Gleixner
  0 siblings, 0 replies; 85+ messages in thread
From: Thomas Gleixner @ 2018-05-18 10:38 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Vlastimil Babka, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin

On Thu, 17 May 2018, Josh Poimboeuf wrote:

> On Wed, Dec 20, 2017 at 10:35:38PM +0100, Thomas Gleixner wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> > 
> > The (irq)entry text must be visible in the user space page tables. To allow
> > simple PMD based sharing, make the entry text PMD aligned.
> > 
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> 
> With my config, this patch adds 2.5M (+~25%) to the kernel text size.
> Is there a more granular way to achieve this?

We could make it PTE granular with some effort, but that would make the
whole PTI handling more complex. So not sure, if that's worth it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2018-05-18 10:38 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-20 21:35 [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
2017-12-20 21:35 ` [patch V181 01/54] x86/Kconfig: Limit NR_CPUS on 32bit to a sane amount Thomas Gleixner
2017-12-20 21:35 ` [patch V181 02/54] x86/mm/dump_pagetables: Check PAGE_PRESENT for real Thomas Gleixner
2017-12-20 21:35 ` [patch V181 03/54] x86/mm/dump_pagetables: Make the address hints correct and readable Thomas Gleixner
2017-12-20 21:35 ` [patch V181 04/54] x86/vsyscall/64: Explicitly set _PAGE_USER in the pagetable hierarchy Thomas Gleixner
2017-12-20 21:35 ` [patch V181 05/54] x86/vsyscall/64: Warn and fail vsyscall emulation in NATIVE mode Thomas Gleixner
2017-12-20 21:35 ` [patch V181 06/54] arch: Allow arch_dup_mmap() to fail Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 07/54] x86/ldt: Rework locking Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 08/54] x86/ldt: Prevent ldt inheritance on exec Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 09/54] x86/mm/64: Improve the memory map documentation Thomas Gleixner
2017-12-20 21:35 ` [patch V181 10/54] x86/doc: Remove obvious weirdness Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 11/54] x86/entry: Remove SYSENTER_stack naming Thomas Gleixner
2017-12-20 21:35 ` [patch V181 12/54] x86/uv: Use the right tlbflush API Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 13/54] x86/microcode: Dont abuse the tlbflush interface Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 14/54] x86/mm: Use __flush_tlb_one() for kernel memory Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 15/54] x86/mm: Remove superfluous barriers Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 16/54] x86/mm: Clarify which functions are supposed to flush what Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 17/54] x86/mm: Move the CR3 construction functions to tlbflush.h Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 18/54] x86/mm: Remove hard-coded ASID limit checks Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 19/54] x86/mm: Put MMU to hardware ASID translation in one place Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 20/54] x86/mm: Create asm/invpcid.h Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 21/54] x86/cpu_entry_area: Move it to a separate unit Thomas Gleixner
2017-12-20 22:29   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 22/54] x86/cpu_entry_area: Move it out of fixmap Thomas Gleixner
2017-12-22  2:46   ` [V181,22/54] " Andrei Vagin
2017-12-22 13:05     ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 23/54] init: Invoke init_espfix_bsp() from mm_init() Thomas Gleixner
2017-12-20 21:35 ` [patch V181 24/54] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
2017-12-20 21:35 ` [patch V181 25/54] x86/mm/pti: Disable global pages if PAGE_TABLE_ISOLATION=y Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 26/54] x86/mm/pti: Prepare the x86/entry assembly code for entry/exit CR3 switching Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 27/54] x86/mm/pti: Add infrastructure for page table isolation Thomas Gleixner
2017-12-20 21:35 ` [patch V181 28/54] x86/mm/pti: Add mapping helper functions Thomas Gleixner
2017-12-20 21:35 ` [patch V181 29/54] x86/mm/pti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
2017-12-20 21:35 ` [patch V181 30/54] x86/mm/pti: Allocate a separate user PGD Thomas Gleixner
2017-12-20 21:35 ` [patch V181 31/54] x86/mm/pti: Populate " Thomas Gleixner
2017-12-20 21:35 ` [patch V181 32/54] x86/mm/pti: Add functions to clone kernel PMDs Thomas Gleixner
2017-12-20 21:35 ` [patch V181 33/54] x86/mm/pti: Force entry through trampoline when PTI active Thomas Gleixner
2017-12-20 21:35 ` [patch V181 34/54] x86/mm/pti: Share cpu_entry_area with user space page tables Thomas Gleixner
2017-12-20 21:35 ` [patch V181 35/54] x86/entry: Align entry text section to PMD boundary Thomas Gleixner
2018-05-17 15:58   ` Josh Poimboeuf
2018-05-18 10:38     ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 36/54] x86/mm/pti: Share entry text PMD Thomas Gleixner
2017-12-20 21:35 ` [patch V181 37/54] x86/mm/pti: Map ESPFIX into user space Thomas Gleixner
2017-12-20 21:35 ` [patch V181 38/54] x86/cpu_entry_area: Add debugstore entries to cpu_entry_area Thomas Gleixner
2017-12-20 21:35 ` [patch V181 39/54] x86/events/intel/ds: Map debug buffers in cpu_entry_area Thomas Gleixner
2017-12-20 21:35 ` [patch V181 40/54] x86/mm/64: Make a full PGD-entry size hole in the memory map Thomas Gleixner
2017-12-20 21:35 ` [patch V181 41/54] x86/pti: Put the LDT in its own PGD if PTI is on Thomas Gleixner
2017-12-20 21:35 ` [patch V181 42/54] x86/pti: Map the vsyscall page if needed Thomas Gleixner
2017-12-20 21:35 ` [patch V181 43/54] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
2017-12-20 21:35 ` [patch V181 44/54] x86/mm: Abstract switching CR3 Thomas Gleixner
2017-12-20 21:35 ` [patch V181 45/54] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
2017-12-20 21:35 ` [patch V181 46/54] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
2017-12-20 21:35 ` [patch V181 47/54] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
2017-12-20 21:35 ` [patch V181 48/54] x86/mm: Clarify the whole ASID/kernel PCID/user PCID naming Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 49/54] x86/dumpstack: Indicate in Oops whether pti is configured and enabled Thomas Gleixner
2017-12-20 22:03   ` Jiri Kosina
2017-12-20 21:35 ` [patch V181 50/54] x86/mm/pti: Add Kconfig Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 51/54] x86/mm/dump_pagetables: Add page table directory Thomas Gleixner
2017-12-20 21:35 ` [patch V181 52/54] x86/mm/dump_pagetables: Check user space page table for WX pages Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 53/54] x86/mm/dump_pagetables: Allow dumping current pagetables Thomas Gleixner
2017-12-20 21:35   ` Thomas Gleixner
2017-12-20 21:35 ` [patch V181 54/54] x86/ldt: Make the LDT mapping RO Thomas Gleixner
2017-12-20 23:48 ` [patch V181 00/54] x86/pti: Final XMAS release Thomas Gleixner
2017-12-21 12:57 ` Kirill A. Shutemov
2017-12-21 16:26   ` Kirill A. Shutemov
2017-12-21 18:39     ` Thomas Gleixner
2017-12-21 15:57 ` Boris Ostrovsky

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.