LKML Archive on lore.kernel.org
 help / Atom feed
* [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
@ 2017-12-04 14:07 Thomas Gleixner
  2017-12-04 14:07 ` [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags Thomas Gleixner
                   ` (62 more replies)
  0 siblings, 63 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

This series is a major overhaul of the KAISER patches:

1) Entry code

   Mostly the same, except for a handful of fixlets and delta
   improvements folded into the corresponding patches

   New: Map TSS read only into the user space visible mapping

     This is 64bit only, as 32bit needs the TSS mapped RW

     AMD confirmed that there is no issue with that. It would be nice to
     get confirmation from Intel as well.

2) Namespace

   Several people including Linus requested to change the KAISER name.

   We came up with a list of technically correct acronyms:

     User Address Space Separation, prefix uass_

     Forcefully Unmap Complete Kernel With Interrupt Trampolines, prefix fuckwit_

   but we are politically correct people so we settled for

    Kernel Page Table Isolation, prefix kpti_

   Linus, your call :)

3) The actual isolation patches

   - Replaced the magic kaiser_add/remove_mapping() code by mapping everything
     which needs to be shared with user space into the fixmap

   - PMD aligned the shared fixmap so the PTE page can be shared between
     user and kernel space page tables

   - Integrated all fixes and Peters rewrite of the PCID/TLB flush code.

   - Restructured the patch set in a way that it is simpler to review

   - Got rid of the strange wording of shadow page tables, because they are
     not shadowish at all. KASAN, virt etc. use shadows, but these tables
     are actively in use and integral part of the functionality

   - Moved the debugfs files into a new directory so they don't clutter the
     debugfs root directory.

LIMITATIONS:

   - allmod/yes config builds fail right now because the fixmap grows
     too large and breaks the EFI assumptions. This is still investigated.

     A possible solution is just to use one of the address space holes
     and grab a separate pgdir to map the cpu entry area. Not hard to do
     and it wont change much of the principle of these patches.

TODOs:

   - This needs a thorough review again. Sorry.

   - Please verify that all fixlets have been integrated. The mail threads
     are horribly scattered so I might have missed something.

   - Rewrite documentation. I dropped the documentation patch as it not
     longer applies.

   - Handle native vsyscalls. Right now the patch set supports only
     emulation, but it should be possible to support native as well.
     Nothing urgent, I'd rather prefer to kill them completely.

   - Populate a branch with minimal prerequisite patches to apply.

Thanks to Andy Lutomirsky, Peter Zijlstra, Ingo Molnar, Borislav Petkov and
Dave Hansen for all the help with this.

The patches apply on top of

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git x86/urgent

and are available from git in

    git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip.git WIP.x86/kpti

and as tarball from

    https://tglx.de/~tglx/patches-kpti-119.tar.bz2

    Signature file for the uncompressed tarball

    https://tglx.de/~tglx/patches-kpti-119.tar.sig

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 12:17   ` Juergen Gross
  2017-12-04 14:07 ` [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow Thomas Gleixner
                   ` (61 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, xen-devel

[-- Attachment #0: x86-entry-64-paravirt--Use_paravirt-safe_macro_to_access_eflags.patch --]
[-- Type: text/plain, Size: 3202 bytes --]

From: Boris Ostrovsky <boris.ostrovsky@oracle.com>

Commit 1d3e53e8624a ("x86/entry/64: Refactor IRQ stacks and make them
NMI-safe") added DEBUG_ENTRY_ASSERT_IRQS_OFF macro that acceses eflags
using 'pushfq' instruction when testing for IF bit. On PV Xen guests
looking at IF flag directly will always see it set, resulting in 'ud2'.

Introduce SAVE_FLAGS() macro that will use appropriate save_fl pv op when
running paravirt.

Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: jgross@suse.com
Cc: xen-devel@lists.xenproject.org
Cc: luto@kernel.org
Link: https://lkml.kernel.org/r/1512159805-6314-1-git-send-email-boris.ostrovsky@oracle.com

---
V3:
* Use CLBR_RAX to preserve all registers except %rax


 arch/x86/entry/entry_64.S        |    7 ++++---
 arch/x86/include/asm/irqflags.h  |    3 +++
 arch/x86/include/asm/paravirt.h  |    9 +++++++++
 arch/x86/kernel/asm-offsets_64.c |    3 +++
 4 files changed, 19 insertions(+), 3 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d..18474bb 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -466,12 +466,13 @@ END(irq_entries_start)
 
 .macro DEBUG_ENTRY_ASSERT_IRQS_OFF
 #ifdef CONFIG_DEBUG_ENTRY
-	pushfq
-	testl $X86_EFLAGS_IF, (%rsp)
+	pushq %rax
+	SAVE_FLAGS(CLBR_RAX)
+	testl $X86_EFLAGS_IF, %eax
 	jz .Lokay_\@
 	ud2
 .Lokay_\@:
-	addq $8, %rsp
+	popq %rax
 #endif
 .endm
 
diff --git a/arch/x86/include/asm/irqflags.h b/arch/x86/include/asm/irqflags.h
index c8ef23f..89f0895 100644
--- a/arch/x86/include/asm/irqflags.h
+++ b/arch/x86/include/asm/irqflags.h
@@ -142,6 +142,9 @@ static inline notrace unsigned long arch_local_irq_save(void)
 	swapgs;					\
 	sysretl
 
+#ifdef CONFIG_DEBUG_ENTRY
+#define SAVE_FLAGS(x)		pushfq; popq %rax
+#endif
 #else
 #define INTERRUPT_RETURN		iret
 #define ENABLE_INTERRUPTS_SYSEXIT	sti; sysexit
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 283efca..892df37 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -927,6 +927,15 @@ static inline notrace unsigned long arch_local_irq_save(void)
 	PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret64),	\
 		  CLBR_NONE,						\
 		  jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret64))
+
+#ifdef CONFIG_DEBUG_ENTRY
+#define SAVE_FLAGS(clobbers)                                        \
+	PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_save_fl), clobbers, \
+		  PV_SAVE_REGS(clobbers | CLBR_CALLEE_SAVE);        \
+		  call PARA_INDIRECT(pv_irq_ops+PV_IRQ_save_fl);    \
+		  PV_RESTORE_REGS(clobbers | CLBR_CALLEE_SAVE);)
+#endif
+
 #endif	/* CONFIG_X86_32 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212f..e3a5175 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -23,6 +23,9 @@ int main(void)
 #ifdef CONFIG_PARAVIRT
 	OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64);
 	OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
+#ifdef CONFIG_DEBUG_ENTRY
+	OFFSET(PV_IRQ_save_fl, pv_irq_ops, save_fl);
+#endif
 	BLANK();
 #endif
 
-- 
1.7.1

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
  2017-12-04 14:07 ` [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 20:31   ` Andy Lutomirski
  2017-12-04 14:07 ` [patch 03/60] x86/unwinder: Handle stack overflows more gracefully Thomas Gleixner
                   ` (60 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar

[-- Attachment #0: x86-unwinder-orc--Dont_bail_on_stack_overflow.patch --]
[-- Type: text/plain, Size: 2292 bytes --]

From: Andy Lutomirski <luto@kernel.org>

If the stack overflows into a guard page and the ORC unwinder should work
well: by construction, there can't be any meaningful data in the guard page
because no writes to the guard page will have succeeded.

But there is a bug that prevents unwinding from working correctly: if the
starting register state has RSP pointing into a stack guard page, the ORC
unwinder bails out immediately.

Instead of bailing out immediately check whether the next page up is a
valid check page and if so analyze that. As a result the ORC unwinder will
start the unwind.

Tested by intentionally overflowing the task stack.  The result is an
accurate call trace instead of a trace consisting purely of '?' entries.

There are a few other bugs that are triggered if the unwinder encounters a
stack overflow after the first step, but they are outside the scope of this
fix.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/927042950d7f1a7007dd0f58538966a593508f8b.1511715954.git.luto@kernel.org

---
 arch/x86/kernel/unwind_orc.c |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -553,8 +553,18 @@ void __unwind_start(struct unwind_state
 	}
 
 	if (get_stack_info((unsigned long *)state->sp, state->task,
-			   &state->stack_info, &state->stack_mask))
-		return;
+			   &state->stack_info, &state->stack_mask)) {
+		/*
+		 * We weren't on a valid stack.  It's possible that
+		 * we overflowed a valid stack into a guard page.
+		 * See if the next page up is valid so that we can
+		 * generate some kind of backtrace if this happens.
+		 */
+		void *next_page = (void *)PAGE_ALIGN((unsigned long)regs->sp);
+		if (get_stack_info(next_page, state->task, &state->stack_info,
+				   &state->stack_mask))
+			return;
+	}
 
 	/*
 	 * The caller can provide the address of the first frame directly

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 03/60] x86/unwinder: Handle stack overflows more gracefully
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
  2017-12-04 14:07 ` [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags Thomas Gleixner
  2017-12-04 14:07 ` [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 04/60] x86/irq: Remove an old outdated comment about context tracking races Thomas Gleixner
                   ` (59 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar

[-- Attachment #0: x86-unwinder--Handle_stack_overflows_more_gracefully.patch --]
[-- Type: text/plain, Size: 9443 bytes --]

From: Josh Poimboeuf <jpoimboe@redhat.com>

There are at least two unwinder bugs hindering the debugging of
stack-overflow crashes:

- It doesn't deal gracefully with the case where the stack overflows and
  the stack pointer itself isn't on a valid stack but the
  to-be-dereferenced data *is*.

- The ORC oops dump code doesn't know how to print partial pt_regs, for the
  case where if we get an interrupt/exception in *early* entry code
  before the full pt_regs have been saved.

Fix both issues.

Signed-off-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
http://lkml.kernel.org/r/20171126024031.uxi4numpbjm5rlbr@treble
---
 arch/x86/include/asm/kdebug.h |    1 
 arch/x86/include/asm/unwind.h |    7 +++
 arch/x86/kernel/dumpstack.c   |   32 ++++++++++++++---
 arch/x86/kernel/process_64.c  |   11 ++----
 arch/x86/kernel/unwind_orc.c  |   76 ++++++++++++++----------------------------
 5 files changed, 66 insertions(+), 61 deletions(-)

--- a/arch/x86/include/asm/kdebug.h
+++ b/arch/x86/include/asm/kdebug.h
@@ -26,6 +26,7 @@ extern void die(const char *, struct pt_
 extern int __must_check __die(const char *, struct pt_regs *, long);
 extern void show_stack_regs(struct pt_regs *regs);
 extern void __show_regs(struct pt_regs *regs, int all);
+extern void show_iret_regs(struct pt_regs *regs);
 extern unsigned long oops_begin(void);
 extern void oops_end(unsigned long, struct pt_regs *, int signr);
 
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -7,6 +7,9 @@
 #include <asm/ptrace.h>
 #include <asm/stacktrace.h>
 
+#define IRET_FRAME_OFFSET (offsetof(struct pt_regs, ip))
+#define IRET_FRAME_SIZE   (sizeof(struct pt_regs) - IRET_FRAME_OFFSET)
+
 struct unwind_state {
 	struct stack_info stack_info;
 	unsigned long stack_mask;
@@ -52,6 +55,10 @@ void unwind_start(struct unwind_state *s
 }
 
 #if defined(CONFIG_UNWINDER_ORC) || defined(CONFIG_UNWINDER_FRAME_POINTER)
+/*
+ * WARNING: The entire pt_regs may not be safe to dereference.  In some cases,
+ * only the iret frame registers are accessible.  Use with caution!
+ */
 static inline struct pt_regs *unwind_get_entry_regs(struct unwind_state *state)
 {
 	if (unwind_done(state))
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -50,6 +50,28 @@ static void printk_stack_address(unsigne
 	printk("%s %s%pB\n", log_lvl, reliable ? "" : "? ", (void *)address);
 }
 
+void show_iret_regs(struct pt_regs *regs)
+{
+	printk(KERN_DEFAULT "RIP: %04x:%pS\n", (int)regs->cs, (void *)regs->ip);
+	printk(KERN_DEFAULT "RSP: %04x:%016lx EFLAGS: %08lx", (int)regs->ss,
+		regs->sp, regs->flags);
+}
+
+static void show_regs_safe(struct stack_info *info, struct pt_regs *regs)
+{
+	if (on_stack(info, regs, sizeof(*regs)))
+		__show_regs(regs, 0);
+	else if (on_stack(info, (void *)regs + IRET_FRAME_OFFSET,
+			  IRET_FRAME_SIZE)) {
+		/*
+		 * When an interrupt or exception occurs in entry code, the
+		 * full pt_regs might not have been saved yet.  In that case
+		 * just print the iret frame.
+		 */
+		show_iret_regs(regs);
+	}
+}
+
 void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 			unsigned long *stack, char *log_lvl)
 {
@@ -94,8 +116,8 @@ void show_trace_log_lvl(struct task_stru
 		if (stack_name)
 			printk("%s <%s>\n", log_lvl, stack_name);
 
-		if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-			__show_regs(regs, 0);
+		if (regs)
+			show_regs_safe(&stack_info, regs);
 
 		/*
 		 * Scan the stack, printing any text addresses we find.  At the
@@ -119,7 +141,7 @@ void show_trace_log_lvl(struct task_stru
 
 			/*
 			 * Don't print regs->ip again if it was already printed
-			 * by __show_regs() below.
+			 * by show_regs_safe() below.
 			 */
 			if (regs && stack == &regs->ip)
 				goto next;
@@ -155,8 +177,8 @@ void show_trace_log_lvl(struct task_stru
 
 			/* if the frame has entry regs, print them */
 			regs = unwind_get_entry_regs(&state);
-			if (regs && on_stack(&stack_info, regs, sizeof(*regs)))
-				__show_regs(regs, 0);
+			if (regs)
+				show_regs_safe(&stack_info, regs);
 		}
 
 		if (stack_name)
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -69,9 +69,8 @@ void __show_regs(struct pt_regs *regs, i
 	unsigned int fsindex, gsindex;
 	unsigned int ds, cs, es;
 
-	printk(KERN_DEFAULT "RIP: %04lx:%pS\n", regs->cs, (void *)regs->ip);
-	printk(KERN_DEFAULT "RSP: %04lx:%016lx EFLAGS: %08lx", regs->ss,
-		regs->sp, regs->flags);
+	show_iret_regs(regs);
+
 	if (regs->orig_ax != -1)
 		pr_cont(" ORIG_RAX: %016lx\n", regs->orig_ax);
 	else
@@ -88,6 +87,9 @@ void __show_regs(struct pt_regs *regs, i
 	printk(KERN_DEFAULT "R13: %016lx R14: %016lx R15: %016lx\n",
 	       regs->r13, regs->r14, regs->r15);
 
+	if (!all)
+		return;
+
 	asm("movl %%ds,%0" : "=r" (ds));
 	asm("movl %%cs,%0" : "=r" (cs));
 	asm("movl %%es,%0" : "=r" (es));
@@ -98,9 +100,6 @@ void __show_regs(struct pt_regs *regs, i
 	rdmsrl(MSR_GS_BASE, gs);
 	rdmsrl(MSR_KERNEL_GS_BASE, shadowgs);
 
-	if (!all)
-		return;
-
 	cr0 = read_cr0();
 	cr2 = read_cr2();
 	cr3 = __read_cr3();
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -253,22 +253,15 @@ unsigned long *unwind_get_return_address
 	return NULL;
 }
 
-static bool stack_access_ok(struct unwind_state *state, unsigned long addr,
+static bool stack_access_ok(struct unwind_state *state, unsigned long _addr,
 			    size_t len)
 {
 	struct stack_info *info = &state->stack_info;
+	void *addr = (void *)_addr;
 
-	/*
-	 * If the address isn't on the current stack, switch to the next one.
-	 *
-	 * We may have to traverse multiple stacks to deal with the possibility
-	 * that info->next_sp could point to an empty stack and the address
-	 * could be on a subsequent stack.
-	 */
-	while (!on_stack(info, (void *)addr, len))
-		if (get_stack_info(info->next_sp, state->task, info,
-				   &state->stack_mask))
-			return false;
+	if (!on_stack(info, addr, len) &&
+	    (get_stack_info(addr, state->task, info, &state->stack_mask)))
+		return false;
 
 	return true;
 }
@@ -283,42 +276,32 @@ static bool deref_stack_reg(struct unwin
 	return true;
 }
 
-#define REGS_SIZE (sizeof(struct pt_regs))
-#define SP_OFFSET (offsetof(struct pt_regs, sp))
-#define IRET_REGS_SIZE (REGS_SIZE - offsetof(struct pt_regs, ip))
-#define IRET_SP_OFFSET (SP_OFFSET - offsetof(struct pt_regs, ip))
-
 static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
-			     unsigned long *ip, unsigned long *sp, bool full)
+			     unsigned long *ip, unsigned long *sp)
 {
-	size_t regs_size = full ? REGS_SIZE : IRET_REGS_SIZE;
-	size_t sp_offset = full ? SP_OFFSET : IRET_SP_OFFSET;
-	struct pt_regs *regs = (struct pt_regs *)(addr + regs_size - REGS_SIZE);
-
-	if (IS_ENABLED(CONFIG_X86_64)) {
-		if (!stack_access_ok(state, addr, regs_size))
-			return false;
-
-		*ip = regs->ip;
-		*sp = regs->sp;
+	struct pt_regs *regs = (struct pt_regs *)addr;
 
-		return true;
-	}
+	/* x86-32 support will be more complicated due to the &regs->sp hack */
+	BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32));
 
-	if (!stack_access_ok(state, addr, sp_offset))
+	if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
 		return false;
 
 	*ip = regs->ip;
+	*sp = regs->sp;
+	return true;
+}
 
-	if (user_mode(regs)) {
-		if (!stack_access_ok(state, addr + sp_offset,
-				     REGS_SIZE - SP_OFFSET))
-			return false;
-
-		*sp = regs->sp;
-	} else
-		*sp = (unsigned long)&regs->sp;
+static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr,
+				  unsigned long *ip, unsigned long *sp)
+{
+	struct pt_regs *regs = (void *)addr - IRET_FRAME_OFFSET;
 
+	if (!stack_access_ok(state, addr, IRET_FRAME_SIZE))
+		return false;
+
+	*ip = regs->ip;
+	*sp = regs->sp;
 	return true;
 }
 
@@ -327,7 +310,6 @@ bool unwind_next_frame(struct unwind_sta
 	unsigned long ip_p, sp, orig_ip, prev_sp = state->sp;
 	enum stack_type prev_type = state->stack_info.type;
 	struct orc_entry *orc;
-	struct pt_regs *ptregs;
 	bool indirect = false;
 
 	if (unwind_done(state))
@@ -435,7 +417,7 @@ bool unwind_next_frame(struct unwind_sta
 		break;
 
 	case ORC_TYPE_REGS:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, true)) {
+		if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
@@ -447,20 +429,14 @@ bool unwind_next_frame(struct unwind_sta
 		break;
 
 	case ORC_TYPE_REGS_IRET:
-		if (!deref_stack_regs(state, sp, &state->ip, &state->sp, false)) {
+		if (!deref_stack_iret_regs(state, sp, &state->ip, &state->sp)) {
 			orc_warn("can't dereference iret registers at %p for ip %pB\n",
 				 (void *)sp, (void *)orig_ip);
 			goto done;
 		}
 
-		ptregs = container_of((void *)sp, struct pt_regs, ip);
-		if ((unsigned long)ptregs >= prev_sp &&
-		    on_stack(&state->stack_info, ptregs, REGS_SIZE)) {
-			state->regs = ptregs;
-			state->full_regs = false;
-		} else
-			state->regs = NULL;
-
+		state->regs = (void *)sp - IRET_FRAME_OFFSET;
+		state->full_regs = false;
 		state->signal = true;
 		break;
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 04/60] x86/irq: Remove an old outdated comment about context tracking races
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (2 preceding siblings ...)
  2017-12-04 14:07 ` [patch 03/60] x86/unwinder: Handle stack overflows more gracefully Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 05/60] x86/irq/64: Print the offending IP in the stack overflow warning Thomas Gleixner
                   ` (58 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-irq--Remove_an_old_outdated_comment_about_context_tracking_races.patch --]
[-- Type: text/plain, Size: 1636 bytes --]

From: Andy Lutomirski <luto@kernel.org>

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.luto@kernel.org

---
 arch/x86/kernel/irq.c |   12 ------------
 1 file changed, 12 deletions(-)

--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ u64 arch_irq_stat(void)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 05/60] x86/irq/64: Print the offending IP in the stack overflow warning
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (3 preceding siblings ...)
  2017-12-04 14:07 ` [patch 04/60] x86/irq: Remove an old outdated comment about context tracking races Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 06/60] x86/entry/64: Allocate and enable the SYSENTER stack Thomas Gleixner
                   ` (57 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-irq-64--Print_the_offending_IP_in_the_stack_overflow_warning.patch --]
[-- Type: text/plain, Size: 1644 bytes --]

From: Andy Lutomirski <luto@kernel.org>

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.luto@kernel.org

---
 arch/x86/kernel/irq_64.c |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 06/60] x86/entry/64: Allocate and enable the SYSENTER stack
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (4 preceding siblings ...)
  2017-12-04 14:07 ` [patch 05/60] x86/irq/64: Print the offending IP in the stack overflow warning Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 07/60] x86/dumpstack: Add get_stack_info() support for " Thomas Gleixner
                   ` (56 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Allocate_and_enable_the_SYSENTER_stack.patch --]
[-- Type: text/plain, Size: 4793 bytes --]

From: Andy Lutomirski <luto@kernel.org>

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION=y because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).

I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/c37d6e68a73e1b5b1203e0e95b488fa8092b3cfb.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_64_compat.S |    2 +-
 arch/x86/include/asm/processor.h |    3 ---
 arch/x86/kernel/asm-offsets.c    |    5 +++++
 arch/x86/kernel/asm-offsets_32.c |    5 -----
 arch/x86/kernel/cpu/common.c     |    4 +++-
 arch/x86/kernel/process.c        |    2 --
 arch/x86/kernel/traps.c          |    3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -800,14 +800,13 @@ dotraplinkage void do_debug(struct pt_re
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 07/60] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (5 preceding siblings ...)
  2017-12-04 14:07 ` [patch 06/60] x86/entry/64: Allocate and enable the SYSENTER stack Thomas Gleixner
@ 2017-12-04 14:07 ` " Thomas Gleixner
  2017-12-04 14:07 ` [patch 08/60] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Thomas Gleixner
                   ` (55 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-dumpstack--Add_get_stack_info__support_for_the_SYSENTER_stack.patch --]
[-- Type: text/plain, Size: 4198 bytes --]

From: Andy Lutomirski <luto@kernel.org>

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

  PANIC: double fault, error_code: 0x0
  ...
  RIP: 0010:do_error_trap+0x33/0x1c0
  ...
  Call Trace:
  Code: ...

With this patch, I get:

  PANIC: double fault, error_code: 0x0
  ...
  Call Trace:
   <SYSENTER>
   ? async_page_fault+0x36/0x60
   ? invalid_op+0x22/0x40
   ? async_page_fault+0x36/0x60
   ? sync_regs+0x3c/0x40
   ? sync_regs+0x2e/0x40
   ? error_entry+0x6c/0xd0
   ? async_page_fault+0x36/0x60
   </SYSENTER>
  Code: ...

which is a lot more informative.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/c32ce8b363e27fa9b4a4773297d5b4b0f4b39e94.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/stacktrace.h |    3 +++
 arch/x86/kernel/dumpstack.c       |   19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |    6 ++++++
 arch/x86/kernel/dumpstack_64.c    |    6 ++++++
 4 files changed, 34 insertions(+)

--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack,
 	return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_t
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack,
 	if (task != current)
 		goto unknown;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_t
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 08/60] x86/entry/gdt: Put per-CPU GDT remaps in ascending order
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (6 preceding siblings ...)
  2017-12-04 14:07 ` [patch 07/60] x86/dumpstack: Add get_stack_info() support for " Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 09/60] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area Thomas Gleixner
                   ` (54 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-gdt--Put_per-CPU_GDT_remaps_in_ascending_order.patch --]
[-- Type: text/plain, Size: 1487 bytes --]

From: Andy Lutomirski <luto@kernel.org>

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/3966a6edf6fd45deca4cf52a9b9276402499dda9.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/desc.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_cu
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 09/60] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (7 preceding siblings ...)
  2017-12-04 14:07 ` [patch 08/60] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 10/60] x86/kasan/64: Teach KASAN about the cpu_entry_area Thomas Gleixner
                   ` (53 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-mm-fixmap--Generalize_the_GDT_fixmap_mechanism-_introduce_struct_cpu_entry_area.patch --]
[-- Type: text/plain, Size: 5577 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new 'struct cpu_entry_area'
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/desc.h   |    9 +--------
 arch/x86/include/asm/fixmap.h |   37 +++++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  |   14 +++++++-------
 arch/x86/xen/mmu_pv.c         |    2 +-
 4 files changed, 44 insertions(+), 18 deletions(-)

--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_cu
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,19 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ *
+ * Every field is a virtual alias of some other allocated backing store.
+ * There is no direct allocation of a struct cpu_entry_area.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +114,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 	__end_of_permanent_fixed_addresses,
 
@@ -185,5 +198,25 @@ void __init *early_memremap_decrypted_wp
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1614,7 +1614,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,7 +1676,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx,
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 10/60] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (8 preceding siblings ...)
  2017-12-04 14:07 ` [patch 09/60] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 11/60] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Thomas Gleixner
                   ` (52 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Andrey Ryabinin,
	Ingo Molnar, Dave Hansen, kasan-dev, Borislav Petkov,
	Alexander Potapenko, Dmitry Vyukov

[-- Attachment #0: x86-kasan-64--Teach_KASAN_about_the_cpu_entry_area.patch --]
[-- Type: text/plain, Size: 2336 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: kasan-dev@googlegroups.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Alexander Potapenko <glider@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Dmitry Vyukov <dvyukov@google.com>
Link: https://lkml.kernel.org/r/8407adf9126440d6467dade88fdb3e3b75fc1019.1511497875.git.luto@kernel.org

---
 arch/x86/mm/kasan_init_64.c |   18 +++++++++++++++++-
 1 file changed, 17 insertions(+), 1 deletion(-)

--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *shadow_cpu_entry_begin, *shadow_cpu_entry_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,23 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	shadow_cpu_entry_begin = (void *)__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM);
+	shadow_cpu_entry_begin = kasan_mem_to_shadow(shadow_cpu_entry_begin);
+	shadow_cpu_entry_begin = (void *)round_down((unsigned long)shadow_cpu_entry_begin,
+						PAGE_SIZE);
+
+	shadow_cpu_entry_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+	shadow_cpu_entry_end = kasan_mem_to_shadow(shadow_cpu_entry_end);
+	shadow_cpu_entry_end = (void *)round_up((unsigned long)shadow_cpu_entry_end,
+					PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   shadow_cpu_entry_begin);
+
+	kasan_populate_shadow((unsigned long)shadow_cpu_entry_begin,
+			      (unsigned long)shadow_cpu_entry_end, 0);
+
+	kasan_populate_zero_shadow(shadow_cpu_entry_end, (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 11/60] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (9 preceding siblings ...)
  2017-12-04 14:07 ` [patch 10/60] x86/kasan/64: Teach KASAN about the cpu_entry_area Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 12/60] x86/dumpstack: Handle stack overflow on all stacks Thomas Gleixner
                   ` (51 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov

[-- Attachment #0: x86-entry--Fix_assumptions_that_the_HW_TSS_is_at_the_beginning_of_cpu_tss.patch --]
[-- Type: text/plain, Size: 6003 bytes --]

From: Andy Lutomirski <luto@kernel.org>

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Dave Hansen <dave.hansen@intel.com>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/d40a2c5ae4539d64090849a374f3169ec492f4e2.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/desc.h      |    2 +-
 arch/x86/include/asm/processor.h |    9 +++++++--
 arch/x86/kernel/cpu/common.c     |    8 ++++----
 arch/x86/kernel/doublefault.c    |   32 +++++++++++++++-----------------
 arch/x86/kvm/vmx.c               |    2 +-
 arch/x86/power/cpu.c             |   13 +++++++------
 6 files changed, 35 insertions(+), 31 deletions(-)

--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -253,6 +253,11 @@ static inline void load_cr3(pgd_t *pgdir
 	write_cr3(__sme_pa(pgdir));
 }
 
+/*
+ * Note that while the legacy 'TSS' name comes from 'Task State Segment',
+ * on modern x86 CPUs the TSS also holds information important to 64-bit mode,
+ * unrelated to the task-switch mechanism:
+ */
 #ifdef CONFIG_X86_32
 /* This is the TSS defined by the hardware. */
 struct x86_hw_tss {
@@ -323,7 +328,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
 
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
 
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2291,7 +2291,7 @@ static void vmx_vcpu_load(struct kvm_vcp
 		 * processors.  See 22.2.4.
 		 */
 		vmcs_writel(HOST_TR_BASE,
-			    (unsigned long)this_cpu_ptr(&cpu_tss));
+			    (unsigned long)this_cpu_ptr(&cpu_tss.x86_tss));
 		vmcs_writel(HOST_GDTR_BASE, (unsigned long)gdt);   /* 22.2.4 */
 
 		/*
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 12/60] x86/dumpstack: Handle stack overflow on all stacks
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (10 preceding siblings ...)
  2017-12-04 14:07 ` [patch 11/60] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 13/60] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Thomas Gleixner
                   ` (50 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-dumpstack--Handle_stack_overflow_on_all_stacks.patch --]
[-- Type: text/plain, Size: 2336 bytes --]

From: Andy Lutomirski <luto@kernel.org>

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/5454bb325cb30a70457a47b50f22317be65eba7d.1511497875.git.luto@kernel.org

---
 arch/x86/kernel/dumpstack.c |   24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -112,24 +112,28 @@ void show_trace_log_lvl(struct task_stru
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid so that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info, &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 13/60] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (11 preceding siblings ...)
  2017-12-04 14:07 ` [patch 12/60] x86/dumpstack: Handle stack overflow on all stacks Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 14/60] x86/entry: Remap the TSS into the CPU entry area Thomas Gleixner
                   ` (49 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry--Move_SYSENTER_stack_to_the_beginning_of_struct_tss_struct.patch --]
[-- Type: text/plain, Size: 3444 bytes --]

From: Andy Lutomirski <luto@kernel.org>

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/processor.h |   21 ++++++++++++---------
 arch/x86/kernel/cpu/common.c     |   21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 9 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -333,7 +333,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack, used for SYSENTER
+	 * and the entry trampoline as well.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -344,15 +353,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 14/60] x86/entry: Remap the TSS into the CPU entry area
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (12 preceding siblings ...)
  2017-12-04 14:07 ` [patch 13/60] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 18:20   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 15/60] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Thomas Gleixner
                   ` (48 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar, Dave Hansen

[-- Attachment #0: x86-entry--Remap_the_TSS_into_the_CPU_entry_area.patch --]
[-- Type: text/plain, Size: 8124 bytes --]

From: Andy Lutomirski <luto@kernel.org>

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_32.S     |    6 ++++--
 arch/x86/include/asm/fixmap.h |    7 +++++++
 arch/x86/kernel/asm-offsets.c |    3 +++
 arch/x86/kernel/cpu/common.c  |   41 +++++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |    3 ++-
 arch/x86/kvm/vmx.c            |    2 +-
 arch/x86/power/cpu.c          |   11 ++++++-----
 7 files changed, 58 insertions(+), 15 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -54,6 +54,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,22 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr,
+				    int pages, pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++) {
+		__set_fixmap(fixmap_index - i,
+			     per_cpu_ptr_to_phys(ptr + i * PAGE_SIZE), prot);
+	}
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +547,15 @@ static inline void setup_cpu_entry_area(
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1306,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1395,6 +1420,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1408,7 +1435,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1618,11 +1645,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1635,7 +1664,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,11 +1704,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1697,7 +1727,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -2291,7 +2291,7 @@ static void vmx_vcpu_load(struct kvm_vcp
 		 * processors.  See 22.2.4.
 		 */
 		vmcs_writel(HOST_TR_BASE,
-			    (unsigned long)this_cpu_ptr(&cpu_tss.x86_tss));
+			    (unsigned long)&get_cpu_entry_area(cpu)->tss.x86_tss);
 		vmcs_writel(HOST_GDTR_BASE, (unsigned long)gdt);   /* 22.2.4 */
 
 		/*
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 15/60] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (13 preceding siblings ...)
  2017-12-04 14:07 ` [patch 14/60] x86/entry: Remap the TSS into the CPU entry area Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 16/60] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Thomas Gleixner
                   ` (47 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Separate_cpu_current_top_of_stack_from_TSS.sp0.patch --]
[-- Type: text/plain, Size: 3812 bytes --]

From: Andy Lutomirski <luto@kernel.org>

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/processor.h   |   18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |    2 +-
 arch/x86/kernel/asm-offsets_64.c   |    1 +
 arch/x86/kernel/process.c          |   10 ++++++++++
 arch/x86/kernel/process_64.c       |    1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -310,7 +310,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -369,6 +375,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -540,12 +548,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_fram
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -66,6 +66,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -461,6 +461,7 @@ void compat_start_thread(struct pt_regs
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 16/60] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (14 preceding siblings ...)
  2017-12-04 14:07 ` [patch 15/60] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 17/60] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Thomas Gleixner
                   ` (46 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-espfix-64--Stop_assuming_that_pt_regs_is_on_the_entry_stack.patch --]
[-- Type: text/plain, Size: 3911 bytes --]

From: Andy Lutomirski <luto@kernel.org>

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

While we're at it, improve the comments to explain what's actually
going on.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org

---
 arch/x86/kernel/traps.c |   37 ++++++++++++++++++++++++++++---------
 1 file changed, 28 insertions(+), 9 deletions(-)

--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -349,9 +349,15 @@ dotraplinkage void do_double_fault(struc
 
 	/*
 	 * If IRET takes a non-IST fault on the espfix64 stack, then we
-	 * end up promoting it to a doublefault.  In that case, modify
-	 * the stack to make it look like we just entered the #GP
-	 * handler from user space, similar to bad_iret.
+	 * end up promoting it to a doublefault.  In that case, take
+	 * advantage of the fact that we're not using the normal (TSS.sp0)
+	 * stack right now.  We can write a fake #GP(0) frame at TSS.sp0
+	 * and then modify our own IRET frame so that, when we return,
+	 * we land directly at the #GP(0) vector with the stack already
+	 * set up according to its expectations.
+	 *
+	 * The net result is that our #GP handler will think that we
+	 * entered from usermode with the bad user context.
 	 *
 	 * No need for ist_enter here because we don't use RCU.
 	 */
@@ -359,13 +365,26 @@ dotraplinkage void do_double_fault(struc
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
-		/* Fake a #GP(0) from userspace. */
-		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
-		normal_regs->orig_ax = 0;  /* Missing (lost) #GP error code */
+		/*
+		 * regs->sp points to the failing IRET frame on the
+		 * ESPFIX64 stack.  Copy it to the entry stack.  This fills
+		 * in gpregs->ss through gpregs->ip.
+		 *
+		 */
+		memmove(&gpregs->ip, (void *)regs->sp, 5*8);
+		gpregs->orig_ax = 0;  /* Missing (lost) #GP error code */
+
+		/*
+		 * Adjust our frame so that we return straight to the #GP
+		 * vector with the expected RSP value.  This is safe because
+		 * we won't enable interupts or schedule before we invoke
+		 * general_protection, so nothing will clobber the stack
+		 * frame we just set up.
+		 */
 		regs->ip = (unsigned long)general_protection;
-		regs->sp = (unsigned long)&normal_regs->orig_ax;
+		regs->sp = (unsigned long)&gpregs->orig_ax;
 
 		return;
 	}
@@ -390,7 +409,7 @@ dotraplinkage void do_double_fault(struc
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 17/60] x86/entry/64: Use a per-CPU trampoline stack for IDT entries
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (15 preceding siblings ...)
  2017-12-04 14:07 ` [patch 16/60] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 18/60] x86/entry/64: Return to userspace from the trampoline stack Thomas Gleixner
                   ` (45 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Use_a_per-CPU_trampoline_stack_for_IDT_entries.patch --]
[-- Type: text/plain, Size: 7743 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a per-CPU trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S        |   67 +++++++++++++++++++++++++++++----------
 arch/x86/entry/entry_64_compat.S |    5 ++
 arch/x86/include/asm/switch_to.h |    3 -
 arch/x86/include/asm/traps.h     |    1 
 arch/x86/kernel/cpu/common.c     |    6 ++-
 arch/x86/kernel/traps.c          |   12 +++---
 6 files changed, 65 insertions(+), 29 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -564,6 +564,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
+
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+1:
+
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
@@ -573,12 +580,8 @@ END(irq_entries_start)
 	jz	1f
 
 	/*
-	 * IRQ from user mode.  Switch to kernel gsbase and inform context
-	 * tracking that we're in kernel mode.
-	 */
-	SWAPGS
-
-	/*
+	 * IRQ from user mode.
+	 *
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -832,6 +835,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_FUNC
+
+	pushq	%rdi
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+	pushq	7*8(%rdi)		/* regs->ss */
+	pushq	6*8(%rdi)		/* regs->rsp */
+	pushq	5*8(%rdi)		/* regs->eflags */
+	pushq	4*8(%rdi)		/* regs->cs */
+	pushq	3*8(%rdi)		/* regs->ip */
+	pushq	2*8(%rdi)		/* regs->orig_ax */
+	pushq	8(%rdi)			/* return address */
+	UNWIND_HINT_FUNC
+
+	movq	(%rdi), %rdi
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -849,11 +878,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -895,20 +925,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1171,6 +1196,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -306,8 +306,11 @@ ENTRY(entry_INT80_compat)
 	 */
 	movl	%eax, %eax
 
-	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
+
+	/* switch to thread stack expects orig_ax to be pushed */
+	call	switch_to_thread_stack
+
 	pushq	%rdi			/* pt_regs->di */
 	pushq	%rsi			/* pt_regs->si */
 	pushq	%rdx			/* pt_regs->dx */
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -90,10 +90,9 @@ static inline void refresh_sysenter_cs(s
 /* This is used when switching tasks or entering/exiting vm86 mode. */
 static inline void update_sp0(struct task_struct *task)
 {
+	/* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
-#else
-	load_sp0(task_top_of_stack(task));
 #endif
 }
 
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_presen
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1648,11 +1648,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -631,7 +631,7 @@ NOKPROBE_SYMBOL(do_int3);
  */
 asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -648,13 +648,13 @@ struct bad_iret_stack *fixup_bad_iret(st
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want to move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the IRET target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 18/60] x86/entry/64: Return to userspace from the trampoline stack
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (16 preceding siblings ...)
  2017-12-04 14:07 ` [patch 17/60] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Thomas Gleixner
                   ` (44 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Return_to_userspace_from_the_trampoline_stack.patch --]
[-- Type: text/plain, Size: 2850 bytes --]

From: Andy Lutomirski <luto@kernel.org>

By itself, this is useless.  It gives us the ability to run some final code
before exit that cannnot run on the kernel stack.  This could include a CR3
switch a la KERNEL_PAGE_TABLE_ISOLATION or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack gets
used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S |   55 ++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -634,10 +650,41 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (17 preceding siblings ...)
  2017-12-04 14:07 ` [patch 18/60] x86/entry/64: Return to userspace from the trampoline stack Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 22:30   ` Andy Lutomirski
  2017-12-04 14:07 ` [patch 20/60] x86/entry/64: Move the IST stacks into struct cpu_entry_area Thomas Gleixner
                   ` (43 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar, Dave Hansen

[-- Attachment #0: x86-entry-64--Create_a_per-CPU_SYSCALL_entry_trampoline.patch --]
[-- Type: text/plain, Size: 7345 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KERNEL_PAGE_TABLE_ISOLATION-like pagetable switching, this is
problematic.  Without a scratch register, switching CR3 is impossible, so
%gs-based percpu memory would need to be mapped in the user pagetables.
Doing that without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

This patch actually seems to be a small speedup.  With this patch,
SYSCALL touches an extra cache line and an extra virtual page, but
the pipeline no longer stalls waiting for SWAPGS.  It seems that, at
least in a tight loop, the latter outweights the former.

Thanks to David Laight for an optimization tip.

XXX: Whenever we settle how KERNEL_PAGE_TABLE_ISOLATION gets turned on
and off, we should do the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_64.S     |   58 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |    2 +
 arch/x86/kernel/asm-offsets.c |    1 
 arch/x86/kernel/cpu/common.c  |   15 ++++++++++
 arch/x86/kernel/vmlinux.lds.S |    9 ++++++
 5 files changed, 84 insertions(+), 1 deletion(-)

--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,64 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * _entry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump.  We could push the target
+	 * address and then use retq, but this destroys the pipeline on
+	 * many CPUs (wasting over 20 cycles on Sandy Bridge).  Instead,
+	 * spill RDI and restore it in a second-stage trampoline.
+	 */
+	pushq	%rdi
+	movq	$entry_SYSCALL_64_stage2, %rdi
+	jmp	*%rdi
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
+ENTRY(entry_SYSCALL_64_stage2)
+	UNWIND_HINT_EMPTY
+	popq	%rdi
+	jmp	entry_SYSCALL_64_after_hwframe
+END(entry_SYSCALL_64_stage2)
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -61,6 +61,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -510,6 +510,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *,
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -556,6 +558,11 @@ static inline void setup_cpu_entry_area(
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1420,10 +1427,16 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
+	unsigned long SYSCALL64_entry_trampoline =
+		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
+		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,15 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 20/60] x86/entry/64: Move the IST stacks into struct cpu_entry_area
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (18 preceding siblings ...)
  2017-12-04 14:07 ` [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 21/60] x86/entry/64: Remove the SYSENTER stack canary Thomas Gleixner
                   ` (42 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Move_the_IST_stacks_into_struct_cpu_entry_area.patch --]
[-- Type: text/plain, Size: 6361 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The IST stacks are needed when an IST exception occurs and are accessed
before any kernel code at all runs.  Move them into struct cpu_entry_area.

The IST stacks are unlike the rest of cpu_entry_area: they're used even for
entries from kernel mode.  This means that they should be set up before we
load the final IDT.  Move cpu_entry_area setup to trap_init() for the boot
CPU and set it up for all possible CPUs at once in native_smp_prepare_cpus().

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/fixmap.h |   12 ++++++
 arch/x86/kernel/cpu/common.c  |   74 +++++++++++++++++++++++-------------------
 arch/x86/kernel/traps.c       |    3 +
 3 files changed, 57 insertions(+), 32 deletions(-)

--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -63,10 +63,22 @@ struct cpu_entry_area {
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
+extern void setup_cpu_entry_areas(void);
+
 /*
  * Here we define all the compile-time 'special' virtual
  * addresses. The point is to have a constant address at
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,24 +490,36 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-static void set_percpu_fixmap_pages(int fixmap_index, void *ptr,
-				    int pages, pgprot_t prot)
-{
-	int i;
-
-	for (i = 0; i < pages; i++) {
-		__set_fixmap(fixmap_index - i,
-			     per_cpu_ptr_to_phys(ptr + i * PAGE_SIZE), prot);
-	}
-}
-
 #ifdef CONFIG_X86_32
 /* The 32-bit entry code needs to find cpu_entry_area. */
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
+static void __init
+set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
+{
+	for ( ; pages; pages--, idx--, ptr += PAGE_SIZE)
+		__set_fixmap(idx, per_cpu_ptr_to_phys(ptr), prot);
+}
+
 /* Setup the fixmap mappings only once per-processor */
-static inline void setup_cpu_entry_area(int cpu)
+static void __init setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	extern char _entry_trampoline[];
@@ -556,15 +568,31 @@ static inline void setup_cpu_entry_area(
 				PAGE_KERNEL);
 
 #ifdef CONFIG_X86_32
-	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
 }
 
+void __init setup_cpu_entry_areas(void)
+{
+	unsigned int cpu;
+
+	for_each_possible_cpu(cpu)
+		setup_cpu_entry_area(cpu);
+}
+
 /* Load the original GDT from the per-cpu structure */
 void load_direct_gdt(int cpu)
 {
@@ -1410,20 +1438,6 @@ DEFINE_PER_CPU(unsigned int, irq_count)
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1632,7 +1646,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
@@ -1658,8 +1672,6 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
-	setup_cpu_entry_area(cpu);
-
 	/*
 	 * Initialize the TSS.  sp0 points to the entry trampoline stack
 	 * regardless of what task is running.
@@ -1719,8 +1731,6 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
-	setup_cpu_entry_area(cpu);
-
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -952,6 +952,9 @@ dotraplinkage void do_iret_error(struct
 
 void __init trap_init(void)
 {
+	/* Init cpu_entry_area before IST entries are set up */
+	setup_cpu_entry_areas();
+
 	idt_setup_traps();
 
 	/*

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 21/60] x86/entry/64: Remove the SYSENTER stack canary
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (19 preceding siblings ...)
  2017-12-04 14:07 ` [patch 20/60] x86/entry/64: Move the IST stacks into struct cpu_entry_area Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 22/60] x86/entry: Clean up the SYSENTER_stack code Thomas Gleixner
                   ` (41 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar,
	Borislav Petkov, Dave Hansen

[-- Attachment #0: x86-entry-64--Remove_the_SYSENTER_stack_canary.patch --]
[-- Type: text/plain, Size: 2537 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Now that the SYSENTER stack has a guard page, there's no need for a canary
to detect overflow after the fact.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.luto@kernel.org

---
 arch/x86/include/asm/processor.h |    1 -
 arch/x86/kernel/dumpstack.c      |    3 +--
 arch/x86/kernel/process.c        |    1 -
 arch/x86/kernel/traps.c          |    7 -------
 4 files changed, 1 insertion(+), 11 deletions(-)

--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -342,7 +342,6 @@ struct tss_struct {
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
 
 	/*
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *st
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-	/* Treat the canary as part of the stack for unwinding purposes. */
-	void *begin = &tss->SYSENTER_stack_canary;
+	void *begin = &tss->SYSENTER_stack;
 	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
 	if ((void *)stack < begin || (void *)stack >= end)
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-	.SYSENTER_stack_canary	= STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -819,13 +819,6 @@ dotraplinkage void do_debug(struct pt_re
 	debug_stack_usage_dec();
 
 exit:
-	/*
-	 * This is the most likely code path that involves non-trivial use
-	 * of the SYSENTER stack.  Check that we haven't overrun it.
-	 */
-	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-	     "Overran or corrupted SYSENTER stack\n");
-
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 22/60] x86/entry: Clean up the SYSENTER_stack code
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (20 preceding siblings ...)
  2017-12-04 14:07 ` [patch 21/60] x86/entry/64: Remove the SYSENTER stack canary Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 19:41   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only Thomas Gleixner
                   ` (40 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar, Dave Hansen

[-- Attachment #0: x86-entry--Clean_up_the_SYSENTER_stack_code.patch --]
[-- Type: text/plain, Size: 6211 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The existing code was a mess, mainly because C arrays are nasty.  Turn
SYSENTER_stack into a struct, add a helper to find it, and do all the
obvious cleanups this enables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org

---
 arch/x86/entry/entry_32.S        |    4 ++--
 arch/x86/entry/entry_64.S        |    2 +-
 arch/x86/include/asm/fixmap.h    |    5 +++++
 arch/x86/include/asm/processor.h |    6 +++++-
 arch/x86/kernel/asm-offsets.c    |    6 ++----
 arch/x86/kernel/cpu/common.c     |   14 +++-----------
 arch/x86/kernel/dumpstack.c      |    7 +++----
 7 files changed, 21 insertions(+), 23 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
 			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -239,5 +239,10 @@ static inline struct cpu_entry_area *get
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+	return &get_cpu_entry_area(cpu)->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,12 +337,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
+struct SYSENTER_stack {
+	unsigned long		words[64];
+};
+
 struct tss_struct {
 	/*
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack[64];
+	struct SYSENTER_stack	SYSENTER_stack;
 
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1339,12 +1339,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-
-	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
-	      offsetofend(struct tss_struct, SYSENTER_stack),
-	      0);
-
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1461,9 +1456,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1678,8 +1671,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
-		 offsetofend(struct tss_struct, SYSENTER_stack));
+	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,11 +45,10 @@ bool in_task_stack(unsigned long *stack,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	int cpu = smp_processor_id();
-	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
+	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
 
-	void *begin = &tss->SYSENTER_stack;
-	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+	void *begin = ss;
+	void *end = ss + 1;
 
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (21 preceding siblings ...)
  2017-12-04 14:07 ` [patch 22/60] x86/entry: Clean up the SYSENTER_stack code Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 20:25   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 24/60] x86/paravirt: Dont patch flush_tlb_single Thomas Gleixner
                   ` (39 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Kees Cook, Borislav Petkov

[-- Attachment #0: x86-entry-64--Make_cpu_entry_area.tss_read-only.patch --]
[-- Type: text/plain, Size: 15479 bytes --]

From: Andy Lutomirski <luto@kernel.org>

The TSS is a fairly juicy target for exploits, and, now that the TSS
is in the cpu_entry_area, it's no longer protected by kASLR.  Make it
read-only on x86_64.

On x86_32, it can't be RO because it's written by the CPU during task
switches, and we use a task gate for double faults.  I'd also be
nervous about errata if we tried to make it RO even on configurations
without double fault handling.

[ tglx: AMD confirmed that there is no problem on 64bit with TSS RO.  So
  	it's probably safe to assume that it's a non issue, though Intel
  	might have been creative in that area. Still waiting for
  	confirmation. ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: David Laight <David.Laight@aculab.com>
Cc: Borislav Petkov <bp@alien8.de>
Link: https://lkml.kernel.org/r/7d2f65f86a46e3489ba996932554485c3d345632.1512109321.git.luto@kernel.org

---
 arch/x86/entry/entry_32.S          |    4 ++--
 arch/x86/entry/entry_64.S          |    8 ++++----
 arch/x86/include/asm/fixmap.h      |   13 +++++++++----
 arch/x86/include/asm/processor.h   |   17 ++++++++---------
 arch/x86/include/asm/switch_to.h   |    4 ++--
 arch/x86/include/asm/thread_info.h |    2 +-
 arch/x86/kernel/asm-offsets.c      |    5 ++---
 arch/x86/kernel/asm-offsets_32.c   |    4 ++--
 arch/x86/kernel/cpu/common.c       |   29 +++++++++++++++++++----------
 arch/x86/kernel/ioport.c           |    2 +-
 arch/x86/kernel/process.c          |    6 +++---
 arch/x86/kernel/process_32.c       |    2 +-
 arch/x86/kernel/process_64.c       |    2 +-
 arch/x86/kernel/traps.c            |    4 ++--
 arch/x86/lib/delay.c               |    4 ++--
 arch/x86/xen/enlighten_pv.c        |    2 +-
 16 files changed, 60 insertions(+), 48 deletions(-)

--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
+#define RSP_SCRATCH	CPU_ENTRY_AREA_SYSENTER_stack + \
 			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
@@ -394,7 +394,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * Save old stack pointer and switch to trampoline stack.
 	 */
 	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
 
 	pushq	RSP-RDI(%rdi)	/* RSP */
 	pushq	(%rdi)		/* RDI */
@@ -723,7 +723,7 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * Save old stack pointer and switch to trampoline stack.
 	 */
 	movq	%rsp, %rdi
-	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+	movq	PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp
 
 	/* Copy the IRET frame to the trampoline stack. */
 	pushq	6*8(%rdi)	/* SS */
@@ -938,7 +938,7 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work
 /*
  * Exception entry points.
  */
-#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
+#define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss_rw) + (TSS_ist + ((x) - 1) * 8)
 
 /*
  * Switch to the thread stack.  This is called with the IRET frame and
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -56,9 +56,14 @@ struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
 
 	/*
-	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
-	 * a read-only guard page for the SYSENTER stack at the bottom
-	 * of the TSS region.
+	 * The GDT is just below SYSENTER_stack and thus serves (on x86_64) as
+	 * a a read-only guard page.
+	 */
+	struct SYSENTER_stack_page SYSENTER_stack_page;
+
+	/*
+	 * On x86_64, the TSS is mapped RO.  On x86_32, it's mapped RW because
+	 * we need task switches to work, and task switches write to the TSS.
 	 */
 	struct tss_struct tss;
 
@@ -241,7 +246,7 @@ static inline struct cpu_entry_area *get
 
 static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
 {
-	return &get_cpu_entry_area(cpu)->tss.SYSENTER_stack;
+	return &get_cpu_entry_area(cpu)->SYSENTER_stack_page.stack;
 }
 
 #endif /* !__ASSEMBLY__ */
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -341,13 +341,11 @@ struct SYSENTER_stack {
 	unsigned long		words[64];
 };
 
-struct tss_struct {
-	/*
-	 * Space for the temporary SYSENTER stack, used for SYSENTER
-	 * and the entry trampoline as well.
-	 */
-	struct SYSENTER_stack	SYSENTER_stack;
+struct SYSENTER_stack_page {
+	struct SYSENTER_stack stack;
+} __aligned(PAGE_SIZE);
 
+struct tss_struct {
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
 	 * at risk of violating the SDM's advice and potentially triggering
@@ -364,7 +362,7 @@ struct tss_struct {
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss_rw);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
@@ -379,7 +377,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
 #else
-#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
+/* The RO copy can't be accessed with this_cpu_xyz(), so use the RW copy. */
+#define cpu_current_top_of_stack cpu_tss_rw.x86_tss.sp1
 #endif
 
 /*
@@ -539,7 +538,7 @@ static inline void native_set_iopl_mask(
 static inline void
 native_load_sp0(unsigned long sp0)
 {
-	this_cpu_write(cpu_tss.x86_tss.sp0, sp0);
+	this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
 }
 
 static inline void native_swapgs(void)
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -79,10 +79,10 @@ do {									\
 static inline void refresh_sysenter_cs(struct thread_struct *thread)
 {
 	/* Only happens when SEP is enabled, no need to test "SEP"arately: */
-	if (unlikely(this_cpu_read(cpu_tss.x86_tss.ss1) == thread->sysenter_cs))
+	if (unlikely(this_cpu_read(cpu_tss_rw.x86_tss.ss1) == thread->sysenter_cs))
 		return;
 
-	this_cpu_write(cpu_tss.x86_tss.ss1, thread->sysenter_cs);
+	this_cpu_write(cpu_tss_rw.x86_tss.ss1, thread->sysenter_cs);
 	wrmsr(MSR_IA32_SYSENTER_CS, thread->sysenter_cs, 0);
 }
 #endif
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_fram
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
+# define cpu_current_top_of_stack (cpu_tss_rw + TSS_sp1)
 #endif
 
 #endif
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,9 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
-
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
+	OFFSET(CPU_ENTRY_AREA_SYSENTER_stack, cpu_entry_area, SYSENTER_stack_page);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 }
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -47,8 +47,8 @@ void foo(void)
 	BLANK();
 
 	/* Offset from the sysenter stack to tss.sp0 */
-	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
-	       offsetofend(struct tss_struct, SYSENTER_stack));
+	DEFINE(TSS_sysenter_sp0, offsetof(struct cpu_entry_area, tss.x86_tss.sp0) -
+	       offsetofend(struct cpu_entry_area, SYSENTER_stack_page.stack));
 
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,9 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
+static DEFINE_PER_CPU_PAGE_ALIGNED(struct SYSENTER_stack_page,
+				   SYSENTER_stack_storage);
+
 static void __init
 set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
 {
@@ -524,23 +527,29 @@ static void __init setup_cpu_entry_area(
 #ifdef CONFIG_X86_64
 	extern char _entry_trampoline[];
 
-	/* On 64-bit systems, we use a read-only fixmap GDT. */
+	/* On 64-bit systems, we use a read-only fixmap GDT and TSS. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
+	pgprot_t tss_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
 	 * our double fault handler uses a task gate, and entering through
-	 * a task gate needs to change an available TSS to busy.  If the GDT
-	 * is read-only, that will triple fault.
+	 * a task gate needs to change an available TSS to busy.  If the
+	 * GDT is read-only, that will triple fault.  The TSS cannot be
+	 * read-only because the CPU writes to it on task switches.
 	 *
-	 * On Xen PV, the GDT must be read-only because the hypervisor requires
-	 * it.
+	 * On Xen PV, the GDT must be read-only because the hypervisor
+	 * requires it.
 	 */
 	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
+	pgprot_t tss_prot = PAGE_KERNEL;
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, SYSENTER_stack_page),
+				per_cpu_ptr(&SYSENTER_stack_storage, cpu), 1,
+				PAGE_KERNEL);
 
 	/*
 	 * The Intel SDM says (Volume 3, 7.2.1):
@@ -563,9 +572,9 @@ static void __init setup_cpu_entry_area(
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
 	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
 	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
-				&per_cpu(cpu_tss, cpu),
+				&per_cpu(cpu_tss_rw, cpu),
 				sizeof(struct tss_struct) / PAGE_SIZE,
-				PAGE_KERNEL);
+				tss_prot);
 
 #ifdef CONFIG_X86_32
 	per_cpu(cpu_entry_area, cpu) = get_cpu_entry_area(cpu);
@@ -1330,7 +1339,7 @@ void enable_sep_cpu(void)
 		return;
 
 	cpu = get_cpu();
-	tss = &per_cpu(cpu_tss, cpu);
+	tss = &per_cpu(cpu_tss_rw, cpu);
 
 	/*
 	 * We cache MSR_IA32_SYSENTER_CS's value in the TSS's ss1 field --
@@ -1600,7 +1609,7 @@ void cpu_init(void)
 	if (cpu)
 		load_ucode_ap();
 
-	t = &per_cpu(cpu_tss, cpu);
+	t = &per_cpu(cpu_tss_rw, cpu);
 	oist = &per_cpu(orig_ist, cpu);
 
 #ifdef CONFIG_NUMA
@@ -1692,7 +1701,7 @@ void cpu_init(void)
 {
 	int cpu = smp_processor_id();
 	struct task_struct *curr = current;
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *t = &per_cpu(cpu_tss_rw, cpu);
 
 	wait_for_master_cpu(cpu);
 
--- a/arch/x86/kernel/ioport.c
+++ b/arch/x86/kernel/ioport.c
@@ -67,7 +67,7 @@ asmlinkage long sys_ioperm(unsigned long
 	 * because the ->io_bitmap_max value must match the bitmap
 	 * contents:
 	 */
-	tss = &per_cpu(cpu_tss, get_cpu());
+	tss = &per_cpu(cpu_tss_rw, get_cpu());
 
 	if (turn_on)
 		bitmap_clear(t->io_bitmap_ptr, from, num);
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,7 +47,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss_rw) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
@@ -82,7 +82,7 @@
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
 };
-EXPORT_PER_CPU_SYMBOL(cpu_tss);
+EXPORT_PER_CPU_SYMBOL(cpu_tss_rw);
 
 DEFINE_PER_CPU(bool, __tss_limit_invalid);
 EXPORT_PER_CPU_SYMBOL_GPL(__tss_limit_invalid);
@@ -111,7 +111,7 @@ void exit_thread(struct task_struct *tsk
 	struct fpu *fpu = &t->fpu;
 
 	if (bp) {
-		struct tss_struct *tss = &per_cpu(cpu_tss, get_cpu());
+		struct tss_struct *tss = &per_cpu(cpu_tss_rw, get_cpu());
 
 		t->io_bitmap_ptr = NULL;
 		clear_thread_flag(TIF_IO_BITMAP);
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -234,7 +234,7 @@ EXPORT_SYMBOL_GPL(start_thread);
 	struct fpu *prev_fpu = &prev->fpu;
 	struct fpu *next_fpu = &next->fpu;
 	int cpu = smp_processor_id();
-	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
 
 	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
 
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -399,7 +399,7 @@ void compat_start_thread(struct pt_regs
 	struct fpu *prev_fpu = &prev->fpu;
 	struct fpu *next_fpu = &next->fpu;
 	int cpu = smp_processor_id();
-	struct tss_struct *tss = &per_cpu(cpu_tss, cpu);
+	struct tss_struct *tss = &per_cpu(cpu_tss_rw, cpu);
 
 	WARN_ON_ONCE(IS_ENABLED(CONFIG_DEBUG_ENTRY) &&
 		     this_cpu_read(irq_count) != -1);
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -365,7 +365,7 @@ dotraplinkage void do_double_fault(struc
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		struct pt_regs *gpregs = (struct pt_regs *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 		/*
 		 * regs->sp points to the failing IRET frame on the
@@ -654,7 +654,7 @@ struct bad_iret_stack *fixup_bad_iret(st
 	 * exception came from the IRET target.
 	 */
 	struct bad_iret_stack *new_stack =
-		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss_rw.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
--- a/arch/x86/lib/delay.c
+++ b/arch/x86/lib/delay.c
@@ -107,10 +107,10 @@ static void delay_mwaitx(unsigned long _
 		delay = min_t(u64, MWAITX_MAX_LOOPS, loops);
 
 		/*
-		 * Use cpu_tss as a cacheline-aligned, seldomly
+		 * Use cpu_tss_rw as a cacheline-aligned, seldomly
 		 * accessed per-cpu variable as the monitor target.
 		 */
-		__monitorx(raw_cpu_ptr(&cpu_tss), 0, 0);
+		__monitorx(raw_cpu_ptr(&cpu_tss_rw), 0, 0);
 
 		/*
 		 * AMD, like Intel, supports the EAX hint and EAX=0xf
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -826,7 +826,7 @@ static void xen_load_sp0(unsigned long s
 	mcs = xen_mc_entry(0);
 	MULTI_stack_switch(mcs.mc, __KERNEL_DS, sp0);
 	xen_mc_issue(PARAVIRT_LAZY_CPU);
-	this_cpu_write(cpu_tss.x86_tss.sp0, sp0);
+	this_cpu_write(cpu_tss_rw.x86_tss.sp0, sp0);
 }
 
 void xen_set_iopl_mask(unsigned mask)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 24/60] x86/paravirt: Dont patch flush_tlb_single
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (22 preceding siblings ...)
  2017-12-04 14:07 ` [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 12:18   ` Juergen Gross
  2017-12-04 14:07 ` [patch 25/60] x86/paravirt: Provide a way to check for hypervisors Thomas Gleixner
                   ` (38 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen,
	michael.schwarz, linux-mm, Borislav Petkov, moritz.lipp,
	richard.fellner

[-- Attachment #0: x86-paravirt--Dont_patch_flush_tlb_single.patch --]
[-- Type: text/plain, Size: 1859 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

native_flush_tlb_single() will be changed with the upcoming
KERNEL_PAGE_TABLE_ISOLATION feature. This requires to have more code in
there than INVLPG.

Remove the paravirt patching for it.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Borislav Petkov <bp@alien8.de>
Cc: moritz.lipp@iaik.tugraz.at
Cc: keescook@google.com
Cc: Boris Ostrovsky <boris.ostrovsky@oracle.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at

---
 arch/x86/kernel/paravirt_patch_64.c |    2 --
 1 file changed, 2 deletions(-)

--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -10,7 +10,6 @@ DEF_NATIVE(pv_irq_ops, save_fl, "pushfq;
 DEF_NATIVE(pv_mmu_ops, read_cr2, "movq %cr2, %rax");
 DEF_NATIVE(pv_mmu_ops, read_cr3, "movq %cr3, %rax");
 DEF_NATIVE(pv_mmu_ops, write_cr3, "movq %rdi, %cr3");
-DEF_NATIVE(pv_mmu_ops, flush_tlb_single, "invlpg (%rdi)");
 DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
 
 DEF_NATIVE(pv_cpu_ops, usergs_sysret64, "swapgs; sysretq");
@@ -60,7 +59,6 @@ unsigned native_patch(u8 type, u16 clobb
 		PATCH_SITE(pv_mmu_ops, read_cr2);
 		PATCH_SITE(pv_mmu_ops, read_cr3);
 		PATCH_SITE(pv_mmu_ops, write_cr3);
-		PATCH_SITE(pv_mmu_ops, flush_tlb_single);
 		PATCH_SITE(pv_cpu_ops, wbinvd);
 #if defined(CONFIG_PARAVIRT_SPINLOCKS)
 		case PARAVIRT_PATCH(pv_lock_ops.queued_spin_unlock):

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 25/60] x86/paravirt: Provide a way to check for hypervisors
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (23 preceding siblings ...)
  2017-12-04 14:07 ` [patch 24/60] x86/paravirt: Dont patch flush_tlb_single Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 12:19   ` Juergen Gross
  2017-12-04 14:07 ` [patch 26/60] x86/cpufeature: Make cpu bugs sticky Thomas Gleixner
                   ` (37 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-paravirt--Provide_a_way_to_check_for_hypervisors.patch --]
[-- Type: text/plain, Size: 1728 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

There is no generic way to test whether a kernel is running on a specific
hypervisor. But that's required to prevent the upcoming user address space
separation feature in certain guest modes.

Make the hypervisor type enum unconditionally available and provide a
helper function which allows to test for a specific type.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/hypervisor.h |   25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -20,16 +20,7 @@
 #ifndef _ASM_X86_HYPERVISOR_H
 #define _ASM_X86_HYPERVISOR_H
 
-#ifdef CONFIG_HYPERVISOR_GUEST
-
-#include <asm/kvm_para.h>
-#include <asm/x86_init.h>
-#include <asm/xen/hypervisor.h>
-
-/*
- * x86 hypervisor information
- */
-
+/* x86 hypervisor types  */
 enum x86_hypervisor_type {
 	X86_HYPER_NATIVE = 0,
 	X86_HYPER_VMWARE,
@@ -39,6 +30,12 @@ enum x86_hypervisor_type {
 	X86_HYPER_KVM,
 };
 
+#ifdef CONFIG_HYPERVISOR_GUEST
+
+#include <asm/kvm_para.h>
+#include <asm/x86_init.h>
+#include <asm/xen/hypervisor.h>
+
 struct hypervisor_x86 {
 	/* Hypervisor name */
 	const char	*name;
@@ -58,7 +55,15 @@ struct hypervisor_x86 {
 
 extern enum x86_hypervisor_type x86_hyper_type;
 extern void init_hypervisor_platform(void);
+static inline bool hypervisor_is_type(enum x86_hypervisor_type type)
+{
+	return x86_hyper_type == type;
+}
 #else
 static inline void init_hypervisor_platform(void) { }
+static inline bool hypervisor_is_type(enum x86_hypervisor_type type)
+{
+	return type == X86_HYPER_NATIVE;
+}
 #endif /* CONFIG_HYPERVISOR_GUEST */
 #endif /* _ASM_X86_HYPERVISOR_H */

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 26/60] x86/cpufeature: Make cpu bugs sticky
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (24 preceding siblings ...)
  2017-12-04 14:07 ` [patch 25/60] x86/paravirt: Provide a way to check for hypervisors Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 22:39   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
                   ` (36 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-cpufeature--Make_cpu_bugs_sticky.patch --]
[-- Type: text/plain, Size: 2048 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

There is currently no way to force CPU bug bits like CPU feature bits. That
makes it impossible to set a bug bit once at boot and have it stick for all
upcoming CPUs.

Extend the force set/clear arrays to handle bug bits as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/cpufeature.h |    2 ++
 arch/x86/include/asm/processor.h  |    4 ++--
 arch/x86/kernel/cpu/common.c      |    6 +++---
 3 files changed, 7 insertions(+), 5 deletions(-)

--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -135,6 +135,8 @@ extern void clear_cpu_cap(struct cpuinfo
 	set_bit(bit, (unsigned long *)cpu_caps_set);	\
 } while (0)
 
+#define setup_force_cpu_bug(bit) setup_force_cpu_cap(bit)
+
 #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_X86_FAST_FEATURE_TESTS)
 /*
  * Static testing of CPU features.  Used the same as boot_cpu_has().
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -164,8 +164,8 @@ extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
 extern struct x86_hw_tss	doublefault_tss;
-extern __u32			cpu_caps_cleared[NCAPINTS];
-extern __u32			cpu_caps_set[NCAPINTS];
+extern __u32			cpu_caps_cleared[NCAPINTS + NBUGINTS];
+extern __u32			cpu_caps_set[NCAPINTS + NBUGINTS];
 
 #ifdef CONFIG_SMP
 DECLARE_PER_CPU_READ_MOSTLY(struct cpuinfo_x86, cpu_info);
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -476,8 +476,8 @@ static const char *table_lookup_model(st
 	return NULL;		/* Not found */
 }
 
-__u32 cpu_caps_cleared[NCAPINTS];
-__u32 cpu_caps_set[NCAPINTS];
+__u32 cpu_caps_cleared[NCAPINTS + NBUGINTS];
+__u32 cpu_caps_set[NCAPINTS + NBUGINTS];
 
 void load_percpu_segment(int cpu)
 {
@@ -836,7 +836,7 @@ static void apply_forced_caps(struct cpu
 {
 	int i;
 
-	for (i = 0; i < NCAPINTS; i++) {
+	for (i = 0; i < NCAPINTS + NBUGINTS; i++) {
 		c->x86_capability[i] &= ~cpu_caps_cleared[i];
 		c->x86_capability[i] |= cpu_caps_set[i];
 	}

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (25 preceding siblings ...)
  2017-12-04 14:07 ` [patch 26/60] x86/cpufeature: Make cpu bugs sticky Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 23:18   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y Thomas Gleixner
                   ` (35 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-cpufeatures--Add_X86_BUG_CPU_INSECURE.patch --]
[-- Type: text/plain, Size: 1663 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Many x86 CPUs leak information to user space due to missing isolation of
user space and kernel space page tables. There are many well documented
ways to exploit that.

The upcoming software migitation of isolating the user and kernel space
page tables needs a misfeature flag so code can be made runtime
conditional.

Add two BUG bits: One which indicates that the CPU is affected and one that
the software migitation is enabled.

Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
made later.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/cpufeatures.h |    2 ++
 arch/x86/kernel/cpu/common.c       |    4 ++++
 2 files changed, 6 insertions(+)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -340,5 +340,7 @@
 #define X86_BUG_SWAPGS_FENCE		X86_BUG(11) /* SWAPGS without input dep on GS */
 #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
 #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
+#define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
+#define X86_BUG_CPU_SECURE_MODE_KPTI	X86_BUG(15) /* Kernel Page Table Isolation enabled*/
 
 #endif /* _ASM_X86_CPUFEATURES_H */
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1016,6 +1016,10 @@ static void __init early_identify_cpu(st
 	}
 
 	setup_force_cpu_cap(X86_FEATURE_ALWAYS);
+
+	/* Assume for now that ALL x86 CPUs are insecure */
+	setup_force_cpu_bug(X86_BUG_CPU_INSECURE);
+
 	fpu__init_system(c);
 
 #ifdef CONFIG_X86_32

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (26 preceding siblings ...)
  2017-12-04 14:07 ` [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 14:34   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 29/60] x86/mm/kpti: Prepare the x86/entry assembly code for entry/exit CR3 switching Thomas Gleixner
                   ` (34 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, richard.fellner, michael.schwarz

[-- Attachment #0: x86-mm-kpti--Disable_global_pages_if_KERNEL_PAGE_TABLE_ISOLATION-y.patch --]
[-- Type: text/plain, Size: 2747 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use, such as the double-page-fault attack:

   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

That means that even when KERNEL_PAGE_TABLE_ISOLATION switches page tables
on return to user space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Supress global pages via the __supported_pte_mask. The user space
mappings set PAGE_GLOBAL for the minimal kernel mappings which are
required for entry/exit. These mappings are set up manually so the
filtering does not take place.

[ The __supported_pte_mask simplification was written by Thomas Gleixner. ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171123003441.63DDFC6F@viggo.jf.intel.com

---
 arch/x86/mm/init.c |   12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -161,6 +161,12 @@ struct map_range {
 
 static int page_size_mask;
 
+static void enable_global_pages(void)
+{
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		__supported_pte_mask |= _PAGE_GLOBAL;
+}
+
 static void __init probe_page_size_mask(void)
 {
 	/*
@@ -179,11 +185,11 @@ static void __init probe_page_size_mask(
 		cr4_set_bits_and_update_boot(X86_CR4_PSE);
 
 	/* Enable PGE if available */
+	__supported_pte_mask &= ~_PAGE_GLOBAL;
 	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		cr4_set_bits_and_update_boot(X86_CR4_PGE);
-		__supported_pte_mask |= _PAGE_GLOBAL;
-	} else
-		__supported_pte_mask &= ~_PAGE_GLOBAL;
+		enable_global_pages();
+	}
 
 	/* Enable 1 GB linear kernel mappings if available: */
 	if (direct_gbpages && boot_cpu_has(X86_FEATURE_GBPAGES)) {

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 29/60] x86/mm/kpti: Prepare the x86/entry assembly code for entry/exit CR3 switching
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (27 preceding siblings ...)
  2017-12-04 14:07 ` [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation Thomas Gleixner
                   ` (33 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	Borislav Petkov, moritz.lipp, linux-mm, richard.fellner,
	michael.schwarz

[-- Attachment #0: x86-mm-kpti--Prepare_the_x86-entry_assembly_code_for_entry-exit_CR3_switching.patch --]
[-- Type: text/plain, Size: 10572 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

KERNEL_PAGE_TABLE_ISOLATION needs to switch to a different CR3 value when
it enters the kernel and switch back when it exits.  This essentially needs
to be done before leaving assembly code.

This is extra challenging because the switching context is tricky: the
registers that can be clobbered can vary.  It is also hard to store things
on the stack because there is an established ABI (ptregs) or the stack is
entirely unsafe to use.

Establish a set of macros that allow changing to the user and kernel CR3
values.

Interactions with SWAPGS:
Previous versions of the KERNEL_PAGE_TABLE_ISOLATION code relied on having
per-CPU scratch space to save/restore a register that can be used for the
CR3 MOV.  The %GS register is used to index into our per-CPU space, so
SWAPGS *had* to be done before the CR3 switch.  That scratch space is gone
now, but the semantic that SWAPGS must be done before the CR3 MOV is
retained.  This is good to keep because it is not that hard to do and it
allows to do things like add per-CPU debugging information.

What this does in the NMI code is worth pointing out.  NMIs can interrupt
*any* context and they can also be nested with NMIs interrupting other
NMIs.  The comments below ".Lnmi_from_kernel" explain the format of the
stack during this situation.  Changing the format of this stack is hard.
Instead of storing the old CR3 value on the stack, this depends on the
*regular* register save/restore mechanism and then uses %r14 to keep CR3
during the NMI.  It is callee-saved and will not be clobbered by the C NMI
handlers that get called.

[ peterz: ESPFIX optimization ]

Based-on-code-from: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lkml.kernel.org/r/20171123003442.2D047A7D@viggo.jf.intel.com

---
 arch/x86/entry/calling.h         |   66 +++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        |   45 +++++++++++++++++++++++---
 arch/x86/entry/entry_64_compat.S |   24 +++++++++++++-
 3 files changed, 128 insertions(+), 7 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,8 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
@@ -187,6 +189,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+
+/* KERNEL_PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define KPTI_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KERNEL_PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+	andq	$(~KPTI_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KPTI_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, \scratch_reg
+	movq	\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KERNEL_PAGE_TABLE_ISOLATION patches in a moment.
+	 */
+	testq	$(KPTI_SWITCH_MASK), \scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 \scratch_reg
+	movq	\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KERNEL_PAGE_TABLE_ISOLATION=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -208,6 +211,10 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+	/*
+	 * This path is not taken when KERNEL_PAGE_TABLE_ISOLATION is disabled so it
+	 * is not required to switch CR3.
+	 */
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -403,6 +410,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -740,6 +748,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -822,7 +832,9 @@ ENTRY(native_iret)
 	 */
 
 	pushq	%rdi				/* Stash user RDI */
-	SWAPGS
+	SWAPGS					/* to kernel GS */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
+
 	movq	PER_CPU_VAR(espfix_waddr), %rdi
 	movq	%rax, (0*8)(%rdi)		/* user RAX */
 	movq	(1*8)(%rsp), %rax		/* user RIP */
@@ -838,7 +850,6 @@ ENTRY(native_iret)
 	/* Now RAX == RSP. */
 
 	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */
-	popq	%rdi				/* Restore user RDI */
 
 	/*
 	 * espfix_stack[31:16] == 0.  The page tables are set up such that
@@ -849,7 +860,11 @@ ENTRY(native_iret)
 	 * still points to an RO alias of the ESPFIX stack.
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
-	SWAPGS
+
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWAPGS					/* to user GS */
+	popq	%rdi				/* Restore user RDI */
+
 	movq	%rax, %rsp
 	UNWIND_HINT_IRET_REGS offset=8
 
@@ -949,6 +964,8 @@ ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
 	pushq	%rdi
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1250,7 +1267,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1272,6 +1293,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1299,6 +1321,8 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1345,6 +1369,7 @@ ENTRY(error_entry)
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	jmp .Lerror_entry_done
 
 .Lbstep_iret:
@@ -1354,10 +1379,11 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1389,6 +1415,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KERNEL_PAGE_TABLE_ISOLATION is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1452,6 +1482,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1704,6 +1735,8 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -216,6 +220,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r15 = 0 */
 
 	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
+	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
@@ -256,10 +266,22 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (28 preceding siblings ...)
  2017-12-04 14:07 ` [patch 29/60] x86/mm/kpti: Prepare the x86/entry assembly code for entry/exit CR3 switching Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 15:20   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 31/60] x86/mm/kpti: Add mapping helper functions Thomas Gleixner
                   ` (32 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-kpti--Add_infrastructure_for_page_table_isolation.patch --]
[-- Type: text/plain, Size: 7218 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add the initial files for kernel page table isolation, with a minimal init
function and the boot time detection for this misfeature.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 Documentation/admin-guide/kernel-parameters.txt |    2 
 arch/x86/boot/compressed/pagetable.c            |    3 
 arch/x86/entry/calling.h                        |    7 ++
 arch/x86/include/asm/kpti.h                     |   14 ++++
 arch/x86/mm/Makefile                            |    7 +-
 arch/x86/mm/init.c                              |    2 
 arch/x86/mm/kpti.c                              |   76 ++++++++++++++++++++++++
 include/linux/kpti.h                            |   11 +++
 init/main.c                                     |    2 
 9 files changed, 121 insertions(+), 3 deletions(-)

--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2711,6 +2711,8 @@
 			steal time is computed, but won't influence scheduler
 			behaviour
 
+	nokpti		[X86-64] Disable kernel page table isolation
+
 	nolapic		[X86-32,APIC] Do not enable or use the local APIC.
 
 	nolapic_timer	[X86-32,APIC] Do not use the local APIC timer.
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -23,6 +23,9 @@
  */
 #undef CONFIG_AMD_MEM_ENCRYPT
 
+/* No KERNEL_PAGE_TABLE_ISOLATION support needed either: */
+#undef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+
 #include "misc.h"
 
 /* These actually do the work of building the kernel identity maps. */
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -205,18 +205,23 @@ For 32-bit we have the following convent
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lend_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lend_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	ALTERNATIVE "jmp .Ldone_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
 	/*
@@ -233,11 +238,13 @@ For 32-bit we have the following convent
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Lend_\@:
 .endm
 
 #else /* CONFIG_KERNEL_PAGE_TABLE_ISOLATION=n: */
--- /dev/null
+++ b/arch/x86/include/asm/kpti.h
@@ -0,0 +1,14 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _ASM_X86_KPTI_H
+#define _ASM_X86_KPTI_H
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+extern void kpti_init(void);
+extern void kpti_check_boottime_disable(void);
+#else
+static inline void kpti_check_boottime_disable(void) { }
+#endif
+
+#endif /* __ASSEMBLY__ */
+#endif /* _ASM_X86_KPTI_H */
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -43,9 +43,10 @@ obj-$(CONFIG_AMD_NUMA)		+= amdtopology.o
 obj-$(CONFIG_ACPI_NUMA)		+= srat.o
 obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 
-obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
-obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
-obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_X86_INTEL_MPX)			+= mpx.o
+obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) 	+= pkeys.o
+obj-$(CONFIG_RANDOMIZE_MEMORY) 			+= kaslr.o
+obj-$(CONFIG_KERNEL_PAGE_TABLE_ISOLATION)	+= kpti.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -20,6 +20,7 @@
 #include <asm/kaslr.h>
 #include <asm/hypervisor.h>
 #include <asm/cpufeature.h>
+#include <asm/kpti.h>
 
 /*
  * We need to define the tracepoints somewhere, and tlb.c
@@ -630,6 +631,7 @@ void __init init_mem_mapping(void)
 {
 	unsigned long end;
 
+	kpti_check_boottime_disable();
 	probe_page_size_mask();
 	setup_pcid();
 
--- /dev/null
+++ b/arch/x86/mm/kpti.c
@@ -0,0 +1,76 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ * Mostly rewritten by Thomas Gleixner <tglx@linutronix.de> and
+ *		       Andy Lutomirsky <luto@amacapital.net>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/cmdline.h>
+#include <asm/kpti.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+#undef pr_fmt
+#define pr_fmt(fmt)     "Kernel/User page tables isolation: " fmt
+
+void __init kpti_check_boottime_disable(void)
+{
+	bool enable = true;
+
+	if (cmdline_find_option_bool(boot_command_line, "nokpti")) {
+		pr_info("disabled on command line.\n");
+		enable = false;
+	}
+	if (hypervisor_is_type(X86_HYPER_XEN_PV)) {
+		pr_info("disabled on XEN_PV.\n");
+		enable = false;
+	}
+	if (enable)
+		setup_force_cpu_bug(X86_BUG_CPU_SECURE_MODE_KPTI);
+}
+
+/*
+ * Initialize kernel page table isolation
+ */
+void __init kpti_init(void)
+{
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return;
+
+	pr_info("enabled\n");
+}
--- /dev/null
+++ b/include/linux/kpti.h
@@ -0,0 +1,11 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _INCLUDE_KPTI_H
+#define _INCLUDE_KPTI_H
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+#include <asm/kpti.h>
+#else
+static inline void kpti_init(void) { }
+#endif
+
+#endif
--- a/init/main.c
+++ b/init/main.c
@@ -76,6 +76,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kpti.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -505,6 +506,7 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	kpti_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 31/60] x86/mm/kpti: Add mapping helper functions
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (29 preceding siblings ...)
  2017-12-04 14:07 ` [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 22:27   ` Andy Lutomirski
  2017-12-05 16:01   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
                   ` (31 subsequent siblings)
  62 siblings, 2 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-kpti--Add_mapping_helper_functions.patch --]
[-- Type: text/plain, Size: 4763 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Add the pagetable helper functions do manage the separate user space page
tables.

[ tglx: Split out from the big combo kaiser patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/pgtable_64.h |  139 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 139 insertions(+)

--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,9 +131,144 @@ static inline pud_t native_pudp_get_and_
 #endif
 }
 
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+/*
+ * All top-level KERNEL_PAGE_TABLE_ISOLATION page tables are order-1 pages
+ * (8k-aligned and 8k in size).  The kernel one is at the beginning 4k and
+ * the user one is in the last 4k.  To switch between them, you
+ * just need to flip the 12th bit in their addresses.
+ */
+#define KPTI_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= BIT(bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~BIT(bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_user_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KPTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline pgd_t *user_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KPTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *kernel_to_user_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KPTI_PGTABLE_SWITCH_BIT);
+}
+
+static inline p4d_t *user_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KPTI_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KERNEL_PAGE_TABLE_ISOLATION */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
+}
+
+/*
+ * Does this PGD allow access from userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return pgd.pgd & _PAGE_USER;
+}
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs to be set there.
+ * Populates the user and returns the resulting PGD that must be set in
+ * the kernel copy of the page tables.
+ */
+static inline pgd_t kpti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return pgd;
+
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user page tables get the full PGD,
+			 * accessible from userspace:
+			 */
+			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel uses,
+			 * make it unusable to userspace.  This ensures on
+			 * in case that a return to userspace with the
+			 * kernel CR3 value, userspace will crash instead
+			 * of running.
+			 *
+			 * Note: NX might be not available or disabled.
+			 */
+			if (__supported_pte_mask & _PAGE_NX)
+				pgd.pgd |= _PAGE_NX;
+		}
+	} else if (pgd_userspace_access(*pgdp)) {
+		/*
+		 * We are clearing a _PAGE_USER PGD for which we presumably
+		 * populated the user PGD.  We must now clear the user PGD
+		 * entry.
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Attempted to clear a _PAGE_USER PGD which is in
+			 * the kernel porttion of the address space.  PGDs
+			 * are pre-populated and we never clear them.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	} else {
+		/*
+		 * _PAGE_USER was not set in either the PGD being set or
+		 * cleared.  All kernel PGDs should be pre-populated so
+		 * this should never happen after boot.
+		 */
+		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KERNEL_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
+	p4dp->pgd = kpti_set_user_pgd(&p4dp->pgd, p4d.pgd);
+#else
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -147,7 +282,11 @@ static inline void native_p4d_clear(p4d_
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	*pgdp = kpti_set_user_pgd(pgdp, pgd);
+#else
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (30 preceding siblings ...)
  2017-12-04 14:07 ` [patch 31/60] x86/mm/kpti: Add mapping helper functions Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 17:09   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 33/60] x86/mm/kpti: Allocate a separate user PGD Thomas Gleixner
                   ` (30 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-mm-kpti--Allow_NX_poison_to_be_set_in_p4d-pgd.patch --]
[-- Type: text/plain, Size: 1422 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

With KERNEL_PAGE_TABLE_ISOLATION the user portion of the kernel page
tables is poisoned with the NX bit so if the entry code exits with the
kernel page tables selected in CR3, userspace crashes.

But doing so trips the p4d/pgd_bad() checks.  Make sure it does not do
that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,12 @@ static inline pud_t *pud_offset(p4d_t *p
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KERNEL_PAGE_TABLE_ISOLATION))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -880,7 +885,12 @@ static inline p4d_t *p4d_offset(pgd_t *p
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KERNEL_PAGE_TABLE_ISOLATION))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 33/60] x86/mm/kpti: Allocate a separate user PGD
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (31 preceding siblings ...)
  2017-12-04 14:07 ` [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 18:33   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 34/60] x86/mm/kpti: Populate " Thomas Gleixner
                   ` (29 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-kpti--Allocate_a_separate_user_PGD.patch --]
[-- Type: text/plain, Size: 3768 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Kernel page table isolation requires to have two PGDs. One for the kernel,
which contains the full kernel mapping plus the user space mapping and one
for user space which contains the user space mappings and the minimal set
of kernel mappings which are required by the architecture to be able to
transition from and to user space.

Add the necessary preliminaries.

[ tglx: Split out from the big kaiser dump ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kernel/head_64.S |   30 +++++++++++++++++++++++++++---
 arch/x86/mm/pgtable.c     |   16 ++++++++++++++--
 2 files changed, 41 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -341,6 +341,27 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KPTI_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KPTI_USER_PGD_FILL	0
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -350,13 +371,14 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KPTI_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -364,13 +386,14 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KPTI_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -381,8 +404,9 @@ NEXT_PAGE(level2_ident_pgt)
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KPTI_USER_PGD_FILL,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 34/60] x86/mm/kpti: Populate user PGD
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (32 preceding siblings ...)
  2017-12-04 14:07 ` [patch 33/60] x86/mm/kpti: Allocate a separate user PGD Thomas Gleixner
@ 2017-12-04 14:07 ` " Thomas Gleixner
  2017-12-05 19:17   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in " Thomas Gleixner
                   ` (28 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-kpti--Populate_user_PGD.patch --]
[-- Type: text/plain, Size: 2624 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Populate the PGD entries in the init user PGD which cover the kernel half
of the address space. This makes sure that the installment of the user
visible kernel mappings finds a populated PGD.

In clone_pgd_range() copy the init user PGDs which cover the kernel half of
the address space, so a process has all the required kernel mappings
visible.

[ tglx: Split out from the big kaiser dump ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/pgtable.h |    5 +++++
 arch/x86/mm/kpti.c             |   41 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 46 insertions(+)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1116,6 +1116,11 @@ static inline void pmdp_set_wrprotect(st
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	/* Clone the user space pgd as well */
+	memcpy(kernel_to_user_pgdp(dst), kernel_to_user_pgdp(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -65,6 +65,45 @@ void __init kpti_check_boottime_disable(
 }
 
 /*
+ * Ensure that the top level of the user page tables are entirely
+ * populated.  This ensures that all processes that get forked have the
+ * same entries.  This way, we do not have to ever go set up new entries in
+ * older processes.
+ *
+ * Note: we never free these, so there are no updates to them after this.
+ */
+static void __init kpti_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i;
+
+	pgd = kernel_to_user_pgdp(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		/*
+		 * Each PGD entry moves up PGDIR_SIZE bytes through the
+		 * address space, so get the first virtual address mapped
+		 * by PGD #i:
+		 */
+		unsigned long addr = i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
  * Initialize kernel page table isolation
  */
 void __init kpti_init(void)
@@ -73,4 +112,6 @@ void __init kpti_init(void)
 		return;
 
 	pr_info("enabled\n");
+
+	kpti_init_all_pgds();
 }

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in user PGD
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (33 preceding siblings ...)
  2017-12-04 14:07 ` [patch 34/60] x86/mm/kpti: Populate " Thomas Gleixner
@ 2017-12-04 14:07 ` " Thomas Gleixner
  2017-12-04 22:28   ` Andy Lutomirski
  2017-12-04 14:07 ` [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs Thomas Gleixner
                   ` (27 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-espfix--Ensure_that_ESPFIX_is_visible_in_user_PGD.patch --]
[-- Type: text/plain, Size: 1274 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Clone the ESPFIX alias mapping area so the entry/exit code has access to it
even with the user space page tables.

[ tglx: Remove the per cpu user mapped oddity ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kernel/espfix_64.c |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -129,6 +129,22 @@ void __init init_espfix_bsp(void)
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
 
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix area to
+	 * ensure it is mapped into the user page tables.
+	 *
+	 * For 5-level paging, the espfix pgd was populated when
+	 * kpti_init() pre-populated all the pgd entries.  The above
+	 * p4d_alloc() would never do anything and the p4d_populate() would
+	 * be done to a p4d already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	if (CONFIG_PGTABLE_LEVELS <= 4) {
+		set_pgd(kernel_to_user_pgdp(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+	}
+#endif
+
 	/* Randomize the locations */
 	init_espfix_random();
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (34 preceding siblings ...)
  2017-12-04 14:07 ` [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in " Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-06 15:39   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active Thomas Gleixner
                   ` (26 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-kpti--Add_functions_to_clone_kernel_PMDs.patch --]
[-- Type: text/plain, Size: 3553 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Provide infrastructure to:

 - find a kernel PMD for a mapping which must be visible to user space for
   the entry/exit code to work.

 - walk an address range and share the kernel PMD with it.

This reuses a small part of the original KAISER patches to populate the
user space page table.

[ tglx: Made it universally usable so it can be used for any kind of shared
  	mapping. Add a mechanism to clear specific bits in the user space
	visible PMD entry. ]

Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/mm/kpti.c |  102 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 102 insertions(+)

--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -65,6 +65,108 @@ void __init kpti_check_boottime_disable(
 }
 
 /*
+ * Walk the user copy of the page tables (optionally) trying to allocate
+ * page table pages on the way down.
+ *
+ * Returns a pointer to a PMD on success, or NULL on failure.
+ */
+static pmd_t *kpti_user_pagetable_walk_pmd(unsigned long address)
+{
+	pgd_t *pgd = kernel_to_user_pgdp(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+	pud_t *pud;
+	p4d_t *p4d;
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All user pgds should have been populated\n");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		if (p4d_none(*p4d)) {
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+			new_pud_page = 0;
+		}
+		if (new_pud_page)
+			free_page(new_pud_page);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The user page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		if (pud_none(*pud)) {
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+			new_pmd_page = 0;
+		}
+		if (new_pmd_page)
+			free_page(new_pmd_page);
+	}
+
+	return pmd_offset(pud, address);
+}
+
+static void __init
+kpti_clone_pmds(unsigned long start, unsigned long end, pmdval_t clear)
+{
+	unsigned long addr;
+
+	/*
+	 * Clone the populated PMDs which cover start to end. These PMD areas
+	 * can have holes.
+	 */
+	for (addr = start; addr < end; addr += PMD_SIZE) {
+		pmd_t *pmd, *target_pmd;
+		pgd_t *pgd;
+		p4d_t *p4d;
+		pud_t *pud;
+
+		pgd = pgd_offset_k(addr);
+		if (WARN_ON(pgd_none(*pgd)))
+			return;
+		p4d = p4d_offset(pgd, addr);
+		if (WARN_ON(p4d_none(*p4d)))
+			return;
+		pud = pud_offset(p4d, addr);
+		if (pud_none(*pud))
+			continue;
+		pmd = pmd_offset(pud, addr);
+		if (pmd_none(*pmd))
+			continue;
+
+		target_pmd = kpti_user_pagetable_walk_pmd(addr);
+		if (WARN_ON(!target_pmd))
+			return;
+
+		/*
+		 * Copy the PMD.  That is, the kernelmode and usermode
+		 * tables will share the last-level page tables of this
+		 * address range
+		 */
+		*target_pmd = pmd_clear_flags(*pmd, clear);
+	}
+}
+
+/*
  * Ensure that the top level of the user page tables are entirely
  * populated.  This ensures that all processes that get forked have the
  * same entries.  This way, we do not have to ever go set up new entries in

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (35 preceding siblings ...)
  2017-12-04 14:07 ` [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-06 16:01   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD Thomas Gleixner
                   ` (25 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-kpti--Force_entry_through_trampoline.patch --]
[-- Type: text/plain, Size: 840 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Force the entry through the trampoline only when KPTI is active. Otherwise
go through the normal entry code.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kernel/cpu/common.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1458,7 +1458,10 @@ void syscall_init(void)
 		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
+	if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
+	else
+		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (36 preceding siblings ...)
  2017-12-04 14:07 ` [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-06 18:57   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs Thomas Gleixner
                   ` (24 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-fixmap--Move_cpu_entry_area_into_a_separate_PMD.patch --]
[-- Type: text/plain, Size: 1544 bytes --]

From: Andy Lutomirski <luto@kernel.org>

This allows the cpu entry area PMDs to be shared between the kernel and
user space page tables.

[ tglx: Fixed bottom of by one and added guards so other fixmaps can be
  	added later ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/fixmap.h |   14 ++++++++++----
 1 file changed, 10 insertions(+), 4 deletions(-)

--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -134,16 +134,22 @@ enum fixed_addresses {
 #ifdef CONFIG_PARAVIRT
 	FIX_PARAVIRT_BOOTMAP,
 #endif
-	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
-	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
 #ifdef	CONFIG_X86_INTEL_MID
 	FIX_LNW_VRTC,
 #endif
-	/* Fixmap entries to remap the GDTs, one per processor. */
+	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
+	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
+
+	/*
+	 * Fixmap entries to remap the IDT, and the per cpu entry areas.
+	 * Aligend to a PMD boundary.
+	 */
+	FIX_USR_SHARED_TOP = round_up(FIX_TEXT_POKE0 + 1, PTRS_PER_PMD),
 	FIX_CPU_ENTRY_AREA_TOP,
 	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
+	FIX_USR_SHARED_BOTTOM  = round_up(FIX_CPU_ENTRY_AREA_BOTTOM + 2, PTRS_PER_PMD) - 1,
 
-	__end_of_permanent_fixed_addresses,
+	__end_of_permanent_fixed_addresses = FIX_USR_SHARED_BOTTOM,
 
 	/*
 	 * 512 temporary boot-time mappings, used by early_ioremap(),

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (37 preceding siblings ...)
  2017-12-04 14:07 ` [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-06 21:18   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 40/60] x86: PMD align entry text Thomas Gleixner
                   ` (23 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-kpti--Clone_cpu_entry_area_PMDs.patch --]
[-- Type: text/plain, Size: 1309 bytes --]

From: Andy Lutomirski <luto@kernel.org>

Share the FIX_USR_SHARED PMDs so the user space and kernel space page
tables have the same PMD page.

[ tglx: Made it use the FIX_USR_SHARED range so later additions
  	are covered automatically ]

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/kpti.c |   18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -167,6 +167,23 @@ kpti_clone_pmds(unsigned long start, uns
 }
 
 /*
+ * Clone the populated PMDs of the user shared fixmaps into the user space
+ * visible page table.
+ */
+static void __init kpti_clone_user_shared(void)
+{
+	unsigned long bot, top;
+
+	bot = __fix_to_virt(FIX_USR_SHARED_BOTTOM);
+	top = __fix_to_virt(FIX_USR_SHARED_TOP) + PAGE_SIZE;
+
+	/* Top of the user shared block must be PMD-aligned. */
+	WARN_ON(top & ~PMD_MASK);
+
+	kpti_clone_pmds(bot, top, 0);
+}
+
+/*
  * Ensure that the top level of the user page tables are entirely
  * populated.  This ensures that all processes that get forked have the
  * same entries.  This way, we do not have to ever go set up new entries in
@@ -216,4 +233,5 @@ void __init kpti_init(void)
 	pr_info("enabled\n");
 
 	kpti_init_all_pgds();
+	kpti_clone_user_shared();
 }

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 40/60] x86: PMD align entry text
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (38 preceding siblings ...)
  2017-12-04 14:07 ` [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-07  8:07   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 41/60] x86/mm/kpti: Share entry text PMD Thomas Gleixner
                   ` (22 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86--PMD_align_entry_text.patch --]
[-- Type: text/plain, Size: 969 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The (irq)entry text must be visible in the user space page tables. To allow
simple PMD based sharing, make the entry text PMD aligned.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/kernel/vmlinux.lds.S |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -61,11 +61,17 @@ jiffies_64 = jiffies;
 		. = ALIGN(HPAGE_SIZE);				\
 		__end_rodata_hpage_align = .;
 
+#define ALIGN_ENRTY_TEXT_BEGIN	. = ALIGN(PMD_SIZE);
+#define ALIGN_ENRTY_TEXT_END	. = ALIGN(PMD_SIZE);
+
 #else
 
 #define X64_ALIGN_RODATA_BEGIN
 #define X64_ALIGN_RODATA_END
 
+#define ALIGN_ENRTY_TEXT_BEGIN
+#define ALIGN_ENRTY_TEXT_END
+
 #endif
 
 PHDRS {
@@ -102,8 +108,10 @@ SECTIONS
 		CPUIDLE_TEXT
 		LOCK_TEXT
 		KPROBES_TEXT
+		ALIGN_ENRTY_TEXT_BEGIN
 		ENTRY_TEXT
 		IRQENTRY_TEXT
+		ALIGN_ENRTY_TEXT_END
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 41/60] x86/mm/kpti: Share entry text PMD
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (39 preceding siblings ...)
  2017-12-04 14:07 ` [patch 40/60] x86: PMD align entry text Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-07  8:24   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 42/60] x86/fixmap: Move IDT fixmap into the cpu_entry_area range Thomas Gleixner
                   ` (21 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-kpti--Clone_entry_text_PMD.patch --]
[-- Type: text/plain, Size: 1182 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Share the entry text PMD of the kernel mapping with the user space
mapping. If large pages are enabled this is a single PMD entry and at the
point where it is copied into the user page table the RW bit has not been
cleared yet. Clear it right away so the user space visible map becomes RX.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/kpti.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -184,6 +184,15 @@ static void __init kpti_clone_user_share
 }
 
 /*
+ * Clone the populated PMDs of the entry and irqentry text and force it RO.
+ */
+static void __init kpti_clone_entry_text(void)
+{
+	kpti_clone_pmds((unsigned long) __entry_text_start,
+			(unsigned long) __irqentry_text_end, _PAGE_RW);
+}
+
+/*
  * Ensure that the top level of the user page tables are entirely
  * populated.  This ensures that all processes that get forked have the
  * same entries.  This way, we do not have to ever go set up new entries in
@@ -234,4 +243,5 @@ void __init kpti_init(void)
 
 	kpti_init_all_pgds();
 	kpti_clone_user_shared();
+	kpti_clone_entry_text();
 }

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 42/60] x86/fixmap: Move IDT fixmap into the cpu_entry_area range
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (40 preceding siblings ...)
  2017-12-04 14:07 ` [patch 41/60] x86/mm/kpti: Share entry text PMD Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area Thomas Gleixner
                   ` (20 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-fixmap--Move_IDT_fixmap_into_the_cpu_entry_area_range.patch --]
[-- Type: text/plain, Size: 1043 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

That makes it automatically a shared mapping along with the cpu_entry_area.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fixmap.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -123,7 +123,6 @@ enum fixed_addresses {
 	FIX_IO_APIC_BASE_0,
 	FIX_IO_APIC_BASE_END = FIX_IO_APIC_BASE_0 + MAX_IO_APICS - 1,
 #endif
-	FIX_RO_IDT,	/* Virtual mapping for read-only IDT */
 #ifdef CONFIG_X86_32
 	FIX_KMAP_BEGIN,	/* reserved pte's for temporary kernel mappings */
 	FIX_KMAP_END = FIX_KMAP_BEGIN+(KM_TYPE_NR*NR_CPUS)-1,
@@ -145,6 +144,7 @@ enum fixed_addresses {
 	 * Aligend to a PMD boundary.
 	 */
 	FIX_USR_SHARED_TOP = round_up(FIX_TEXT_POKE0 + 1, PTRS_PER_PMD),
+	FIX_RO_IDT,
 	FIX_CPU_ENTRY_AREA_TOP,
 	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 	FIX_USR_SHARED_BOTTOM  = round_up(FIX_CPU_ENTRY_AREA_BOTTOM + 2, PTRS_PER_PMD) - 1,

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (41 preceding siblings ...)
  2017-12-04 14:07 ` [patch 42/60] x86/fixmap: Move IDT fixmap into the cpu_entry_area range Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-07  9:55   ` Borislav Petkov
  2017-12-04 14:07 ` [patch 44/60] x86/events/intel/ds: Map debug buffers in fixmap Thomas Gleixner
                   ` (19 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-fixmap--Add_debugstore_entries_to_cpu_entry_area.patch --]
[-- Type: text/plain, Size: 5666 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

The intel PEBS/BTS debug store is a design trainwreck as is expects virtual
addresses which must be visible in any execution context.

So it is required to make these mappings visible to user space when kernel
page table isolation is active.

Provide enough room for the buffer mappings in the cpu_entry_area so the
buffers are available in the user space visible fixmap.

At the point where the kernel side fixmap is populated there is no buffer
available yet, but the kernel PMD must be populated. To achieve this set
the fixmap entries for these buffers to non present.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/events/intel/ds.c      |    5 +++--
 arch/x86/events/perf_event.h    |   21 ++-------------------
 arch/x86/include/asm/fixmap.h   |   13 +++++++++++++
 arch/x86/include/asm/intel_ds.h |   36 ++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/cpu/common.c    |   21 +++++++++++++++++++++
 5 files changed, 75 insertions(+), 21 deletions(-)

--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -8,11 +8,12 @@
 
 #include "../perf_event.h"
 
+/* Waste a full page so it can be mapped into the cpu_entry_area */
+DEFINE_PER_CPU_PAGE_ALIGNED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
-#define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
-#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
 #define PEBS_FIXUP_SIZE		PAGE_SIZE
 
 /*
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -14,6 +14,8 @@
 
 #include <linux/perf_event.h>
 
+#include <asm/intel_ds.h>
+
 /* To enable MSR tracing please use the generic trace points. */
 
 /*
@@ -77,8 +79,6 @@ struct amd_nb {
 	struct event_constraint event_constraints[X86_PMC_IDX_MAX];
 };
 
-/* The maximal number of PEBS events: */
-#define MAX_PEBS_EVENTS		8
 #define PEBS_COUNTER_MASK	((1ULL << MAX_PEBS_EVENTS) - 1)
 
 /*
@@ -95,23 +95,6 @@ struct amd_nb {
 	PERF_SAMPLE_TRANSACTION | PERF_SAMPLE_PHYS_ADDR | \
 	PERF_SAMPLE_REGS_INTR | PERF_SAMPLE_REGS_USER)
 
-/*
- * A debug store configuration.
- *
- * We only support architectures that use 64bit fields.
- */
-struct debug_store {
-	u64	bts_buffer_base;
-	u64	bts_index;
-	u64	bts_absolute_maximum;
-	u64	bts_interrupt_threshold;
-	u64	pebs_buffer_base;
-	u64	pebs_index;
-	u64	pebs_absolute_maximum;
-	u64	pebs_interrupt_threshold;
-	u64	pebs_event_reset[MAX_PEBS_EVENTS];
-};
-
 #define PEBS_REGS \
 	(PERF_REG_X86_AX | \
 	 PERF_REG_X86_BX | \
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -19,6 +19,7 @@
 #include <asm/acpi.h>
 #include <asm/apicdef.h>
 #include <asm/page.h>
+#include <asm/intel_ds.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -78,6 +79,18 @@ struct cpu_entry_area {
 	 */
 	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
 #endif
+#ifdef CONFIG_CPU_SUP_INTEL
+	/*
+	 * Per CPU debug store for Intel performance monitoring. Wastes a
+	 * full page at the moment.
+	 */
+	struct debug_store cpu_debug_store;
+	/*
+	 * The actual PEBS/BTS buffers must be mapped to user space
+	 * Reserve enough fixmap PTEs.
+	 */
+	struct debug_store_buffers cpu_debug_buffers;
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
--- /dev/null
+++ b/arch/x86/include/asm/intel_ds.h
@@ -0,0 +1,36 @@
+#ifndef _ASM_INTEL_DS_H
+#define _ASM_INTEL_DS_H
+
+#include <linux/percpu-defs.h>
+
+#define BTS_BUFFER_SIZE		(PAGE_SIZE << 4)
+#define PEBS_BUFFER_SIZE	(PAGE_SIZE << 4)
+
+/* The maximal number of PEBS events: */
+#define MAX_PEBS_EVENTS		8
+
+/*
+ * A debug store configuration.
+ *
+ * We only support architectures that use 64bit fields.
+ */
+struct debug_store {
+	u64	bts_buffer_base;
+	u64	bts_index;
+	u64	bts_absolute_maximum;
+	u64	bts_interrupt_threshold;
+	u64	pebs_buffer_base;
+	u64	pebs_index;
+	u64	pebs_absolute_maximum;
+	u64	pebs_interrupt_threshold;
+	u64	pebs_event_reset[MAX_PEBS_EVENTS];
+} __aligned(PAGE_SIZE);
+
+DECLARE_PER_CPU_PAGE_ALIGNED(struct debug_store, cpu_debug_store);
+
+struct debug_store_buffers {
+	char	bts_buffer[BTS_BUFFER_SIZE];
+	char	pebs_buffer[PEBS_BUFFER_SIZE];
+};
+
+#endif
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -31,6 +31,7 @@
 #include <linux/cpumask.h>
 #include <asm/pgtable.h>
 #include <linux/atomic.h>
+#include <asm/intel_ds.h>
 #include <asm/proto.h>
 #include <asm/setup.h>
 #include <asm/apic.h>
@@ -514,6 +515,16 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 static DEFINE_PER_CPU_PAGE_ALIGNED(struct SYSENTER_stack_page,
 				   SYSENTER_stack_storage);
 
+/*
+ * Force the population of PMDs for not yet allocated per cpu
+ * memory like debug store buffers.
+ */
+static void __init set_percpu_fixmap_ptes(int idx, int pages)
+{
+	for (; pages; pages--, idx--)
+		__set_fixmap(idx, 0, PAGE_NONE);
+}
+
 static void __init
 set_percpu_fixmap_pages(int idx, void *ptr, int pages, pgprot_t prot)
 {
@@ -592,6 +603,16 @@ static void __init setup_cpu_entry_area(
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
+
+#ifdef CONFIG_CPU_SUP_INTEL
+	BUILD_BUG_ON(sizeof(struct debug_store) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, cpu_debug_store),
+				&per_cpu(cpu_debug_store, cpu),
+				sizeof(struct debug_store) / PAGE_SIZE,
+				PAGE_KERNEL);
+	set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, cpu_debug_buffers),
+			       sizeof(struct debug_store_buffers) / PAGE_SIZE);
+#endif
 }
 
 void __init setup_cpu_entry_areas(void)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 44/60] x86/events/intel/ds: Map debug buffers in fixmap
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (42 preceding siblings ...)
  2017-12-04 14:07 ` [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 45/60] x86/fixmap: Add ldt entries to user shared fixmap Thomas Gleixner
                   ` (18 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-events-intel-ds--Map_debug_buffers_in_fixmap.patch --]
[-- Type: text/plain, Size: 6820 bytes --]

From: Hugh Dickins <hughd@google.com>

The BTS and PEBS buffers both have their virtual addresses programmed into
the hardware.  This means that any access to them is performed via the page
tables.  The times that the hardware accesses these are entirely dependent
on how the performance monitoring hardware events are set up.  In other
words, there is no way for the kernel to tell when the hardware might
access these buffers.

To avoid perf crashes, place 'debug_store' allocate pages and map them into
the cpu_entry_area fixmap.

The PEBS fixup buffer does not need this treatment.

[ tglx: Got rid of the kaiser_add_mapping() cruft ]

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/events/intel/ds.c   |  114 ++++++++++++++++++++++++++-----------------
 arch/x86/events/perf_event.h |    2 
 2 files changed, 73 insertions(+), 43 deletions(-)

--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -280,17 +280,46 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
-static int alloc_pebs_buffer(int cpu)
+static u64 ds_update_fixmap(int idx, void *addr, size_t size, pgprot_t prot)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	phys_addr_t pa, va;
+	size_t msz = 0;
+
+	va = __fix_to_virt(idx);
+	pa = virt_to_phys(addr);
+	for (; msz < size; idx--, msz += PAGE_SIZE, pa += PAGE_SIZE)
+		__set_fixmap(idx, pa, prot);
+	return va;
+}
+
+static void *dsalloc_pages(size_t size, gfp_t flags, int cpu)
+{
+	unsigned int order = get_order(size);
 	int node = cpu_to_node(cpu);
-	int max;
+	struct page *page;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	return page ? page_address(page) : NULL;
+}
+
+static void dsfree_pages(const void *buffer, size_t size)
+{
+	if (buffer)
+		free_pages((unsigned long)buffer, get_order(size));
+}
+
+static int alloc_pebs_buffer(int cpu)
+{
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	size_t bsiz = x86_pmu.pebs_buffer_size;
+	int idx, max, node = cpu_to_node(cpu);
 	void *buffer, *ibuffer;
 
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc_pages(bsiz, GFP_KERNEL, cpu);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -301,25 +330,27 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree_pages(buffer, bsiz);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
 	}
-
-	max = x86_pmu.pebs_buffer_size / x86_pmu.pebs_record_size;
-
-	ds->pebs_buffer_base = (u64)(unsigned long)buffer;
+	hwev->ds_pebs_vaddr = buffer;
+	/* Update the fixmap */
+	idx = get_cpu_entry_area_index(cpu, cpu_debug_buffers.pebs_buffer);
+	ds->pebs_buffer_base = ds_update_fixmap(idx, buffer, bsiz,
+						PAGE_KERNEL);
 	ds->pebs_index = ds->pebs_buffer_base;
-	ds->pebs_absolute_maximum = ds->pebs_buffer_base +
-		max * x86_pmu.pebs_record_size;
-
+	max = x86_pmu.pebs_record_size * (bsiz / x86_pmu.pebs_record_size);
+	ds->pebs_absolute_maximum = ds->pebs_buffer_base + max;
 	return 0;
 }
 
 static void release_pebs_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	int idx;
 
 	if (!ds || !x86_pmu.pebs)
 		return;
@@ -327,73 +358,70 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	/* Clear the fixmap */
+	idx = get_cpu_entry_area_index(cpu, cpu_debug_buffers.pebs_buffer);
+	ds_update_fixmap(idx, 0, x86_pmu.pebs_buffer_size, PAGE_NONE);
 	ds->pebs_buffer_base = 0;
+	dsfree_pages(hwev->ds_pebs_vaddr, x86_pmu.pebs_buffer_size);
+	hwev->ds_pebs_vaddr = NULL;
 }
 
 static int alloc_bts_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-	int node = cpu_to_node(cpu);
-	int max, thresh;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	int idx, max;
 	void *buffer;
 
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc_pages(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, cpu);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
 	}
-
-	max = BTS_BUFFER_SIZE / BTS_RECORD_SIZE;
-	thresh = max / 16;
-
-	ds->bts_buffer_base = (u64)(unsigned long)buffer;
+	hwev->ds_bts_vaddr = buffer;
+	/* Update the fixmap */
+	idx = get_cpu_entry_area_index(cpu, cpu_debug_buffers.bts_buffer);
+	ds->bts_buffer_base = ds_update_fixmap(idx, buffer, BTS_BUFFER_SIZE,
+					       PAGE_KERNEL);
 	ds->bts_index = ds->bts_buffer_base;
-	ds->bts_absolute_maximum = ds->bts_buffer_base +
-		max * BTS_RECORD_SIZE;
-	ds->bts_interrupt_threshold = ds->bts_absolute_maximum -
-		thresh * BTS_RECORD_SIZE;
-
+	max = BTS_RECORD_SIZE * (BTS_BUFFER_SIZE / BTS_RECORD_SIZE);
+	ds->bts_absolute_maximum = ds->bts_buffer_base + max;
+	ds->bts_interrupt_threshold = ds->bts_absolute_maximum - (max / 16);
 	return 0;
 }
 
 static void release_bts_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
+	struct cpu_hw_events *hwev = per_cpu_ptr(&cpu_hw_events, cpu);
+	struct debug_store *ds = hwev->ds;
+	int idx;
 
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	/* Clear the fixmap */
+	idx = get_cpu_entry_area_index(cpu, cpu_debug_buffers.bts_buffer);
+	ds_update_fixmap(idx, 0, BTS_BUFFER_SIZE, PAGE_NONE);
 	ds->bts_buffer_base = 0;
+	dsfree_pages(hwev->ds_bts_vaddr, BTS_BUFFER_SIZE);
+	hwev->ds_bts_vaddr = NULL;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = &get_cpu_entry_area(cpu)->cpu_debug_store;
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
-
 	return 0;
 }
 
 static void release_ds_buffer(int cpu)
 {
-	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
-
-	if (!ds)
-		return;
-
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
--- a/arch/x86/events/perf_event.h
+++ b/arch/x86/events/perf_event.h
@@ -199,6 +199,8 @@ struct cpu_hw_events {
 	 * Intel DebugStore bits
 	 */
 	struct debug_store	*ds;
+	void			*ds_pebs_vaddr;
+	void			*ds_bts_vaddr;
 	u64			pebs_enabled;
 	int			n_pebs;
 	int			n_large_pebs;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 45/60] x86/fixmap: Add ldt entries to user shared fixmap
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (43 preceding siblings ...)
  2017-12-04 14:07 ` [patch 44/60] x86/events/intel/ds: Map debug buffers in fixmap Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 46/60] x86/ldt: Rename ldt_struct->entries member Thomas Gleixner
                   ` (17 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-fixmap--Add_ldt_entries_to_user_shared_fixmap.patch --]
[-- Type: text/plain, Size: 1428 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

LDT entries need to be user visible. Add them to the user shared fixmaps so
they can be mapped to the actual location of the LDT entries of a process
on task switch.

Populate the PTEs upfront so the PMD sharing works.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/include/asm/fixmap.h |    3 +++
 arch/x86/kernel/cpu/common.c  |    2 ++
 2 files changed, 5 insertions(+)

--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -20,6 +20,7 @@
 #include <asm/apicdef.h>
 #include <asm/page.h>
 #include <asm/intel_ds.h>
+#include <asm/ldt.h>
 #ifdef CONFIG_X86_32
 #include <linux/threads.h>
 #include <asm/kmap_types.h>
@@ -91,6 +92,8 @@ struct cpu_entry_area {
 	 */
 	struct debug_store_buffers cpu_debug_buffers;
 #endif
+	/* Provide fixmap space for user LDTs */
+	char ldt_entries[LDT_ENTRIES * LDT_ENTRY_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -613,6 +613,8 @@ static void __init setup_cpu_entry_area(
 	set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, cpu_debug_buffers),
 			       sizeof(struct debug_store_buffers) / PAGE_SIZE);
 #endif
+	set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, ldt_entries),
+			       (LDT_ENTRIES * LDT_ENTRY_SIZE) / PAGE_SIZE);
 }
 
 void __init setup_cpu_entry_areas(void)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 46/60] x86/ldt: Rename ldt_struct->entries member
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (44 preceding siblings ...)
  2017-12-04 14:07 ` [patch 45/60] x86/fixmap: Add ldt entries to user shared fixmap Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 47/60] x86/ldt: Map LDT entries into fixmap Thomas Gleixner
                   ` (16 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-ldt-Rename_ldt_struct---Entries_member.patch --]
[-- Type: text/plain, Size: 5772 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

To support user shared LDT entry mappings it's required to change the LDT
related code so that the kernel side only references the real page mapping
of the LDT. When the LDT is loaded then the entries are alias mapped in the
per cpu fixmap. To catch all users rename ldt_struct->entries and fix them
up.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/events/core.c             |    2 +-
 arch/x86/include/asm/mmu_context.h |    4 ++--
 arch/x86/kernel/ldt.c              |   28 +++++++++++++++-------------
 arch/x86/kernel/process_64.c       |    2 +-
 arch/x86/kernel/step.c             |    2 +-
 arch/x86/lib/insn-eval.c           |    2 +-
 arch/x86/math-emu/fpu_system.h     |    2 +-
 7 files changed, 22 insertions(+), 20 deletions(-)

--- a/arch/x86/events/core.c
+++ b/arch/x86/events/core.c
@@ -2375,7 +2375,7 @@ static unsigned long get_segment_base(un
 		if (!ldt || idx >= ldt->nr_entries)
 			return 0;
 
-		desc = &ldt->entries[idx];
+		desc = &ldt->entries_va[idx];
 #else
 		return 0;
 #endif
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -50,7 +50,7 @@ struct ldt_struct {
 	 * call gates.  On native, we could merge the ldt_struct and LDT
 	 * allocations, but it's not worth trying to optimize.
 	 */
-	struct desc_struct *entries;
+	struct desc_struct *entries_va;
 	unsigned int nr_entries;
 };
 
@@ -91,7 +91,7 @@ static inline void load_mm_ldt(struct mm
 	 */
 
 	if (unlikely(ldt))
-		set_ldt(ldt->entries, ldt->nr_entries);
+		set_ldt(ldt->entries_va, ldt->nr_entries);
 	else
 		clear_LDT();
 #else
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -52,7 +52,7 @@ static void flush_ldt(void *__mm)
 		return;
 
 	pc = &mm->context;
-	set_ldt(pc->ldt->entries, pc->ldt->nr_entries);
+	set_ldt(pc->ldt->entries_va, pc->ldt->nr_entries);
 
 	refresh_ldt_segments();
 }
@@ -80,11 +80,11 @@ static struct ldt_struct *alloc_ldt_stru
 	 * than PAGE_SIZE.
 	 */
 	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries = vzalloc(alloc_size);
+		new_ldt->entries_va = vzalloc(alloc_size);
 	else
-		new_ldt->entries = (void *)get_zeroed_page(GFP_KERNEL);
+		new_ldt->entries_va = (void *)get_zeroed_page(GFP_KERNEL);
 
-	if (!new_ldt->entries) {
+	if (!new_ldt->entries_va) {
 		kfree(new_ldt);
 		return NULL;
 	}
@@ -96,7 +96,7 @@ static struct ldt_struct *alloc_ldt_stru
 /* After calling this, the LDT is immutable. */
 static void finalize_ldt_struct(struct ldt_struct *ldt)
 {
-	paravirt_alloc_ldt(ldt->entries, ldt->nr_entries);
+	paravirt_alloc_ldt(ldt->entries_va, ldt->nr_entries);
 }
 
 /* context.lock is held */
@@ -115,11 +115,11 @@ static void free_ldt_struct(struct ldt_s
 	if (likely(!ldt))
 		return;
 
-	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
+	paravirt_free_ldt(ldt->entries_va, ldt->nr_entries);
 	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
+		vfree_atomic(ldt->entries_va);
 	else
-		free_page((unsigned long)ldt->entries);
+		free_page((unsigned long)ldt->entries_va);
 	kfree(ldt);
 }
 
@@ -152,7 +152,7 @@ int init_new_context_ldt(struct task_str
 		goto out_unlock;
 	}
 
-	memcpy(new_ldt->entries, old_mm->context.ldt->entries,
+	memcpy(new_ldt->entries_va, old_mm->context.ldt->entries_va,
 	       new_ldt->nr_entries * LDT_ENTRY_SIZE);
 	finalize_ldt_struct(new_ldt);
 
@@ -194,7 +194,7 @@ static int read_ldt(void __user *ptr, un
 	if (entries_size > bytecount)
 		entries_size = bytecount;
 
-	if (copy_to_user(ptr, mm->context.ldt->entries, entries_size)) {
+	if (copy_to_user(ptr, mm->context.ldt->entries_va, entries_size)) {
 		retval = -EFAULT;
 		goto out_unlock;
 	}
@@ -280,10 +280,12 @@ static int write_ldt(void __user *ptr, u
 	if (!new_ldt)
 		goto out_unlock;
 
-	if (old_ldt)
-		memcpy(new_ldt->entries, old_ldt->entries, old_nr_entries * LDT_ENTRY_SIZE);
+	if (old_ldt) {
+		memcpy(new_ldt->entries_va, old_ldt->entries_va,
+		       old_nr_entries * LDT_ENTRY_SIZE);
+	}
 
-	new_ldt->entries[ldt_info.entry_number] = ldt;
+	new_ldt->entries_va[ldt_info.entry_number] = ldt;
 	finalize_ldt_struct(new_ldt);
 
 	install_ldt(mm, new_ldt);
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -139,7 +139,7 @@ void release_thread(struct task_struct *
 		if (dead_task->mm->context.ldt) {
 			pr_warn("WARNING: dead process %s still has LDT? <%p/%d>\n",
 				dead_task->comm,
-				dead_task->mm->context.ldt->entries,
+				dead_task->mm->context.ldt->entries_va,
 				dead_task->mm->context.ldt->nr_entries);
 			BUG();
 		}
--- a/arch/x86/kernel/step.c
+++ b/arch/x86/kernel/step.c
@@ -38,7 +38,7 @@ unsigned long convert_ip_to_linear(struc
 			     seg >= child->mm->context.ldt->nr_entries))
 			addr = -1L; /* bogus selector, access would fault */
 		else {
-			desc = &child->mm->context.ldt->entries[seg];
+			desc = &child->mm->context.ldt->entries_va[seg];
 			base = get_desc_base(desc);
 
 			/* 16-bit code segment? */
--- a/arch/x86/lib/insn-eval.c
+++ b/arch/x86/lib/insn-eval.c
@@ -583,7 +583,7 @@ static struct desc_struct *get_desc(unsi
 		mutex_lock(&current->active_mm->context.lock);
 		ldt = current->active_mm->context.ldt;
 		if (ldt && sel < ldt->nr_entries)
-			desc = &ldt->entries[sel];
+			desc = &ldt->entries_va[sel];
 
 		mutex_unlock(&current->active_mm->context.lock);
 
--- a/arch/x86/math-emu/fpu_system.h
+++ b/arch/x86/math-emu/fpu_system.h
@@ -29,7 +29,7 @@ static inline struct desc_struct FPU_get
 	seg >>= 3;
 	mutex_lock(&current->mm->context.lock);
 	if (current->mm->context.ldt && seg < current->mm->context.ldt->nr_entries)
-		ret = current->mm->context.ldt->entries[seg];
+		ret = current->mm->context.ldt->entries_va[seg];
 	mutex_unlock(&current->mm->context.lock);
 #endif
 	return ret;

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 47/60] x86/ldt: Map LDT entries into fixmap
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (45 preceding siblings ...)
  2017-12-04 14:07 ` [patch 46/60] x86/ldt: Rename ldt_struct->entries member Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 22:33   ` Andy Lutomirski
  2017-12-04 14:07 ` [patch 48/60] x86/mm: Move the CR3 construction functions to tlbflush.h Thomas Gleixner
                   ` (15 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-ldt--Map_LDT_entries_into_fixmap.patch --]
[-- Type: text/plain, Size: 6867 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

LDT is not really commonly used on 64bit so the overhead of populating the
fixmap entries on context switch for the rare LDT syscall users is a
reasonable trade off vs. having extra dynamically managed mapping space per
process.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/mmu_context.h |   44 ++++--------------
 arch/x86/kernel/ldt.c              |   87 +++++++++++++++++++++++++++++++------
 2 files changed, 84 insertions(+), 47 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -45,13 +45,17 @@ static inline void load_mm_cr4(struct mm
  */
 struct ldt_struct {
 	/*
-	 * Xen requires page-aligned LDTs with special permissions.  This is
-	 * needed to prevent us from installing evil descriptors such as
+	 * Xen requires page-aligned LDTs with special permissions.  This
+	 * is needed to prevent us from installing evil descriptors such as
 	 * call gates.  On native, we could merge the ldt_struct and LDT
-	 * allocations, but it's not worth trying to optimize.
+	 * allocations, but it's not worth trying to optimize and it does
+	 * not work with page table isolation enabled, which requires
+	 * page-aligned LDT entries as well.
 	 */
-	struct desc_struct *entries_va;
-	unsigned int nr_entries;
+	struct desc_struct	*entries_va;
+	phys_addr_t		entries_pa;
+	unsigned int		nr_entries;
+	unsigned int		order;
 };
 
 /*
@@ -59,6 +63,7 @@ struct ldt_struct {
  */
 int init_new_context_ldt(struct task_struct *tsk, struct mm_struct *mm);
 void destroy_context_ldt(struct mm_struct *mm);
+void load_mm_ldt(struct mm_struct *mm);
 #else	/* CONFIG_MODIFY_LDT_SYSCALL */
 static inline int init_new_context_ldt(struct task_struct *tsk,
 				       struct mm_struct *mm)
@@ -66,38 +71,11 @@ static inline int init_new_context_ldt(s
 	return 0;
 }
 static inline void destroy_context_ldt(struct mm_struct *mm) {}
-#endif
-
 static inline void load_mm_ldt(struct mm_struct *mm)
 {
-#ifdef CONFIG_MODIFY_LDT_SYSCALL
-	struct ldt_struct *ldt;
-
-	/* READ_ONCE synchronizes with smp_store_release */
-	ldt = READ_ONCE(mm->context.ldt);
-
-	/*
-	 * Any change to mm->context.ldt is followed by an IPI to all
-	 * CPUs with the mm active.  The LDT will not be freed until
-	 * after the IPI is handled by all such CPUs.  This means that,
-	 * if the ldt_struct changes before we return, the values we see
-	 * will be safe, and the new values will be loaded before we run
-	 * any user code.
-	 *
-	 * NB: don't try to convert this to use RCU without extreme care.
-	 * We would still need IRQs off, because we don't want to change
-	 * the local LDT after an IPI loaded a newer value than the one
-	 * that we can see.
-	 */
-
-	if (unlikely(ldt))
-		set_ldt(ldt->entries_va, ldt->nr_entries);
-	else
-		clear_LDT();
-#else
 	clear_LDT();
-#endif
 }
+#endif
 
 static inline void switch_ldt(struct mm_struct *prev, struct mm_struct *next)
 {
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -22,6 +22,7 @@
 #include <asm/desc.h>
 #include <asm/mmu_context.h>
 #include <asm/syscalls.h>
+#include <asm/fixmap.h>
 
 static void refresh_ldt_segments(void)
 {
@@ -42,6 +43,61 @@ static void refresh_ldt_segments(void)
 #endif
 }
 
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+
+#define LDT_EPP		(PAGE_SIZE / LDT_ENTRY_SIZE)
+
+static void set_ldt_and_map(struct ldt_struct *ldt)
+{
+	phys_addr_t pa = ldt->entries_pa;
+	void *fixva;
+	int idx, i;
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI)) {
+		set_ldt(ldt->entries_va, ldt->nr_entries);
+		return;
+	}
+
+	idx = get_cpu_entry_area_index(smp_processor_id(), ldt_entries);
+	fixva = (void *) __fix_to_virt(idx);
+	for (i = 0; i < ldt->nr_entries; idx--, i += LDT_EPP, pa += PAGE_SIZE)
+		__set_fixmap(idx, pa, PAGE_KERNEL);
+	set_ldt(fixva, ldt->nr_entries);
+}
+#else
+static void set_ldt_and_map(struct ldt_struct *ldt)
+{
+	set_ldt(ldt->entries_va, ldt->nr_entries);
+}
+#endif
+
+void load_mm_ldt(struct mm_struct *mm)
+{
+	struct ldt_struct *ldt;
+
+	/* READ_ONCE synchronizes with smp_store_release */
+	ldt = READ_ONCE(mm->context.ldt);
+
+	/*
+	 * Any change to mm->context.ldt is followed by an IPI to all
+	 * CPUs with the mm active.  The LDT will not be freed until
+	 * after the IPI is handled by all such CPUs.  This means that,
+	 * if the ldt_struct changes before we return, the values we see
+	 * will be safe, and the new values will be loaded before we run
+	 * any user code.
+	 *
+	 * NB: don't try to convert this to use RCU without extreme care.
+	 * We would still need IRQs off, because we don't want to change
+	 * the local LDT after an IPI loaded a newer value than the one
+	 * that we can see.
+	 */
+
+	if (unlikely(ldt))
+		set_ldt_and_map(ldt);
+	else
+		clear_LDT();
+}
+
 /* context.lock is held for us, so we don't need any locking. */
 static void flush_ldt(void *__mm)
 {
@@ -52,26 +108,35 @@ static void flush_ldt(void *__mm)
 		return;
 
 	pc = &mm->context;
-	set_ldt(pc->ldt->entries_va, pc->ldt->nr_entries);
+	set_ldt_and_map(pc->ldt);
 
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	free_pages((unsigned long)ldt->entries_va, ldt->order);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	struct page *page;
+	int order;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
 
-	new_ldt = kmalloc(sizeof(struct ldt_struct), GFP_KERNEL);
+	new_ldt = kzalloc(sizeof(struct ldt_struct), GFP_KERNEL);
 	if (!new_ldt)
 		return NULL;
 
 	BUILD_BUG_ON(LDT_ENTRY_SIZE != sizeof(struct desc_struct));
 	alloc_size = num_entries * LDT_ENTRY_SIZE;
+	order = get_order(alloc_size);
 
 	/*
 	 * Xen is very picky: it requires a page-aligned LDT that has no
@@ -79,16 +144,14 @@ static struct ldt_struct *alloc_ldt_stru
 	 * Keep it simple: zero the whole allocation and never allocate less
 	 * than PAGE_SIZE.
 	 */
-	if (alloc_size > PAGE_SIZE)
-		new_ldt->entries_va = vzalloc(alloc_size);
-	else
-		new_ldt->entries_va = (void *)get_zeroed_page(GFP_KERNEL);
-
-	if (!new_ldt->entries_va) {
+	page = alloc_pages(GFP_KERNEL | __GFP_ZERO, order);
+	if (!page) {
 		kfree(new_ldt);
 		return NULL;
 	}
-
+	new_ldt->entries_va = page_address(page);
+	new_ldt->entries_pa = virt_to_phys(new_ldt->entries_va);
+	new_ldt->order = order;
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -116,11 +179,7 @@ static void free_ldt_struct(struct ldt_s
 		return;
 
 	paravirt_free_ldt(ldt->entries_va, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries_va);
-	else
-		free_page((unsigned long)ldt->entries_va);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 48/60] x86/mm: Move the CR3 construction functions to tlbflush.h
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (46 preceding siblings ...)
  2017-12-04 14:07 ` [patch 47/60] x86/ldt: Map LDT entries into fixmap Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 49/60] x86/mm: Remove hard-coded ASID limit checks Thomas Gleixner
                   ` (14 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

[-- Attachment #0: x86-mm--Move_the_CR3_construction_functions_to_tlbflush.h.patch --]
[-- Type: text/plain, Size: 5580 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

For flushing the TLB, the ASID which has been programmed into the hardware
must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003502.CC87BF47@viggo.jf.intel.com

---
 arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 arch/x86/include/asm/tlbflush.h    |   26 ++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 31 insertions(+), 32 deletions(-)

--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -260,33 +260,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -295,7 +268,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,32 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
+ * This serves two purposes.  It prevents a nasty situation in which
+ * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
+ * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
+ * ASID was nonzero.  It also means that any bugs involving loading a
+ * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -208,7 +208,7 @@ void switch_mm_irqs_off(struct mm_struct
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
@@ -288,7 +288,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 49/60] x86/mm: Remove hard-coded ASID limit checks
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (47 preceding siblings ...)
  2017-12-04 14:07 ` [patch 48/60] x86/mm: Move the CR3 construction functions to tlbflush.h Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 50/60] x86/mm: Put MMU to hardware ASID translation in one place Thomas Gleixner
                   ` (13 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

[-- Attachment #0: x86-mm--Remove_hard-coded_ASID_limit_checks.patch --]
[-- Type: text/plain, Size: 2578 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, KERNEL_PAGE_TABLE_ISOLATION is going to consume half of the
available ASID space.  The space is currently unused, but add a comment to
spell out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003504.57EDB845@viggo.jf.intel.com

---
 arch/x86/include/asm/tlbflush.h |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,22 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS		12
+/*
+ * When enabled, KERNEL_PAGE_TABLE_ISOLATION consumes a single bit for
+ * user/kernel switches
+ */
+#define KPTI_CONSUMED_ASID_BITS		0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KPTI_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
+ * for them being zero-based.  Another -1 is because ASID 0 is reserved for
+ * use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
  * This serves two purposes.  It prevents a nasty situation in which
@@ -87,7 +103,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -97,7 +113,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 50/60] x86/mm: Put MMU to hardware ASID translation in one place
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (48 preceding siblings ...)
  2017-12-04 14:07 ` [patch 49/60] x86/mm: Remove hard-coded ASID limit checks Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 51/60] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
                   ` (12 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

[-- Attachment #0: x86-mm--Put_MMU-to-h-w_ASID_translation_in_one_place.patch --]
[-- Type: text/plain, Size: 3258 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:

 1. The one stored in the mmu_context that goes from 0..5
 2. The one programmed into the hardware that goes from 1..6

This consolidates the locations where converting between the two (by doing
a +1) to a single place which gives us a nice place to comment.
KERNEL_PAGE_TABLE_ISOLATION will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003506.67E81D7F@viggo.jf.intel.com

---
 arch/x86/include/asm/tlbflush.h |   29 ++++++++++++++++++-----------
 1 file changed, 18 insertions(+), 11 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -91,20 +91,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID bits.
- * This serves two purposes.  It prevents a nasty situation in which
- * PCID-unaware code saves CR3, loads some other value (with PCID == 0),
- * and then restores CR3, thus corrupting the TLB for ASID 0 if the saved
- * ASID was nonzero.  It also means that any bugs involving loading a
- * PCID-enabled CR3 with CR4.PCIDE off will trigger deterministically.
- */
+static inline u16 kern_pcid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the
+	 * PCID bits.  This serves two purposes.  It prevents a nasty
+	 * situation in which PCID-unaware code saves CR3, loads some other
+	 * value (with PCID == 0), and then restores CR3, thus corrupting
+	 * the TLB for ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3 with
+	 * CR4.PCIDE off will trigger deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_pcid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -114,7 +120,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_pcid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (49 preceding siblings ...)
  2017-12-04 14:07 ` [patch 50/60] x86/mm: Put MMU to hardware ASID translation in one place Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 22:22   ` Andy Lutomirski
  2017-12-04 14:07 ` [patch 52/60] x86/mm: Abstract switching CR3 Thomas Gleixner
                   ` (11 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	michael.schwarz, Borislav Petkov, moritz.lipp, richard.fellner

[-- Attachment #0: x86-mm--Allow_flushing_for_future_ASID_switches.patch --]
[-- Type: text/plain, Size: 5602 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of all
contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
by:

 1. INVPCID for each PCID (works for single pages too).

 2. Load CR3 with each PCID without the NOFLUSH bit set

 3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KERNEL_PAGE_TABLE_ISOLATION) at the time that invalidation is required.
Instead of actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
required.

At the next context-switch, look for this indicator
('invalidate_other' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: michael.schwarz@iaik.tugraz.at
Cc: daniel.gruss@iaik.tugraz.at
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: hughd@google.com
Cc: Borislav Petkov <bp@alien8.de>
Cc: moritz.lipp@iaik.tugraz.at
Cc: keescook@google.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003507.E8C327F5@viggo.jf.intel.com

---
 arch/x86/include/asm/tlbflush.h |   42 ++++++++++++++++++++++++++++++----------
 arch/x86/mm/tlb.c               |   37 +++++++++++++++++++++++++++++++++++
 2 files changed, 69 insertions(+), 10 deletions(-)

--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -188,6 +188,17 @@ struct tlb_state {
 	bool is_lazy;
 
 	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool invalidate_other;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -267,6 +278,19 @@ static inline unsigned long cr4_read_sha
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void invalidate_pcid_other(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return;
+
+	this_cpu_write(cpu_tlbstate.invalidate_other, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -341,24 +365,22 @@ static inline void __native_flush_tlb_si
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	invalidate_pcid_other();
 }
 
 #define TLB_FLUSH_ALL	-1UL
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_asid_other(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.invalidate_other, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.invalidate_other))
+		clear_asid_other();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
@@ -552,6 +587,8 @@ static void do_kernel_range_flush(void *
 	/* flush range by one by one 'invlpg' */
 	for (addr = f->start; addr < f->end; addr += PAGE_SIZE)
 		__flush_tlb_single(addr);
+
+	invalidate_pcid_other();
 }
 
 void flush_tlb_kernel_range(unsigned long start, unsigned long end)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 52/60] x86/mm: Abstract switching CR3
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (50 preceding siblings ...)
  2017-12-04 14:07 ` [patch 51/60] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-04 14:07 ` [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
                   ` (10 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-mm--Abstract-switching-CR3.patch --]
[-- Type: text/plain, Size: 1914 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

In preparation to adding additional PCID flushing, abstract the
loading of a new ASID into CR3.

[ Peterz: Split out from big combo patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/mm/tlb.c |   22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -100,6 +100,24 @@ static void choose_new_asid(struct mm_st
 	*need_flush = true;
 }
 
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -230,7 +248,7 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -243,7 +261,7 @@ void switch_mm_irqs_off(struct mm_struct
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (51 preceding siblings ...)
  2017-12-04 14:07 ` [patch 52/60] x86/mm: Abstract switching CR3 Thomas Gleixner
@ 2017-12-04 14:07 ` Thomas Gleixner
  2017-12-05 21:46   ` Andy Lutomirski
  2017-12-04 14:08 ` [patch 54/60] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
                   ` (9 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:07 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-mm--Use-Fix-PCID-to-optimize-user-kernel-switches.patch --]
[-- Type: text/plain, Size: 14260 bytes --]

We can use PCID to retain the TLBs across CR3 switches; including
those now part of the user/kernel switch. This increases performance
of kernel entry/exit at the cost of more expensive/complicated TLB
flushing.

Now that we have two address spaces, one for kernel and one for user
space, we need two PCIDs per mm. We use the top PCID bit to indicate a
user PCID (just like we use the PFN LSB for the PGD). Since we do TLB
invalidation from kernel space, the existing code will only invalidate
the kernel PCID, we augment that by marking the corresponding user
PCID invalid, and upon switching back to userspace, use a flushing CR3
write for the switch.

In order to access the user_pcid_flush_mask we use PER_CPU storage,
which means the previously established SWAPGS vs CR3 ordering is now
mandatory and required.

Having to do this memory access does require additional registers,
most sites have a functioning stack and we can spill one (RAX), sites
without functional stack need to otherwise provide the second scratch
register.

Note: PCID is generally available on Intel Sandybridge and later CPUs.
Note: Up until this point TLB flushing was broken in this series.

Based-on-code-from: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/calling.h                    |   72 ++++++++++++++++++----
 arch/x86/entry/entry_64.S                   |    9 +-
 arch/x86/entry/entry_64_compat.S            |    4 -
 arch/x86/include/asm/processor-flags.h      |    5 +
 arch/x86/include/asm/tlbflush.h             |   91 ++++++++++++++++++++++++----
 arch/x86/include/uapi/asm/processor-flags.h |    7 +-
 arch/x86/kernel/asm-offsets.c               |    2 
 arch/x86/mm/init.c                          |    2 
 arch/x86/mm/tlb.c                           |    1 
 9 files changed, 160 insertions(+), 33 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -3,6 +3,9 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/percpu.h>
+#include <asm/asm-offsets.h>
+#include <asm/processor-flags.h>
 
 /*
 
@@ -191,17 +194,21 @@ For 32-bit we have the following convent
 
 #ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
 
-/* KERNEL_PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two halves: */
-#define KPTI_SWITCH_MASK (1<<PAGE_SHIFT)
+/*
+ * KERNEL_PAGE_TABLE_ISOLATION PGDs are 8k.  Flip bit 12 to switch between the two
+ * halves:
+ */
+#define KPTI_SWITCH_PGTABLES_MASK	(1<<PAGE_SHIFT)
+#define KPTI_SWITCH_MASK		(KPTI_SWITCH_PGTABLES_MASK|(1<<X86_CR3_KPTI_SWITCH_BIT))
 
-.macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KERNEL_PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
-	andq	$(~KPTI_SWITCH_MASK), \reg
+.macro SET_NOFLUSH_BIT	reg:req
+	bts	$X86_CR3_PCID_NOFLUSH_BIT, \reg
 .endm
 
-.macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KPTI_SWITCH_MASK), \reg
+.macro ADJUST_KERNEL_CR3 reg:req
+	ALTERNATIVE "", "SET_NOFLUSH_BIT \reg", X86_FEATURE_PCID
+	/* Clear PCID and "KERNEL_PAGE_TABLE_ISOLATION bit", point CR3 at kernel pagetables: */
+	andq    $(~KPTI_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -212,21 +219,58 @@ For 32-bit we have the following convent
 .Lend_\@:
 .endm
 
-.macro SWITCH_TO_USER_CR3 scratch_reg:req
+#define THIS_CPU_user_pcid_flush_mask   \
+	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
+
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	mov	%cr3, \scratch_reg
-	ADJUST_USER_CR3 \scratch_reg
+
+	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
+
+	/*
+	 * Test if the ASID needs a flush.
+	 */
+	movq	\scratch_reg, \scratch_reg2
+	andq	$(0x7FF), \scratch_reg		/* mask ASID */
+	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jnc	.Lnoflush_\@
+
+	/* Flush needed, clear the bit */
+	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	movq	\scratch_reg2, \scratch_reg
+	jmp	.Lwrcr3_\@
+
+.Lnoflush_\@:
+	movq	\scratch_reg2, \scratch_reg
+	SET_NOFLUSH_BIT \scratch_reg
+
+.Lwrcr3_\@:
+	/* Flip the PGD and ASID to the user version */
+	orq     $(KPTI_SWITCH_MASK), \scratch_reg
 	mov	\scratch_reg, %cr3
 .Lend_\@:
 .endm
 
+.macro SWITCH_TO_USER_CR3_STACK	scratch_reg:req
+	pushq	%rax
+	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=\scratch_reg scratch_reg2=%rax
+	popq	%rax
+.endm
+
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Ldone_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KERNEL_PAGE_TABLE_ISOLATION patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not a user CR3.
 	 */
 	testq	$(KPTI_SWITCH_MASK), \scratch_reg
 	jz	.Ldone_\@
@@ -251,7 +295,9 @@ For 32-bit we have the following convent
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
 .endm
-.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
+.endm
+.macro SWITCH_TO_USER_CR3_STACK scratch_reg:req
 .endm
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 .endm
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -23,7 +23,6 @@
 #include <asm/segment.h>
 #include <asm/cache.h>
 #include <asm/errno.h>
-#include "calling.h"
 #include <asm/asm-offsets.h>
 #include <asm/msr.h>
 #include <asm/unistd.h>
@@ -40,6 +39,8 @@
 #include <asm/frame.h>
 #include <linux/err.h>
 
+#include "calling.h"
+
 .code64
 .section .entry.text, "ax"
 
@@ -410,7 +411,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -748,7 +749,7 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 
 	/* Restore RDI. */
 	popq	%rdi
@@ -861,7 +862,7 @@ ENTRY(native_iret)
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
 
-	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
 	SWAPGS					/* to user GS */
 	popq	%rdi				/* Restore user RDI */
 
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -275,9 +275,9 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	 * switch until after after the last reference to the process
 	 * stack.
 	 *
-	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 * %r8/%r9 are zeroed before the sysret, thus safe to clobber.
 	 */
-	SWITCH_TO_USER_CR3 scratch_reg=%r8
+	SWITCH_TO_USER_CR3_NOSTACK scratch_reg=%r8 scratch_reg2=%r9
 
 	xorq	%r8, %r8
 	xorq	%r9, %r9
--- a/arch/x86/include/asm/processor-flags.h
+++ b/arch/x86/include/asm/processor-flags.h
@@ -38,6 +38,11 @@
 #define CR3_ADDR_MASK	__sme_clr(0x7FFFFFFFFFFFF000ull)
 #define CR3_PCID_MASK	0xFFFull
 #define CR3_NOFLUSH	BIT_ULL(63)
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+# define X86_CR3_KPTI_SWITCH_BIT	11
+#endif
+
 #else
 /*
  * CR3_ADDR_MASK needs at least bits 31:5 set on PAE systems, and we save
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -9,6 +9,8 @@
 #include <asm/cpufeature.h>
 #include <asm/special_insns.h>
 #include <asm/smp.h>
+#include <asm/kpti.h>
+#include <asm/processor-flags.h>
 
 static inline void __invpcid(unsigned long pcid, unsigned long addr,
 			     unsigned long type)
@@ -77,24 +79,54 @@ static inline u64 inc_mm_tlb_gen(struct
 
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS		12
+
 /*
  * When enabled, KERNEL_PAGE_TABLE_ISOLATION consumes a single bit for
  * user/kernel switches
  */
-#define KPTI_CONSUMED_ASID_BITS		0
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+# define KPTI_CONSUMED_PCID_BITS	1
+#else
+# define KPTI_CONSUMED_PCID_BITS	0
+#endif
+
+#define CR3_AVAIL_PCID_BITS (X86_CR3_PCID_BITS - KPTI_CONSUMED_PCID_BITS)
 
-#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KPTI_CONSUMED_ASID_BITS)
 /*
  * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below to account
  * for them being zero-based.  Another -1 is because ASID 0 is reserved for
  * use by non-PCID-aware users.
  */
-#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_ASID_BITS) - 2)
+#define MAX_ASID_AVAILABLE ((1 << CR3_AVAIL_PCID_BITS) - 2)
+
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in two cache
+ * lines.
+ */
+#define TLB_NR_DYN_ASIDS	6
 
 static inline u16 kern_pcid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	/*
+	 * Make sure that the dynamic ASID space does not confict with the
+	 * bit we are using to switch between user and kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1 << X86_CR3_KPTI_SWITCH_BIT));
+
 	/*
+	 * The ASID being passed in here should have respected the
+	 * MAX_ASID_AVAILABLE and thus never have the switch bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1 << X86_CR3_KPTI_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in are small
+	 * (<TLB_NR_DYN_ASIDS).  They never have the high switch bit set,
+	 * so do not bother to clear it.
+	 *
 	 * If PCID is on, ASID-aware code paths put the ASID+1 into the
 	 * PCID bits.  This serves two purposes.  It prevents a nasty
 	 * situation in which PCID-unaware code saves CR3, loads some other
@@ -148,12 +180,6 @@ static inline bool tlb_defer_switch_to_i
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -199,6 +225,13 @@ struct tlb_state {
 	bool invalidate_other;
 
 	/*
+	 * Mask that contains TLB_NR_DYN_ASIDS+1 bits to indicate
+	 * the corresponding user PCID needs a flush next time we
+	 * switch to it; see SWITCH_TO_USER_CR3.
+	 */
+	unsigned short user_pcid_flush_mask;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -310,12 +343,39 @@ static inline void cr4_set_bits_and_upda
 
 extern void initialize_tlbstate_and_flush(void);
 
+/*
+ * Given an ASID, flush the corresponding user ASID.  We can delay this
+ * until the next time we switch to it.
+ *
+ * See SWITCH_TO_USER_CR3.
+ */
+static inline void invalidate_user_asid(u16 asid)
+{
+	/* There is no user ASID if address space separation is off */
+	if (!IS_ENABLED(CONFIG_KERNEL_PAGE_TABLE_ISOLATION))
+		return;
+
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return;
+
+	__set_bit(kern_pcid(asid),
+		  (unsigned long *)this_cpu_ptr(&cpu_tlbstate.user_pcid_flush_mask));
+}
+
 static inline void __native_flush_tlb(void)
 {
+	invalidate_user_asid(this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
+	 * If current->mm == NULL then we borrow a mm which may change
+	 * during a task switch and therefore we must not be preempted
+	 * while we write CR3 back:
 	 */
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
@@ -360,7 +420,14 @@ static inline void __native_flush_tlb_gl
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return;
+
+	invalidate_user_asid(loaded_mm_asid);
 }
 
 static inline void __flush_tlb_all(void)
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -78,7 +78,12 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+
+#define X86_CR3_PCID_BITS	12
+#define X86_CR3_PCID_MASK	(_AC((1UL << X86_CR3_PCID_BITS) - 1, UL))
+
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -17,6 +17,7 @@
 #include <asm/sigframe.h>
 #include <asm/bootparam.h>
 #include <asm/suspend.h>
+#include <asm/tlbflush.h>
 
 #ifdef CONFIG_XEN
 #include <xen/interface/xen.h>
@@ -97,6 +98,7 @@ void common(void) {
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
+	OFFSET(TLB_STATE_user_pcid_flush_mask, tlb_state, user_pcid_flush_mask);
 	OFFSET(CPU_ENTRY_AREA_SYSENTER_stack, cpu_entry_area, SYSENTER_stack_page);
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 }
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -855,7 +855,7 @@ void __init zone_sizes_init(void)
 	free_area_init_nodes(max_zone_pfns);
 }
 
-DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate) = {
 	.loaded_mm = &init_mm,
 	.next_asid = 1,
 	.cr4 = ~0UL,	/* fail hard if we screw up cr4 shadow initialization */
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -105,6 +105,7 @@ static void load_new_mm_cr3(pgd_t *pgdir
 	unsigned long new_mm_cr3;
 
 	if (need_flush) {
+		invalidate_user_asid(new_asid);
 		new_mm_cr3 = build_cr3(pgdir, new_asid);
 	} else {
 		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 54/60] x86/mm: Optimize RESTORE_CR3
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (52 preceding siblings ...)
  2017-12-04 14:07 ` [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 14:08 ` [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
                   ` (8 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss

[-- Attachment #0: x86-mm--Optimize-RESTORE_CR3.patch --]
[-- Type: text/plain, Size: 2485 bytes --]

Most NMI/paranoid exceptions will not in fact change pagetables and
would thus not require TLB flushing, however RESTORE_CR3 uses flushing
CR3 writes.

Restores to kernel PCIDs can be NOFLUSH, because we explicitly flush
the kernel mappings and now that we track which user PCIDs need
flushing we can avoid those too when possible.

This does mean RESTORE_CR3 needs an additional scratch_reg, luckily
both sites have plenty available.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/entry/calling.h  |   30 ++++++++++++++++++++++++++++--
 arch/x86/entry/entry_64.S |    4 ++--
 2 files changed, 30 insertions(+), 4 deletions(-)

--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -281,8 +281,34 @@ For 32-bit we have the following convent
 .Ldone_\@:
 .endm
 
-.macro RESTORE_CR3 save_reg:req
+.macro RESTORE_CR3 scratch_reg:req save_reg:req
 	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
+
+	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
+
+	/*
+	 * KERNEL pages can always resume with NOFLUSH as we do
+	 * explicit flushes.
+	 */
+	bt	$X86_CR3_KPTI_SWITCH_BIT, \save_reg
+	jnc	.Lnoflush_\@
+
+	/*
+	 * Check if there's a pending flush for the user ASID we're
+	 * about to set.
+	 */
+	movq	\save_reg, \scratch_reg
+	andq	$(0x7FF), \scratch_reg
+	bt	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jnc	.Lnoflush_\@
+
+	btr	\scratch_reg, THIS_CPU_user_pcid_flush_mask
+	jmp	.Lwrcr3_\@
+
+.Lnoflush_\@:
+	SET_NOFLUSH_BIT \save_reg
+
+.Lwrcr3_\@:
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
@@ -301,7 +327,7 @@ For 32-bit we have the following convent
 .endm
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
 .endm
-.macro RESTORE_CR3 save_reg:req
+.macro RESTORE_CR3 scratch_reg:req save_reg:req
 .endm
 
 #endif
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1294,7 +1294,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
-	RESTORE_CR3	save_reg=%r14
+	RESTORE_CR3	scratch_reg=%rbx save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1736,7 +1736,7 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
-	RESTORE_CR3 save_reg=%r14
+	RESTORE_CR3 scratch_reg=%r15 save_reg=%r14
 
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (53 preceding siblings ...)
  2017-12-04 14:08 ` [patch 54/60] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 22:25   ` Andy Lutomirski
  2017-12-04 14:08 ` [patch 56/60] x86/mm/kpti: Disable native VSYSCALL Thomas Gleixner
                   ` (7 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen

[-- Attachment #0: x86-mm--Use-INVPCID-for__native_flush_tlb_single.patch --]
[-- Type: text/plain, Size: 5047 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

This uses INVPCID to shoot down individual lines of the user mapping
instead of marking the entire user map as invalid. This
could/might/possibly be faster.

This for sure needs tlb_single_page_flush_ceiling to be redetermined;
esp. since INVPCID is _slow_.

[ Peterz: Split out from big combo patch ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/cpufeatures.h |    1 
 arch/x86/include/asm/tlbflush.h    |   23 ++++++++++++-
 arch/x86/mm/init.c                 |   64 +++++++++++++++++++++----------------
 3 files changed, 60 insertions(+), 28 deletions(-)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -197,6 +197,7 @@
 #define X86_FEATURE_CAT_L3		( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2		( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3		( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE	( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -138,6 +138,18 @@ static inline u16 kern_pcid(u16 asid)
 	return asid + 1;
 }
 
+/*
+ * The user PCID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_pcid(u16 asid)
+{
+	u16 ret = kern_pcid(asid);
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	ret |= 1 << X86_CR3_KPTI_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -401,6 +413,8 @@ static inline void __native_flush_tlb_gl
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -427,7 +441,14 @@ static inline void __native_flush_tlb_si
 	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
 		return;
 
-	invalidate_user_asid(loaded_mm_asid);
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before CR4.PCIDE=1.
+	 * Just use invalidate_user_asid() in case we are called early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE))
+		invalidate_user_asid(loaded_mm_asid);
+	else
+		invpcid_flush_one(user_pcid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -203,34 +203,44 @@ static void __init probe_page_size_mask(
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
-			setup_clear_cpu_cap(X86_FEATURE_PCID);
-		}
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() -- the
+		 * trampoline code can't handle CR4.PCIDE and it wouldn't
+		 * do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually cause
+		 * the bits in question to remain set all the way through
+		 * the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE manually in
+		 * start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work if we set
+		 * X86_CR4_PCIDE, *and* we INVPCID support.  It's unusable
+		 * on systems that have X86_CR4_PCIDE clear, or that have
+		 * no INVPCID support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't work if
+		 * PCID is on but PGE is not.  Since that combination
+		 * doesn't exist on real hardware, there's no reason to try
+		 * to fully support it, but it's polite to avoid corrupting
+		 * data if we're on an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 56/60] x86/mm/kpti: Disable native VSYSCALL
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (54 preceding siblings ...)
  2017-12-04 14:08 ` [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 22:33   ` Andy Lutomirski
  2017-12-04 14:08 ` [patch 57/60] x86/mm/kpti: Add Kconfig Thomas Gleixner
                   ` (6 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

[-- Attachment #0: x86-mm-kpti--Disable_native_VSYSCALL.patch --]
[-- Type: text/plain, Size: 2636 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

The KERNEL_PAGE_TABLE_ISOLATION code attempts to "poison" the user
portion of the kernel page tables. It detects entries that it wants that it
wants to poison in two ways:

 * Looking for addresses >= PAGE_OFFSET

 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KERNEL_PAGE_TABLE_ISOLATION-enforced restriction,
_PAGE_USER is never set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not even
see the memory any more.  Disable it whenever KERNEL_PAGE_TABLE_ISOLATION
is enabled.

Also add some help text about how KERNEL_PAGE_TABLE_ISOLATION might
affect the emulation case as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003513.10CAD896@viggo.jf.intel.com

---
 arch/x86/Kconfig |    8 ++++++++
 1 file changed, 8 insertions(+)

--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2249,6 +2249,9 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
+		# The VSYSCALL page comes from the kernel page tables
+		# and is not available when KERNEL_PAGE_TABLE_ISOLATION is enabled.
+		depends on !KERNEL_PAGE_TABLE_ISOLATION
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2266,6 +2269,11 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
+		  When KERNEL_PAGE_TABLE_ISOLATION is enabled, the vsyscall area will become
+		  unreadable.  This emulation option still works, but KERNEL_PAGE_TABLE_ISOLATION
+		  will make it harder to do things like trace code using the
+		  emulation.
+
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 57/60] x86/mm/kpti: Add Kconfig
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (55 preceding siblings ...)
  2017-12-04 14:08 ` [patch 56/60] x86/mm/kpti: Disable native VSYSCALL Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 16:54   ` Andy Lutomirski
  2017-12-04 14:08 ` [patch 58/60] x86/mm/debug_pagetables: Add page table directory Thomas Gleixner
                   ` (5 subsequent siblings)
  62 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

[-- Attachment #0: x86-mm-kpti--Add_Kconfig.patch --]
[-- Type: text/plain, Size: 2346 bytes --]

From: Dave Hansen <dave.hansen@linux.intel.com>

Finally allow CONFIG_KERNEL_PAGE_TABLE_ISOLATION to be enabled.

PARAVIRT generally requires that the kernel not manage its own page tables.
It also means that the hypervisor and kernel must agree wholeheartedly
about what format the page tables are in and what they contain.
KERNEL_PAGE_TABLE_ISOLATION, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether they
want the Kconfig magic to go first or last in a patch series.  It's going
last here because the partially-applied series leads to kernels that can
not boot in a bunch of cases.  I did a run through the entire series with
CONFIG_KERNEL_PAGE_TABLE_ISOLATION=y to look for build errors, though.

[ tglx: Removed SMP and !PARAVIRT dependencies as they not longer exist ]

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003524.88C90659@viggo.jf.intel.com

---
 security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KERNEL_PAGE_TABLE_ISOLATION
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && JUMP_LABEL
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/pagetable-isolation.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 58/60] x86/mm/debug_pagetables: Add page table directory
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (56 preceding siblings ...)
  2017-12-04 14:08 ` [patch 57/60] x86/mm/kpti: Add Kconfig Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 14:08 ` [patch 59/60] x86/mm/dump_pagetables: Check user space page table for WX pages Thomas Gleixner
                   ` (4 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Borislav Petkov

[-- Attachment #0: x86-mm-debug_pagetables-Add_pagetable_dir.patch --]
[-- Type: text/plain, Size: 1253 bytes --]

From: Borislav Petkov <bp@suse.de>

The upcoming support for dumping the kernel and the user space page tables
of the current process would create more random files in the top level
debugfs directory.

Add a page table directory and move the existing file to it.

Signed-off-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

---
 arch/x86/mm/debug_pagetables.c |   15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -22,21 +22,26 @@ static const struct file_operations ptdu
 	.release	= single_release,
 };
 
-static struct dentry *pe;
+static struct dentry *dir, *pe;
 
 static int __init pt_dump_debug_init(void)
 {
-	pe = debugfs_create_file("kernel_page_tables", S_IRUSR, NULL, NULL,
-				 &ptdump_fops);
-	if (!pe)
+	dir = debugfs_create_dir("page_tables", NULL);
+	if (!dir)
 		return -ENOMEM;
 
+	pe = debugfs_create_file("kernel", 0400, dir, NULL, &ptdump_fops);
+	if (!pe)
+		goto err;
 	return 0;
+err:
+	debugfs_remove_recursive(dir);
+	return -ENOMEM;
 }
 
 static void __exit pt_dump_debug_exit(void)
 {
-	debugfs_remove_recursive(pe);
+	debugfs_remove_recursive(dir);
 }
 
 module_init(pt_dump_debug_init);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 59/60] x86/mm/dump_pagetables: Check user space page table for WX pages
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (57 preceding siblings ...)
  2017-12-04 14:08 ` [patch 58/60] x86/mm/debug_pagetables: Add page table directory Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 14:08 ` [patch 60/60] x86/mm/debug_pagetables: Allow dumping current pagetables Thomas Gleixner
                   ` (3 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, moritz.lipp, linux-mm,
	Dave Hansen, Borislav Petkov, michael.schwarz, richard.fellner

[-- Attachment #0: x86-mm-dump_pagetables--Check_shadow_page_table_for_WX_pages.patch --]
[-- Type: text/plain, Size: 3416 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

ptdump_walk_pgd_level_checkwx() checks the kernel page table for WX pages,
but does not check the KERNEL_PAGE_TABLE_ISOLATION user space page table.

Restructure the code so that dmesg output is selected by an explicit
argument and not implicit via checking the pgd argument for !NULL.

Add the check for the user space page table.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at

---
 arch/x86/include/asm/pgtable.h |    1 +
 arch/x86/mm/debug_pagetables.c |    2 +-
 arch/x86/mm/dump_pagetables.c  |   30 +++++++++++++++++++++++++-----
 3 files changed, 27 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,6 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL);
 	return 0;
 }
 
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -447,7 +447,7 @@ static inline bool is_hypervisor_range(i
 }
 
 static void ptdump_walk_pgd_level_core(struct seq_file *m, pgd_t *pgd,
-				       bool checkwx)
+				       bool checkwx, bool dmesg)
 {
 #ifdef CONFIG_X86_64
 	pgd_t *start = (pgd_t *) &init_top_pgt;
@@ -460,7 +460,7 @@ static void ptdump_walk_pgd_level_core(s
 
 	if (pgd) {
 		start = pgd;
-		st.to_dmesg = true;
+		st.to_dmesg = dmesg;
 	}
 
 	st.check_wx = checkwx;
@@ -498,13 +498,33 @@ static void ptdump_walk_pgd_level_core(s
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 {
-	ptdump_walk_pgd_level_core(m, pgd, false);
+	ptdump_walk_pgd_level_core(m, pgd, false, true);
+}
+
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+{
+	ptdump_walk_pgd_level_core(m, pgd, false, false);
+}
+EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);
+
+static void ptdump_walk_user_pgd_level_checkwx(void)
+{
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	pgd_t *pgd = (pgd_t *) &init_top_pgt;
+
+	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		return;
+
+	pr_info("x86/mm: Checking user space page tables\n");
+	pgd = kernel_to_user_pgdp(pgd);
+	ptdump_walk_pgd_level_core(NULL, pgd, true, false);
+#endif
 }
-EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level);
 
 void ptdump_walk_pgd_level_checkwx(void)
 {
-	ptdump_walk_pgd_level_core(NULL, NULL, true);
+	ptdump_walk_pgd_level_core(NULL, NULL, true, false);
+	ptdump_walk_user_pgd_level_checkwx();
 }
 
 static int __init pt_dump_init(void)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* [patch 60/60] x86/mm/debug_pagetables: Allow dumping current pagetables
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (58 preceding siblings ...)
  2017-12-04 14:08 ` [patch 59/60] x86/mm/dump_pagetables: Check user space page table for WX pages Thomas Gleixner
@ 2017-12-04 14:08 ` Thomas Gleixner
  2017-12-04 18:02 ` [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Linus Torvalds
                   ` (2 subsequent siblings)
  62 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 14:08 UTC (permalink / raw)
  To: LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, moritz.lipp, linux-mm,
	Dave Hansen, Borislav Petkov, michael.schwarz, richard.fellner

[-- Attachment #0: x86-mm-debug_pagetables--Allow_dumping_current_pagetables.patch --]
[-- Type: text/plain, Size: 4916 bytes --]

From: Thomas Gleixner <tglx@linutronix.de>

Add two debugfs files which allow to dump the pagetable of the current
task.

current_kernel dumps the regular page table. This is the page table which
is normally shared between kernel and user space. If kernel page table
isolation is enabled this is the kernel space mapping.

If kernel page table isolation is enabled the second file, current_user,
dumps the user space page table.

These files allow to verify the resulting page tables for page table
isolation, but even in the normal case its useful to be able to inspect
user space page tables of current for debugging purposes.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: keescook@google.com
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: moritz.lipp@iaik.tugraz.at
Cc: linux-mm@kvack.org
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: hughd@google.com
Cc: daniel.gruss@iaik.tugraz.at
Cc: Borislav Petkov <bp@alien8.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: michael.schwarz@iaik.tugraz.at
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: richard.fellner@student.tugraz.at

---
 arch/x86/include/asm/pgtable.h |    2 -
 arch/x86/mm/debug_pagetables.c |   71 ++++++++++++++++++++++++++++++++++++++---
 arch/x86/mm/dump_pagetables.c  |    6 ++-
 3 files changed, 73 insertions(+), 6 deletions(-)

--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -28,7 +28,7 @@ extern pgd_t early_top_pgt[PTRS_PER_PGD]
 int __init __early_make_pgtable(unsigned long address, pmdval_t pmd);
 
 void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd);
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd);
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user);
 void ptdump_walk_pgd_level_checkwx(void);
 
 #ifdef CONFIG_DEBUG_WX
--- a/arch/x86/mm/debug_pagetables.c
+++ b/arch/x86/mm/debug_pagetables.c
@@ -5,7 +5,7 @@
 
 static int ptdump_show(struct seq_file *m, void *v)
 {
-	ptdump_walk_pgd_level_debugfs(m, NULL);
+	ptdump_walk_pgd_level_debugfs(m, NULL, false);
 	return 0;
 }
 
@@ -22,7 +22,57 @@ static const struct file_operations ptdu
 	.release	= single_release,
 };
 
-static struct dentry *dir, *pe;
+static int ptdump_show_curknl(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, false);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curknl(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curknl, NULL);
+}
+
+static const struct file_operations ptdump_curknl_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curknl,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+static struct dentry *pe_curusr;
+
+static int ptdump_show_curusr(struct seq_file *m, void *v)
+{
+	if (current->mm->pgd) {
+		down_read(&current->mm->mmap_sem);
+		ptdump_walk_pgd_level_debugfs(m, current->mm->pgd, true);
+		up_read(&current->mm->mmap_sem);
+	}
+	return 0;
+}
+
+static int ptdump_open_curusr(struct inode *inode, struct file *filp)
+{
+	return single_open(filp, ptdump_show_curusr, NULL);
+}
+
+static const struct file_operations ptdump_curusr_fops = {
+	.owner		= THIS_MODULE,
+	.open		= ptdump_open_curusr,
+	.read		= seq_read,
+	.llseek		= seq_lseek,
+	.release	= single_release,
+};
+#endif
+
+static struct dentry *dir, *pe_knl, *pe_curknl;
 
 static int __init pt_dump_debug_init(void)
 {
@@ -30,9 +80,22 @@ static int __init pt_dump_debug_init(voi
 	if (!dir)
 		return -ENOMEM;
 
-	pe = debugfs_create_file("kernel", 0400, dir, NULL, &ptdump_fops);
-	if (!pe)
+	pe_knl = debugfs_create_file("kernel", 0400, dir, NULL,
+				     &ptdump_fops);
+	if (!pe_knl)
+		goto err;
+
+	pe_curknl = debugfs_create_file("current_kernel", 0400,
+					dir, NULL, &ptdump_curknl_fops);
+	if (!pe_curknl)
+		goto err;
+
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	pe_curusr = debugfs_create_file("current_user", 0400,
+					dir, NULL, &ptdump_curusr_fops);
+	if (!pe_curusr)
 		goto err;
+#endif
 	return 0;
 err:
 	debugfs_remove_recursive(dir);
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -501,8 +501,12 @@ void ptdump_walk_pgd_level(struct seq_fi
 	ptdump_walk_pgd_level_core(m, pgd, false, true);
 }
 
-void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd)
+void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
 {
+#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
+	if (user && static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+		pgd = kernel_to_user_pgdp(pgd);
+#endif
 	ptdump_walk_pgd_level_core(m, pgd, false, false);
 }
 EXPORT_SYMBOL_GPL(ptdump_walk_pgd_level_debugfs);

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 57/60] x86/mm/kpti: Add Kconfig
  2017-12-04 14:08 ` [patch 57/60] x86/mm/kpti: Add Kconfig Thomas Gleixner
@ 2017-12-04 16:54   ` Andy Lutomirski
  2017-12-04 16:57     ` Thomas Gleixner
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 16:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Finally allow CONFIG_KERNEL_PAGE_TABLE_ISOLATION to be enabled.
>
> PARAVIRT generally requires that the kernel not manage its own page tables.
> It also means that the hypervisor and kernel must agree wholeheartedly
> about what format the page tables are in and what they contain.
> KERNEL_PAGE_TABLE_ISOLATION, unfortunately, changes the rules and they
> can not be used together.
>
> I've seen conflicting feedback from maintainers lately about whether they
> want the Kconfig magic to go first or last in a patch series.  It's going
> last here because the partially-applied series leads to kernels that can
> not boot in a bunch of cases.  I did a run through the entire series with
> CONFIG_KERNEL_PAGE_TABLE_ISOLATION=y to look for build errors, though.
>
> [ tglx: Removed SMP and !PARAVIRT dependencies as they not longer exist ]
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: keescook@google.com
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: moritz.lipp@iaik.tugraz.at
> Cc: linux-mm@kvack.org
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: hughd@google.com
> Cc: daniel.gruss@iaik.tugraz.at
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: michael.schwarz@iaik.tugraz.at
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: richard.fellner@student.tugraz.at
> Link: https://lkml.kernel.org/r/20171123003524.88C90659@viggo.jf.intel.com
>
> ---
>  security/Kconfig |   10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> --- a/security/Kconfig
> +++ b/security/Kconfig
> @@ -54,6 +54,16 @@ config SECURITY_NETWORK
>           implement socket and networking access controls.
>           If you are unsure how to answer this question, answer N.
>
> +config KERNEL_PAGE_TABLE_ISOLATION
> +       bool "Remove the kernel mapping in user mode"
> +       depends on X86_64 && JUMP_LABEL

select JUMP_LABEL perhaps?

> +       help
> +         This feature reduces the number of hardware side channels by
> +         ensuring that the majority of kernel addresses are not mapped
> +         into userspace.
> +
> +         See Documentation/x86/pagetable-isolation.txt for more details.
> +
>  config SECURITY_INFINIBAND
>         bool "Infiniband Security Hooks"
>         depends on SECURITY && INFINIBAND
>
>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 57/60] x86/mm/kpti: Add Kconfig
  2017-12-04 16:54   ` Andy Lutomirski
@ 2017-12-04 16:57     ` Thomas Gleixner
  2017-12-05  9:34       ` Thomas Gleixner
  0 siblings, 1 reply; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 16:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Linus Torvalds, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar, moritz.lipp,
	linux-mm, Borislav Petkov, michael.schwarz, richard.fellner

On Mon, 4 Dec 2017, Andy Lutomirski wrote:
> On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > --- a/security/Kconfig
> > +++ b/security/Kconfig
> > @@ -54,6 +54,16 @@ config SECURITY_NETWORK
> >           implement socket and networking access controls.
> >           If you are unsure how to answer this question, answer N.
> >
> > +config KERNEL_PAGE_TABLE_ISOLATION
> > +       bool "Remove the kernel mapping in user mode"
> > +       depends on X86_64 && JUMP_LABEL
> 
> select JUMP_LABEL perhaps?

Silly me. Yes.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (59 preceding siblings ...)
  2017-12-04 14:08 ` [patch 60/60] x86/mm/debug_pagetables: Allow dumping current pagetables Thomas Gleixner
@ 2017-12-04 18:02 ` Linus Torvalds
  2017-12-04 18:18   ` Thomas Gleixner
  2017-12-05 21:49 ` Andy Lutomirski
  2018-01-19 20:56 ` Andrew Morton
  62 siblings, 1 reply; 118+ messages in thread
From: Linus Torvalds @ 2017-12-04 18:02 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, Daniel Gruss

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
>     Kernel Page Table Isolation, prefix kpti_
>
>    Linus, your call :)

I think you probably chose the right name here. The alternatives sound
intriguing, but probably not the right thing to do.

How much of this is considered worth trying to integrate early?
Clearly I'm not taking all of it.

                    Linus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 18:02 ` [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Linus Torvalds
@ 2017-12-04 18:18   ` Thomas Gleixner
  2017-12-04 18:21     ` Boris Ostrovsky
  2017-12-04 18:28     ` Linus Torvalds
  0 siblings, 2 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 18:18 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, Daniel Gruss

On Mon, 4 Dec 2017, Linus Torvalds wrote:
> On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> >     Kernel Page Table Isolation, prefix kpti_
> >
> >    Linus, your call :)
> 
> I think you probably chose the right name here. The alternatives sound
> intriguing, but probably not the right thing to do.
> 
> How much of this is considered worth trying to integrate early?

Probably the entry changes, but we need to sort out that fixmap issue first
and that affects the entry changes as well. Give me a day or two and I can
tell you.

> Clearly I'm not taking all of it.

I did not expect that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 14/60] x86/entry: Remap the TSS into the CPU entry area
  2017-12-04 14:07 ` [patch 14/60] x86/entry: Remap the TSS into the CPU entry area Thomas Gleixner
@ 2017-12-04 18:20   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-04 18:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Ingo Molnar, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:20PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> This has a secondary purpose: it puts the entry stack into a region
> with a well-controlled layout.  A subsequent patch will take
> advantage of this to streamline the SYSCALL entry code to be able to
> find it more easily.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.luto@kernel.org
> 
> ---
>  arch/x86/entry/entry_32.S     |    6 ++++--
>  arch/x86/include/asm/fixmap.h |    7 +++++++
>  arch/x86/kernel/asm-offsets.c |    3 +++
>  arch/x86/kernel/cpu/common.c  |   41 +++++++++++++++++++++++++++++++++++------
>  arch/x86/kernel/dumpstack.c   |    3 ++-
>  arch/x86/kvm/vmx.c            |    2 +-
>  arch/x86/power/cpu.c          |   11 ++++++-----
>  7 files changed, 58 insertions(+), 15 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 18:18   ` Thomas Gleixner
@ 2017-12-04 18:21     ` Boris Ostrovsky
  2017-12-04 18:28     ` Linus Torvalds
  1 sibling, 0 replies; 118+ messages in thread
From: Boris Ostrovsky @ 2017-12-04 18:21 UTC (permalink / raw)
  To: Thomas Gleixner, Linus Torvalds
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Juergen Gross, David Laight, Eduardo Valentin, Liguori, Anthony,
	Will Deacon, Daniel Gruss

On 12/04/2017 01:18 PM, Thomas Gleixner wrote:
> On Mon, 4 Dec 2017, Linus Torvalds wrote:
>> On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>     Kernel Page Table Isolation, prefix kpti_
>>>
>>>    Linus, your call :)
>> I think you probably chose the right name here. The alternatives sound
>> intriguing, but probably not the right thing to do.
>>
>> How much of this is considered worth trying to integrate early?
> Probably the entry changes, but we need to sort out that fixmap issue first
> and that affects the entry changes as well. Give me a day or two and I can
> tell you.

This series breaks Xen PV.

When I tested it last time it was patch 17 (of this series). I don't
know whether it breaks now due to the same patch, I haven't had a chance
to look into this yet, sorry.

-boris


>> Clearly I'm not taking all of it.
> I did not expect that.
>
> Thanks,
>
> 	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 18:18   ` Thomas Gleixner
  2017-12-04 18:21     ` Boris Ostrovsky
@ 2017-12-04 18:28     ` Linus Torvalds
  1 sibling, 0 replies; 118+ messages in thread
From: Linus Torvalds @ 2017-12-04 18:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, the arch/x86 maintainers, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	Liguori, Anthony, Will Deacon, Daniel Gruss

On Mon, Dec 4, 2017 at 10:18 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> How much of this is considered worth trying to integrate early?
>
> Probably the entry changes, but we need to sort out that fixmap issue first
> and that affects the entry changes as well. Give me a day or two and I can
> tell you.

Sure. I've skimmed through the patches, and a number of the early ones
seem to be "obviously safe and independently nice cleanups". Even the
sysenter stack setup etc that isn't really required without the other
work seems sane and fine.

In fact, I have to say that the patches themselves look very good.
Nothing made me go "Christ, what an ugly hack". Maybe that is because
of just the skimming through, but still, it was not an unpleasant
read-through.

The problem, of course, is how *subtle* all the interactions are, and
how one missed "oh, the CPU also needs this" makes for some really
nasty breakage.  So it may all look nice and clean, and then blow up
horribly in some very particular configuration.

And yes, paravirtualization is evil.

            Linus

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 22/60] x86/entry: Clean up the SYSENTER_stack code
  2017-12-04 14:07 ` [patch 22/60] x86/entry: Clean up the SYSENTER_stack code Thomas Gleixner
@ 2017-12-04 19:41   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-04 19:41 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Ingo Molnar, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:28PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> The existing code was a mess, mainly because C arrays are nasty.  Turn
> SYSENTER_stack into a struct, add a helper to find it, and do all the
> obvious cleanups this enables.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Link: https://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org
> 
> ---
>  arch/x86/entry/entry_32.S        |    4 ++--
>  arch/x86/entry/entry_64.S        |    2 +-
>  arch/x86/include/asm/fixmap.h    |    5 +++++
>  arch/x86/include/asm/processor.h |    6 +++++-
>  arch/x86/kernel/asm-offsets.c    |    6 ++----
>  arch/x86/kernel/cpu/common.c     |   14 +++-----------
>  arch/x86/kernel/dumpstack.c      |    7 +++----
>  7 files changed, 21 insertions(+), 23 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only
  2017-12-04 14:07 ` [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only Thomas Gleixner
@ 2017-12-04 20:25   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-04 20:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Kees Cook

On Mon, Dec 04, 2017 at 03:07:29PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> The TSS is a fairly juicy target for exploits, and, now that the TSS
> is in the cpu_entry_area, it's no longer protected by kASLR.  Make it
> read-only on x86_64.
> 
> On x86_32, it can't be RO because it's written by the CPU during task
> switches, and we use a task gate for double faults.  I'd also be
> nervous about errata if we tried to make it RO even on configurations
> without double fault handling.
> 
> [ tglx: AMD confirmed that there is no problem on 64bit with TSS RO.  So
>   	it's probably safe to assume that it's a non issue, though Intel
>   	might have been creative in that area. Still waiting for
>   	confirmation. ]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Kees Cook <keescook@chromium.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: David Laight <David.Laight@aculab.com>
> Cc: Borislav Petkov <bp@alien8.de>
> Link: https://lkml.kernel.org/r/7d2f65f86a46e3489ba996932554485c3d345632.1512109321.git.luto@kernel.org
> 
> ---
>  arch/x86/entry/entry_32.S          |    4 ++--
>  arch/x86/entry/entry_64.S          |    8 ++++----
>  arch/x86/include/asm/fixmap.h      |   13 +++++++++----
>  arch/x86/include/asm/processor.h   |   17 ++++++++---------
>  arch/x86/include/asm/switch_to.h   |    4 ++--
>  arch/x86/include/asm/thread_info.h |    2 +-
>  arch/x86/kernel/asm-offsets.c      |    5 ++---
>  arch/x86/kernel/asm-offsets_32.c   |    4 ++--
>  arch/x86/kernel/cpu/common.c       |   29 +++++++++++++++++++----------
>  arch/x86/kernel/ioport.c           |    2 +-
>  arch/x86/kernel/process.c          |    6 +++---
>  arch/x86/kernel/process_32.c       |    2 +-
>  arch/x86/kernel/process_64.c       |    2 +-
>  arch/x86/kernel/traps.c            |    4 ++--
>  arch/x86/lib/delay.c               |    4 ++--
>  arch/x86/xen/enlighten_pv.c        |    2 +-
>  16 files changed, 60 insertions(+), 48 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow
  2017-12-04 14:07 ` [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow Thomas Gleixner
@ 2017-12-04 20:31   ` Andy Lutomirski
  2017-12-04 21:31     ` Thomas Gleixner
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 20:31 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar

Fixlet lost here

> On Dec 4, 2017, at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> regs->sp

Should be state->sp

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow
  2017-12-04 20:31   ` Andy Lutomirski
@ 2017-12-04 21:31     ` Thomas Gleixner
  0 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 21:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Ingo Molnar

On Mon, 4 Dec 2017, Andy Lutomirski wrote:
> Fixlet lost here
> 
> > On Dec 4, 2017, at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > regs->sp
> 
> Should be state->sp

Fixed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 14:07 ` [patch 51/60] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
@ 2017-12-04 22:22   ` Andy Lutomirski
  2017-12-04 22:34     ` Dave Hansen
  2017-12-04 22:47     ` Peter Zijlstra
  0 siblings, 2 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:22 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	michael.schwarz, Borislav Petkov, moritz.lipp, richard.fellner

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> If changing the page tables in such a way that an invalidation of all
> contexts (aka. PCIDs / ASIDs) is required, they can be actively invalidated
> by:
>
>  1. INVPCID for each PCID (works for single pages too).
>
>  2. Load CR3 with each PCID without the NOFLUSH bit set
>
>  3. Load CR3 with the NOFLUSH bit set for each and do INVLPG for each address.
>
> But, none of these are really feasible since there are ~6 ASIDs (12 with
> KERNEL_PAGE_TABLE_ISOLATION) at the time that invalidation is required.
> Instead of actively invalidating them, invalidate the *current* context and
> also mark the cpu_tlbstate _quickly_ to indicate future invalidation to be
> required.
>
> At the next context-switch, look for this indicator
> ('invalidate_other' being set) invalidate all of the
> cpu_tlbstate.ctxs[] entries.
>
> This ensures that any future context switches will do a full flush
> of the TLB, picking up the previous changes.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> Cc: Rik van Riel <riel@redhat.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: michael.schwarz@iaik.tugraz.at
> Cc: daniel.gruss@iaik.tugraz.at
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: hughd@google.com
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: moritz.lipp@iaik.tugraz.at
> Cc: keescook@google.com
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: richard.fellner@student.tugraz.at
> Link: https://lkml.kernel.org/r/20171123003507.E8C327F5@viggo.jf.intel.com
>
> ---
>  arch/x86/include/asm/tlbflush.h |   42 ++++++++++++++++++++++++++++++----------
>  arch/x86/mm/tlb.c               |   37 +++++++++++++++++++++++++++++++++++
>  2 files changed, 69 insertions(+), 10 deletions(-)
>
> --- a/arch/x86/include/asm/tlbflush.h
> +++ b/arch/x86/include/asm/tlbflush.h
> @@ -188,6 +188,17 @@ struct tlb_state {
>         bool is_lazy;
>
>         /*
> +        * If set we changed the page tables in such a way that we
> +        * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
> +        * This tells us to go invalidate all the non-loaded ctxs[]
> +        * on the next context switch.
> +        *
> +        * The current ctx was kept up-to-date as it ran and does not
> +        * need to be invalidated.
> +        */
> +       bool invalidate_other;
> +
> +       /*
>          * Access to this CR4 shadow and to H/W CR4 is protected by
>          * disabling interrupts when modifying either one.
>          */
> @@ -267,6 +278,19 @@ static inline unsigned long cr4_read_sha
>         return this_cpu_read(cpu_tlbstate.cr4);
>  }
>
> +static inline void invalidate_pcid_other(void)
> +{
> +       /*
> +        * With global pages, all of the shared kenel page tables
> +        * are set as _PAGE_GLOBAL.  We have no shared nonglobals
> +        * and nothing to do here.
> +        */
> +       if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
> +               return;

I think I'd be more comfortable if this check were in the caller, not
here.  Shouldn't a function called invalidate_pcid_other() do what the
name says?

> +
> +       this_cpu_write(cpu_tlbstate.invalidate_other, true);

Why do we need this extra variable instead of just looping over all
other ASIDs and invalidating them?  It would be something like:

        for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
                if (i != this_cpu_read(cpu_tlbstate.loaded_mm_asid))
                       this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
        }

modulo epic whitespace damage and possible typos.

> +}
> +
>  /*
>   * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
>   * enable and PPro Global page enable), so that any CPU's that boot
> @@ -341,24 +365,22 @@ static inline void __native_flush_tlb_si
>
>  static inline void __flush_tlb_all(void)
>  {
> -       if (boot_cpu_has(X86_FEATURE_PGE))
> +       if (boot_cpu_has(X86_FEATURE_PGE)) {
>                 __flush_tlb_global();
> -       else
> +       } else {
>                 __flush_tlb();
> -
> -       /*
> -        * Note: if we somehow had PCID but not PGE, then this wouldn't work --
> -        * we'd end up flushing kernel translations for the current ASID but
> -        * we might fail to flush kernel translations for other cached ASIDs.
> -        *
> -        * To avoid this issue, we force PCID off if PGE is off.
> -        */
> +       }
>  }
>
>  static inline void __flush_tlb_one(unsigned long addr)
>  {
>         count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
>         __flush_tlb_single(addr);
> +       /*
> +        * Invalidate other address spaces inaccessible to single-page
> +        * invalidation:
> +        */

Ugh.  If I'm reading this right, __flush_tlb_single() means "flush one
user address" and __flush_tlb_one() means "flush one kernel address".
That's, um, not exactly obvious.  Could this be at least commented
better?

--Andy

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-04 14:08 ` [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
@ 2017-12-04 22:25   ` Andy Lutomirski
  2017-12-04 22:51     ` Peter Zijlstra
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:25 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen

On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> This uses INVPCID to shoot down individual lines of the user mapping
> instead of marking the entire user map as invalid. This
> could/might/possibly be faster.
>
> This for sure needs tlb_single_page_flush_ceiling to be redetermined;
> esp. since INVPCID is _slow_.

I'm wondering if INVPCID is *so* slow that this patch is entirely
counterproductive.

--Andy

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 31/60] x86/mm/kpti: Add mapping helper functions
  2017-12-04 14:07 ` [patch 31/60] x86/mm/kpti: Add mapping helper functions Thomas Gleixner
@ 2017-12-04 22:27   ` Andy Lutomirski
  2017-12-05 16:01   ` Borislav Petkov
  1 sibling, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:27 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Add the pagetable helper functions do manage the separate user space page
> tables.
>
> [ tglx: Split out from the big combo kaiser patch ]

> +/*
> + * Take a PGD location (pgdp) and a pgd value that needs to be set there.
> + * Populates the user and returns the resulting PGD that must be set in
> + * the kernel copy of the page tables.
> + */
> +static inline pgd_t kpti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
> +{
> +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> +       if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
> +               return pgd;
> +
> +       if (pgd_userspace_access(pgd)) {
> +               if (pgdp_maps_userspace(pgdp)) {
> +                       /*
> +                        * The user page tables get the full PGD,
> +                        * accessible from userspace:
> +                        */
> +                       kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
> +                       /*
> +                        * For the copy of the pgd that the kernel uses,
> +                        * make it unusable to userspace.  This ensures on
> +                        * in case that a return to userspace with the
> +                        * kernel CR3 value, userspace will crash instead
> +                        * of running.
> +                        *
> +                        * Note: NX might be not available or disabled.
> +                        */
> +                       if (__supported_pte_mask & _PAGE_NX)
> +                               pgd.pgd |= _PAGE_NX;
> +               }
> +       } else if (pgd_userspace_access(*pgdp)) {
> +               /*
> +                * We are clearing a _PAGE_USER PGD for which we presumably
> +                * populated the user PGD.  We must now clear the user PGD
> +                * entry.
> +                */
> +               if (pgdp_maps_userspace(pgdp)) {
> +                       kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
> +               } else {
> +                       /*
> +                        * Attempted to clear a _PAGE_USER PGD which is in
> +                        * the kernel porttion of the address space.  PGDs
> +                        * are pre-populated and we never clear them.
> +                        */
> +                       WARN_ON_ONCE(1);
> +               }
> +       } else {
> +               /*
> +                * _PAGE_USER was not set in either the PGD being set or
> +                * cleared.  All kernel PGDs should be pre-populated so
> +                * this should never happen after boot.
> +                */
> +               WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> +       }
> +#endif
> +       /* return the copy of the PGD we want the kernel to use: */
> +       return pgd;
> +}
> +

I mentioned this earlier, but I think this should be:


  VM_BUG_ON(pgdp points to a usermode table);

  if (pgdp_maps_userspace(pgdp)) {
    /* Install the pgd as requested into the usermode tables. */
    kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;

    if (pgd_val(pgd) & _PAGE_USER) {
      /*
       * This is a normal user pgd -- the kernelmode mapping should have NX
       * set to prevent erroneous usermode execution with the kernel tables.
       */
      return __pgd(pgd_val(pgd) | _PAGE_NX;
    } else {
      /* This is a weird mapping, e.g. EFI.  Map it straight through. */
      return pgd;
    }
  } else {
    /*
     * We can get here due to vmalloc, a vmalloc fault, memory
hot-add, or initial setup
     * of kernelmode page tables.  Regardless of which particular code
path we're in,
     * these mappings should not be automatically propagated to the
usermode tables.
     */
    return pgd;
  }
}

That should make all the VSYSCALL nastiness go away.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in user PGD
  2017-12-04 14:07 ` [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in " Thomas Gleixner
@ 2017-12-04 22:28   ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:28 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> Clone the ESPFIX alias mapping area so the entry/exit code has access to it
> even with the user space page tables.
>
> [ tglx: Remove the per cpu user mapped oddity ]
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>
> ---
>  arch/x86/kernel/espfix_64.c |   16 ++++++++++++++++
>  1 file changed, 16 insertions(+)
>
> --- a/arch/x86/kernel/espfix_64.c
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -129,6 +129,22 @@ void __init init_espfix_bsp(void)
>         p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
>         p4d_populate(&init_mm, p4d, espfix_pud_page);
>
> +       /*
> +        * Just copy the top-level PGD that is mapping the espfix area to
> +        * ensure it is mapped into the user page tables.
> +        *
> +        * For 5-level paging, the espfix pgd was populated when
> +        * kpti_init() pre-populated all the pgd entries.  The above
> +        * p4d_alloc() would never do anything and the p4d_populate() would
> +        * be done to a p4d already mapped in the userspace pgd.
> +        */

Is this actually true?  From brief inspection, it doesn't seem to be
the case, nor do I see why it should be true.

> +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> +       if (CONFIG_PGTABLE_LEVELS <= 4) {
> +               set_pgd(kernel_to_user_pgdp(pgd),
> +                       __pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
> +       }
> +#endif
> +
>         /* Randomize the locations */
>         init_espfix_random();
>
>
>

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline
  2017-12-04 14:07 ` [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Thomas Gleixner
@ 2017-12-04 22:30   ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:30 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Ingo Molnar, Dave Hansen

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Andy Lutomirski <luto@kernel.org>
>

> XXX: Whenever we settle how KERNEL_PAGE_TABLE_ISOLATION gets turned on
> and off, we should do the same to this.

This is done now :)

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 47/60] x86/ldt: Map LDT entries into fixmap
  2017-12-04 14:07 ` [patch 47/60] x86/ldt: Map LDT entries into fixmap Thomas Gleixner
@ 2017-12-04 22:33   ` Andy Lutomirski
  2017-12-04 22:51     ` Thomas Gleixner
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> LDT is not really commonly used on 64bit so the overhead of populating the
> fixmap entries on context switch for the rare LDT syscall users is a
> reasonable trade off vs. having extra dynamically managed mapping space per
> process.
>

Hmm, I wonder just how slow this is.  It might be okay.  It's
certainly not the way I imagined it working.

--Andy

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 56/60] x86/mm/kpti: Disable native VSYSCALL
  2017-12-04 14:08 ` [patch 56/60] x86/mm/kpti: Disable native VSYSCALL Thomas Gleixner
@ 2017-12-04 22:33   ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	moritz.lipp, linux-mm, Borislav Petkov, michael.schwarz,
	richard.fellner

On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
>
> The KERNEL_PAGE_TABLE_ISOLATION code attempts to "poison" the user
> portion of the kernel page tables. It detects entries that it wants that it
> wants to poison in two ways:
>
>  * Looking for addresses >= PAGE_OFFSET
>
>  * Looking for entries without _PAGE_USER set
>
> But, to allow the _PAGE_USER check to work, it must never be set on
> init_mm entries, and an earlier patch in this series ensured that it
> will never be set.
>
> The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
> Because of the earlier, KERNEL_PAGE_TABLE_ISOLATION-enforced restriction,
> _PAGE_USER is never set which makes the VDSO unreadable to userspace.
>
> This makes the "NATIVE" case totally unusable since userspace can not even
> see the memory any more.  Disable it whenever KERNEL_PAGE_TABLE_ISOLATION
> is enabled.
>
> Also add some help text about how KERNEL_PAGE_TABLE_ISOLATION might
> affect the emulation case as well.
>

I think my other suggestion may obsolete this patch.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 22:22   ` Andy Lutomirski
@ 2017-12-04 22:34     ` Dave Hansen
  2017-12-04 22:36       ` Andy Lutomirski
  2017-12-04 22:47     ` Peter Zijlstra
  1 sibling, 1 reply; 118+ messages in thread
From: Dave Hansen @ 2017-12-04 22:34 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Peter Zijlstra, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Rik van Riel, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon,
	Daniel Gruss, Dave Hansen, Ingo Molnar, michael.schwarz,
	Borislav Petkov, moritz.lipp, richard.fellner

On 12/04/2017 02:22 PM, Andy Lutomirski wrote:
>> +
>> +       this_cpu_write(cpu_tlbstate.invalidate_other, true);
> 
> Why do we need this extra variable instead of just looping over all
> other ASIDs and invalidating them?  It would be something like:
> 
>         for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
>                 if (i != this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>                        this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
>         }

We have loops like this:

	for (addr = start; addr < end; addr += PAGE_SIZE)
		flush_tlb_single();

Where flush_tlb_single() does a invalidate_pcid_other().  So, inlining
flush_tlb_single() rougly looks like:

	for (addr = start; addr < end; addr += PAGE_SIZE) {
		invlpg;
		for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
			this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
	}

or, with a "invalidate_other" variable:

	for (addr = start; addr < end; addr += PAGE_SIZE) {
		invlpg;
		this_cpu_write(cpu_tlbstate.ctxs.invalidate_other, 1);
	}

The double-for-loop looks a bit wasteful to me.


>>  static inline void __flush_tlb_one(unsigned long addr)
>>  {
>>         count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
>>         __flush_tlb_single(addr);
>> +       /*
>> +        * Invalidate other address spaces inaccessible to single-page
>> +        * invalidation:
>> +        */
> 
> Ugh.  If I'm reading this right, __flush_tlb_single() means "flush one
> user address" and __flush_tlb_one() means "flush one kernel address".
> That's, um, not exactly obvious.  Could this be at least commented
> better?

That sounds sane, but let me take a look at it.

Didn't Peter have some patches to do some of that rename?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 22:34     ` Dave Hansen
@ 2017-12-04 22:36       ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:36 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Rik van Riel, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, Daniel Gruss,
	Dave Hansen, Ingo Molnar, michael.schwarz, Borislav Petkov,
	moritz.lipp, richard.fellner

On Mon, Dec 4, 2017 at 2:34 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/04/2017 02:22 PM, Andy Lutomirski wrote:
>>> +
>>> +       this_cpu_write(cpu_tlbstate.invalidate_other, true);
>>
>> Why do we need this extra variable instead of just looping over all
>> other ASIDs and invalidating them?  It would be something like:
>>
>>         for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
>>                 if (i != this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>>                        this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
>>         }
>
> We have loops like this:
>
>         for (addr = start; addr < end; addr += PAGE_SIZE)
>                 flush_tlb_single();

Couldn't we just make those looks more intelligent:

for (...)
  flush_tlb_kernelmode_single(...);

if (kpti)
  invalidate_asid_other();

(Isn't there only one such look now, in flush_tlb_func_common()?)

--Andy

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 26/60] x86/cpufeature: Make cpu bugs sticky
  2017-12-04 14:07 ` [patch 26/60] x86/cpufeature: Make cpu bugs sticky Thomas Gleixner
@ 2017-12-04 22:39   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-04 22:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:32PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> There is currently no way to force CPU bug bits like CPU feature bits. That
> makes it impossible to set a bug bit once at boot and have it stick for all
> upcoming CPUs.
> 
> Extend the force set/clear arrays to handle bug bits as well.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/cpufeature.h |    2 ++
>  arch/x86/include/asm/processor.h  |    4 ++--
>  arch/x86/kernel/cpu/common.c      |    6 +++---
>  3 files changed, 7 insertions(+), 5 deletions(-)

This whole area needs more work... well, later...

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 22:22   ` Andy Lutomirski
  2017-12-04 22:34     ` Dave Hansen
@ 2017-12-04 22:47     ` Peter Zijlstra
  2017-12-04 22:54       ` Andy Lutomirski
  1 sibling, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2017-12-04 22:47 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	michael.schwarz, Borislav Petkov, moritz.lipp, richard.fellner,
	abanman, mike.travis

On Mon, Dec 04, 2017 at 02:22:54PM -0800, Andy Lutomirski wrote:

> > +static inline void invalidate_pcid_other(void)
> > +{
> > +       /*
> > +        * With global pages, all of the shared kenel page tables
> > +        * are set as _PAGE_GLOBAL.  We have no shared nonglobals
> > +        * and nothing to do here.
> > +        */
> > +       if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
> > +               return;
> 
> I think I'd be more comfortable if this check were in the caller, not
> here.  Shouldn't a function called invalidate_pcid_other() do what the
> name says?

Yeah, you're probably right. The thing is course that we only ever need
that operation for kpti (as of now). But me renaming this stuff made
this problem :/

> > +       this_cpu_write(cpu_tlbstate.invalidate_other, true);
> 
> Why do we need this extra variable instead of just looping over all
> other ASIDs and invalidating them?  It would be something like:
> 
>         for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
>                 if (i != this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>                        this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
>         }
> 
> modulo epic whitespace damage and possible typos.

I think the point is that we can do many invalidate_other's before we
ever do a switch_mm(). The above would be more expensive.

Not sure it would matter in practise though.

> >  static inline void __flush_tlb_one(unsigned long addr)
> >  {
> >         count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
> >         __flush_tlb_single(addr);
> > +       /*
> > +        * Invalidate other address spaces inaccessible to single-page
> > +        * invalidation:
> > +        */
> 
> Ugh.  If I'm reading this right, __flush_tlb_single() means "flush one
> user address" and __flush_tlb_one() means "flush one kernel address".

That would make sense, woulnd't it? :-) But afaict the __flush_tlb_one()
user in tlb_uv.c is in fact for userspace and should be
__flush_tlb_single().

Andrew, Mike, can either of you shed light on what exactly you need
invalidated there?

> That's, um, not exactly obvious.  Could this be at least commented
> better?

As is __flush_tlb_single() does user and __flush_tlb_one() does
user+kernel.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 47/60] x86/ldt: Map LDT entries into fixmap
  2017-12-04 22:33   ` Andy Lutomirski
@ 2017-12-04 22:51     ` Thomas Gleixner
  0 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-04 22:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Linus Torvalds, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss

On Mon, 4 Dec 2017, Andy Lutomirski wrote:

> On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Thomas Gleixner <tglx@linutronix.de>
> >
> > LDT is not really commonly used on 64bit so the overhead of populating the
> > fixmap entries on context switch for the rare LDT syscall users is a
> > reasonable trade off vs. having extra dynamically managed mapping space per
> > process.
> >
> 
> Hmm, I wonder just how slow this is.  It might be okay.  It's
> certainly not the way I imagined it working.

I know, it was the laziest way I could come up with. The only nasty thing
here is that __set_fixmap() does a tlb flush which is pointless as that
happens anyway. On my todo list was a noflush variant for set_fixmap along
with a variant which takes a whole range. That would simplify other places
as well. Though with the plan to map that stuff to a different place we
actually can avoid the weirdness of fixmaps

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-04 22:25   ` Andy Lutomirski
@ 2017-12-04 22:51     ` Peter Zijlstra
  2017-12-05 13:51       ` Dave Hansen
  0 siblings, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2017-12-04 22:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen

On Mon, Dec 04, 2017 at 02:25:43PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> >
> > This uses INVPCID to shoot down individual lines of the user mapping
> > instead of marking the entire user map as invalid. This
> > could/might/possibly be faster.
> >
> > This for sure needs tlb_single_page_flush_ceiling to be redetermined;
> > esp. since INVPCID is _slow_.
> 
> I'm wondering if INVPCID is *so* slow that this patch is entirely
> counterproductive.

We should find some of the benchmarks that were used to determine
tlb_single_page_flush_ceiling and measure. I've not gotten around to
doing either.

Someone called Dave Hansen did that patch and might still have something
lying around to help with that:

  a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)")

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 22:47     ` Peter Zijlstra
@ 2017-12-04 22:54       ` Andy Lutomirski
  2017-12-04 23:06         ` Peter Zijlstra
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-04 22:54 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	michael.schwarz, Borislav Petkov, moritz.lipp, richard.fellner,
	Andrew Banman, mike.travis

On Mon, Dec 4, 2017 at 2:47 PM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Dec 04, 2017 at 02:22:54PM -0800, Andy Lutomirski wrote:
>
>> > +static inline void invalidate_pcid_other(void)
>> > +{
>> > +       /*
>> > +        * With global pages, all of the shared kenel page tables
>> > +        * are set as _PAGE_GLOBAL.  We have no shared nonglobals
>> > +        * and nothing to do here.
>> > +        */
>> > +       if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
>> > +               return;
>>
>> I think I'd be more comfortable if this check were in the caller, not
>> here.  Shouldn't a function called invalidate_pcid_other() do what the
>> name says?
>
> Yeah, you're probably right. The thing is course that we only ever need
> that operation for kpti (as of now). But me renaming this stuff made
> this problem :/
>
>> > +       this_cpu_write(cpu_tlbstate.invalidate_other, true);
>>
>> Why do we need this extra variable instead of just looping over all
>> other ASIDs and invalidating them?  It would be something like:
>>
>>         for (i = 1; i < TLB_NR_DYN_ASIDS; i++) {
>>                 if (i != this_cpu_read(cpu_tlbstate.loaded_mm_asid))
>>                        this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);
>>         }
>>
>> modulo epic whitespace damage and possible typos.
>
> I think the point is that we can do many invalidate_other's before we
> ever do a switch_mm(). The above would be more expensive.
>
> Not sure it would matter in practise though.
>
>> >  static inline void __flush_tlb_one(unsigned long addr)
>> >  {
>> >         count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
>> >         __flush_tlb_single(addr);
>> > +       /*
>> > +        * Invalidate other address spaces inaccessible to single-page
>> > +        * invalidation:
>> > +        */
>>
>> Ugh.  If I'm reading this right, __flush_tlb_single() means "flush one
>> user address" and __flush_tlb_one() means "flush one kernel address".
>
> That would make sense, woulnd't it? :-) But afaict the __flush_tlb_one()
> user in tlb_uv.c is in fact for userspace and should be
> __flush_tlb_single().
>
> Andrew, Mike, can either of you shed light on what exactly you need
> invalidated there?
>
>> That's, um, not exactly obvious.  Could this be at least commented
>> better?
>
> As is __flush_tlb_single() does user and __flush_tlb_one() does
> user+kernel.

Yep.  A one-liner above the function to that effect would make it
*way* clearer what's going on.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 51/60] x86/mm: Allow flushing for future ASID switches
  2017-12-04 22:54       ` Andy Lutomirski
@ 2017-12-04 23:06         ` Peter Zijlstra
  0 siblings, 0 replies; 118+ messages in thread
From: Peter Zijlstra @ 2017-12-04 23:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar,
	michael.schwarz, Borislav Petkov, moritz.lipp, richard.fellner,
	Andrew Banman, mike.travis

On Mon, Dec 04, 2017 at 02:54:46PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 4, 2017 at 2:47 PM, Peter Zijlstra <peterz@infradead.org> wrote:

> > As is __flush_tlb_single() does user and __flush_tlb_one() does
> > user+kernel.
> 
> Yep.  A one-liner above the function to that effect would make it
> *way* clearer what's going on.

Bah, since my notes are upstairs I actually got that wrong,
do_kernel_range_flush() also uses __flush_tlb_single(), but then it
finishes with invalidate_pcid_other(), so effectively it shoots down
world.

So we should probably switch do_kernel_range_flush() to
__flush_tlb_one() and tlb_uv.c (pending SGI approval) to
__flush_tlb_single().

I'll dig through my notes in the morning and do a patch with comments.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE
  2017-12-04 14:07 ` [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
@ 2017-12-04 23:18   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-04 23:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:33PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Many x86 CPUs leak information to user space due to missing isolation of
> user space and kernel space page tables. There are many well documented
> ways to exploit that.
> 
> The upcoming software migitation of isolating the user and kernel space
> page tables needs a misfeature flag so code can be made runtime
> conditional.
> 
> Add two BUG bits: One which indicates that the CPU is affected and one that
> the software migitation is enabled.
> 
> Assume for now that _ALL_ x86 CPUs are affected by this. Exceptions can be
> made later.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/cpufeatures.h |    2 ++
>  arch/x86/kernel/cpu/common.c       |    4 ++++
>  2 files changed, 6 insertions(+)
> 
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -340,5 +340,7 @@
>  #define X86_BUG_SWAPGS_FENCE		X86_BUG(11) /* SWAPGS without input dep on GS */
>  #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
>  #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
> +#define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
> +#define X86_BUG_CPU_SECURE_MODE_KPTI	X86_BUG(15) /* Kernel Page Table Isolation enabled*/

Right, if this second one is going to denote that the workaround is
enabled, let's make it a feature bit and shorter:

#define X86_FEATURE_KPTI

Delta diff below.

---
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 4dd0bda9fe09..604b62a5a2fe 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -212,7 +212,7 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
-	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_KPTI
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
@@ -223,7 +223,7 @@ For 32-bit we have the following conventions - kernel is built with
 	PER_CPU_VAR(cpu_tlbstate) + TLB_STATE_user_pcid_flush_mask
 
 .macro SWITCH_TO_USER_CR3_NOSTACK scratch_reg:req scratch_reg2:req
-	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_KPTI
 	mov	%cr3, \scratch_reg
 
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
@@ -259,7 +259,7 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
-	ALTERNATIVE "jmp .Ldone_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
+	ALTERNATIVE "jmp .Ldone_\@", "", X86_FEATURE_KPTI
 	movq	%cr3, \scratch_reg
 	movq	\scratch_reg, \save_reg
 	/*
@@ -282,7 +282,7 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro RESTORE_CR3 scratch_reg:req save_reg:req
-	ALTERNATIVE "jmp .Lend_\@", "", X86_BUG_CPU_SECURE_MODE_KPTI
+	ALTERNATIVE "jmp .Lend_\@", "", X86_FEATURE_KPTI
 
 	ALTERNATIVE "jmp .Lwrcr3_\@", "", X86_FEATURE_PCID
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 6e905acb4e97..b367c23e7d83 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -198,10 +198,10 @@
 #define X86_FEATURE_CAT_L2		( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3		( 7*32+ 6) /* Code and Data Prioritization L3 */
 #define X86_FEATURE_INVPCID_SINGLE	( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
-
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
 #define X86_FEATURE_SME			( 7*32+10) /* AMD Secure Memory Encryption */
+#define X86_FEATURE_KPTI		( 7*32+11) /* Kernel Page Table Isolation enabled */
 
 #define X86_FEATURE_INTEL_PPIN		( 7*32+14) /* Intel Processor Inventory Number */
 #define X86_FEATURE_INTEL_PT		( 7*32+15) /* Intel Processor Trace */
@@ -342,6 +342,5 @@
 #define X86_BUG_MONITOR			X86_BUG(12) /* IPI required to wake up remote CPU */
 #define X86_BUG_AMD_E400		X86_BUG(13) /* CPU is among the affected by Erratum 400 */
 #define X86_BUG_CPU_INSECURE		X86_BUG(14) /* CPU is insecure and needs kernel page table isolation */
-#define X86_BUG_CPU_SECURE_MODE_KPTI	X86_BUG(15) /* Kernel Page Table Isolation enabled*/
 
 #endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 0405960cee25..d1bf0b3a8232 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -210,7 +210,7 @@ static inline bool pgd_userspace_access(pgd_t pgd)
 static inline pgd_t kpti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return pgd;
 
 	if (pgd_userspace_access(pgd)) {
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 55ebfd144f18..d84167c036c0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -330,7 +330,7 @@ static inline void invalidate_pcid_other(void)
 	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
 	 * and nothing to do here.
 	 */
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return;
 
 	this_cpu_write(cpu_tlbstate.invalidate_other, true);
@@ -374,7 +374,7 @@ static inline void invalidate_user_asid(u16 asid)
 	if (!cpu_feature_enabled(X86_FEATURE_PCID))
 		return;
 
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return;
 
 	__set_bit(kern_pcid(asid),
@@ -438,7 +438,7 @@ static inline void __native_flush_tlb_single(unsigned long addr)
 
 	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
 
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return;
 
 	/*
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index b38a426a9855..4aa7b1efa6d8 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1481,7 +1481,7 @@ void syscall_init(void)
 		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (static_cpu_has(X86_FEATURE_KPTI))
 		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
 	else
 		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index f63a2b00d775..15dfdb76523d 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -53,7 +53,7 @@ static void set_ldt_and_map(struct ldt_struct *ldt)
 	void *fixva;
 	int idx, i;
 
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI)) {
+	if (!static_cpu_has(X86_FEATURE_KPTI)) {
 		set_ldt(ldt->entries_va, ldt->nr_entries);
 		return;
 	}
diff --git a/arch/x86/mm/dump_pagetables.c b/arch/x86/mm/dump_pagetables.c
index f9dfc20234e9..f18041e7d4d2 100644
--- a/arch/x86/mm/dump_pagetables.c
+++ b/arch/x86/mm/dump_pagetables.c
@@ -504,7 +504,7 @@ void ptdump_walk_pgd_level(struct seq_file *m, pgd_t *pgd)
 void ptdump_walk_pgd_level_debugfs(struct seq_file *m, pgd_t *pgd, bool user)
 {
 #ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
-	if (user && static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (user && static_cpu_has(X86_FEATURE_KPTI))
 		pgd = kernel_to_user_pgdp(pgd);
 #endif
 	ptdump_walk_pgd_level_core(m, pgd, false, false);
@@ -516,7 +516,7 @@ static void ptdump_walk_user_pgd_level_checkwx(void)
 #ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
 	pgd_t *pgd = (pgd_t *) &init_top_pgt;
 
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return;
 
 	pr_info("x86/mm: Checking user space page tables\n");
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index ffd55531206e..d65bc503da44 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -164,7 +164,7 @@ static int page_size_mask;
 
 static void enable_global_pages(void)
 {
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		__supported_pte_mask |= _PAGE_GLOBAL;
 }
 
diff --git a/arch/x86/mm/kpti.c b/arch/x86/mm/kpti.c
index a3b39c01e028..b8f2e300e26c 100644
--- a/arch/x86/mm/kpti.c
+++ b/arch/x86/mm/kpti.c
@@ -61,7 +61,7 @@ void __init kpti_check_boottime_disable(void)
 		enable = false;
 	}
 	if (enable)
-		setup_force_cpu_bug(X86_BUG_CPU_SECURE_MODE_KPTI);
+		setup_force_cpu_cap(X86_FEATURE_KPTI);
 }
 
 /*
@@ -236,7 +236,7 @@ static void __init kpti_init_all_pgds(void)
  */
 void __init kpti_init(void)
 {
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
+	if (!static_cpu_has(X86_FEATURE_KPTI))
 		return;
 
 	pr_info("enabled\n");
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 20f6cc4e49b8..430c6ba24ad7 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -42,7 +42,7 @@ void clear_asid_other(void)
 	 * This is only expected to be set if we have disabled
 	 * kernel _PAGE_GLOBAL pages.
 	 */
-	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI)) {
+	if (!static_cpu_has(X86_FEATURE_KPTI)) {
 		WARN_ON_ONCE(1);
 		return;
 	}

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 57/60] x86/mm/kpti: Add Kconfig
  2017-12-04 16:57     ` Thomas Gleixner
@ 2017-12-05  9:34       ` Thomas Gleixner
  0 siblings, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2017-12-05  9:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: LKML, X86 ML, Linus Torvalds, Peter Zijlstra, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen, Ingo Molnar, moritz.lipp,
	linux-mm, Borislav Petkov, michael.schwarz, richard.fellner

On Mon, 4 Dec 2017, Thomas Gleixner wrote:
> On Mon, 4 Dec 2017, Andy Lutomirski wrote:
> > On Mon, Dec 4, 2017 at 6:08 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > > --- a/security/Kconfig
> > > +++ b/security/Kconfig
> > > @@ -54,6 +54,16 @@ config SECURITY_NETWORK
> > >           implement socket and networking access controls.
> > >           If you are unsure how to answer this question, answer N.
> > >
> > > +config KERNEL_PAGE_TABLE_ISOLATION
> > > +       bool "Remove the kernel mapping in user mode"
> > > +       depends on X86_64 && JUMP_LABEL
> > 
> > select JUMP_LABEL perhaps?
> 
> Silly me. Yes.

Peter just pointed out that we switched everything to cpu_has() which is
using alternatives so jump label is not longer required at all.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags
  2017-12-04 14:07 ` [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags Thomas Gleixner
@ 2017-12-05 12:17   ` Juergen Gross
  0 siblings, 0 replies; 118+ messages in thread
From: Juergen Gross @ 2017-12-05 12:17 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, xen-devel

On 04/12/17 15:07, Thomas Gleixner wrote:
> From: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> 
> Commit 1d3e53e8624a ("x86/entry/64: Refactor IRQ stacks and make them
> NMI-safe") added DEBUG_ENTRY_ASSERT_IRQS_OFF macro that acceses eflags
> using 'pushfq' instruction when testing for IF bit. On PV Xen guests
> looking at IF flag directly will always see it set, resulting in 'ud2'.
> 
> Introduce SAVE_FLAGS() macro that will use appropriate save_fl pv op when
> running paravirt.
> 
> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: jgross@suse.com
> Cc: xen-devel@lists.xenproject.org
> Cc: luto@kernel.org
> Link: https://lkml.kernel.org/r/1512159805-6314-1-git-send-email-boris.ostrovsky@oracle.com

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 24/60] x86/paravirt: Dont patch flush_tlb_single
  2017-12-04 14:07 ` [patch 24/60] x86/paravirt: Dont patch flush_tlb_single Thomas Gleixner
@ 2017-12-05 12:18   ` Juergen Gross
  0 siblings, 0 replies; 118+ messages in thread
From: Juergen Gross @ 2017-12-05 12:18 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen, michael.schwarz,
	linux-mm, Borislav Petkov, moritz.lipp, richard.fellner

On 04/12/17 15:07, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> native_flush_tlb_single() will be changed with the upcoming
> KERNEL_PAGE_TABLE_ISOLATION feature. This requires to have more code in
> there than INVLPG.
> 
> Remove the paravirt patching for it.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Acked-by: Peter Zijlstra <peterz@infradead.org>
> Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 25/60] x86/paravirt: Provide a way to check for hypervisors
  2017-12-04 14:07 ` [patch 25/60] x86/paravirt: Provide a way to check for hypervisors Thomas Gleixner
@ 2017-12-05 12:19   ` Juergen Gross
  0 siblings, 0 replies; 118+ messages in thread
From: Juergen Gross @ 2017-12-05 12:19 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On 04/12/17 15:07, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> There is no generic way to test whether a kernel is running on a specific
> hypervisor. But that's required to prevent the upcoming user address space
> separation feature in certain guest modes.
> 
> Make the hypervisor type enum unconditionally available and provide a
> helper function which allows to test for a specific type.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

Reviewed-by: Juergen Gross <jgross@suse.com>


Juergen

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-04 22:51     ` Peter Zijlstra
@ 2017-12-05 13:51       ` Dave Hansen
  2017-12-05 14:08         ` Peter Zijlstra
  0 siblings, 1 reply; 118+ messages in thread
From: Dave Hansen @ 2017-12-05 13:51 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Rik van Riel, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon,
	Daniel Gruss, Dave Hansen

On 12/04/2017 02:51 PM, Peter Zijlstra wrote:
> We should find some of the benchmarks that were used to determine
> tlb_single_page_flush_ceiling and measure. I've not gotten around to
> doing either.
> 
> Someone called Dave Hansen did that patch and might still have something
> lying around to help with that:
> 
>   a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)")

I hate git. :)

But, yeah, we have certainly changed enough variables to necessitate
measuring it again.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single()
  2017-12-05 13:51       ` Dave Hansen
@ 2017-12-05 14:08         ` Peter Zijlstra
  0 siblings, 0 replies; 118+ messages in thread
From: Peter Zijlstra @ 2017-12-05 14:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss, Dave Hansen

On Tue, Dec 05, 2017 at 05:51:28AM -0800, Dave Hansen wrote:
> On 12/04/2017 02:51 PM, Peter Zijlstra wrote:
> > We should find some of the benchmarks that were used to determine
> > tlb_single_page_flush_ceiling and measure. I've not gotten around to
> > doing either.
> > 
> > Someone called Dave Hansen did that patch and might still have something
> > lying around to help with that:
> > 
> >   a5102476a24b ("x86/mm: Set TLB flush tunable to sane value (33)")
> 
> I hate git. :)

:-)

> But, yeah, we have certainly changed enough variables to necessitate
> measuring it again.

Its more than that, I think much of that could show if it makes sense to
use invpcid_flush_one() at all.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y
  2017-12-04 14:07 ` [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y Thomas Gleixner
@ 2017-12-05 14:34   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 14:34 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen, Ingo Molnar, moritz.lipp,
	linux-mm, richard.fellner, michael.schwarz

On Mon, Dec 04, 2017 at 03:07:34PM +0100, Thomas Gleixner wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Global pages stay in the TLB across context switches.  Since all contexts
> share the same kernel mapping, these mappings are marked as global pages
> so kernel entries in the TLB are not flushed out on a context switch.
> 
> But, even having these entries in the TLB opens up something that an
> attacker can use, such as the double-page-fault attack:
> 
>    http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf
> 
> That means that even when KERNEL_PAGE_TABLE_ISOLATION switches page tables
> on return to user space the global pages would stay in the TLB cache.
> 
> Disable global pages so that kernel TLB entries can be flushed before
> returning to user space. This way, all accesses to kernel addresses from
> userspace result in a TLB miss independent of the existence of a kernel
> mapping.
> 
> Supress global pages via the __supported_pte_mask. The user space

"Suppress"

Otherwise

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation
  2017-12-04 14:07 ` [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation Thomas Gleixner
@ 2017-12-05 15:20   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 15:20 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:36PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Add the initial files for kernel page table isolation, with a minimal init
> function and the boot time detection for this misfeature.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  Documentation/admin-guide/kernel-parameters.txt |    2 
>  arch/x86/boot/compressed/pagetable.c            |    3 
>  arch/x86/entry/calling.h                        |    7 ++
>  arch/x86/include/asm/kpti.h                     |   14 ++++
>  arch/x86/mm/Makefile                            |    7 +-
>  arch/x86/mm/init.c                              |    2 
>  arch/x86/mm/kpti.c                              |   76 ++++++++++++++++++++++++
>  include/linux/kpti.h                            |   11 +++
>  init/main.c                                     |    2 
>  9 files changed, 121 insertions(+), 3 deletions(-)

Nice splitting.

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 31/60] x86/mm/kpti: Add mapping helper functions
  2017-12-04 14:07 ` [patch 31/60] x86/mm/kpti: Add mapping helper functions Thomas Gleixner
  2017-12-04 22:27   ` Andy Lutomirski
@ 2017-12-05 16:01   ` Borislav Petkov
  2017-12-07  8:33     ` Borislav Petkov
  1 sibling, 1 reply; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 16:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:37PM +0100, Thomas Gleixner wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Add the pagetable helper functions do manage the separate user space page
> tables.
> 
> [ tglx: Split out from the big combo kaiser patch ]
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/pgtable_64.h |  139 ++++++++++++++++++++++++++++++++++++++
>  1 file changed, 139 insertions(+)

...

> +/*
> + * Take a PGD location (pgdp) and a pgd value that needs to be set there.
> + * Populates the user and returns the resulting PGD that must be set in
> + * the kernel copy of the page tables.
> + */
> +static inline pgd_t kpti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
> +{

Btw, do we want to inline a relatively big function like that? I see at
least 20-ish callsites of set_pgd() only.

> +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> +	if (!static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
> +		return pgd;
> +
> +	if (pgd_userspace_access(pgd)) {
> +		if (pgdp_maps_userspace(pgdp)) {
> +			/*
> +			 * The user page tables get the full PGD,
> +			 * accessible from userspace:
> +			 */
> +			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
> +			/*
> +			 * For the copy of the pgd that the kernel uses,
> +			 * make it unusable to userspace.  This ensures on
> +			 * in case that a return to userspace with the
> +			 * kernel CR3 value, userspace will crash instead
> +			 * of running.
> +			 *
> +			 * Note: NX might be not available or disabled.
> +			 */
> +			if (__supported_pte_mask & _PAGE_NX)
> +				pgd.pgd |= _PAGE_NX;
> +		}
> +	} else if (pgd_userspace_access(*pgdp)) {
> +		/*
> +		 * We are clearing a _PAGE_USER PGD for which we presumably
> +		 * populated the user PGD.  We must now clear the user PGD
> +		 * entry.
> +		 */
> +		if (pgdp_maps_userspace(pgdp)) {
> +			kernel_to_user_pgdp(pgdp)->pgd = pgd.pgd;
> +		} else {
> +			/*
> +			 * Attempted to clear a _PAGE_USER PGD which is in
> +			 * the kernel porttion of the address space.  PGDs

"portion"

> +			 * are pre-populated and we never clear them.
> +			 */
> +			WARN_ON_ONCE(1);
> +		}
> +	} else {
> +		/*
> +		 * _PAGE_USER was not set in either the PGD being set or
> +		 * cleared.  All kernel PGDs should be pre-populated so
> +		 * this should never happen after boot.
> +		 */
> +		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
> +	}

Btw, we could keep the warning and have a separate path kernel users
like kernel_ident_mapping_init() (i.e., kexec, hibernation, et al) call
to bypass the warning vs all the remaining users which call the default
set_pgd().

> +#endif
> +	/* return the copy of the PGD we want the kernel to use: */
> +	return pgd;
> +}
> +
> +
>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
> +#if defined(CONFIG_KERNEL_PAGE_TABLE_ISOLATION) && !defined(CONFIG_X86_5LEVEL)
> +	p4dp->pgd = kpti_set_user_pgd(&p4dp->pgd, p4d.pgd);
> +#else
>  	*p4dp = p4d;
> +#endif
>  }
>  
>  static inline void native_p4d_clear(p4d_t *p4d)
> @@ -147,7 +282,11 @@ static inline void native_p4d_clear(p4d_
>  
>  static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
>  {
> +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> +	*pgdp = kpti_set_user_pgd(pgdp, pgd);
> +#else
>  	*pgdp = pgd;
> +#endif

I guess that ifdef is not needed as kpti_set_user_pgd() already has it
so we can do simply:

static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
{
        *pgdp = kpti_set_user_pgd(pgdp, pgd);
}

AFAICT.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd
  2017-12-04 14:07 ` [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
@ 2017-12-05 17:09   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 17:09 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:38PM +0100, Thomas Gleixner wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> With KERNEL_PAGE_TABLE_ISOLATION the user portion of the kernel page
> tables is poisoned with the NX bit so if the entry code exits with the
> kernel page tables selected in CR3, userspace crashes.
> 
> But doing so trips the p4d/pgd_bad() checks.  Make sure it does not do
> that.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 33/60] x86/mm/kpti: Allocate a separate user PGD
  2017-12-04 14:07 ` [patch 33/60] x86/mm/kpti: Allocate a separate user PGD Thomas Gleixner
@ 2017-12-05 18:33   ` Borislav Petkov
  2017-12-06 20:56     ` Ingo Molnar
  0 siblings, 1 reply; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 18:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:39PM +0100, Thomas Gleixner wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Kernel page table isolation requires to have two PGDs. One for the kernel,
> which contains the full kernel mapping plus the user space mapping and one
> for user space which contains the user space mappings and the minimal set
> of kernel mappings which are required by the architecture to be able to
> transition from and to user space.
> 
> Add the necessary preliminaries.
> 
> [ tglx: Split out from the big kaiser dump ]
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/kernel/head_64.S |   30 +++++++++++++++++++++++++++---
>  arch/x86/mm/pgtable.c     |   16 ++++++++++++++--
>  2 files changed, 41 insertions(+), 5 deletions(-)

...

> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
>  		kmem_cache_free(pgd_cache, pgd);
>  }
>  #else
> +
> +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> +/*
> + * Instead of one pgd, we aquire two pgds.  Being order-1, it is

"acquire"

Otherwise:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 34/60] x86/mm/kpti: Populate user PGD
  2017-12-04 14:07 ` [patch 34/60] x86/mm/kpti: Populate " Thomas Gleixner
@ 2017-12-05 19:17   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-05 19:17 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen

On Mon, Dec 04, 2017 at 03:07:40PM +0100, Thomas Gleixner wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> Populate the PGD entries in the init user PGD which cover the kernel half
> of the address space. This makes sure that the installment of the user
> visible kernel mappings finds a populated PGD.
> 
> In clone_pgd_range() copy the init user PGDs which cover the kernel half of
> the address space, so a process has all the required kernel mappings
> visible.
> 
> [ tglx: Split out from the big kaiser dump ]
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/pgtable.h |    5 +++++
>  arch/x86/mm/kpti.c             |   41 +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 46 insertions(+)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches
  2017-12-04 14:07 ` [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
@ 2017-12-05 21:46   ` Andy Lutomirski
  2017-12-05 22:05     ` Peter Zijlstra
  0 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-05 21:46 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss

On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> We can use PCID to retain the TLBs across CR3 switches; including
> those now part of the user/kernel switch. This increases performance
> of kernel entry/exit at the cost of more expensive/complicated TLB
> flushing.
>
> Now that we have two address spaces, one for kernel and one for user
> space, we need two PCIDs per mm. We use the top PCID bit to indicate a
> user PCID (just like we use the PFN LSB for the PGD). Since we do TLB
> invalidation from kernel space, the existing code will only invalidate
> the kernel PCID, we augment that by marking the corresponding user
> PCID invalid, and upon switching back to userspace, use a flushing CR3
> write for the switch.
>
> In order to access the user_pcid_flush_mask we use PER_CPU storage,
> which means the previously established SWAPGS vs CR3 ordering is now
> mandatory and required.
>
> Having to do this memory access does require additional registers,
> most sites have a functioning stack and we can spill one (RAX), sites
> without functional stack need to otherwise provide the second scratch
> register.
>
> Note: PCID is generally available on Intel Sandybridge and later CPUs.
> Note: Up until this point TLB flushing was broken in this series.

I haven't checked that hard which patch introduces this bug, but it
seems that, with this applied, nothing propagates
non-mm-switch-related flushes to usermode.  Shouldn't
flush_tlb_func_common() contain a call to invalidate_user_asid() near
the bottom?  Alternatively, it could be in local_flush_tlb() and
__flush_tlb_single() (or whatever the hell the flush-one-usermode-TLB
function ends up being called).

Also, on a somewhat related note, __flush_tlb_single() is called from
both flush_tlb_func_common() and do_kernel_range_flush.  That sounds
wrong.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (60 preceding siblings ...)
  2017-12-04 18:02 ` [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Linus Torvalds
@ 2017-12-05 21:49 ` Andy Lutomirski
  2017-12-05 21:57   ` Dave Hansen
  2018-01-19 20:56 ` Andrew Morton
  62 siblings, 1 reply; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-05 21:49 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, Daniel Gruss

Random thought for the future: KPTI will make it possible to avoid
global IPI broadcasts on kernel flushes as we discussed, incorrectly,
two years ago at LPC.  This could be nice.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-05 21:49 ` Andy Lutomirski
@ 2017-12-05 21:57   ` Dave Hansen
  2017-12-05 23:19     ` Andy Lutomirski
  0 siblings, 1 reply; 118+ messages in thread
From: Dave Hansen @ 2017-12-05 21:57 UTC (permalink / raw)
  To: Andy Lutomirski, Thomas Gleixner
  Cc: LKML, X86 ML, Linus Torvalds, Peter Zijlstra, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Rik van Riel, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon,
	Daniel Gruss

On 12/05/2017 01:49 PM, Andy Lutomirski wrote:
> Random thought for the future: KPTI will make it possible to avoid
> global IPI broadcasts on kernel flushes as we discussed, incorrectly,
> two years ago at LPC.  This could be nice.

I'm slow.  How?

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches
  2017-12-05 21:46   ` Andy Lutomirski
@ 2017-12-05 22:05     ` Peter Zijlstra
  2017-12-05 22:08       ` Dave Hansen
  0 siblings, 1 reply; 118+ messages in thread
From: Peter Zijlstra @ 2017-12-05 22:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Dave Hansen,
	Borislav Petkov, Greg KH, Kees Cook, Hugh Dickins, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, Daniel Gruss

On Tue, Dec 05, 2017 at 01:46:36PM -0800, Andy Lutomirski wrote:
> On Mon, Dec 4, 2017 at 6:07 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > We can use PCID to retain the TLBs across CR3 switches; including
> > those now part of the user/kernel switch. This increases performance
> > of kernel entry/exit at the cost of more expensive/complicated TLB
> > flushing.
> >
> > Now that we have two address spaces, one for kernel and one for user
> > space, we need two PCIDs per mm. We use the top PCID bit to indicate a
> > user PCID (just like we use the PFN LSB for the PGD). Since we do TLB
> > invalidation from kernel space, the existing code will only invalidate
> > the kernel PCID, we augment that by marking the corresponding user
> > PCID invalid, and upon switching back to userspace, use a flushing CR3
> > write for the switch.
> >
> > In order to access the user_pcid_flush_mask we use PER_CPU storage,
> > which means the previously established SWAPGS vs CR3 ordering is now
> > mandatory and required.
> >
> > Having to do this memory access does require additional registers,
> > most sites have a functioning stack and we can spill one (RAX), sites
> > without functional stack need to otherwise provide the second scratch
> > register.
> >
> > Note: PCID is generally available on Intel Sandybridge and later CPUs.
> > Note: Up until this point TLB flushing was broken in this series.
> 
> I haven't checked that hard which patch introduces this bug, but it
> seems that, with this applied, nothing propagates
> non-mm-switch-related flushes to usermode.  Shouldn't
> flush_tlb_func_common() contain a call to invalidate_user_asid() near
> the bottom?  Alternatively, it could be in local_flush_tlb() and
> __flush_tlb_single() (or whatever the hell the flush-one-usermode-TLB
> function ends up being called).

__native_flush_tlb_single() has the invalidate_user_asid()
__native_flush_tlb() has the invalidate_user_asid().

Which should be exactly that last option you mention.

> Also, on a somewhat related note, __flush_tlb_single() is called from
> both flush_tlb_func_common() and do_kernel_range_flush.  That sounds
> wrong.

Fixed that in the patches I send out earlier today.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches
  2017-12-05 22:05     ` Peter Zijlstra
@ 2017-12-05 22:08       ` Dave Hansen
  0 siblings, 0 replies; 118+ messages in thread
From: Dave Hansen @ 2017-12-05 22:08 UTC (permalink / raw)
  To: Peter Zijlstra, Andy Lutomirski
  Cc: Thomas Gleixner, LKML, X86 ML, Linus Torvalds, Borislav Petkov,
	Greg KH, Kees Cook, Hugh Dickins, Brian Gerst, Josh Poimboeuf,
	Denys Vlasenko, Rik van Riel, Boris Ostrovsky, Juergen Gross,
	David Laight, Eduardo Valentin, aliguori, Will Deacon,
	Daniel Gruss

On 12/05/2017 02:05 PM, Peter Zijlstra wrote:
>> I haven't checked that hard which patch introduces this bug, but it
>> seems that, with this applied, nothing propagates
>> non-mm-switch-related flushes to usermode.  Shouldn't
>> flush_tlb_func_common() contain a call to invalidate_user_asid() near
>> the bottom?  Alternatively, it could be in local_flush_tlb() and
>> __flush_tlb_single() (or whatever the hell the flush-one-usermode-TLB
>> function ends up being called).
> __native_flush_tlb_single() has the invalidate_user_asid()
> __native_flush_tlb() has the invalidate_user_asid().
> 
> Which should be exactly that last option you mention.

I can also see INVPCIDs in profiles, so it's definitely getting used.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-05 21:57   ` Dave Hansen
@ 2017-12-05 23:19     ` Andy Lutomirski
  0 siblings, 0 replies; 118+ messages in thread
From: Andy Lutomirski @ 2017-12-05 23:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, LKML, X86 ML, Linus Torvalds,
	Peter Zijlstra, Borislav Petkov, Greg KH, Kees Cook,
	Hugh Dickins, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Rik van Riel, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, Daniel Gruss

On Tue, Dec 5, 2017 at 1:57 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> On 12/05/2017 01:49 PM, Andy Lutomirski wrote:
>> Random thought for the future: KPTI will make it possible to avoid
>> global IPI broadcasts on kernel flushes as we discussed, incorrectly,
>> two years ago at LPC.  This could be nice.
>
> I'm slow.  How?
>


By introducing an (optional) atomic check for need-to-flush on
switches from user CR3 to kernel CR3.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs
  2017-12-04 14:07 ` [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs Thomas Gleixner
@ 2017-12-06 15:39   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-06 15:39 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:42PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Provide infrastructure to:
> 
>  - find a kernel PMD for a mapping which must be visible to user space for
>    the entry/exit code to work.
> 
>  - walk an address range and share the kernel PMD with it.
> 
> This reuses a small part of the original KAISER patches to populate the
> user space page table.
> 
> [ tglx: Made it universally usable so it can be used for any kind of shared
>   	mapping. Add a mechanism to clear specific bits in the user space
> 	visible PMD entry. ]
> 
> Originally-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/mm/kpti.c |  102 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 102 insertions(+)
> 
> --- a/arch/x86/mm/kpti.c
> +++ b/arch/x86/mm/kpti.c
> @@ -65,6 +65,108 @@ void __init kpti_check_boottime_disable(
>  }
>  
>  /*
> + * Walk the user copy of the page tables (optionally) trying to allocate
> + * page table pages on the way down.
> + *
> + * Returns a pointer to a PMD on success, or NULL on failure.
> + */
> +static pmd_t *kpti_user_pagetable_walk_pmd(unsigned long address)
> +{
> +	pgd_t *pgd = kernel_to_user_pgdp(pgd_offset_k(address));
> +	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
> +	pud_t *pud;
> +	p4d_t *p4d;
> +
> +	if (address < PAGE_OFFSET) {
> +		WARN_ONCE(1, "attempt to walk user address\n");
> +		return NULL;
> +	}
> +
> +	if (pgd_none(*pgd)) {
> +		WARN_ONCE(1, "All user pgds should have been populated\n");
> +		return NULL;
> +	}
> +	BUILD_BUG_ON(pgd_large(*pgd) != 0);

Must be some 5LEVEL thing? Because it currently does:

static inline int pgd_large(pgd_t pgd) { return 0; }

> +
> +	p4d = p4d_offset(pgd, address);
> +	BUILD_BUG_ON(p4d_large(*p4d) != 0);

That too.

> +	if (p4d_none(*p4d)) {
> +		unsigned long new_pud_page = __get_free_page(gfp);
> +		if (!new_pud_page)
> +			return NULL;
> +
> +		if (p4d_none(*p4d)) {

We already tested that above or does __get_free_page() have side-effects?

> +			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
> +			new_pud_page = 0;
> +		}
> +		if (new_pud_page)
> +			free_page(new_pud_page);
> +	}
> +
> +	pud = pud_offset(p4d, address);
> +	/* The user page tables do not use large mappings: */
> +	if (pud_large(*pud)) {
> +		WARN_ON(1);
> +		return NULL;
> +	}
> +	if (pud_none(*pud)) {
> +		unsigned long new_pmd_page = __get_free_page(gfp);
> +		if (!new_pmd_page)
> +			return NULL;
> +
> +		if (pud_none(*pud)) {

Ditto.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active
  2017-12-04 14:07 ` [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active Thomas Gleixner
@ 2017-12-06 16:01   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-06 16:01 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:43PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Force the entry through the trampoline only when KPTI is active. Otherwise
> go through the normal entry code.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/kernel/cpu/common.c |    5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1458,7 +1458,10 @@ void syscall_init(void)
>  		(entry_SYSCALL_64_trampoline - _entry_trampoline);
>  
>  	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> -	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
> +	if (static_cpu_has_bug(X86_BUG_CPU_SECURE_MODE_KPTI))
> +		wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
> +	else
> +		wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
>  
>  #ifdef CONFIG_IA32_EMULATION
>  	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);

Just a nitpick:

Subject: x86/mm/kpti:...

Otherwise,

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD
  2017-12-04 14:07 ` [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD Thomas Gleixner
@ 2017-12-06 18:57   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-06 18:57 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirsky
  Cc: LKML, x86, Linus Torvalds, Peter Zijlstra, Dave Hansen, Greg KH,
	keescook, hughd, Brian Gerst, Josh Poimboeuf, Denys Vlasenko,
	Rik van Riel, Boris Ostrovsky, Juergen Gross, David Laight,
	Eduardo Valentin, aliguori, Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:44PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> This allows the cpu entry area PMDs to be shared between the kernel and
> user space page tables.
> 
> [ tglx: Fixed bottom of by one and added guards so other fixmaps can be
>   	added later ]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/include/asm/fixmap.h |   14 ++++++++++----
>  1 file changed, 10 insertions(+), 4 deletions(-)
> 
> --- a/arch/x86/include/asm/fixmap.h
> +++ b/arch/x86/include/asm/fixmap.h
> @@ -134,16 +134,22 @@ enum fixed_addresses {
>  #ifdef CONFIG_PARAVIRT
>  	FIX_PARAVIRT_BOOTMAP,
>  #endif
> -	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
> -	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
>  #ifdef	CONFIG_X86_INTEL_MID
>  	FIX_LNW_VRTC,
>  #endif
> -	/* Fixmap entries to remap the GDTs, one per processor. */
> +	FIX_TEXT_POKE1,	/* reserve 2 pages for text_poke() */
> +	FIX_TEXT_POKE0, /* first page is last, because allocation is backward */
> +
> +	/*
> +	 * Fixmap entries to remap the IDT, and the per cpu entry areas.
> +	 * Aligend to a PMD boundary.

"Aligned"

> +	 */
> +	FIX_USR_SHARED_TOP = round_up(FIX_TEXT_POKE0 + 1, PTRS_PER_PMD),
>  	FIX_CPU_ENTRY_AREA_TOP,
>  	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
> +	FIX_USR_SHARED_BOTTOM  = round_up(FIX_CPU_ENTRY_AREA_BOTTOM + 2, PTRS_PER_PMD) - 1,

So those look like this here:

FIX_TEXT_POKE0:			0x285, va: 0xffffffffff57a000
FIX_USR_SHARED_TOP:		0x400, va: 0xffffffffff3ff000
FIX_CPU_ENTRY_AREA_TOP:		0x401, va: 0xffffffffff3fe000
FIX_CPU_ENTRY_AREA_BOTTOM:	0x458, va: 0xffffffffff3a7000
FIX_USR_SHARED_BOTTOM:		0x5ff, va: 0xffffffffff200000

and FIX_CPU_ENTRY_AREA_TOP is the one PTE before the last 4K. But we
could just as well use the last one too, no? I.e.,

	FIX_USR_SHARED_TOP = round_up(FIX_TEXT_POKE0 + 1, PTRS_PER_PMD),
	FIX_CPU_ENTRY_AREA_TOP = FIX_USR_SHARED_TOP,

?

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 33/60] x86/mm/kpti: Allocate a separate user PGD
  2017-12-05 18:33   ` Borislav Petkov
@ 2017-12-06 20:56     ` Ingo Molnar
  0 siblings, 0 replies; 118+ messages in thread
From: Ingo Molnar @ 2017-12-06 20:56 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, LKML, x86, Linus Torvalds, Andy Lutomirsky,
	Peter Zijlstra, Dave Hansen, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, Dave Hansen


* Borislav Petkov <bp@suse.de> wrote:

> On Mon, Dec 04, 2017 at 03:07:39PM +0100, Thomas Gleixner wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > Kernel page table isolation requires to have two PGDs. One for the kernel,
> > which contains the full kernel mapping plus the user space mapping and one
> > for user space which contains the user space mappings and the minimal set
> > of kernel mappings which are required by the architecture to be able to
> > transition from and to user space.
> > 
> > Add the necessary preliminaries.
> > 
> > [ tglx: Split out from the big kaiser dump ]
> > 
> > Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > 
> > ---
> >  arch/x86/kernel/head_64.S |   30 +++++++++++++++++++++++++++---
> >  arch/x86/mm/pgtable.c     |   16 ++++++++++++++--
> >  2 files changed, 41 insertions(+), 5 deletions(-)
> 
> ...
> 
> > --- a/arch/x86/mm/pgtable.c
> > +++ b/arch/x86/mm/pgtable.c
> > @@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
> >  		kmem_cache_free(pgd_cache, pgd);
> >  }
> >  #else
> > +
> > +#ifdef CONFIG_KERNEL_PAGE_TABLE_ISOLATION
> > +/*
> > + * Instead of one pgd, we aquire two pgds.  Being order-1, it is
> 
> "acquire"

Fixed. I also did a s/pgd/PGD

> Otherwise:
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs
  2017-12-04 14:07 ` [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs Thomas Gleixner
@ 2017-12-06 21:18   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-06 21:18 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:45PM +0100, Thomas Gleixner wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Share the FIX_USR_SHARED PMDs so the user space and kernel space page
> tables have the same PMD page.
> 
> [ tglx: Made it use the FIX_USR_SHARED range so later additions
>   	are covered automatically ]
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/mm/kpti.c |   18 ++++++++++++++++++
>  1 file changed, 18 insertions(+)
> 
> --- a/arch/x86/mm/kpti.c
> +++ b/arch/x86/mm/kpti.c
> @@ -167,6 +167,23 @@ kpti_clone_pmds(unsigned long start, uns
>  }
>  
>  /*
> + * Clone the populated PMDs of the user shared fixmaps into the user space
> + * visible page table.
> + */
> +static void __init kpti_clone_user_shared(void)
> +{
> +	unsigned long bot, top;
> +
> +	bot = __fix_to_virt(FIX_USR_SHARED_BOTTOM);
> +	top = __fix_to_virt(FIX_USR_SHARED_TOP) + PAGE_SIZE;
> +
> +	/* Top of the user shared block must be PMD-aligned. */
> +	WARN_ON(top & ~PMD_MASK);

Or

	WARN_ON(top & (PMD_SIZE - 1));


Otherwise:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 40/60] x86: PMD align entry text
  2017-12-04 14:07 ` [patch 40/60] x86: PMD align entry text Thomas Gleixner
@ 2017-12-07  8:07   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-07  8:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:46PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The (irq)entry text must be visible in the user space page tables. To allow
> simple PMD based sharing, make the entry text PMD aligned.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> ---
>  arch/x86/kernel/vmlinux.lds.S |    8 ++++++++
>  1 file changed, 8 insertions(+)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 41/60] x86/mm/kpti: Share entry text PMD
  2017-12-04 14:07 ` [patch 41/60] x86/mm/kpti: Share entry text PMD Thomas Gleixner
@ 2017-12-07  8:24   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-07  8:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:47PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> Share the entry text PMD of the kernel mapping with the user space
> mapping. If large pages are enabled this is a single PMD entry and at the
> point where it is copied into the user page table the RW bit has not been
> cleared yet. Clear it right away so the user space visible map becomes RX.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/mm/kpti.c |   10 ++++++++++
>  1 file changed, 10 insertions(+)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 31/60] x86/mm/kpti: Add mapping helper functions
  2017-12-05 16:01   ` Borislav Petkov
@ 2017-12-07  8:33     ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-07  8:33 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, Dave Hansen

On Tue, Dec 05, 2017 at 05:01:34PM +0100, Borislav Petkov wrote:
> > +/*
> > + * Take a PGD location (pgdp) and a pgd value that needs to be set there.
> > + * Populates the user and returns the resulting PGD that must be set in
> > + * the kernel copy of the page tables.
> > + */
> > +static inline pgd_t kpti_set_user_pgd(pgd_t *pgdp, pgd_t pgd)
> > +{
> 
> Btw, do we want to inline a relatively big function like that? I see at
> least 20-ish callsites of set_pgd() only.

Yap, looking at Hugh's version, he has moved it to kaiser.c. I guess in
our case, that should be arch/x86/mm/kpti.c respectively.

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area
  2017-12-04 14:07 ` [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area Thomas Gleixner
@ 2017-12-07  9:55   ` Borislav Petkov
  0 siblings, 0 replies; 118+ messages in thread
From: Borislav Petkov @ 2017-12-07  9:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss

On Mon, Dec 04, 2017 at 03:07:49PM +0100, Thomas Gleixner wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
> 
> The intel PEBS/BTS debug store is a design trainwreck as is expects virtual
> addresses which must be visible in any execution context.

Sure, what can possibly go wrong?! :-\

> So it is required to make these mappings visible to user space when kernel
> page table isolation is active.
> 
> Provide enough room for the buffer mappings in the cpu_entry_area so the
> buffers are available in the user space visible fixmap.
> 
> At the point where the kernel side fixmap is populated there is no buffer
> available yet, but the kernel PMD must be populated. To achieve this set
> the fixmap entries for these buffers to non present.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>  arch/x86/events/intel/ds.c      |    5 +++--
>  arch/x86/events/perf_event.h    |   21 ++-------------------
>  arch/x86/include/asm/fixmap.h   |   13 +++++++++++++
>  arch/x86/include/asm/intel_ds.h |   36 ++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/cpu/common.c    |   21 +++++++++++++++++++++
>  5 files changed, 75 insertions(+), 21 deletions(-)

...

> @@ -592,6 +603,16 @@ static void __init setup_cpu_entry_area(
>  	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
>  		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
>  #endif
> +
> +#ifdef CONFIG_CPU_SUP_INTEL
> +	BUILD_BUG_ON(sizeof(struct debug_store) % PAGE_SIZE != 0);
> +	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, cpu_debug_store),
> +				&per_cpu(cpu_debug_store, cpu),
> +				sizeof(struct debug_store) / PAGE_SIZE,
> +				PAGE_KERNEL);
> +	set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, cpu_debug_buffers),
> +			       sizeof(struct debug_store_buffers) / PAGE_SIZE);
> +#endif

I guess we can do that additionally, so as not to setup the mappings on
distro kernels running !INTEL:

---
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 1364a8f378f8..5cfb68090a24 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -606,12 +606,16 @@ static void __init setup_cpu_entry_area(int cpu)
 
 #ifdef CONFIG_CPU_SUP_INTEL
 	BUILD_BUG_ON(sizeof(struct debug_store) % PAGE_SIZE != 0);
-	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, cpu_debug_store),
-				&per_cpu(cpu_debug_store, cpu),
-				sizeof(struct debug_store) / PAGE_SIZE,
-				PAGE_KERNEL);
-	set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, cpu_debug_buffers),
-			       sizeof(struct debug_store_buffers) / PAGE_SIZE);
+
+	if (boot_cpu_data.x86_vendor == X86_VENDOR_INTEL) {
+		set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, cpu_debug_store),
+					&per_cpu(cpu_debug_store, cpu),
+					sizeof(struct debug_store) / PAGE_SIZE,
+					PAGE_KERNEL);
+
+		set_percpu_fixmap_ptes(get_cpu_entry_area_index(cpu, cpu_debug_buffers),
+				       sizeof(struct debug_store_buffers) / PAGE_SIZE);
+	}
 #endif
 }

-- 
Regards/Gruss,
    Boris.

SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton, HRB 21284 (AG Nürnberg)
-- 

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
                   ` (61 preceding siblings ...)
  2017-12-05 21:49 ` Andy Lutomirski
@ 2018-01-19 20:56 ` Andrew Morton
  2018-01-19 21:06   ` Dave Hansen
  2018-01-20 19:59   ` Thomas Gleixner
  62 siblings, 2 replies; 118+ messages in thread
From: Andrew Morton @ 2018-01-19 20:56 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, neil.berrington

Should KPTI have a MAINTAINERS entry?

Neil Berrington (cc'ed) is reporting "Double fault in load_new_mm_cr3 with KPTI
enabled" at https://bugzilla.kernel.org/show_bug.cgi?id=198517

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2018-01-19 20:56 ` Andrew Morton
@ 2018-01-19 21:06   ` Dave Hansen
  2018-01-20 19:59   ` Thomas Gleixner
  1 sibling, 0 replies; 118+ messages in thread
From: Dave Hansen @ 2018-01-19 21:06 UTC (permalink / raw)
  To: Andrew Morton, Thomas Gleixner
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Borislav Petkov, Greg KH, keescook, hughd, Brian Gerst,
	Josh Poimboeuf, Denys Vlasenko, Rik van Riel, Boris Ostrovsky,
	Juergen Gross, David Laight, Eduardo Valentin, aliguori,
	Will Deacon, daniel.gruss, neil.berrington

On 01/19/2018 12:56 PM, Andrew Morton wrote:
> Should KPTI have a MAINTAINERS entry?
> 
> Neil Berrington (cc'ed) is reporting "Double fault in load_new_mm_cr3 with KPTI
> enabled" at https://bugzilla.kernel.org/show_bug.cgi?id=198517

Seems sane to me.  There have been quite a few patches I wish I'd been
cc'd on along the way.  I think Andy L in particular is probably way
under-cc'd on x86 stuff in general.

A better long-term solution (that others have suggested) is probably to
create an x86-discuss@vger.kernel.org or something.

^ permalink raw reply	[flat|nested] 118+ messages in thread

* Re: [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER)
  2018-01-19 20:56 ` Andrew Morton
  2018-01-19 21:06   ` Dave Hansen
@ 2018-01-20 19:59   ` Thomas Gleixner
  1 sibling, 0 replies; 118+ messages in thread
From: Thomas Gleixner @ 2018-01-20 19:59 UTC (permalink / raw)
  To: Andrew Morton
  Cc: LKML, x86, Linus Torvalds, Andy Lutomirsky, Peter Zijlstra,
	Dave Hansen, Borislav Petkov, Greg KH, keescook, hughd,
	Brian Gerst, Josh Poimboeuf, Denys Vlasenko, Rik van Riel,
	Boris Ostrovsky, Juergen Gross, David Laight, Eduardo Valentin,
	aliguori, Will Deacon, daniel.gruss, neil.berrington

On Fri, 19 Jan 2018, Andrew Morton wrote:

> Should KPTI have a MAINTAINERS entry?

I don't think so. It's all x86 core code which has a maintainer entry.

> Neil Berrington (cc'ed) is reporting "Double fault in load_new_mm_cr3 with KPTI
> enabled" at https://bugzilla.kernel.org/show_bug.cgi?id=198517

Neil, the screenshot shows that this is on a ubuntu 4.13 something
kernel. Can you reproduce on 4.14.14 or on Linus latest ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 118+ messages in thread

end of thread, back to index

Thread overview: 118+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-04 14:07 [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Thomas Gleixner
2017-12-04 14:07 ` [patch 01/60] x86/entry/64/paravirt: Use paravirt-safe macro to access eflags Thomas Gleixner
2017-12-05 12:17   ` Juergen Gross
2017-12-04 14:07 ` [patch 02/60] x86/unwinder/orc: Dont bail on stack overflow Thomas Gleixner
2017-12-04 20:31   ` Andy Lutomirski
2017-12-04 21:31     ` Thomas Gleixner
2017-12-04 14:07 ` [patch 03/60] x86/unwinder: Handle stack overflows more gracefully Thomas Gleixner
2017-12-04 14:07 ` [patch 04/60] x86/irq: Remove an old outdated comment about context tracking races Thomas Gleixner
2017-12-04 14:07 ` [patch 05/60] x86/irq/64: Print the offending IP in the stack overflow warning Thomas Gleixner
2017-12-04 14:07 ` [patch 06/60] x86/entry/64: Allocate and enable the SYSENTER stack Thomas Gleixner
2017-12-04 14:07 ` [patch 07/60] x86/dumpstack: Add get_stack_info() support for " Thomas Gleixner
2017-12-04 14:07 ` [patch 08/60] x86/entry/gdt: Put per-CPU GDT remaps in ascending order Thomas Gleixner
2017-12-04 14:07 ` [patch 09/60] x86/mm/fixmap: Generalize the GDT fixmap mechanism, introduce struct cpu_entry_area Thomas Gleixner
2017-12-04 14:07 ` [patch 10/60] x86/kasan/64: Teach KASAN about the cpu_entry_area Thomas Gleixner
2017-12-04 14:07 ` [patch 11/60] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Thomas Gleixner
2017-12-04 14:07 ` [patch 12/60] x86/dumpstack: Handle stack overflow on all stacks Thomas Gleixner
2017-12-04 14:07 ` [patch 13/60] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Thomas Gleixner
2017-12-04 14:07 ` [patch 14/60] x86/entry: Remap the TSS into the CPU entry area Thomas Gleixner
2017-12-04 18:20   ` Borislav Petkov
2017-12-04 14:07 ` [patch 15/60] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Thomas Gleixner
2017-12-04 14:07 ` [patch 16/60] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Thomas Gleixner
2017-12-04 14:07 ` [patch 17/60] x86/entry/64: Use a per-CPU trampoline stack for IDT entries Thomas Gleixner
2017-12-04 14:07 ` [patch 18/60] x86/entry/64: Return to userspace from the trampoline stack Thomas Gleixner
2017-12-04 14:07 ` [patch 19/60] x86/entry/64: Create a per-CPU SYSCALL entry trampoline Thomas Gleixner
2017-12-04 22:30   ` Andy Lutomirski
2017-12-04 14:07 ` [patch 20/60] x86/entry/64: Move the IST stacks into struct cpu_entry_area Thomas Gleixner
2017-12-04 14:07 ` [patch 21/60] x86/entry/64: Remove the SYSENTER stack canary Thomas Gleixner
2017-12-04 14:07 ` [patch 22/60] x86/entry: Clean up the SYSENTER_stack code Thomas Gleixner
2017-12-04 19:41   ` Borislav Petkov
2017-12-04 14:07 ` [patch 23/60] x86/entry/64: Make cpu_entry_area.tss read-only Thomas Gleixner
2017-12-04 20:25   ` Borislav Petkov
2017-12-04 14:07 ` [patch 24/60] x86/paravirt: Dont patch flush_tlb_single Thomas Gleixner
2017-12-05 12:18   ` Juergen Gross
2017-12-04 14:07 ` [patch 25/60] x86/paravirt: Provide a way to check for hypervisors Thomas Gleixner
2017-12-05 12:19   ` Juergen Gross
2017-12-04 14:07 ` [patch 26/60] x86/cpufeature: Make cpu bugs sticky Thomas Gleixner
2017-12-04 22:39   ` Borislav Petkov
2017-12-04 14:07 ` [patch 27/60] x86/cpufeatures: Add X86_BUG_CPU_INSECURE Thomas Gleixner
2017-12-04 23:18   ` Borislav Petkov
2017-12-04 14:07 ` [patch 28/60] x86/mm/kpti: Disable global pages if KERNEL_PAGE_TABLE_ISOLATION=y Thomas Gleixner
2017-12-05 14:34   ` Borislav Petkov
2017-12-04 14:07 ` [patch 29/60] x86/mm/kpti: Prepare the x86/entry assembly code for entry/exit CR3 switching Thomas Gleixner
2017-12-04 14:07 ` [patch 30/60] x86/mm/kpti: Add infrastructure for page table isolation Thomas Gleixner
2017-12-05 15:20   ` Borislav Petkov
2017-12-04 14:07 ` [patch 31/60] x86/mm/kpti: Add mapping helper functions Thomas Gleixner
2017-12-04 22:27   ` Andy Lutomirski
2017-12-05 16:01   ` Borislav Petkov
2017-12-07  8:33     ` Borislav Petkov
2017-12-04 14:07 ` [patch 32/60] x86/mm/kpti: Allow NX poison to be set in p4d/pgd Thomas Gleixner
2017-12-05 17:09   ` Borislav Petkov
2017-12-04 14:07 ` [patch 33/60] x86/mm/kpti: Allocate a separate user PGD Thomas Gleixner
2017-12-05 18:33   ` Borislav Petkov
2017-12-06 20:56     ` Ingo Molnar
2017-12-04 14:07 ` [patch 34/60] x86/mm/kpti: Populate " Thomas Gleixner
2017-12-05 19:17   ` Borislav Petkov
2017-12-04 14:07 ` [patch 35/60] x86/espfix: Ensure that ESPFIX is visible in " Thomas Gleixner
2017-12-04 22:28   ` Andy Lutomirski
2017-12-04 14:07 ` [patch 36/60] x86/mm/kpti: Add functions to clone kernel PMDs Thomas Gleixner
2017-12-06 15:39   ` Borislav Petkov
2017-12-04 14:07 ` [patch 37/60] x86mm//kpti: Force entry through trampoline when KPTI active Thomas Gleixner
2017-12-06 16:01   ` Borislav Petkov
2017-12-04 14:07 ` [patch 38/60] x86/fixmap: Move cpu entry area into a separate PMD Thomas Gleixner
2017-12-06 18:57   ` Borislav Petkov
2017-12-04 14:07 ` [patch 39/60] x86/mm/kpti: Share cpu_entry_area PMDs Thomas Gleixner
2017-12-06 21:18   ` Borislav Petkov
2017-12-04 14:07 ` [patch 40/60] x86: PMD align entry text Thomas Gleixner
2017-12-07  8:07   ` Borislav Petkov
2017-12-04 14:07 ` [patch 41/60] x86/mm/kpti: Share entry text PMD Thomas Gleixner
2017-12-07  8:24   ` Borislav Petkov
2017-12-04 14:07 ` [patch 42/60] x86/fixmap: Move IDT fixmap into the cpu_entry_area range Thomas Gleixner
2017-12-04 14:07 ` [patch 43/60] x86/fixmap: Add debugstore entries to cpu_entry_area Thomas Gleixner
2017-12-07  9:55   ` Borislav Petkov
2017-12-04 14:07 ` [patch 44/60] x86/events/intel/ds: Map debug buffers in fixmap Thomas Gleixner
2017-12-04 14:07 ` [patch 45/60] x86/fixmap: Add ldt entries to user shared fixmap Thomas Gleixner
2017-12-04 14:07 ` [patch 46/60] x86/ldt: Rename ldt_struct->entries member Thomas Gleixner
2017-12-04 14:07 ` [patch 47/60] x86/ldt: Map LDT entries into fixmap Thomas Gleixner
2017-12-04 22:33   ` Andy Lutomirski
2017-12-04 22:51     ` Thomas Gleixner
2017-12-04 14:07 ` [patch 48/60] x86/mm: Move the CR3 construction functions to tlbflush.h Thomas Gleixner
2017-12-04 14:07 ` [patch 49/60] x86/mm: Remove hard-coded ASID limit checks Thomas Gleixner
2017-12-04 14:07 ` [patch 50/60] x86/mm: Put MMU to hardware ASID translation in one place Thomas Gleixner
2017-12-04 14:07 ` [patch 51/60] x86/mm: Allow flushing for future ASID switches Thomas Gleixner
2017-12-04 22:22   ` Andy Lutomirski
2017-12-04 22:34     ` Dave Hansen
2017-12-04 22:36       ` Andy Lutomirski
2017-12-04 22:47     ` Peter Zijlstra
2017-12-04 22:54       ` Andy Lutomirski
2017-12-04 23:06         ` Peter Zijlstra
2017-12-04 14:07 ` [patch 52/60] x86/mm: Abstract switching CR3 Thomas Gleixner
2017-12-04 14:07 ` [patch 53/60] x86/mm: Use/Fix PCID to optimize user/kernel switches Thomas Gleixner
2017-12-05 21:46   ` Andy Lutomirski
2017-12-05 22:05     ` Peter Zijlstra
2017-12-05 22:08       ` Dave Hansen
2017-12-04 14:08 ` [patch 54/60] x86/mm: Optimize RESTORE_CR3 Thomas Gleixner
2017-12-04 14:08 ` [patch 55/60] x86/mm: Use INVPCID for __native_flush_tlb_single() Thomas Gleixner
2017-12-04 22:25   ` Andy Lutomirski
2017-12-04 22:51     ` Peter Zijlstra
2017-12-05 13:51       ` Dave Hansen
2017-12-05 14:08         ` Peter Zijlstra
2017-12-04 14:08 ` [patch 56/60] x86/mm/kpti: Disable native VSYSCALL Thomas Gleixner
2017-12-04 22:33   ` Andy Lutomirski
2017-12-04 14:08 ` [patch 57/60] x86/mm/kpti: Add Kconfig Thomas Gleixner
2017-12-04 16:54   ` Andy Lutomirski
2017-12-04 16:57     ` Thomas Gleixner
2017-12-05  9:34       ` Thomas Gleixner
2017-12-04 14:08 ` [patch 58/60] x86/mm/debug_pagetables: Add page table directory Thomas Gleixner
2017-12-04 14:08 ` [patch 59/60] x86/mm/dump_pagetables: Check user space page table for WX pages Thomas Gleixner
2017-12-04 14:08 ` [patch 60/60] x86/mm/debug_pagetables: Allow dumping current pagetables Thomas Gleixner
2017-12-04 18:02 ` [patch 00/60] x86/kpti: Kernel Page Table Isolation (was KAISER) Linus Torvalds
2017-12-04 18:18   ` Thomas Gleixner
2017-12-04 18:21     ` Boris Ostrovsky
2017-12-04 18:28     ` Linus Torvalds
2017-12-05 21:49 ` Andy Lutomirski
2017-12-05 21:57   ` Dave Hansen
2017-12-05 23:19     ` Andy Lutomirski
2018-01-19 20:56 ` Andrew Morton
2018-01-19 21:06   ` Dave Hansen
2018-01-20 19:59   ` Thomas Gleixner

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox