linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
@ 2017-10-31 22:31 Dave Hansen
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
                   ` (26 more replies)
  0 siblings, 27 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel; +Cc: linux-mm, dave.hansen

tl;dr:

KAISER makes it harder to defeat KASLR, but makes syscalls and
interrupts slower.  These patches are based on work from a team at
Graz University of Technology posted here[1].  The major addition is
support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
for a wide variety of use cases.

Full Description:

KAISER is a countermeasure against attacks on kernel address
information.  There are at least three existing, published,
approaches using the shared user/kernel mapping and hardware features
to defeat KASLR.  One approach referenced in the paper locates the
kernel by observing differences in page fault timing between
present-but-inaccessable kernel pages and non-present pages.

KAISER addresses this by unmapping (most of) the kernel when
userspace runs.  It leaves the existing page tables largely alone and
refers to them as "kernel page tables".  For running userspace, a new
"shadow" copy of the page tables is allocated for each process.  The
shadow page tables map all the same user memory as the "kernel" copy,
but only maps a minimal set of kernel memory.

When we enter the kernel via syscalls, interrupts or exceptions,
page tables are switched to the full "kernel" copy.  When the system
switches back to user mode, the "shadow" copy is used.  Process
Context IDentifiers (PCIDs) are used to to ensure that the TLB is not
flushed when switching between page tables, which makes syscalls
roughly 2x faster than without it.  PCIDs are usable on Haswell and
newer CPUs (the ones with "v4", or called fourth-generation Core).

The minimal kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions, interrupt
descriptors (IDT) and the kernel stacks.  This minimal set of data
can still reveal the kernel's ASLR base address.  But, this minimal
kernel data is all trusted, which makes it harder to exploit than
data in the kernel direct map which contains loads of user-controlled
data.

KAISER will affect performance for anything that does system calls or
interrupts: everything.  Just the new instructions (CR3 manipulation)
add a few hundred cycles to a syscall or interrupt.  Most workloads
that we have run show single-digit regressions.  5% is a good round
number for what is typical.  The worst we have seen is a roughly 30%
regression on a loopback networking test that did a ton of syscalls
and context switches.  More details about possible performance
impacts are in the new Documentation/ file.

This code is based on a version I downloaded from
(https://github.com/IAIK/KAISER).  It has been heavily modified.

The approach is described in detail in a paper[2].  However, there is
some incorrect and information in the paper, both on how Linux and
the hardware works.  For instance, I do not share the opinion that
KAISER has "runtime overhead of only 0.28%".  Please rely on this
patch series as the canonical source of information about this
submission.

Here is one example of how the kernel image grow with CONFIG_KAISER
on and off.  Most of the size increase is presumably from additional
alignment requirements for mapping entry/exit code and structures.

    text    data     bss      dec filename
11786064 7356724 2928640 22071428 vmlinux-nokaiser
11798203 7371704 2928640 22098547 vmlinux-kaiser
  +12139  +14980       0   +27119

To give folks an idea what the performance impact is like, I took
the following test and ran it single-threaded:

	https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c

It's a pretty quick syscall so this shows how much KAISER slows
down syscalls (and how much PCIDs help).  The units here are
lseeks/second:

        no kaiser: 5.2M
    kaiser+  pcid: 3.0M
    kaiser+nopcid: 2.2M

"nopcid" is literally with the "nopcid" command-line option which
turns PCIDs off entirely.

Thanks to:
The original KAISER team at Graz University of Technology.
Andy Lutomirski for all the help with the entry code.
Kirill Shutemov for a helpful review of the code.

1. https://github.com/IAIK/KAISER
2. https://gruss.cc/files/kaiser.pdf

--

The code is available here:

	https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/

 Documentation/x86/kaiser.txt                | 128 ++++++
 arch/x86/Kconfig                            |   4 +
 arch/x86/entry/calling.h                    |  77 ++++
 arch/x86/entry/entry_64.S                   |  34 +-
 arch/x86/entry/entry_64_compat.S            |  13 +
 arch/x86/events/intel/ds.c                  |  57 ++-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/desc.h                 |   2 +-
 arch/x86/include/asm/hw_irq.h               |   2 +-
 arch/x86/include/asm/kaiser.h               |  59 +++
 arch/x86/include/asm/mmu_context.h          |  29 +-
 arch/x86/include/asm/pgalloc.h              |  32 +-
 arch/x86/include/asm/pgtable.h              |  20 +-
 arch/x86/include/asm/pgtable_64.h           | 121 ++++++
 arch/x86/include/asm/pgtable_types.h        |  16 +
 arch/x86/include/asm/processor.h            |   2 +-
 arch/x86/include/asm/tlbflush.h             | 230 +++++++++--
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/cpu/common.c                |  21 +-
 arch/x86/kernel/espfix_64.c                 |  22 +-
 arch/x86/kernel/head_64.S                   |  30 +-
 arch/x86/kernel/irqinit.c                   |   2 +-
 arch/x86/kernel/ldt.c                       |  25 +-
 arch/x86/kernel/process.c                   |   2 +-
 arch/x86/kernel/process_64.c                |   2 +-
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/mm/Makefile                        |   1 +
 arch/x86/mm/init.c                          |  75 ++--
 arch/x86/mm/kaiser.c                        | 416 ++++++++++++++++++++
 arch/x86/mm/pageattr.c                      |  63 ++-
 arch/x86/mm/pgtable.c                       |  16 +-
 arch/x86/mm/tlb.c                           | 105 ++++-
 include/asm-generic/vmlinux.lds.h           |  17 +
 include/linux/kaiser.h                      |  34 ++
 include/linux/percpu-defs.h                 |  32 +-
 init/main.c                                 |   2 +
 kernel/fork.c                               |   6 +
 security/Kconfig                            |  10 +
 38 files changed, 1565 insertions(+), 149 deletions(-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-11-01  0:43   ` Brian Gerst
                     ` (2 more replies)
  2017-10-31 22:31 ` [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables Dave Hansen
                   ` (25 subsequent siblings)
  26 siblings, 3 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before we leave assembly code.

This is extra challenging because the context in which we have to
make this switch is tricky: the registers we are allowed to
clobber can vary.  It's also hard to store things on the stack
because there are already things on it with an established ABI
(ptregs) or the stack is unsafe to use at all.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values, but do not actually switch
CR3.  The code will, however, clobber the registers that it
says it will and also does perform *writes* to CR3.  So, this
patch by itself tests that the registers we are clobbering
and restoring from are OK, and that things like our stack
manipulation are in safe places.

In other words, if you bisect to here, this *does* introduce
changes that can break things.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space so we have a register
to clobber for our CR3 MOV.  The %GS register is what we use
to index into our per-cpu sapce, so SWAPGS *had* to be done
before the CR3 switch.  That scratch space is gone now, but we
still keep the semantic that SWAPGS must be done before the
CR3 MOV.  This is good to keep because it is not that hard to
do and it allows us to do things like add per-cpu debugging
information to help us figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack that we
have to deal with this situation.  Changing the format of
this stack is not a fun exercise: I tried.  Instead of
storing the old CR3 value on the stack, we depend on the
*regular* register save/restore mechanism and then use %r14
to keep CR3 during the NMI.  It will not be clobbered by the
C NMI handlers that get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h         |   40 +++++++++++++++++++++++++++++++++++++
 b/arch/x86/entry/entry_64.S        |   33 +++++++++++++++++++++++++-----
 b/arch/x86/entry/entry_64_compat.S |   13 ++++++++++++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.105007253 -0700
+++ b/arch/x86/entry/calling.h	2017-10-31 15:03:48.113007631 -0700
@@ -1,5 +1,6 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -217,6 +218,45 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+.macro ADJUST_KERNEL_CR3 reg:req
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Just stick a random bit in here that never gets set.  Fixed
+	 * up in real KAISER patches in a moment.
+	 */
+	bt	$63, %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/* optimize this */
+	movq	\save_reg, %cr3
+.endm
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
--- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.107007348 -0700
+++ b/arch/x86/entry/entry_64_compat.S	2017-10-31 15:03:48.113007631 -0700
@@ -48,8 +48,13 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS_UNSAFE_STACK
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
+	pushq	%rdi
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+	popq	%rdi
+
 	/*
 	 * User tracing code (ptrace or signal handlers) might assume that
 	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
@@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat)
 	pushq   $0			/* pt_regs->r15 = 0 */
 	cld
 
+	pushq	%rdi
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+	popq	%rdi
 	/*
 	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
 	 * ourselves.  To save a few cycles, we can check whether
@@ -214,6 +222,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
@@ -240,6 +250,7 @@ sysret32_from_system_call:
 	popq	%rsi			/* pt_regs->si */
 	popq	%rdi			/* pt_regs->di */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
         /*
          * USERGS_SYSRET32 does:
          *  GSBASE = user's GS base
@@ -324,6 +335,7 @@ ENTRY(entry_INT80_compat)
 	pushq   %r15                    /* pt_regs->r15 */
 	cld
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
 	/*
 	 * User mode is traced as though IRQs are on, and the interrupt
 	 * gate turned them off.
@@ -337,6 +349,7 @@ ENTRY(entry_INT80_compat)
 	/* Go back to user mode. */
 	TRACE_IRQS_ON
 	SWAPGS
+	SWITCH_TO_USER_CR3 scratch_reg=%r11
 	jmp	restore_regs_and_iret
 END(entry_INT80_compat)
 
diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.109007442 -0700
+++ b/arch/x86/entry/entry_64.S	2017-10-31 15:03:48.115007726 -0700
@@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64)
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
-	TRACE_IRQS_OFF
-
 	/* Construct struct pt_regs on stack */
 	pushq	$__USER_DS			/* pt_regs->ss */
 	pushq	PER_CPU_VAR(rsp_scratch)	/* pt_regs->sp */
@@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	sub	$(6*8), %rsp			/* pt_regs->bp, bx, r12-15 not saved */
 	UNWIND_HINT_REGS extra=0
 
+	/* NB: right here, all regs except r11 are live. */
+
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
+
+	/* Must wait until we have the kernel CR3 to call C functions: */
+	TRACE_IRQS_OFF
+
 	/*
 	 * If we need to do entry work or if we guess we'll need to do
 	 * exit work, go straight to the slow path.
@@ -220,6 +225,7 @@ entry_SYSCALL_64_fastpath:
 	TRACE_IRQS_ON		/* user mode is traced as IRQs on */
 	movq	RIP(%rsp), %rcx
 	movq	EFLAGS(%rsp), %r11
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
 	UNWIND_HINT_EMPTY
@@ -313,6 +319,7 @@ return_from_SYSCALL_64:
 	 * perf profiles. Nothing jumps here.
 	 */
 syscall_return_via_sysret:
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	/* rcx and r11 are already restored (see code above) */
 	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq	RSP(%rsp), %rsp
@@ -320,6 +327,7 @@ syscall_return_via_sysret:
 	USERGS_SYSRET64
 
 opportunistic_sysret_failed:
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	SWAPGS
 	jmp	restore_c_regs_and_iret
 END(entry_SYSCALL_64)
@@ -422,6 +430,7 @@ ENTRY(ret_from_fork)
 	movq	%rsp, %rdi
 	call	syscall_return_slowpath	/* returns with IRQs disabled */
 	TRACE_IRQS_ON			/* user mode is traced as IRQS on */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	SWAPGS
 	jmp	restore_regs_and_iret
 
@@ -611,6 +620,7 @@ GLOBAL(retint_user)
 	mov	%rsp,%rdi
 	call	prepare_exit_to_usermode
 	TRACE_IRQS_IRETQ
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	SWAPGS
 	jmp	restore_regs_and_iret
 
@@ -1091,7 +1101,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1118,6 +1132,7 @@ ENTRY(paranoid_exit)
 paranoid_exit_no_swapgs:
 	TRACE_IRQS_IRETQ_DEBUG
 paranoid_exit_restore:
+	RESTORE_CR3	%r14
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
 	REMOVE_PT_GPREGS_FROM_STACK 8
@@ -1144,6 +1159,9 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+
 .Lerror_entry_from_usermode_after_swapgs:
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
@@ -1190,9 +1208,10 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 
 	/*
@@ -1313,6 +1332,7 @@ ENTRY(nmi)
 	UNWIND_HINT_REGS
 	ENCODE_FRAME_POINTER
 
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	/*
 	 * At this point we no longer need to worry about stack damage
 	 * due to nesting -- we're on the normal thread stack and we're
@@ -1328,6 +1348,7 @@ ENTRY(nmi)
 	 * work, because we don't want to enable interrupts.
 	 */
 	SWAPGS
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 	jmp	restore_regs_and_iret
 
 .Lnmi_from_kernel:
@@ -1538,6 +1559,8 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-11-01 21:11   ` Thomas Gleixner
  2017-10-31 22:31 ` [PATCH 03/23] x86, kaiser: disable global pages Dave Hansen
                   ` (24 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


init_mm is for kernel-exclusive use.  If someone is allocating page
tables in it, do not set _PAGE_USER on them.  This ensures that
we do *not* set NX on these page tables in the KAISER code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgalloc.h |   32 +++++++++++++++++++++++++++-----
 1 file changed, 27 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm arch/x86/include/asm/pgalloc.h
--- a/arch/x86/include/asm/pgalloc.h~kaiser-prep-clear-_PAGE_USER-for-init_mm	2017-10-31 15:03:48.745037506 -0700
+++ b/arch/x86/include/asm/pgalloc.h	2017-10-31 15:03:48.749037695 -0700
@@ -61,20 +61,36 @@ static inline void __pte_free_tlb(struct
 	___pte_free_tlb(tlb, pte);
 }
 
+/*
+ * _KERNPG_TABLE has _PAGE_USER clear which tells the KAISER code
+ * that this mapping is for kernel use only.  That makes sure that
+ * we leave the mapping usable by the kernel and do not try to
+ * sabotage it by doing stuff like setting _PAGE_NX on it.
+ */
+static inline pteval_t mm_pgtable_flags(struct mm_struct *mm)
+{
+	if (!mm || (mm == &init_mm))
+		return _KERNPG_TABLE;
+	return _PAGE_TABLE;
+}
+
 static inline void pmd_populate_kernel(struct mm_struct *mm,
 				       pmd_t *pmd, pte_t *pte)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pte(mm, __pa(pte) >> PAGE_SHIFT);
-	set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(__pa(pte) | pgtable_flags));
 }
 
 static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
 				struct page *pte)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
 	unsigned long pfn = page_to_pfn(pte);
 
 	paravirt_alloc_pte(mm, pfn);
-	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+	set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | pgtable_flags));
 }
 
 #define pmd_pgtable(pmd) pmd_page(pmd)
@@ -117,16 +133,20 @@ extern void pud_populate(struct mm_struc
 #else	/* !CONFIG_X86_PAE */
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pmd(mm, __pa(pmd) >> PAGE_SHIFT);
-	set_pud(pud, __pud(_PAGE_TABLE | __pa(pmd)));
+	set_pud(pud, __pud(__pa(pmd) | pgtable_flags));
 }
 #endif	/* CONFIG_X86_PAE */
 
 #if CONFIG_PGTABLE_LEVELS > 3
 static inline void p4d_populate(struct mm_struct *mm, p4d_t *p4d, pud_t *pud)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_pud(mm, __pa(pud) >> PAGE_SHIFT);
-	set_p4d(p4d, __p4d(_PAGE_TABLE | __pa(pud)));
+	set_p4d(p4d, __p4d(__pa(pud) | pgtable_flags));
 }
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
@@ -155,8 +175,10 @@ static inline void __pud_free_tlb(struct
 #if CONFIG_PGTABLE_LEVELS > 4
 static inline void pgd_populate(struct mm_struct *mm, pgd_t *pgd, p4d_t *p4d)
 {
+	pteval_t pgtable_flags = mm_pgtable_flags(mm);
+
 	paravirt_alloc_p4d(mm, __pa(p4d) >> PAGE_SHIFT);
-	set_pgd(pgd, __pgd(_PAGE_TABLE | __pa(p4d)));
+	set_pgd(pgd, __pgd(__pa(p4d) | pgtable_flags));
 }
 
 static inline p4d_t *p4d_alloc_one(struct mm_struct *mm, unsigned long addr)
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 03/23] x86, kaiser: disable global pages
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
  2017-10-31 22:31 ` [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-11-01 21:18   ` Thomas Gleixner
  2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
                   ` (23 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


Global pages stay in the TLB across context switches.  Since all
contexts share the same kernel mapping, we use global pages to
allow kernel entries in the TLB to survive when we context
switch.

But, even having these entries in the TLB opens up something that
an attacker can use [1].

Disable global pages so that kernel TLB entries are flushed when
we run userspace.  This way, all accesses to kernel memory result
in a TLB miss whether there is good data there or not.  Without
this, even when KAISER switches pages tables, the kernel entries
might remain in the TLB.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/Kconfig                     |    4 ++++
 b/arch/x86/include/asm/pgtable_types.h |    5 +++++
 2 files changed, 9 insertions(+)

diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-10-31 15:03:49.314064402 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2017-10-31 15:03:49.323064827 -0700
@@ -47,7 +47,12 @@
 #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
 #define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
 #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
+#ifdef CONFIG_X86_GLOBAL_PAGES
 #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
+#else
+/* We must ensure that kernel TLBs are unusable while in userspace */
+#define _PAGE_GLOBAL	(_AT(pteval_t, 0))
+#endif
 #define _PAGE_SOFTW1	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW1)
 #define _PAGE_SOFTW2	(_AT(pteval_t, 1) << _PAGE_BIT_SOFTW2)
 #define _PAGE_PAT	(_AT(pteval_t, 1) << _PAGE_BIT_PAT)
diff -puN arch/x86/Kconfig~kaiser-prep-disable-global-pages arch/x86/Kconfig
--- a/arch/x86/Kconfig~kaiser-prep-disable-global-pages	2017-10-31 15:03:49.318064591 -0700
+++ b/arch/x86/Kconfig	2017-10-31 15:03:49.325064922 -0700
@@ -327,6 +327,10 @@ config ARCH_SUPPORTS_UPROBES
 config FIX_EARLYCON_MEM
 	def_bool y
 
+config X86_GLOBAL_PAGES
+	def_bool y
+	depends on ! KAISER
+
 config PGTABLE_LEVELS
 	int
 	default 5 if X86_5LEVEL
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (2 preceding siblings ...)
  2017-10-31 22:31 ` [PATCH 03/23] x86, kaiser: disable global pages Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-11-01  8:01   ` Andy Lutomirski
  2017-11-01 21:25   ` Thomas Gleixner
  2017-10-31 22:31 ` [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior Dave Hansen
                   ` (22 subsequent siblings)
  26 siblings, 2 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


Our CR4-based TLB flush currently requries global pages to be
supported *and* enabled.  But, we really only need for them to be
supported.  Make the code more robust by alllowing X86_CR4_PGE to
clear as well as set.

This change was suggested by Kirill Shutemov.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-make-cr4-writes-tolerate-clear-pge	2017-10-31 15:03:49.913092716 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:03:49.917092905 -0700
@@ -250,9 +250,20 @@ static inline void __native_flush_tlb_gl
 	unsigned long cr4;
 
 	cr4 = this_cpu_read(cpu_tlbstate.cr4);
-	/* clear PGE */
-	native_write_cr4(cr4 & ~X86_CR4_PGE);
-	/* write old PGE again and flush TLBs */
+	/*
+	 * This function is only called on systems that support X86_CR4_PGE
+	 * and where always set X86_CR4_PGE.  Warn if we are called without
+	 * PGE set.
+	 */
+	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
+	/*
+	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
+	 * TLB of all entries including all entries in all PCIDs and all
+	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * whether we had X86_CR4_PGE set in the first place.
+	 */
+	native_write_cr4(cr4 ^ X86_CR4_PGE);
+	/* Put original CR3 value back: */
 	native_write_cr4(cr4);
 }
 
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (3 preceding siblings ...)
  2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-10-31 23:31   ` Kees Cook
  2017-10-31 22:31 ` [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas Dave Hansen
                   ` (21 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


The comment says it all here.  The problem here is that the
X86_CR4_PGE bit affects all PCIDs in a way that is totally
obscure.

This makes it easier for someone to find if grepping for PCID-
related stuff and documents the hardware behavior that we are
depending on.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior	2017-10-31 15:03:50.479119470 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:03:50.482119612 -0700
@@ -258,9 +258,11 @@ static inline void __native_flush_tlb_gl
 	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
 	/*
 	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
-	 * TLB of all entries including all entries in all PCIDs and all
-	 * global pages.  Make sure that we _change_ the bit, regardless of
+	 * all entries.  Make sure that we _change_ the bit, regardless of
 	 * whether we had X86_CR4_PGE set in the first place.
+	 *
+	 * Note that just toggling PGE *also* flushes all entries from all
+	 * PCIDs, regardless of the state of X86_CR4_PCIDE.
 	 */
 	native_write_cr4(cr4 ^ X86_CR4_PGE);
 	/* Put original CR3 value back: */
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (4 preceding siblings ...)
  2017-10-31 22:31 ` [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-11-01 21:47   ` Thomas Gleixner
  2017-10-31 22:31 ` [PATCH 07/23] x86, kaiser: unmap kernel from userspace page tables (core patch) Dave Hansen
                   ` (20 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and can be
used no matter which copy of the page tables we are using.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/desc.h       |    2 +-
 b/arch/x86/include/asm/hw_irq.h     |    2 +-
 b/arch/x86/include/asm/processor.h  |    2 +-
 b/arch/x86/kernel/cpu/common.c      |    4 ++--
 b/arch/x86/kernel/irqinit.c         |    2 +-
 b/arch/x86/kernel/process.c         |    2 +-
 b/include/asm-generic/vmlinux.lds.h |    7 +++++++
 b/include/linux/percpu-defs.h       |   32 +++++++++++++++++++++++++++++++-
 8 files changed, 45 insertions(+), 8 deletions(-)

diff -puN arch/x86/include/asm/desc.h~kaiser-prep-user-mapped-percpu arch/x86/include/asm/desc.h
--- a/arch/x86/include/asm/desc.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.046146272 -0700
+++ b/arch/x86/include/asm/desc.h	2017-10-31 15:03:51.066147217 -0700
@@ -45,7 +45,7 @@ struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff -puN arch/x86/include/asm/hw_irq.h~kaiser-prep-user-mapped-percpu arch/x86/include/asm/hw_irq.h
--- a/arch/x86/include/asm/hw_irq.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.048146366 -0700
+++ b/arch/x86/include/asm/hw_irq.h	2017-10-31 15:03:51.066147217 -0700
@@ -160,7 +160,7 @@ extern char irq_entries_start[];
 #define VECTOR_RETRIGGERED	((void *)~0UL)
 
 typedef struct irq_desc* vector_irq_t[NR_VECTORS];
-DECLARE_PER_CPU(vector_irq_t, vector_irq);
+DECLARE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq);
 
 #endif /* !ASSEMBLY_ */
 
diff -puN arch/x86/include/asm/processor.h~kaiser-prep-user-mapped-percpu arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.051146508 -0700
+++ b/arch/x86/include/asm/processor.h	2017-10-31 15:03:51.067147264 -0700
@@ -348,7 +348,7 @@ struct tss_struct {
 
 } ____cacheline_aligned;
 
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff -puN arch/x86/kernel/cpu/common.c~kaiser-prep-user-mapped-percpu arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.053146603 -0700
+++ b/arch/x86/kernel/cpu/common.c	2017-10-31 15:03:51.067147264 -0700
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
 	/*
 	 * We need valid kernel segments for data and code in long mode too
@@ -1345,7 +1345,7 @@ static const unsigned int exception_stac
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 
 /* May not be marked __init: used by software suspend */
diff -puN arch/x86/kernel/irqinit.c~kaiser-prep-user-mapped-percpu arch/x86/kernel/irqinit.c
--- a/arch/x86/kernel/irqinit.c~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.055146697 -0700
+++ b/arch/x86/kernel/irqinit.c	2017-10-31 15:03:51.068147312 -0700
@@ -51,7 +51,7 @@ static struct irqaction irq2 = {
 	.flags = IRQF_NO_THREAD,
 };
 
-DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
+DEFINE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq) = {
 	[0 ... NR_VECTORS - 1] = VECTOR_UNUSED,
 };
 
diff -puN arch/x86/kernel/process.c~kaiser-prep-user-mapped-percpu arch/x86/kernel/process.c
--- a/arch/x86/kernel/process.c~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.057146792 -0700
+++ b/arch/x86/kernel/process.c	2017-10-31 15:03:51.068147312 -0700
@@ -46,7 +46,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = {
 	.x86_tss = {
 		.sp0 = TOP_OF_INIT_STACK,
 #ifdef CONFIG_X86_32
diff -puN include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.059146886 -0700
+++ b/include/asm-generic/vmlinux.lds.h	2017-10-31 15:03:51.068147312 -0700
@@ -807,7 +807,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff -puN include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu include/linux/percpu-defs.h
--- a/include/linux/percpu-defs.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.062147028 -0700
+++ b/include/linux/percpu-defs.h	2017-10-31 15:03:51.069147359 -0700
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,11 +182,21 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
  */
-#define DECLARE_PER_CPU_READ_MOSTLY(type, name)			\
+#define DECLARE_PER_CPU_READ_MOSTLY(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, "..read_mostly")
 
 #define DEFINE_PER_CPU_READ_MOSTLY(type, name)				\
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 07/23] x86, kaiser: unmap kernel from userspace page tables (core patch)
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (5 preceding siblings ...)
  2017-10-31 22:31 ` [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas Dave Hansen
@ 2017-10-31 22:31 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
                   ` (19 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:31 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when we run userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever we enter the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When the system switches back
to user mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

Changes from original KAISER patch:
 * Gobs of coding style cleanups
 * The original patch tried to allocate an order-2 page, then
   8k-align the result.  That's silly since order-2 is already
   guaranteed to be 16k-aligned.  Removed that gunk and just
   allocate an order-1 page.
 * Handle (or at least detect and warn on) allocation failures
 * Use _KERNPG_TABLE, not _PAGE_TABLE when creating mappings for
   the kernel in the shadow (user) page tables.
 * BUG_ON() for !pte_none() case was totally insane: it checked
   the physical address of the 'struct page' against the physical
   address of the page being mapped.
 * Added 5-level page table support
 * Never free kaiser page tables.  We don't have the locking to
   keep them from getting used while we free them.
 * Use a totally different scheme in the entry code.  The
   original code just fell apart in horrific ways in debug faults,
   NMIs, or when iret faults.  Big thanks to Andy Lutomirski for
   reducing the number of places we had to patch.  He made the
   code a ton simpler.

Note: The original KAISER authors signed-off on their patch.  Some of
their code has been broken out into other patches in this series, but
their SoB was only retained here.

Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/Documentation/x86/kaiser.txt      |  128 ++++++++++++
 b/arch/x86/entry/calling.h          |   32 ++-
 b/arch/x86/include/asm/kaiser.h     |   59 +++++
 b/arch/x86/include/asm/pgtable.h    |    6 
 b/arch/x86/include/asm/pgtable_64.h |   93 ++++++++
 b/arch/x86/kernel/espfix_64.c       |   17 +
 b/arch/x86/kernel/head_64.S         |   14 +
 b/arch/x86/mm/Makefile              |    1 
 b/arch/x86/mm/kaiser.c              |  380 ++++++++++++++++++++++++++++++++++++
 b/arch/x86/mm/pageattr.c            |    2 
 b/arch/x86/mm/pgtable.c             |   16 +
 b/include/linux/kaiser.h            |   34 +++
 b/init/main.c                       |    2 
 b/kernel/fork.c                     |    6 
 14 files changed, 781 insertions(+), 9 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-base arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-base	2017-10-31 15:03:51.817182716 -0700
+++ b/arch/x86/entry/calling.h	2017-10-31 15:03:51.842183897 -0700
@@ -1,6 +1,7 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
@@ -218,10 +219,19 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
 .macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -240,10 +250,10 @@ For 32-bit we have the following convent
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Just stick a random bit in here that never gets set.  Fixed
+	 * Is the switch bit zero?  This means the address is
 	 * up in real KAISER patches in a moment.
 	 */
-	bt	$63, %r\scratch_reg
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
 
 	ADJUST_KERNEL_CR3 %r\scratch_reg
@@ -253,10 +263,26 @@ For 32-bit we have the following convent
 .endm
 
 .macro RESTORE_CR3 save_reg:req
-	/* optimize this */
+	/*
+	 * We could avoid the CR3 write if not changing its value,
+	 * but that requires a CR3 read *and* a scratch register.
+	 */
 	movq	\save_reg, %cr3
 .endm
 
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff -puN /dev/null arch/x86/include/asm/kaiser.h
--- /dev/null	2017-05-17 09:46:39.241182829 -0700
+++ b/arch/x86/include/asm/kaiser.h	2017-10-31 15:03:51.843183945 -0700
@@ -0,0 +1,59 @@
+#ifndef _ASM_X86_KAISER_H
+#define _ASM_X86_KAISER_H
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KAISER
+/**
+ *  kaiser_add_mapping - map a kernel range into the user page tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ *  @flags: The mapping flags of the pages
+ *
+ *  Use this on all data and code that need to be mapped into both
+ *  copies of the page tables.  This includes the code that switches
+ *  to/from userspace and all of the hardware structures that are
+ *  virtually-addressed and needed in userspace like the interrupt
+ *  table.
+ */
+extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
+			      unsigned long flags);
+
+extern int kaiser_map_stack(struct task_struct *tsk);
+
+/**
+ *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ */
+extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
+
+/**
+ *  kaiser_init - Initialize the shadow mapping
+ *
+ *  Most parts of the shadow mapping can be mapped upon boot
+ *  time.  Only per-process things like the thread stacks
+ *  or a new LDT have to be mapped at runtime.  These boot-
+ *  time mappings are permanent and never unmapped.
+ */
+extern void kaiser_init(void);
+
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_KAISER_H */
diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-base arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-base	2017-10-31 15:03:51.819182810 -0700
+++ b/arch/x86/include/asm/pgtable_64.h	2017-10-31 15:03:51.843183945 -0700
@@ -130,9 +130,88 @@ static inline pud_t native_pudp_get_and_
 #endif
 }
 
+#ifdef CONFIG_KAISER
+/*
+ * All top-level KAISER page tables are order-1 pages (8k-aligned
+ * and 8k in size).  The kernel one is at the beginning 4k and
+ * the user (shadow) one is in the last 4k.  To switch between
+ * them, you just need to flip the 12th bit in their addresses.
+ */
+#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr |= (1<<bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+	__ptr &= ~(1<<bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *native_get_shadow_pgd(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline pgd_t *native_get_normal_pgd(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_shadow_p4d(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *native_get_normal_p4d(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KAISER */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ * This returns true for user pages that need to get copied into
+ * both the user and kernel copies of the page tables, and false
+ * for kernel pages that should only be in the kernel copy.
+ */
+static inline bool is_userspace_pgd(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
+}
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
+	/*
+	 * set_pgd() does not get called when we are running
+	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
+	 * know here that we have a p4d but that it is really at
+	 * the top level of the page tables; it is really just a
+	 * pgd.
+	 */
+	/* Do we need to also populate the shadow p4d? */
+	if (is_userspace_pgd(p4dp))
+		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
+	/*
+	 * Even if the entry is *mapping* userspace, ensure
+	 * that userspace can not use it.  This way, if we
+	 * get out to userspace with the wrong CR3 value,
+	 * userspace will crash instead of running.
+	 */
+	if (!p4d.pgd.pgd)
+		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -146,7 +225,21 @@ static inline void native_p4d_clear(p4d_
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KAISER
+	/* Do we need to also populate the shadow pgd? */
+	if (is_userspace_pgd(pgdp))
+		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+	/*
+	 * Even if the entry is mapping userspace, ensure
+	 * that it is unusable for userspace.  This way,
+	 * if we get out to userspace with the wrong CR3
+	 * value, userspace will crash instead of running.
+	 */
+	if (!pgd_none(pgd))
+		pgdp->pgd = pgd.pgd | _PAGE_NX;
+#else /* CONFIG_KAISER */
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff -puN arch/x86/include/asm/pgtable.h~kaiser-base arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-base	2017-10-31 15:03:51.821182905 -0700
+++ b/arch/x86/include/asm/pgtable.h	2017-10-31 15:03:51.844183992 -0700
@@ -1105,6 +1105,12 @@ static inline void pmdp_set_wrprotect(st
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KAISER
+	/* Clone the shadow pgd part as well */
+	memcpy(native_get_shadow_pgd(dst),
+	       native_get_shadow_pgd(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
diff -puN arch/x86/kernel/espfix_64.c~kaiser-base arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-base	2017-10-31 15:03:51.823182999 -0700
+++ b/arch/x86/kernel/espfix_64.c	2017-10-31 15:03:51.844183992 -0700
@@ -41,6 +41,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix
+	 * area to ensure it is mapped into the shadow user page
+	 * tables.
+	 *
+	 * For 5-level paging, we should have already populated
+	 * the espfix pgd when kaiser_init() pre-populated all
+	 * the pgd entries.  The above p4d_alloc() would never do
+	 * anything and the p4d_populate() would be done to a p4d
+	 * already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KAISER
+	if (CONFIG_PGTABLE_LEVELS <= 4)
+		set_pgd(native_get_shadow_pgd(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+#endif
 
 	/* Randomize the locations */
 	init_espfix_random();
diff -puN arch/x86/kernel/head_64.S~kaiser-base arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-base	2017-10-31 15:03:51.826183141 -0700
+++ b/arch/x86/kernel/head_64.S	2017-10-31 15:03:51.844183992 -0700
@@ -339,6 +339,14 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KAISER
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -348,7 +356,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
@@ -362,10 +370,10 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #ifndef CONFIG_XEN
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
diff -puN /dev/null arch/x86/mm/kaiser.c
--- /dev/null	2017-05-17 09:46:39.241182829 -0700
+++ b/arch/x86/mm/kaiser.c	2017-10-31 15:03:51.845184039 -0700
@@ -0,0 +1,380 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/kaiser.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+/*
+ * At runtime, the only things we map are some things for CPU
+ * hotplug, and stacks for new processes.  No two CPUs will ever
+ * be populating the same addresses, so we only need to ensure
+ * that we protect between two CPUs trying to allocate and
+ * populate the same page table page.
+ *
+ * Only take this lock when doing a set_p[4um]d(), but it is not
+ * needed for doing a set_pte().  We assume that only the *owner*
+ * of a given allocation will be doing this for _their_
+ * allocation.
+ *
+ * This ensures that once a system has been running for a while
+ * and there have been stacks all over and these page tables
+ * are fully populated, there will be no further acquisitions of
+ * this lock.
+ */
+static DEFINE_SPINLOCK(shadow_table_allocation_lock);
+
+/*
+ * Returns -1 on error.
+ */
+static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	pgd = pgd_offset_k(vaddr);
+	/*
+	 * We made all the kernel PGDs present in kaiser_init().
+	 * We expect them to stay that way.
+	 */
+	if (pgd_none(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+	/*
+	 * PGDs are either 512GB or 128TB on all x86_64
+	 * configurations.  We don't handle these.
+	 */
+	if (pgd_large(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pud = pud_offset(p4d, vaddr);
+	if (pud_none(*pud)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pud_large(*pud))
+		return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
+
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pmd_large(*pmd))
+		return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
+
+	pte = pte_offset_kernel(pmd, vaddr);
+	if (pte_none(*pte)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * This is a relatively normal page table walk, except that it
+ * also tries to allocate page tables pages along the way.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+#define KAISER_WALK_ATOMIC  0x1
+static pte_t *kaiser_pagetable_walk(unsigned long address, unsigned long flags)
+{
+	pmd_t *pmd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pgd_t *pgd = native_get_shadow_pgd(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (flags & KAISER_WALK_ATOMIC) {
+		gfp &= ~GFP_KERNEL;
+		gfp |= __GFP_HIGH | __GFP_ATOMIC;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All shadow pgds should have been populated");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (p4d_none(*p4d))
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+		else
+			free_page(new_pud_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pud_none(*pud))
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+		else
+			free_page(new_pmd_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pmd = pmd_offset(pud, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pmd_none(*pmd))
+			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
+		else
+			free_page(new_pte_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	return pte_offset_kernel(pmd, address);
+}
+
+/*
+ * Given a kernel address, @__start_addr, copy that mapping into
+ * the user (shadow) page tables.  This may need to allocate page
+ * table pages.
+ */
+int kaiser_add_user_map(const void *__start_addr, unsigned long size,
+			unsigned long flags)
+{
+	pte_t *pte;
+	unsigned long start_addr = (unsigned long)__start_addr;
+	unsigned long address = start_addr & PAGE_MASK;
+	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
+	unsigned long target_address;
+
+	for (; address < end_addr; address += PAGE_SIZE) {
+		target_address = get_pa_from_mapping(address);
+		if (target_address == -1)
+			return -EIO;
+
+		pte = kaiser_pagetable_walk(address, false);
+		/*
+		 * Errors come from either -ENOMEM for a page
+		 * table page, or something screwy that did a
+		 * WARN_ON().  Just return -ENOMEM.
+		 */
+		if (!pte)
+			return -ENOMEM;
+		if (pte_none(*pte)) {
+			set_pte(pte, __pte(flags | target_address));
+		} else {
+			pte_t tmp;
+			set_pte(&tmp, __pte(flags | target_address));
+			WARN_ON_ONCE(!pte_same(*pte, tmp));
+		}
+	}
+	return 0;
+}
+
+/*
+ * The stack mapping is called in generic code and can't use
+ * __PAGE_KERNEL
+ */
+int kaiser_map_stack(struct task_struct *tsk)
+{
+	return kaiser_add_mapping((unsigned long)tsk->stack, THREAD_SIZE,
+				  __PAGE_KERNEL);
+}
+
+int kaiser_add_user_map_ptrs(const void *__start_addr,
+			     const void *__end_addr,
+			     unsigned long flags)
+{
+	return kaiser_add_user_map(__start_addr,
+				   __end_addr - __start_addr,
+				   flags);
+}
+
+/*
+ * Ensure that the top level of the (shadow) page tables are
+ * entirely populated.  This ensures that all processes that get
+ * forked have the same entries.  This way, we do not have to
+ * ever go set up new entries in older processes.
+ *
+ * Note: we never free these, so there are no updates to them
+ * after this.
+ */
+static void __init kaiser_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i = 0;
+
+	pgd = native_get_shadow_pgd(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		unsigned long addr = PAGE_OFFSET + i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
+ * The page table allocations in here can theoretically fail, but
+ * we can not do much about it in early boot.  Do the checking
+ * and warning in a macro to make it more readable.
+ */
+#define kaiser_add_user_map_early(start, size, flags) do {	\
+	int __ret = kaiser_add_user_map(start, size, flags);	\
+	WARN_ON(__ret);						\
+} while (0)
+
+#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
+	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
+	WARN_ON(__ret);							\
+} while (0)
+
+extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+/*
+ * If anything in here fails, we will likely die on one of the
+ * first kernel->user transitions and init will die.  But, we
+ * will have most of the kernel up by then and should be able to
+ * get a clean warning out of it.  If we BUG_ON() here, we run
+ * the risk of being before we have good console output.
+ */
+void __init kaiser_init(void)
+{
+	int cpu;
+
+	kaiser_init_all_pgds();
+
+	for_each_possible_cpu(cpu) {
+		void *percpu_vaddr = __per_cpu_user_mapped_start +
+				     per_cpu_offset(cpu);
+		unsigned long percpu_sz = __per_cpu_user_mapped_end -
+					  __per_cpu_user_mapped_start;
+		kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
+					  __PAGE_KERNEL);
+	}
+
+	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
+				       __PAGE_KERNEL_RX);
+
+	/* the fixed map address of the idt_table */
+	kaiser_add_user_map_early((void *)idt_descr.address,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL_RO);
+}
+
+int kaiser_add_mapping(unsigned long addr, unsigned long size,
+		       unsigned long flags)
+{
+	return kaiser_add_user_map((const void *)addr, size, flags);
+}
+
+void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+	unsigned long addr;
+
+	/* The shadow page tables always use small pages: */
+	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+		/*
+		 * Do an "atomic" walk in case this got called from an atomic
+		 * context.  This should not do any allocations because we
+		 * should only be walking things that are known to be mapped.
+		 */
+		pte_t *pte = kaiser_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+
+		/*
+		 * We are removing a mapping that shoud
+		 * exist.  WARN if it was not there:
+		 */
+		if (!pte) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		pte_clear(&init_mm, addr, pte);
+	}
+	/*
+	 * This ensures that the TLB entries used to map this data are
+	 * no longer usable on *this* CPU.  We theoretically want to
+	 * flush the entries on all CPUs here, but that's too
+	 * expensive right now: this is called to unmap process
+	 * stacks in the exit() path path.
+	 *
+	 * This can change if we get to the point where this is not
+	 * in a remotely hot path, like only called via write_ldt().
+	 *
+	 * Note: we could probably also just invalidate the individual
+	 * addresses to take care of *this* PCID and then do a
+	 * tlb_flush_shared_nonglobals() to ensure that all other
+	 * PCIDs get flushed before being used again.
+	 */
+	__native_flush_tlb_global();
+}
diff -puN arch/x86/mm/Makefile~kaiser-base arch/x86/mm/Makefile
--- a/arch/x86/mm/Makefile~kaiser-base	2017-10-31 15:03:51.828183236 -0700
+++ b/arch/x86/mm/Makefile	2017-10-31 15:03:51.845184039 -0700
@@ -45,6 +45,7 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulatio
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_KAISER)		+= kaiser.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff -puN arch/x86/mm/pageattr.c~kaiser-base arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c~kaiser-base	2017-10-31 15:03:51.830183330 -0700
+++ b/arch/x86/mm/pageattr.c	2017-10-31 15:03:51.847184134 -0700
@@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud,
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
+void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, start);
 
diff -puN arch/x86/mm/pgtable.c~kaiser-base arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c~kaiser-base	2017-10-31 15:03:51.833183472 -0700
+++ b/arch/x86/mm/pgtable.c	2017-10-31 15:03:51.847184134 -0700
@@ -354,14 +354,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KAISER
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 
diff -puN /dev/null Documentation/x86/kaiser.txt
--- /dev/null	2017-05-17 09:46:39.241182829 -0700
+++ b/Documentation/x86/kaiser.txt	2017-10-31 15:03:51.848184181 -0700
@@ -0,0 +1,128 @@
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When we enter the kernel via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).
+
+This helps ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It
+can be enabled by setting CONFIG_KAISER=y
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. Statically-allocated structures and entry/exit text must
+     be padded out to 4k (or 8k for PGDs) so they can be mapped
+     into the user page tables.  This bloats the kernel image
+     by ~20-30k.
+  d. The shadow page tables eventually grow to map all of used
+     vmalloc() space.  They can have roughly the same memory
+     consumption as the vmalloc() page tables.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and we need one at entry and another at exit.
+  b. Task stacks must be mapped/unmapped.  We need to walk
+     and modify the shadow page tables at fork() and exit().
+  c. Global pages are disabled.  This feature of the MMU
+     allows different processes to share TLB entries mapping
+     the kernel.  Losing the feature means potentially more
+     TLB misses after a context switch.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when we switch
+     the page tables.  This makes switching the page tables
+     (at context switch, or kernel entry/exit) cheaper.  But,
+     on systems with PCID support, the context switch code
+     must flush both the user and kernel entries out of the
+     TLB, with an INVPCID in addition to the CR3 write.  This
+     INVPCID is generally slower than a CR3 write, but still
+     on the order of a hundred cycles.
+  e. The shadow page tables must be populated for each new
+     process.  Even without KAISER, since we share all of the
+     kernel mappings in all processes, we can do all this
+     population for kernel addresses at the top level of the
+     page tables (the PGD level).  But, with KAISER, we now
+     have *two* kernel mappings: one in the kernel page tables
+     that maps everything and one in the user/shadow page
+     tables mapping the "minimal" kernel.  At fork(), we
+     copy the portion of the shadow PGD that maps the minimal
+     kernel structures in addition to the normal kernel one.
+  f. In addition to the fork()-time copying, we must also
+     update the shadow PGD any time a set_pgd() is done on a
+     PGD used to map userspace.  This ensures that the kernel
+     and user/shadow copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work:
+1. We can be more careful about not actually writing to CR3
+   unless we actually switch it.
+2. Try to have dedicated entry/exit kernel stacks so we do
+   not have to map/unmap the task/thread stacks.
+3. Compress the user/shadow-mapped data to be mapped together
+   underneath a single PGD entry.
+4. Re-enable global pages, but use them for mappings in the
+   user/shadow page tables.  This would allow the kernel to
+   take advantage of TLB entries that were established from
+   the user page tables.  This might speed up the entry/exit
+   code or userspace since it will not have to reload all of
+   its TLB entries.  However, its upside is limited by PCID
+   being used.
+5. Allow KAISER to enabled/disabled at runtime so folks can
+   run a single kernel image.
+
+Debugging:
+
+Bugs in KAISER cause a few different signatures of crashes
+that are worth noting here.
+
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-kaiser-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not kaiser-mapped.
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
diff -puN /dev/null include/linux/kaiser.h
--- /dev/null	2017-05-17 09:46:39.241182829 -0700
+++ b/include/linux/kaiser.h	2017-10-31 15:03:51.848184181 -0700
@@ -0,0 +1,34 @@
+#ifndef _INCLUDE_KAISER_H
+#define _INCLUDE_KAISER_H
+
+#ifdef CONFIG_KAISER
+#include <asm/kaiser.h>
+#else
+
+/*
+ * These stubs are used whenever CONFIG_KAISER is off, which
+ * includes architectures that support KAISER, but have it
+ * disabled.
+ */
+
+static inline int kaiser_map_stack(struct task_struct *tsk)
+{
+	return 0;
+}
+
+static inline void kaiser_init(void)
+{
+}
+
+static inline void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+}
+
+static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
+				     unsigned long flags)
+{
+	return 0;
+}
+
+#endif /* !CONFIG_KAISER */
+#endif /* _INCLUDE_KAISER_H */
diff -puN init/main.c~kaiser-base init/main.c
--- a/init/main.c~kaiser-base	2017-10-31 15:03:51.836183614 -0700
+++ b/init/main.c	2017-10-31 15:03:51.848184181 -0700
@@ -75,6 +75,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kaiser.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -504,6 +505,7 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	kaiser_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff -puN kernel/fork.c~kaiser-base kernel/fork.c
--- a/kernel/fork.c~kaiser-base	2017-10-31 15:03:51.838183708 -0700
+++ b/kernel/fork.c	2017-10-31 15:03:51.849184228 -0700
@@ -70,6 +70,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
+#include <linux/kaiser.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
@@ -247,6 +248,8 @@ static unsigned long *alloc_thread_stack
 
 static inline void free_thread_stack(struct task_struct *tsk)
 {
+	kaiser_remove_mapping((unsigned long)tsk->stack, THREAD_SIZE);
+
 #ifdef CONFIG_VMAP_STACK
 	if (task_stack_vm_area(tsk)) {
 		int i;
@@ -536,6 +539,9 @@ static struct task_struct *dup_task_stru
 	 * functions again.
 	 */
 	tsk->stack = stack;
+	err = kaiser_map_stack(tsk);
+	if (err)
+		goto free_stack;
 #ifdef CONFIG_VMAP_STACK
 	tsk->stack_vm_area = stack_vm_area;
 #endif
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (6 preceding siblings ...)
  2017-10-31 22:31 ` [PATCH 07/23] x86, kaiser: unmap kernel from userspace page tables (core patch) Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 23:35   ` Kees Cook
  2017-10-31 22:32 ` [PATCH 09/23] x86, kaiser: allow NX to be set in p4d/pgd Dave Hansen
                   ` (18 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


KAISER has two copies of the page tables: one for the kernel and
one for when we are running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and we demand-allocate page table
pages.  These updates of the userspace *portion* of the tables
need to be reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at
the address that we are updating *for*.  If it is <PAGE_OFFSET,
we are doing an update for the userspace portion of the page
tables and must make an entry in the shadow.  We also make the
kernel copy if this new entry unusable for userspace.

However, this has a wrinkle: we have a few places where we use
low addresses in supervisor (kernel) mode.  When we make EFI
calls, we they use traditionaly user addresses in supervisor mode
and trip over these checks.  The trampoline code that we use for
booting secondary CPUs has a similar issue.

Remember, we need to do two things for a userspace PGD: populate
the shadow and sabotage the kernel PGD so it can not be used in
userspace.  This patch fixes the wrinkle by only doing those two
things when we are dealing with a user address *and* the PGD has
_PAGE_USER set.  That way, we do not accidentally sabotage the
in-kernel users of low addresses that are typically used only for
userspace.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable_64.h |   94 +++++++++++++++++++++++-------------
 1 file changed, 61 insertions(+), 33 deletions(-)

diff -puN arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX arch/x86/include/asm/pgtable_64.h
--- a/arch/x86/include/asm/pgtable_64.h~kaiser-set-pgd-careful-plus-NX	2017-10-31 15:03:52.732225966 -0700
+++ b/arch/x86/include/asm/pgtable_64.h	2017-10-31 15:03:52.736226155 -0700
@@ -177,38 +177,76 @@ static inline p4d_t *native_get_normal_p
 /*
  * Page table pages are page-aligned.  The lower half of the top
  * level is used for userspace and the top half for the kernel.
- * This returns true for user pages that need to get copied into
- * both the user and kernel copies of the page tables, and false
- * for kernel pages that should only be in the kernel copy.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
  */
-static inline bool is_userspace_pgd(void *__ptr)
+static inline bool pgdp_maps_userspace(void *__ptr)
 {
 	unsigned long ptr = (unsigned long)__ptr;
 
 	return ((ptr % PAGE_SIZE) < (PAGE_SIZE / 2));
 }
 
+/*
+ * Does this PGD allow access via userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return (pgd.pgd & _PAGE_USER);
+}
+
+/*
+ * Returns the pgd_t that the kernel should use in its page tables.
+ */
+static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KAISER
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user/shadow page tables get the full
+			 * PGD, accessible to userspace:
+			 */
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel
+			 * uses, make it unusable to userspace.  This
+			 * ensures if we get out to userspace with the
+			 * wrong CR3 value, userspace will crash
+			 * instead of running.
+			 */
+			pgd.pgd |= _PAGE_NX;
+		}
+	} else if (!pgd.pgd) {
+		/*
+		 * We are clearing the PGD and can not check  _PAGE_USER
+		 * in the zero'd PGD.  We never do this on the
+		 * pre-populated kernel PGDs, except for pgd_bad().
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Uh, we are very confused.  We have been
+			 * asked to clear a PGD that is in the kernel
+			 * part of the address space.  We preallocated
+			 * all the KAISER PGDs, so this should never
+			 * happen.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
-	/*
-	 * set_pgd() does not get called when we are running
-	 * CONFIG_X86_5LEVEL=y.  So, just hack around it.  We
-	 * know here that we have a p4d but that it is really at
-	 * the top level of the page tables; it is really just a
-	 * pgd.
-	 */
-	/* Do we need to also populate the shadow p4d? */
-	if (is_userspace_pgd(p4dp))
-		native_get_shadow_p4d(p4dp)->pgd = p4d.pgd;
-	/*
-	 * Even if the entry is *mapping* userspace, ensure
-	 * that userspace can not use it.  This way, if we
-	 * get out to userspace with the wrong CR3 value,
-	 * userspace will crash instead of running.
-	 */
-	if (!p4d.pgd.pgd)
-		p4dp->pgd.pgd = p4d.pgd.pgd | _PAGE_NX;
+	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
 #else /* CONFIG_KAISER */
 	*p4dp = p4d;
 #endif
@@ -226,17 +264,7 @@ static inline void native_p4d_clear(p4d_
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_KAISER
-	/* Do we need to also populate the shadow pgd? */
-	if (is_userspace_pgd(pgdp))
-		native_get_shadow_pgd(pgdp)->pgd = pgd.pgd;
-	/*
-	 * Even if the entry is mapping userspace, ensure
-	 * that it is unusable for userspace.  This way,
-	 * if we get out to userspace with the wrong CR3
-	 * value, userspace will crash instead of running.
-	 */
-	if (!pgd_none(pgd))
-		pgdp->pgd = pgd.pgd | _PAGE_NX;
+	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
 #else /* CONFIG_KAISER */
 	*pgdp = pgd;
 #endif
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 09/23] x86, kaiser: allow NX to be set in p4d/pgd
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (7 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 10/23] x86, kaiser: make sure static PGDs are 8k in size Dave Hansen
                   ` (17 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


We protect user portion of the kernel page tables with the NX
bit to cripple it.  But, that trips the p4d/pgd_bad() checks.
Make sure it does not do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/pgtable.h |   14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx arch/x86/include/asm/pgtable.h
--- a/arch/x86/include/asm/pgtable.h~kaiser-p4d-allow-nx	2017-10-31 15:03:53.299252767 -0700
+++ b/arch/x86/include/asm/pgtable.h	2017-10-31 15:03:53.304253004 -0700
@@ -845,7 +845,12 @@ static inline pud_t *pud_offset(p4d_t *p
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -879,7 +884,12 @@ static inline p4d_t *p4d_offset(pgd_t *p
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 10/23] x86, kaiser: make sure static PGDs are 8k in size
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (8 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 09/23] x86, kaiser: allow NX to be set in p4d/pgd Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 11/23] x86, kaiser: map GDT into user page tables Dave Hansen
                   ` (16 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


We have a few PGDs that come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all 8k-aligned,
but we also need them to be 8k in *size*e

The original KAISER patch did not do this.  It probably just lucked out
that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/head_64.S |   16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff -puN arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S~kaiser-head_S-pgds-need-8k-too	2017-10-31 15:03:53.866279568 -0700
+++ b/arch/x86/kernel/head_64.S	2017-10-31 15:03:53.870279757 -0700
@@ -340,11 +340,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
 	.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL	0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -363,6 +376,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -372,6 +386,7 @@ NEXT_PAGE(early_dynamic_pgts)
 #ifndef CONFIG_XEN
 NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KAISER_USER_PGD_FILL,8,0
 #else
 NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -380,6 +395,7 @@ NEXT_PGD_PAGE(init_top_pgt)
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 11/23] x86, kaiser: map GDT into user page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (9 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 10/23] x86, kaiser: make sure static PGDs are 8k in size Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
                   ` (15 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


The GDT is used to control the x86 segmentation mechanism.  It
must be virtually mapped when switching segments or at IRET
time when switching between userspace and kernel.

The original KAISER patch did not do this.  I have no ide how
it ever worked.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |   15 +++++++++++++++
 b/arch/x86/mm/kaiser.c         |   10 ++++++++++
 2 files changed, 25 insertions(+)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-gdt-pages	2017-10-31 15:03:54.432306321 -0700
+++ b/arch/x86/kernel/cpu/common.c	2017-10-31 15:03:54.441306747 -0700
@@ -5,6 +5,7 @@
 #include <linux/export.h>
 #include <linux/percpu.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/ctype.h>
 #include <linux/delay.h>
 #include <linux/sched/mm.h>
@@ -487,6 +488,20 @@ static inline void setup_fixmap_gdt(int
 #endif
 
 	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+
+	/* CPU 0's mapping is done in kaiser_init() */
+	if (cpu) {
+		int ret;
+
+		ret = kaiser_add_mapping((unsigned long) get_cpu_gdt_ro(cpu),
+					 PAGE_SIZE, __PAGE_KERNEL_RO);
+		/*
+		 * We do not have a good way to fail CPU bringup.
+		 * Just WARN about it and hope we boot far enough
+		 * to get a good log out.
+		 */
+		WARN_ON(ret);
+	}
 }
 
 /* Load the original GDT from the per-cpu structure */
diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-gdt-pages	2017-10-31 15:03:54.436306511 -0700
+++ b/arch/x86/mm/kaiser.c	2017-10-31 15:03:54.442306794 -0700
@@ -329,6 +329,16 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_early((void *)idt_descr.address,
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO);
+
+	/*
+	 * We could theoretically do this in setup_fixmap_gdt().
+	 * But, we would need to rewrite the above page table
+	 * allocation code to use the bootmem allocator.  The
+	 * buddy allocator is not available at the time that we
+	 * call setup_fixmap_gdt() for CPU 0.
+	 */
+	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
+				  __PAGE_KERNEL_RO);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (10 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 11/23] x86, kaiser: map GDT into user page tables Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-11-01  8:00   ` Andy Lutomirski
  2017-10-31 22:32 ` [PATCH 13/23] x86, kaiser: map espfix structures Dave Hansen
                   ` (14 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


Normally, a process just has a NULL mm->context.ldt.  But, we
have a syscall for a process to set a new one.  If a process does
that, we need to map the new LDT.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/ldt.c |   25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff -puN arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts arch/x86/kernel/ldt.c
--- a/arch/x86/kernel/ldt.c~kaiser-user-map-new-ldts	2017-10-31 15:03:55.034334777 -0700
+++ b/arch/x86/kernel/ldt.c	2017-10-31 15:03:55.038334966 -0700
@@ -10,6 +10,7 @@
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/slab.h>
@@ -55,11 +56,21 @@ static void flush_ldt(void *__mm)
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree_atomic(ldt->entries);
+	else
+		free_page((unsigned long)ldt->entries);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	int ret;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
@@ -87,6 +98,12 @@ static struct ldt_struct *alloc_ldt_stru
 		return NULL;
 	}
 
+	ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+				 __PAGE_KERNEL);
+	if (ret) {
+		__free_ldt_struct(new_ldt);
+		return NULL;
+	}
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -113,12 +130,10 @@ static void free_ldt_struct(struct ldt_s
 	if (likely(!ldt))
 		return;
 
+	kaiser_remove_mapping((unsigned long)ldt->entries,
+			      ldt->nr_entries * LDT_ENTRY_SIZE);
 	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 13/23] x86, kaiser: map espfix structures
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (11 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 14/23] x86, kaiser: map entry stack variables Dave Hansen
                   ` (13 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


We have some rather arcane code to help when we IRET to 16-bit
segments: the "espfix" code.  This consists of a few per-cpu
variables:

	espfix_stack: tells us where we allocated the stack
	  	      (the bottom)
	espfix_waddr: tells us where we can actually point %rsp

and the stack itself.  We need all three things mapped for this
to work.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  We could switch to the kernel page tables
here and then not have to map any of this, but just
user-pagetable-mapping is simpler.  To switch over to the kernel
copy, we would need some temporary storage which is in short
supply at this point.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/espfix_64.c |    7 ++++---
 1 file changed, 4 insertions(+), 3 deletions(-)

diff -puN arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix arch/x86/kernel/espfix_64.c
--- a/arch/x86/kernel/espfix_64.c~kaiser-user-map-espfix	2017-10-31 15:03:55.601361577 -0700
+++ b/arch/x86/kernel/espfix_64.c	2017-10-31 15:03:55.605361766 -0700
@@ -33,6 +33,7 @@
 
 #include <linux/init.h>
 #include <linux/init_task.h>
+#include <linux/kaiser.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
@@ -41,7 +42,6 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
-#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,5 @@ done:
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
+	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE, __PAGE_KERNEL);
 }
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 14/23] x86, kaiser: map entry stack variables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (12 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 13/23] x86, kaiser: map espfix structures Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 15/23] x86, kaiser: map trace interrupt entry Dave Hansen
                   ` (12 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


There are times that we enter the kernel and do not have a safe
stack, like at SYSCALL entry.  We use the per-cpu vairables
'rsp_scratch' and 'cpu_current_top_of_stack' to save off the old
%rsp and find a safe place to have a stack.

You can not directly manipulate the CR3 register.  You can only
'MOV' to it from another register, which means we need to clobber
a register in order to do any CR3 manipulation.  User-mapping these
variables allows us to obtain a safe stack *before* we switch the
CR3 value.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/kernel/cpu/common.c |    2 +-
 b/arch/x86/kernel/process_64.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~kaiser-user-map-stack-helper-vars	2017-10-31 15:03:56.168388378 -0700
+++ b/arch/x86/kernel/cpu/common.c	2017-10-31 15:03:56.175388709 -0700
@@ -1440,7 +1440,7 @@ EXPORT_PER_CPU_SYMBOL(__preempt_count);
  * the top of the kernel stack.  Use an extra percpu variable to track the
  * top of the kernel stack directly.
  */
-DEFINE_PER_CPU(unsigned long, cpu_current_top_of_stack) =
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, cpu_current_top_of_stack) =
 	(unsigned long)&init_thread_union + THREAD_SIZE;
 EXPORT_PER_CPU_SYMBOL(cpu_current_top_of_stack);
 
diff -puN arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c~kaiser-user-map-stack-helper-vars	2017-10-31 15:03:56.170388472 -0700
+++ b/arch/x86/kernel/process_64.c	2017-10-31 15:03:56.176388756 -0700
@@ -59,7 +59,7 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 15/23] x86, kaiser: map trace interrupt entry
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (13 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 14/23] x86, kaiser: map entry stack variables Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 16/23] x86, kaiser: map debug IDT tables Dave Hansen
                   ` (11 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


We put all of the interrupt entry/exit code into a special
section (.irqentry.text).  This enables the ftrace code to figure
out when we are in a "grey area" of interrupt handling before the
C code has taken over and marked the data structures that we are
in an interrupt.

KAISER needs to map this section into the user page tables
because it contains the assembly that helps us enter interrupt
routines.  In addition to the assembly which KAISER *needs*, the
section also contains the first C function that handles an
interrupt.  This is unfortunate, but it doesn't really hurt
anything.

This patch also aligns the .entry.text and .irqentry.text.  This
ensures that we KAISER-map the section we want and *only* the
section we want.  Otherwise, we might pull in extra code that
should be explicitly KAISER-mapped, but just happened to get
pulled in with something that shared the same page.  That also
generally does not hurt anything, but it can make things hard
to debug because random build alignment can cause things to
fail.

This was missed in the original KAISER patch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c              |   14 ++++++++++++++
 b/include/asm-generic/vmlinux.lds.h |   10 ++++++++++
 2 files changed, 24 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-irqentry_text	2017-10-31 15:03:56.764416549 -0700
+++ b/arch/x86/mm/kaiser.c	2017-10-31 15:03:56.770416832 -0700
@@ -19,6 +19,7 @@
 #include <linux/types.h>
 #include <linux/bug.h>
 #include <linux/init.h>
+#include <linux/interrupt.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
 #include <linux/uaccess.h>
@@ -339,6 +340,19 @@ void __init kaiser_init(void)
 	 */
 	kaiser_add_user_map_early(get_cpu_gdt_ro(0), PAGE_SIZE,
 				  __PAGE_KERNEL_RO);
+
+	/*
+	 * .irqentry.text helps us identify code that runs before
+	 * we get a chance to call entering_irq().  This includes
+	 * the interrupt entry assembly plus the first C function
+	 * that gets called.  KAISER does not need the C code
+	 * mapped.  We just use the .irqentry.text section as-is
+	 * to avoid having to carve out a new section for the
+	 * assembly only.
+	 */
+	kaiser_add_user_map_ptrs_early(__irqentry_text_start,
+				       __irqentry_text_end,
+				       __PAGE_KERNEL_RX);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff -puN include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text include/asm-generic/vmlinux.lds.h
--- a/include/asm-generic/vmlinux.lds.h~kaiser-user-map-trace-irqentry_text	2017-10-31 15:03:56.766416643 -0700
+++ b/include/asm-generic/vmlinux.lds.h	2017-10-31 15:03:56.772416927 -0700
@@ -59,6 +59,12 @@
 /* Align . to a 8 byte boundary equals to maximum function alignment. */
 #define ALIGN_FUNCTION()  . = ALIGN(8)
 
+#ifdef CONFIG_KAISER
+#define ALIGN_KAISER()	. = ALIGN(PAGE_SIZE);
+#else
+#define ALIGN_KAISER()
+#endif
+
 /*
  * LD_DEAD_CODE_DATA_ELIMINATION option enables -fdata-sections, which
  * generates .data.identifier sections, which need to be pulled in with
@@ -493,15 +499,19 @@
 		VMLINUX_SYMBOL(__kprobes_text_end) = .;
 
 #define ENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__entry_text_start) = .;			\
 		*(.entry.text)						\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__entry_text_end) = .;
 
 #define IRQENTRY_TEXT							\
+		ALIGN_KAISER();						\
 		ALIGN_FUNCTION();					\
 		VMLINUX_SYMBOL(__irqentry_text_start) = .;		\
 		*(.irqentry.text)					\
+		ALIGN_KAISER();						\
 		VMLINUX_SYMBOL(__irqentry_text_end) = .;
 
 #define SOFTIRQENTRY_TEXT						\
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 16/23] x86, kaiser: map debug IDT tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (14 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 15/23] x86, kaiser: map trace interrupt entry Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 17/23] x86, kaiser: map virtually-addressed performance monitoring buffers Dave Hansen
                   ` (10 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


The IDT table it references are another structure where the
CPU references a virtual address.  It also obviously needs these
to handle an interrupt in userspace, so these need to be mapped into
the user copy of the page tables.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/mm/kaiser.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt arch/x86/mm/kaiser.c
--- a/arch/x86/mm/kaiser.c~kaiser-user-map-trace-and-debug-idt	2017-10-31 15:03:57.365444956 -0700
+++ b/arch/x86/mm/kaiser.c	2017-10-31 15:03:57.368445098 -0700
@@ -250,6 +250,14 @@ int kaiser_add_user_map_ptrs(const void
 				   flags);
 }
 
+static int kaiser_user_map_ptr_early(const void *start_addr, unsigned long size,
+				 unsigned long flags)
+{
+	int ret = kaiser_add_user_map(start_addr, size, flags);
+	WARN_ON(ret);
+	return ret;
+}
+
 /*
  * Ensure that the top level of the (shadow) page tables are
  * entirely populated.  This ensures that all processes that get
@@ -331,6 +339,10 @@ void __init kaiser_init(void)
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO);
 
+	kaiser_user_map_ptr_early(&debug_idt_table,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL);
+
 	/*
 	 * We could theoretically do this in setup_fixmap_gdt().
 	 * But, we would need to rewrite the above page table
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 17/23] x86, kaiser: map virtually-addressed performance monitoring buffers
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (15 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 16/23] x86, kaiser: map debug IDT tables Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 18/23] x86, mm: Move CR3 construction functions Dave Hansen
                   ` (9 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, hughd, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, x86


From: Hugh Dickins <hughd@google.com>

The BTS and PEBS buffers both have their virtual addresses programmed
into the hardware.  This means that we have to access them via the page
tables.  The times that the hardware accesses these are entirely
dependent on how the performance monitoring hardware events are set up.
In other words, we have no idea when we might need to access these
buffers.

Avoid perf crashes: place debug_store in the user-mapped per-cpu area
instead of allocating, and use page allocator plus kaiser_add_mapping()
to keep the BTS and PEBS buffers user-mapped (that is, present in the
user mapping, though visible only to kernel and hardware).  The PEBS
fixup buffer does not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/events/intel/ds.c |   57 +++++++++++++++++++++++++++++++++----------
 1 file changed, 45 insertions(+), 12 deletions(-)

diff -puN arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers arch/x86/events/intel/ds.c
--- a/arch/x86/events/intel/ds.c~kaiser-user-map-virtually-addressed-performance-monitoring-buffers	2017-10-31 15:03:57.933471803 -0700
+++ b/arch/x86/events/intel/ds.c	2017-10-31 15:03:57.937471992 -0700
@@ -2,11 +2,15 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+#include <asm/kaiser.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
@@ -278,6 +282,39 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+#ifdef CONFIG_KAISER
+	unsigned int order = get_order(size);
+	struct page *page;
+	unsigned long addr;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	if (!page)
+		return NULL;
+	addr = (unsigned long)page_address(page);
+	if (kaiser_add_mapping(addr, size, __PAGE_KERNEL) < 0) {
+		__free_pages(page, order);
+		addr = 0;
+	}
+	return (void *)addr;
+#else
+	return kmalloc_node(size, flags | __GFP_ZERO, node);
+#endif
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+#ifdef CONFIG_KAISER
+	if (!buffer)
+		return;
+	kaiser_remove_mapping((unsigned long)buffer, size);
+	free_pages((unsigned long)buffer, get_order(size));
+#else
+	kfree(buffer);
+#endif
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -288,7 +325,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -299,7 +336,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree(buffer, x86_pmu.pebs_buffer_size);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
@@ -325,7 +362,8 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+			x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
 }
 
@@ -339,7 +377,7 @@ static int alloc_bts_buffer(int cpu)
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
@@ -365,19 +403,15 @@ static void release_bts_buffer(int cpu)
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
 
 	return 0;
@@ -391,7 +425,6 @@ static void release_ds_buffer(int cpu)
 		return;
 
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 18/23] x86, mm: Move CR3 construction functions
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (16 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 17/23] x86, kaiser: map virtually-addressed performance monitoring buffers Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 19/23] x86, mm: remove hard-coded ASID limit checks Dave Hansen
                   ` (8 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


For flushing the TLB, we need to know which ASID has been programmed
into the hardware.  Since that differs from what is in 'cpu_tlbstate',
we need to be able to transform the ASID in cpu_tlbstate to the one
programmed into the hardware.

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/mmu_context.h |   29 +----------------------------
 b/arch/x86/include/asm/tlbflush.h    |   27 +++++++++++++++++++++++++++
 b/arch/x86/mm/tlb.c                  |    8 ++++----
 3 files changed, 32 insertions(+), 32 deletions(-)

diff -puN arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~kaiser-pcid-pre-build-func-move	2017-10-31 15:03:58.508498981 -0700
+++ b/arch/x86/include/asm/mmu_context.h	2017-10-31 15:03:58.516499360 -0700
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permi
 }
 
 /*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
-/*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
  *
@@ -316,7 +289,7 @@ static inline unsigned long build_cr3_no
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-func-move	2017-10-31 15:03:58.510499076 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:03:58.518499454 -0700
@@ -74,6 +74,33 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-build-func-move	2017-10-31 15:03:58.512499170 -0700
+++ b/arch/x86/mm/tlb.c	2017-10-31 15:03:58.518499454 -0700
@@ -127,7 +127,7 @@ void switch_mm_irqs_off(struct mm_struct
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -194,12 +194,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
@@ -277,7 +277,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 19/23] x86, mm: remove hard-coded ASID limit checks
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (17 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 18/23] x86, mm: Move CR3 construction functions Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 20/23] x86, mm: put mmu-to-h/w ASID translation in one place Dave Hansen
                   ` (7 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


First, it's nice to remove the magic numbers.

Second, KAISER is going to eat up half of the available ASID
space.  We do not use it today, but we need to at least spell
out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-asids-macros	2017-10-31 15:03:59.132528476 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:03:59.135528617 -0700
@@ -74,6 +74,18 @@ static inline u64 inc_mm_tlb_gen(struct
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
+/*
+ * We lose a single extra ASID because 0 is reserved for use
+ * by non-PCID-aware users.
+ */
+#define NR_AVAIL_ASIDS ((1<<CR3_AVAIL_ASID_BITS) - 1)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
  * bits.  This serves two purposes.  It prevents a nasty situation in
@@ -87,7 +99,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > NR_AVAIL_ASIDS);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -97,7 +109,7 @@ static inline unsigned long build_cr3(pg
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > NR_AVAIL_ASIDS);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 20/23] x86, mm: put mmu-to-h/w ASID translation in one place
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (18 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 19/23] x86, mm: remove hard-coded ASID limit checks Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
                   ` (6 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


We effectively have two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one we program into the hardware that goes from 1->6

Let's just put the +1 in a single place which gives us a
nice place to comment.  KAISER will also need to, given an
ASID, know which hardware ASID to flush for the userspace
mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-build-kern	2017-10-31 15:03:59.699555275 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:03:59.703555465 -0700
@@ -86,21 +86,26 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define NR_AVAIL_ASIDS ((1<<CR3_AVAIL_ASID_BITS) - 1)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
+static inline u16 kern_asid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid >= NR_AVAIL_ASIDS);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+	 * bits.  This serves two purposes.  It prevents a nasty situation in
+	 * which PCID-unaware code saves CR3, loads some other value (with PCID
+	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+	 * the saved ASID was nonzero.  It also means that any bugs involving
+	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+	 * deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > NR_AVAIL_ASIDS);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_asid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -110,7 +115,8 @@ static inline unsigned long build_cr3(pg
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > NR_AVAIL_ASIDS);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (19 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 20/23] x86, mm: put mmu-to-h/w ASID translation in one place Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-11-01  8:03   ` Andy Lutomirski
  2017-10-31 22:32 ` [PATCH 22/23] x86, kaiser: use PCID feature to make user and kernel switches faster Dave Hansen
                   ` (5 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


If we change the page tables in such a way that we need an
invalidation of all contexts (aka. PCIDs / ASIDs) we can
actively invalidate them by:
 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
    INVLPG for each address.

But, none of these are really feasible since we have ~6 ASIDs (12 with
KAISER) at the time that we need to do an invalidation.  So, we just
invalidate the *current* context and then mark the cpu_tlbstate
_quickly_.

Then, at the next context-switch, we notice that we had
'all_other_ctxs_invalid' marked, and go invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any futuee context switches will do a full flush
of the TLB so they pick up the changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/include/asm/tlbflush.h |   47 +++++++++++++++++++++++++++++---------
 b/arch/x86/mm/tlb.c               |   35 ++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid-pre-clear-pcid-cache	2017-10-31 15:04:00.268582170 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:04:00.275582500 -0700
@@ -183,6 +183,17 @@ struct tlb_state {
 	bool is_lazy;
 
 	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool all_other_ctxs_invalid;
+
+	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
 	 */
@@ -259,6 +270,19 @@ static inline unsigned long cr4_read_sha
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+		return;
+
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -288,6 +312,10 @@ static inline void __native_flush_tlb(vo
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
 	preempt_enable();
+	/*
+	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
+	 * without PCIDs flushes all non-globals.
+	 */
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -346,24 +374,23 @@ static inline void __native_flush_tlb_si
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+		tlb_flush_shared_nonglobals();
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL	-1UL
diff -puN arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid-pre-clear-pcid-cache	2017-10-31 15:04:00.271582311 -0700
+++ b/arch/x86/mm/tlb.c	2017-10-31 15:04:00.275582500 -0700
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+        if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES)) {
+		WARN_ON_ONCE(1);
+                return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_st
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.all_other_ctxs_invalid))
+		clear_non_loaded_ctxs();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 22/23] x86, kaiser: use PCID feature to make user and kernel switches faster
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (20 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 22:32 ` [PATCH 23/23] x86, kaiser: add Kconfig Dave Hansen
                   ` (4 subsequent siblings)
  26 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  We switch between them
with the the CR3 register.  But, CR3 was really designed for context
switches and changing it also flushes the entire TLB (modulo global
pages).  This TLB flush increases the cost of interrupts and context
switches.  For syscall-heavy microbenchmarks it can cut the rate of
syscalls by 2/3.

But, now we have suppport for and Intel CPU feature called Process
Context IDentifiers (PCID) in the kernel thanks to Andy Lutomirski.
This feature is intended to allow you to switch between contexts
without flushing the TLB.

Implementation:

We can use PCIDs to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

We do this by assigning the kernel and userspace different ASIDs.  On
entry from userspace, we move over to the kernel page tables *and*
ASID.  On exit, we restore the user page tables and ASID.  Fortunately,
the ASID is programmed via CR3, which we are already using to switch
between the page table copies.  So, we get one-stop shopping.

In current kernels, CR3 is used to switch between processes which also
provides all the TLB flushing that we need at a context switch.  But,
with KAISER, that CR3 move only flushes the current (kernel) ASID.  We
need an extra TLB flushing operation to flush the user ASID: invpcid.
This is probably ~100 cycles, but this is done with the assumption that
the time we lose in context switches is more than made up for in
interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
      the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
      flush done for each.  For instance, what is currently a
      single instruction without KAISER:

		invpcid_flush_one(current_pcid, addr);

      becomes this with KAISER:

      		invpcid_flush_one(current_kern_pcid, addr);
		invpcid_flush_one(current_user_pcid, addr);

      and this without INVPCID:

      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, we fully disable PCIDs with KAISER when INVPCID is
not available.  This is fixable, but it's an optimization that
we can do later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-swtich", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/arch/x86/entry/calling.h                    |   25 +++-
 b/arch/x86/entry/entry_64.S                   |    1 
 b/arch/x86/include/asm/cpufeatures.h          |    1 
 b/arch/x86/include/asm/pgtable_types.h        |   11 ++
 b/arch/x86/include/asm/tlbflush.h             |  141 +++++++++++++++++++++-----
 b/arch/x86/include/uapi/asm/processor-flags.h |    3 
 b/arch/x86/kvm/x86.c                          |    3 
 b/arch/x86/mm/init.c                          |   75 +++++++++----
 b/arch/x86/mm/tlb.c                           |   66 +++++++++++-
 9 files changed, 264 insertions(+), 62 deletions(-)

diff -puN arch/x86/entry/calling.h~kaiser-pcid arch/x86/entry/calling.h
--- a/arch/x86/entry/calling.h~kaiser-pcid	2017-10-31 15:04:00.871610671 -0700
+++ b/arch/x86/entry/calling.h	2017-10-31 15:04:00.895611805 -0700
@@ -2,6 +2,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/pgtable_types.h>
 
 /*
 
@@ -222,16 +223,20 @@ For 32-bit we have the following convent
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  We flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK     (KAISER_SWITCH_PGTABLES_MASK|\
+				(1<<X86_CR3_KAISER_SWITCH_BIT))
 
 .macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
-	andq	$(~KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+        /* Clear PCID and "KAISER bit", point CR3 at kernel pagetables: */
+	andq    $(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Set user PCID bit, and move CR3 up a page to the user page tables: */
+	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -250,8 +255,14 @@ For 32-bit we have the following convent
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KAISER patches in a moment.
+         * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *     1. The user/kernel PCID bit, and
+	 *     2. The user/kernel "bit" that points CR3 to the
+	 *	  bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not user/shadow.
 	 */
 	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
diff -puN arch/x86/entry/entry_64.S~kaiser-pcid arch/x86/entry/entry_64.S
--- a/arch/x86/entry/entry_64.S~kaiser-pcid	2017-10-31 15:04:00.873610765 -0700
+++ b/arch/x86/entry/entry_64.S	2017-10-31 15:04:00.896611852 -0700
@@ -575,6 +575,7 @@ END(irq_entries_start)
 	 * tracking that we're in kernel mode.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
diff -puN arch/x86/include/asm/cpufeatures.h~kaiser-pcid arch/x86/include/asm/cpufeatures.h
--- a/arch/x86/include/asm/cpufeatures.h~kaiser-pcid	2017-10-31 15:04:00.875610860 -0700
+++ b/arch/x86/include/asm/cpufeatures.h	2017-10-31 15:04:00.896611852 -0700
@@ -193,6 +193,7 @@
 #define X86_FEATURE_CAT_L3	( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2	( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3	( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE	( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK ( 7*32+ 9) /* AMD ProcFeedbackInterface */
diff -puN arch/x86/include/asm/pgtable_types.h~kaiser-pcid arch/x86/include/asm/pgtable_types.h
--- a/arch/x86/include/asm/pgtable_types.h~kaiser-pcid	2017-10-31 15:04:00.877610954 -0700
+++ b/arch/x86/include/asm/pgtable_types.h	2017-10-31 15:04:00.898611947 -0700
@@ -144,6 +144,17 @@
 			 _PAGE_SOFT_DIRTY)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
+/* The ASID is the lower 12 bits of CR3 */
+#define X86_CR3_PCID_ASID_MASK  (_AC((1<<12)-1, UL))
+
+/* Mask for all the PCID-related bits in CR3: */
+#define X86_CR3_PCID_MASK       (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK)
+
+/* Make sure this is only usable in KAISER #ifdef'd code: */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#endif
+
 /*
  * The cache modes defined here are used to translate between pure SW usage
  * and the HW defined cache mode bits and/or PAT entries.
diff -puN arch/x86/include/asm/tlbflush.h~kaiser-pcid arch/x86/include/asm/tlbflush.h
--- a/arch/x86/include/asm/tlbflush.h~kaiser-pcid	2017-10-31 15:04:00.879611049 -0700
+++ b/arch/x86/include/asm/tlbflush.h	2017-10-31 15:04:00.898611947 -0700
@@ -77,7 +77,12 @@ static inline u64 inc_mm_tlb_gen(struct
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS 12
 /* When enabled, KAISER consumes a single bit for user/kernel switches */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#define KAISER_CONSUMED_ASID_BITS 1
+#else
 #define KAISER_CONSUMED_ASID_BITS 0
+#endif
 
 #define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS-KAISER_CONSUMED_ASID_BITS)
 /*
@@ -86,21 +91,62 @@ static inline u64 inc_mm_tlb_gen(struct
  */
 #define NR_AVAIL_ASIDS ((1<<CR3_AVAIL_ASID_BITS) - 1)
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
 static inline u16 kern_asid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid >= NR_AVAIL_ASIDS);
+
+#ifdef CONFIG_KAISER
+	/*
+	 * Make sure that the dynamic ASID space does not confict
+	 * with the bit we are using to switch between user and
+	 * kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<<X86_CR3_KAISER_SWITCH_BIT));
+
 	/*
-	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
-	 * bits.  This serves two purposes.  It prevents a nasty situation in
-	 * which PCID-unaware code saves CR3, loads some other value (with PCID
-	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
-	 * the saved ASID was nonzero.  It also means that any bugs involving
-	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
-	 * deterministically.
+	 * The ASID being passed in here should have respected
+	 * the NR_AVAIL_ASIDS and thus never have the switch
+	 * bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1<<X86_CR3_KAISER_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in  are
+	 * small (<TLB_NR_DYN_ASIDS).  They never have the high
+	 * switch bit set, so do not bother to clear it.
+	 */
+
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1
+	 * into the PCID bits.  This serves two purposes.  It
+	 * prevents a nasty situation in which PCID-unaware code
+	 * saves CR3, loads some other value (with PCID == 0),
+	 * and then restores CR3, thus corrupting the TLB for
+	 * ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3
+	 * with CR4.PCIDE off will trigger deterministically.
 	 */
 	return asid + 1;
 }
 
+/*
+ * The user ASID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_asid(u16 asid)
+{
+	u16 ret = kern_asid(asid);
+#ifdef CONFIG_KAISER
+	ret |= 1<<X86_CR3_KAISER_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -143,12 +189,6 @@ static inline bool tlb_defer_switch_to_i
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -304,18 +344,42 @@ extern void initialize_tlbstate_and_flus
 
 static inline void __native_flush_tlb(void)
 {
-	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
-	 */
-	preempt_disable();
-	native_write_cr3(__native_read_cr3());
-	preempt_enable();
-	/*
-	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
-	 * without PCIDs flushes all non-globals.
-	 */
+	if (!cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		/*
+		 * native_write_cr3() only clears the current PCID if
+		 * CR4 has X86_CR4_PCIDE set.  In other words, this does
+		 * not fully flush the TLB if PCIDs are in use.
+ 		 *
+ 		 * With KAISER and PCIDs, the means that we did not
+		 * flush the user PCID.  Warn if it gets called.
+		 */
+		if (IS_ENABLED(CONFIG_KAISER))
+ 			WARN_ON_ONCE(this_cpu_read(cpu_tlbstate.cr4) &
+ 				     X86_CR4_PCIDE);
+		/*
+		 * If current->mm == NULL then we borrow a mm
+		 * which may change during a task switch and
+		 * therefore we must not be preempted while we
+		 * write CR3 back:
+		 */
+		preempt_disable();
+		native_write_cr3(__native_read_cr3());
+		preempt_enable();
+		/*
+		 * Does not need tlb_flush_shared_nonglobals()
+		 * since the CR3 write without PCIDs flushes all
+		 * non-globals.
+		 */
+		return;
+	}
+ 	/*
+	 * We are no longer using globals with KAISER, so a
+	 * "nonglobals" flush would work too. But, this is more
+	 * conservative.
+	 *
+	 * Note, this works with CR4.PCIDE=0 or 1.
+ 	 */
+	invpcid_flush_all();
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -350,6 +414,8 @@ static inline void __native_flush_tlb_gl
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -369,7 +435,30 @@ static inline void __native_flush_tlb_gl
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before
+	 * CR4.PCIDE=1.  Just call invpcid in the case we are called
+	 * early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+		return;
+	}
+	/* Flush the address out of both PCIDs. */
+	/*
+	 * An optimization here might be to determine addresses
+	 * that are only kernel-mapped and only flush the kernel
+	 * ASID.  But, userspace flushes are probably much more
+	 * important performance-wise.
+	 *
+	 * Make sure to do only a single invpcid when KAISER is
+	 * disabled and we have only a single ASID.
+	 */
+	if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid))
+		invpcid_flush_one(user_asid(loaded_mm_asid), addr);
+	invpcid_flush_one(kern_asid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
diff -puN arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid arch/x86/include/uapi/asm/processor-flags.h
--- a/arch/x86/include/uapi/asm/processor-flags.h~kaiser-pcid	2017-10-31 15:04:00.882611191 -0700
+++ b/arch/x86/include/uapi/asm/processor-flags.h	2017-10-31 15:04:00.899611994 -0700
@@ -77,7 +77,8 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
diff -puN arch/x86/kvm/x86.c~kaiser-pcid arch/x86/kvm/x86.c
--- a/arch/x86/kvm/x86.c~kaiser-pcid	2017-10-31 15:04:00.885611332 -0700
+++ b/arch/x86/kvm/x86.c	2017-10-31 15:04:00.902612136 -0700
@@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, u
 			return 1;
 
 		/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
-		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
+		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) ||
+		    !is_long_mode(vcpu))
 			return 1;
 	}
 
diff -puN arch/x86/mm/init.c~kaiser-pcid arch/x86/mm/init.c
--- a/arch/x86/mm/init.c~kaiser-pcid	2017-10-31 15:04:00.887611427 -0700
+++ b/arch/x86/mm/init.c	2017-10-31 15:04:00.902612136 -0700
@@ -196,34 +196,59 @@ static void __init probe_page_size_mask(
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * KAISER uses a PCID for the kernel and another
+		 * for userspace.  Both PCIDs need to be flushed
+		 * when the TLB flush functions are called.  But,
+		 * flushing *another* PCID is insane without
+		 * INVPCID.  Just avoid using PCIDs at all if we
+		 * have KAISER and do not have INVPCID.
+		 */
+		if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) &&
+		    !boot_cpu_has(X86_FEATURE_INVPCID)) {
 			setup_clear_cpu_cap(X86_FEATURE_PCID);
+			return;
 		}
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() --
+		 * the trampoline code can't handle CR4.PCIDE and
+		 * it wouldn't do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually
+		 * cause the bits in question to remain set all the
+		 * way through the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE
+		 * manually in start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work
+		 * if we set X86_CR4_PCIDE, *and* we INVPCID
+		 * support.  It's unusable on systems that have
+		 * X86_CR4_PCIDE clear, or that have no INVPCID
+		 * support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't
+		 * work if PCID is on but PGE is not.  Since that
+		 * combination doesn't exist on real hardware, there's
+		 * no reason to try to fully support it, but it's
+		 * polite to avoid corrupting data if we're on
+		 * an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32
diff -puN arch/x86/mm/tlb.c~kaiser-pcid arch/x86/mm/tlb.c
--- a/arch/x86/mm/tlb.c~kaiser-pcid	2017-10-31 15:04:00.890611569 -0700
+++ b/arch/x86/mm/tlb.c	2017-10-31 15:04:00.903612183 -0700
@@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_st
 	*need_flush = true;
 }
 
+/*
+ * Given a kernel asid, flush the corresponding KAISER
+ * user ASID.
+ */
+static void flush_user_asid(pgd_t *pgd, u16 kern_asid)
+{
+	/* There is no user ASID if KAISER is off */
+	if (!IS_ENABLED(CONFIG_KAISER))
+		return;
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+	/*
+	 * With PCIDs enabled, write_cr3() only flushes TLB
+	 * entries for the current (kernel) ASID.  This leaves
+	 * old TLB entries for the user ASID in place and we must
+	 * flush that context separately.  We can theoretically
+	 * delay doing this until we actually load up the
+	 * userspace CR3, but do it here for simplicity.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		invpcid_flush_single_context(user_asid(kern_asid));
+	} else {
+		/*
+		 * On systems with PCIDs, but no INVPCID, the only
+		 * way to flush a PCID is a CR3 write.  Note that
+		 * we use the kernel page tables with the *user*
+		 * ASID here.
+		 */
+		unsigned long user_asid_flush_cr3;
+		user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid));
+		write_cr3(user_asid_flush_cr3);
+		/*
+		 * We do not use PCIDs with KAISER unless we also
+		 * have INVPCID.  Getting here is unexpected.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		flush_user_asid(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -229,12 +291,12 @@ void switch_mm_irqs_off(struct mm_struct
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH,
 					TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 			trace_tlb_flush(TLB_FLUSH_ON_TASK_SWITCH, 0);
 		}
 
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* [PATCH 23/23] x86, kaiser: add Kconfig
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (21 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 22/23] x86, kaiser: use PCID feature to make user and kernel switches faster Dave Hansen
@ 2017-10-31 22:32 ` Dave Hansen
  2017-10-31 23:59   ` Kees Cook
  2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
                   ` (3 subsequent siblings)
  26 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 22:32 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, dave.hansen, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

	It is absurd that KAISER should depend on SMP, but
	apparently nobody has tried a UP build before: which
	breaks on implicit declaration of function
	'per_cpu_offset' in arch/x86/mm/kaiser.c.

	Now, you would expect that to be trivially fixed up; but
	looking at the System.map when that block is #ifdef'ed
	out of kaiser_init(), I see that in a UP build
	__per_cpu_user_mapped_end is precisely at
	__per_cpu_user_mapped_start, and the items carefully
	gathered into that section for user-mapping on SMP,
	dispersed elsewhere on UP.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@google.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: x86@kernel.org
---

 b/security/Kconfig |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN security/Kconfig~kaiser-kconfig security/Kconfig
--- a/security/Kconfig~kaiser-kconfig	2017-10-31 15:04:01.680648908 -0700
+++ b/security/Kconfig	2017-10-31 15:04:01.684649097 -0700
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KAISER
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && SMP && !PARAVIRT
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
_

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (22 preceding siblings ...)
  2017-10-31 22:32 ` [PATCH 23/23] x86, kaiser: add Kconfig Dave Hansen
@ 2017-10-31 23:27 ` Linus Torvalds
  2017-10-31 23:44   ` Dave Hansen
  2017-11-01 15:53   ` Dave Hansen
  2017-11-01  8:54 ` Ingo Molnar
                   ` (2 subsequent siblings)
  26 siblings, 2 replies; 102+ messages in thread
From: Linus Torvalds @ 2017-10-31 23:27 UTC (permalink / raw)
  To: Dave Hansen, Andy Lutomirski
  Cc: Linux Kernel Mailing List, linux-mm, Kees Cook, Hugh Dickins

Inconveniently, the people you cc'd on the actual patches did *not*
get cc'd with this 00/23 cover letter email.

Also, the documentation was then hidden in patch 07/23, which wasn't
exactly obvious.

So I'd like this to be presented a bit differently.

That said, a couple of comments/questions on this version of the patch series..

 (a) is this on top of Andy's entry cleanups?

     If not, that probably needs to be sorted out.

 (b) the TLB global bit really is nastily done. You basically disable
_PAGE_GLOBAL entirely.

     I can see how/why that would make things simpler, but it's almost
certainly the wrong approach. The small subset of kernel pages that
are always mapped should definitely retain the global bit, so that you
don't always take a TLB miss on those! Those are probably some of the
most latency-critical pages, since there's generally no prefetching
for the kernel entry code or for things like IDT/GDT accesses..

     So even if you don't want to have global pages for normal kernel
entries, you don't want to just make _PAGE_GLOBAL be defined as zero.
You'd want to just use _PAGE_GLOBAL conditionally.

     Hmm?

 (c) am I reading the code correctly, and the shadow page tables are
*completely* duplicated?

     That seems insane. Why isn't only tyhe top level shadowed, and
then lower levels are shared between the shadowed and the "kernel"
page tables?

     But I may be mis-reading the code completely.

Apart from those three questions, I don't see any huge downside to the
patch series, apart from the obvious performance/complexity issues.

              Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior
  2017-10-31 22:31 ` [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior Dave Hansen
@ 2017-10-31 23:31   ` Kees Cook
  0 siblings, 0 replies; 102+ messages in thread
From: Kees Cook @ 2017-10-31 23:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux-MM, moritz.lipp, daniel.gruss, michael.schwarz,
	Andy Lutomirski, Linus Torvalds, Hugh Dickins, x86

On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> The comment says it all here.  The problem here is that the
> X86_CR4_PGE bit affects all PCIDs in a way that is totally
> obscure.
>
> This makes it easier for someone to find if grepping for PCID-
> related stuff and documents the hardware behavior that we are
> depending on.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/arch/x86/include/asm/tlbflush.h |    6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff -puN arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior arch/x86/include/asm/tlbflush.h
> --- a/arch/x86/include/asm/tlbflush.h~kaiser-prep-document-cr4-pge-behavior     2017-10-31 15:03:50.479119470 -0700
> +++ b/arch/x86/include/asm/tlbflush.h   2017-10-31 15:03:50.482119612 -0700
> @@ -258,9 +258,11 @@ static inline void __native_flush_tlb_gl
>         WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
>         /*
>          * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
> -        * TLB of all entries including all entries in all PCIDs and all
> -        * global pages.  Make sure that we _change_ the bit, regardless of
> +        * all entries.  Make sure that we _change_ the bit, regardless of

nit: "... flush the all entries." Drop "the" in the line above?

>          * whether we had X86_CR4_PGE set in the first place.
> +        *
> +        * Note that just toggling PGE *also* flushes all entries from all
> +        * PCIDs, regardless of the state of X86_CR4_PCIDE.
>          */
>         native_write_cr4(cr4 ^ X86_CR4_PGE);
>         /* Put original CR3 value back: */

pre-existing nit: s/CR3/CR4/

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace
  2017-10-31 22:32 ` [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
@ 2017-10-31 23:35   ` Kees Cook
  0 siblings, 0 replies; 102+ messages in thread
From: Kees Cook @ 2017-10-31 23:35 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux-MM, moritz.lipp, daniel.gruss, michael.schwarz,
	Andy Lutomirski, Linus Torvalds, Hugh Dickins, x86

On Tue, Oct 31, 2017 at 3:32 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> KAISER has two copies of the page tables: one for the kernel and
> one for when we are running in userspace.  There is also a kernel
> portion of each of the page tables: the part that *maps* the
> kernel.

I wonder if it might make sense to update
arch/x86/mm/debug_pagetables.c to show the shadow table in some way?
Right now, only the "real" page tables are visible there.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
@ 2017-10-31 23:44   ` Dave Hansen
  2017-11-01  0:21     ` Dave Hansen
                       ` (2 more replies)
  2017-11-01 15:53   ` Dave Hansen
  1 sibling, 3 replies; 102+ messages in thread
From: Dave Hansen @ 2017-10-31 23:44 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Linux Kernel Mailing List, linux-mm, Kees Cook, Hugh Dickins

On 10/31/2017 04:27 PM, Linus Torvalds wrote:
> Inconveniently, the people you cc'd on the actual patches did *not*
> get cc'd with this 00/23 cover letter email.

Urg, sorry about that.

>  (a) is this on top of Andy's entry cleanups?
> 
>      If not, that probably needs to be sorted out.

It is not.  However, I did a version on top of his earlier cleanups, so
I know this can be easily ported on top of them.  It didn't make a major
difference in the number of places that KAISER had to patch, unfortunately.

>  (b) the TLB global bit really is nastily done. You basically disable
> _PAGE_GLOBAL entirely.
> 
>      I can see how/why that would make things simpler, but it's almost
> certainly the wrong approach. The small subset of kernel pages that
> are always mapped should definitely retain the global bit, so that you
> don't always take a TLB miss on those! Those are probably some of the
> most latency-critical pages, since there's generally no prefetching
> for the kernel entry code or for things like IDT/GDT accesses..
> 
>      So even if you don't want to have global pages for normal kernel
> entries, you don't want to just make _PAGE_GLOBAL be defined as zero.
> You'd want to just use _PAGE_GLOBAL conditionally.
> 
>      Hmm?

That's a good point.  Shouldn't be hard to implement at all.  We'll just
need to take _PAGE_GLOBAL out of the default _KERNPG_TABLE definition, I
think.

>  (c) am I reading the code correctly, and the shadow page tables are
> *completely* duplicated?
> 
>      That seems insane. Why isn't only tyhe top level shadowed, and
> then lower levels are shared between the shadowed and the "kernel"
> page tables?

There are obviously two PGDs.  The userspace half of the PGD is an exact
copy so all the lower levels are shared.  You can see this bit in the
memcpy that we do in clone_pgd_range().

For the kernel half, we don't share any of the lower levels.  That's
mostly because the stuff that we're mapping into the user/shadow copy is
only 4k aligned and (probably) never >2MB, so there's really no
opportunity to share.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 23/23] x86, kaiser: add Kconfig
  2017-10-31 22:32 ` [PATCH 23/23] x86, kaiser: add Kconfig Dave Hansen
@ 2017-10-31 23:59   ` Kees Cook
  2017-11-01  9:07     ` Borislav Petkov
  0 siblings, 1 reply; 102+ messages in thread
From: Kees Cook @ 2017-10-31 23:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: LKML, Linux-MM, moritz.lipp, daniel.gruss, michael.schwarz,
	Andy Lutomirski, Linus Torvalds, Hugh Dickins, x86

On Tue, Oct 31, 2017 at 3:32 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> PARAVIRT generally requires that the kernel not manage its own page
> tables.  It also means that the hypervisor and kernel must agree
> wholeheartedly about what format the page tables are in and what
> they contain.  KAISER, unfortunately, changes the rules and they
> can not be used together.

A quick look through "#ifdef CONFIG_KAISER" looks like it might be
possible to make this a runtime setting at some point. When doing
KASLR, it was much more useful to make this runtime selectable so that
distro kernels could build the support in, but let users decide if
they wanted to enable it.

> I've seen conflicting feedback from maintainers lately about whether
> they want the Kconfig magic to go first or last in a patch series.
> It's going last here because the partially-applied series leads to
> kernels that can not boot in a bunch of cases.  I did a run through
> the entire series with CONFIG_KAISER=y to look for build errors,
> though.

Yeah, I think last tends to be the best, though it isn't great for
debugging. Doing it earlier, though, tends to lead to a lot of
confusion about whether some feature is actually operating sanely or
not.

-Kees

> Note from Hugh Dickins on why it depends on SMP:
>
>         It is absurd that KAISER should depend on SMP, but
>         apparently nobody has tried a UP build before: which
>         breaks on implicit declaration of function
>         'per_cpu_offset' in arch/x86/mm/kaiser.c.
>
>         Now, you would expect that to be trivially fixed up; but
>         looking at the System.map when that block is #ifdef'ed
>         out of kaiser_init(), I see that in a UP build
>         __per_cpu_user_mapped_end is precisely at
>         __per_cpu_user_mapped_start, and the items carefully
>         gathered into that section for user-mapping on SMP,
>         dispersed elsewhere on UP.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/security/Kconfig |   10 ++++++++++
>  1 file changed, 10 insertions(+)
>
> diff -puN security/Kconfig~kaiser-kconfig security/Kconfig
> --- a/security/Kconfig~kaiser-kconfig   2017-10-31 15:04:01.680648908 -0700
> +++ b/security/Kconfig  2017-10-31 15:04:01.684649097 -0700
> @@ -54,6 +54,16 @@ config SECURITY_NETWORK
>           implement socket and networking access controls.
>           If you are unsure how to answer this question, answer N.
>
> +config KAISER
> +       bool "Remove the kernel mapping in user mode"
> +       depends on X86_64 && SMP && !PARAVIRT
> +       help
> +         This feature reduces the number of hardware side channels by
> +         ensuring that the majority of kernel addresses are not mapped
> +         into userspace.
> +
> +         See Documentation/x86/kaiser.txt for more details.
> +
>  config SECURITY_INFINIBAND
>         bool "Infiniband Security Hooks"
>         depends on SECURITY && INFINIBAND
> _



-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 23:44   ` Dave Hansen
@ 2017-11-01  0:21     ` Dave Hansen
  2017-11-01  7:59     ` Andy Lutomirski
  2017-11-01 16:08     ` Linus Torvalds
  2 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01  0:21 UTC (permalink / raw)
  To: Dave Hansen, Linus Torvalds, Andy Lutomirski
  Cc: Linux Kernel Mailing List, linux-mm, Kees Cook, Hugh Dickins

On 10/31/2017 04:44 PM, Dave Hansen wrote:
>>      That seems insane. Why isn't only tyhe top level shadowed, and
>> then lower levels are shared between the shadowed and the "kernel"
>> page tables?
> There are obviously two PGDs.  The userspace half of the PGD is an exact
> copy so all the lower levels are shared.  You can see this bit in the
> memcpy that we do in clone_pgd_range().

This is wrong.

The userspace copying is done via the code we add to native_set_pgd().
Whenever we set the kernel PGD, we also make sure to make a
corresponding entry in the user/shadow PGD.

The memcpy() that I was talking about does the kernel portion of the PGD.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
@ 2017-11-01  0:43   ` Brian Gerst
  2017-11-01  1:08     ` Dave Hansen
  2017-11-01 18:18   ` Borislav Petkov
  2017-11-01 21:01   ` Thomas Gleixner
  2 siblings, 1 reply; 102+ messages in thread
From: Brian Gerst @ 2017-11-01  0:43 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linux Kernel Mailing List, Linux-MM, moritz.lipp, daniel.gruss,
	michael.schwarz, Andy Lutomirski, Linus Torvalds, Kees Cook,
	hughd, the arch/x86 maintainers

On Tue, Oct 31, 2017 at 6:31 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> This is largely code from Andy Lutomirski.  I fixed a few bugs
> in it, and added a few SWITCH_TO_* spots.
>
> KAISER needs to switch to a different CR3 value when it enters
> the kernel and switch back when it exits.  This essentially
> needs to be done before we leave assembly code.
>
> This is extra challenging because the context in which we have to
> make this switch is tricky: the registers we are allowed to
> clobber can vary.  It's also hard to store things on the stack
> because there are already things on it with an established ABI
> (ptregs) or the stack is unsafe to use at all.
>
> This patch establishes a set of macros that allow changing to
> the user and kernel CR3 values, but do not actually switch
> CR3.  The code will, however, clobber the registers that it
> says it will and also does perform *writes* to CR3.  So, this
> patch by itself tests that the registers we are clobbering
> and restoring from are OK, and that things like our stack
> manipulation are in safe places.
>
> In other words, if you bisect to here, this *does* introduce
> changes that can break things.
>
> Interactions with SWAPGS: previous versions of the KAISER code
> relied on having per-cpu scratch space so we have a register
> to clobber for our CR3 MOV.  The %GS register is what we use
> to index into our per-cpu sapce, so SWAPGS *had* to be done
> before the CR3 switch.  That scratch space is gone now, but we
> still keep the semantic that SWAPGS must be done before the
> CR3 MOV.  This is good to keep because it is not that hard to
> do and it allows us to do things like add per-cpu debugging
> information to help us figure out what goes wrong sometimes.
>
> What this does in the NMI code is worth pointing out.  NMIs
> can interrupt *any* context and they can also be nested with
> NMIs interrupting other NMIs.  The comments below
> ".Lnmi_from_kernel" explain the format of the stack that we
> have to deal with this situation.  Changing the format of
> this stack is not a fun exercise: I tried.  Instead of
> storing the old CR3 value on the stack, we depend on the
> *regular* register save/restore mechanism and then use %r14
> to keep CR3 during the NMI.  It will not be clobbered by the
> C NMI handlers that get called.
>
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Kees Cook <keescook@google.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: x86@kernel.org
> ---
>
>  b/arch/x86/entry/calling.h         |   40 +++++++++++++++++++++++++++++++++++++
>  b/arch/x86/entry/entry_64.S        |   33 +++++++++++++++++++++++++-----
>  b/arch/x86/entry/entry_64_compat.S |   13 ++++++++++++
>  3 files changed, 81 insertions(+), 5 deletions(-)
>
> diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
> --- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work        2017-10-31 15:03:48.105007253 -0700
> +++ b/arch/x86/entry/calling.h  2017-10-31 15:03:48.113007631 -0700
> @@ -1,5 +1,6 @@
>  #include <linux/jump_label.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cpufeatures.h>
>
>  /*
>
> @@ -217,6 +218,45 @@ For 32-bit we have the following convent
>  #endif
>  .endm
>
> +.macro ADJUST_KERNEL_CR3 reg:req
> +.endm
> +
> +.macro ADJUST_USER_CR3 reg:req
> +.endm
> +
> +.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> +       mov     %cr3, \scratch_reg
> +       ADJUST_KERNEL_CR3 \scratch_reg
> +       mov     \scratch_reg, %cr3
> +.endm
> +
> +.macro SWITCH_TO_USER_CR3 scratch_reg:req
> +       mov     %cr3, \scratch_reg
> +       ADJUST_USER_CR3 \scratch_reg
> +       mov     \scratch_reg, %cr3
> +.endm
> +
> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +       movq    %cr3, %r\scratch_reg
> +       movq    %r\scratch_reg, \save_reg
> +       /*
> +        * Just stick a random bit in here that never gets set.  Fixed
> +        * up in real KAISER patches in a moment.
> +        */
> +       bt      $63, %r\scratch_reg
> +       jz      .Ldone_\@
> +
> +       ADJUST_KERNEL_CR3 %r\scratch_reg
> +       movq    %r\scratch_reg, %cr3
> +
> +.Ldone_\@:
> +.endm
> +
> +.macro RESTORE_CR3 save_reg:req
> +       /* optimize this */
> +       movq    \save_reg, %cr3
> +.endm
> +
>  #endif /* CONFIG_X86_64 */
>
>  /*
> diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
> --- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work        2017-10-31 15:03:48.107007348 -0700
> +++ b/arch/x86/entry/entry_64_compat.S  2017-10-31 15:03:48.113007631 -0700
> @@ -48,8 +48,13 @@
>  ENTRY(entry_SYSENTER_compat)
>         /* Interrupts are off on entry. */
>         SWAPGS_UNSAFE_STACK
> +
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> +       pushq   %rdi
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +       popq    %rdi
> +
>         /*
>          * User tracing code (ptrace or signal handlers) might assume that
>          * the saved RAX contains a 32-bit number when we're invoking a 32-bit
> @@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat)
>         pushq   $0                      /* pt_regs->r15 = 0 */
>         cld
>
> +       pushq   %rdi
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +       popq    %rdi
>         /*
>          * SYSENTER doesn't filter flags, so we need to clear NT and AC
>          * ourselves.  To save a few cycles, we can check whether
> @@ -214,6 +222,8 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
>         pushq   $0                      /* pt_regs->r14 = 0 */
>         pushq   $0                      /* pt_regs->r15 = 0 */
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +
>         /*
>          * User mode is traced as though IRQs are on, and SYSENTER
>          * turned them off.
> @@ -240,6 +250,7 @@ sysret32_from_system_call:
>         popq    %rsi                    /* pt_regs->si */
>         popq    %rdi                    /* pt_regs->di */
>
> +       SWITCH_TO_USER_CR3 scratch_reg=%r8
>          /*
>           * USERGS_SYSRET32 does:
>           *  GSBASE = user's GS base
> @@ -324,6 +335,7 @@ ENTRY(entry_INT80_compat)
>         pushq   %r15                    /* pt_regs->r15 */
>         cld
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
>         /*
>          * User mode is traced as though IRQs are on, and the interrupt
>          * gate turned them off.
> @@ -337,6 +349,7 @@ ENTRY(entry_INT80_compat)
>         /* Go back to user mode. */
>         TRACE_IRQS_ON
>         SWAPGS
> +       SWITCH_TO_USER_CR3 scratch_reg=%r11
>         jmp     restore_regs_and_iret
>  END(entry_INT80_compat)
>
> diff -puN arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64.S
> --- a/arch/x86/entry/entry_64.S~kaiser-luto-base-cr3-work       2017-10-31 15:03:48.109007442 -0700
> +++ b/arch/x86/entry/entry_64.S 2017-10-31 15:03:48.115007726 -0700
> @@ -147,8 +147,6 @@ ENTRY(entry_SYSCALL_64)
>         movq    %rsp, PER_CPU_VAR(rsp_scratch)
>         movq    PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>
> -       TRACE_IRQS_OFF
> -
>         /* Construct struct pt_regs on stack */
>         pushq   $__USER_DS                      /* pt_regs->ss */
>         pushq   PER_CPU_VAR(rsp_scratch)        /* pt_regs->sp */
> @@ -169,6 +167,13 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
>         sub     $(6*8), %rsp                    /* pt_regs->bp, bx, r12-15 not saved */
>         UNWIND_HINT_REGS extra=0
>
> +       /* NB: right here, all regs except r11 are live. */
> +
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%r11
> +
> +       /* Must wait until we have the kernel CR3 to call C functions: */
> +       TRACE_IRQS_OFF
> +
>         /*
>          * If we need to do entry work or if we guess we'll need to do
>          * exit work, go straight to the slow path.
> @@ -220,6 +225,7 @@ entry_SYSCALL_64_fastpath:
>         TRACE_IRQS_ON           /* user mode is traced as IRQs on */
>         movq    RIP(%rsp), %rcx
>         movq    EFLAGS(%rsp), %r11
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
>         UNWIND_HINT_EMPTY
> @@ -313,6 +319,7 @@ return_from_SYSCALL_64:
>          * perf profiles. Nothing jumps here.
>          */
>  syscall_return_via_sysret:
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         /* rcx and r11 are already restored (see code above) */
>         RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq    RSP(%rsp), %rsp
> @@ -320,6 +327,7 @@ syscall_return_via_sysret:
>         USERGS_SYSRET64
>
>  opportunistic_sysret_failed:
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_c_regs_and_iret
>  END(entry_SYSCALL_64)
> @@ -422,6 +430,7 @@ ENTRY(ret_from_fork)
>         movq    %rsp, %rdi
>         call    syscall_return_slowpath /* returns with IRQs disabled */
>         TRACE_IRQS_ON                   /* user mode is traced as IRQS on */
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_regs_and_iret
>
> @@ -611,6 +620,7 @@ GLOBAL(retint_user)
>         mov     %rsp,%rdi
>         call    prepare_exit_to_usermode
>         TRACE_IRQS_IRETQ
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         SWAPGS
>         jmp     restore_regs_and_iret
>
> @@ -1091,7 +1101,11 @@ ENTRY(paranoid_entry)
>         js      1f                              /* negative -> in kernel */
>         SWAPGS
>         xorl    %ebx, %ebx
> -1:     ret
> +
> +1:
> +       SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
> +
> +       ret
>  END(paranoid_entry)
>
>  /*
> @@ -1118,6 +1132,7 @@ ENTRY(paranoid_exit)
>  paranoid_exit_no_swapgs:
>         TRACE_IRQS_IRETQ_DEBUG
>  paranoid_exit_restore:
> +       RESTORE_CR3     %r14
>         RESTORE_EXTRA_REGS
>         RESTORE_C_REGS
>         REMOVE_PT_GPREGS_FROM_STACK 8
> @@ -1144,6 +1159,9 @@ ENTRY(error_entry)
>          */
>         SWAPGS
>
> +       /* We have user CR3.  Change to kernel CR3. */
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> +
>  .Lerror_entry_from_usermode_after_swapgs:
>         /*
>          * We need to tell lockdep that IRQs are off.  We can't do this until
> @@ -1190,9 +1208,10 @@ ENTRY(error_entry)
>
>  .Lerror_bad_iret:
>         /*
> -        * We came from an IRET to user mode, so we have user gsbase.
> -        * Switch to kernel gsbase:
> +        * We came from an IRET to user mode, so we have user
> +        * gsbase and CR3.  Switch to kernel gsbase and CR3:
>          */
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>         SWAPGS
>
>         /*
> @@ -1313,6 +1332,7 @@ ENTRY(nmi)
>         UNWIND_HINT_REGS
>         ENCODE_FRAME_POINTER
>
> +       SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
>         /*
>          * At this point we no longer need to worry about stack damage
>          * due to nesting -- we're on the normal thread stack and we're
> @@ -1328,6 +1348,7 @@ ENTRY(nmi)
>          * work, because we don't want to enable interrupts.
>          */
>         SWAPGS
> +       SWITCH_TO_USER_CR3 scratch_reg=%rdi
>         jmp     restore_regs_and_iret
>
>  .Lnmi_from_kernel:
> @@ -1538,6 +1559,8 @@ end_repeat_nmi:
>         movq    $-1, %rsi
>         call    do_nmi
>
> +       RESTORE_CR3 save_reg=%r14
> +
>         testl   %ebx, %ebx                      /* swapgs needed? */
>         jnz     nmi_restore
>  nmi_swapgs:
> _

This all needs to be conditional on a config option.  Something with
this amount of performance impact needs to be 100% optional.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-01  0:43   ` Brian Gerst
@ 2017-11-01  1:08     ` Dave Hansen
  0 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01  1:08 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Linux Kernel Mailing List, Linux-MM, moritz.lipp, daniel.gruss,
	michael.schwarz, Andy Lutomirski, Linus Torvalds, Kees Cook,
	hughd, the arch/x86 maintainers

On 10/31/2017 05:43 PM, Brian Gerst wrote:
>>
>> +       RESTORE_CR3 save_reg=%r14
>> +
>>         testl   %ebx, %ebx                      /* swapgs needed? */
>>         jnz     nmi_restore
>>  nmi_swapgs:
>> _
> This all needs to be conditional on a config option.  Something with
> this amount of performance impact needs to be 100% optional.

The 07/23 patch does just this.  I should have at least called that out
in the description.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 23:44   ` Dave Hansen
  2017-11-01  0:21     ` Dave Hansen
@ 2017-11-01  7:59     ` Andy Lutomirski
  2017-11-01 16:08     ` Linus Torvalds
  2 siblings, 0 replies; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01  7:59 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Linus Torvalds, Andy Lutomirski, Linux Kernel Mailing List,
	linux-mm, Kees Cook, Hugh Dickins

On Tue, Oct 31, 2017 at 4:44 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>> Inconveniently, the people you cc'd on the actual patches did *not*
>> get cc'd with this 00/23 cover letter email.
>
> Urg, sorry about that.
>
>>  (a) is this on top of Andy's entry cleanups?
>>
>>      If not, that probably needs to be sorted out.
>
> It is not.  However, I did a version on top of his earlier cleanups, so
> I know this can be easily ported on top of them.  It didn't make a major
> difference in the number of places that KAISER had to patch, unfortunately.
>
>>  (b) the TLB global bit really is nastily done. You basically disable
>> _PAGE_GLOBAL entirely.
>>
>>      I can see how/why that would make things simpler, but it's almost
>> certainly the wrong approach. The small subset of kernel pages that
>> are always mapped should definitely retain the global bit, so that you
>> don't always take a TLB miss on those! Those are probably some of the
>> most latency-critical pages, since there's generally no prefetching
>> for the kernel entry code or for things like IDT/GDT accesses..
>>
>>      So even if you don't want to have global pages for normal kernel
>> entries, you don't want to just make _PAGE_GLOBAL be defined as zero.
>> You'd want to just use _PAGE_GLOBAL conditionally.
>>
>>      Hmm?
>
> That's a good point.  Shouldn't be hard to implement at all.  We'll just
> need to take _PAGE_GLOBAL out of the default _KERNPG_TABLE definition, I
> think.
>
>>  (c) am I reading the code correctly, and the shadow page tables are
>> *completely* duplicated?
>>
>>      That seems insane. Why isn't only tyhe top level shadowed, and
>> then lower levels are shared between the shadowed and the "kernel"
>> page tables?
>
> There are obviously two PGDs.  The userspace half of the PGD is an exact
> copy so all the lower levels are shared.  You can see this bit in the
> memcpy that we do in clone_pgd_range().
>
> For the kernel half, we don't share any of the lower levels.  That's
> mostly because the stuff that we're mapping into the user/shadow copy is
> only 4k aligned and (probably) never >2MB, so there's really no
> opportunity to share.
>

I think we should map exactly two kernel PGDs: one for the fixmap and
one for the special shared stuff.  Those PGDs should be mapped
identically in the user tables.  We can eventually (or immediately)
get rid of the fixmap, too, by moving the IDT and GDT and making a
special user fixmap table for the vsyscall page.

--Andy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs
  2017-10-31 22:32 ` [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
@ 2017-11-01  8:00   ` Andy Lutomirski
  2017-11-01  8:06     ` Ingo Molnar
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01  8:00 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Andrew Lutomirski, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Tue, Oct 31, 2017 at 3:32 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Normally, a process just has a NULL mm->context.ldt.  But, we
> have a syscall for a process to set a new one.  If a process does
> that, we need to map the new LDT.
>
> The original KAISER patch missed this case.

Tglx suggested that we instead increase the padding at the top of the
user address space from 4k to 64k and put the LDT there.  This is a
slight ABI break, but I'd be rather surprised if anything noticed,
especially because the randomized vdso currently regularly lands there
(IIRC), so any user code that explicitly uses those 60k already
collides with the vdso.

I can make this happen.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
@ 2017-11-01  8:01   ` Andy Lutomirski
  2017-11-01 10:11     ` Kirill A. Shutemov
  2017-11-01 21:25   ` Thomas Gleixner
  1 sibling, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01  8:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Andrew Lutomirski, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> Our CR4-based TLB flush currently requries global pages to be
> supported *and* enabled.  But, we really only need for them to be
> supported.  Make the code more robust by alllowing X86_CR4_PGE to
> clear as well as set.
>
> This change was suggested by Kirill Shutemov.

I may have missed something, but why would be ever have CR4.PGE off?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-10-31 22:32 ` [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
@ 2017-11-01  8:03   ` Andy Lutomirski
  2017-11-01 14:17     ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01  8:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Andrew Lutomirski, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Tue, Oct 31, 2017 at 3:32 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> If we change the page tables in such a way that we need an
> invalidation of all contexts (aka. PCIDs / ASIDs) we can
> actively invalidate them by:
>  1. INVPCID for each PCID (works for single pages too).
>  2. Load CR3 with each PCID without the NOFLUSH bit set
>  3. Load CR3 with the NOFLUSH bit set for each and do
>     INVLPG for each address.
>
> But, none of these are really feasible since we have ~6 ASIDs (12 with
> KAISER) at the time that we need to do an invalidation.  So, we just
> invalidate the *current* context and then mark the cpu_tlbstate
> _quickly_.
>
> Then, at the next context-switch, we notice that we had
> 'all_other_ctxs_invalid' marked, and go invalidate all of the
> cpu_tlbstate.ctxs[] entries.
>
> This ensures that any futuee context switches will do a full flush
> of the TLB so they pick up the changes.

I'm convuced.  What was wrong with the old code?  I guess I just don't
see what the problem is that is solved by this patch.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs
  2017-11-01  8:00   ` Andy Lutomirski
@ 2017-11-01  8:06     ` Ingo Molnar
  0 siblings, 0 replies; 102+ messages in thread
From: Ingo Molnar @ 2017-11-01  8:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML


* Andy Lutomirski <luto@kernel.org> wrote:

> On Tue, Oct 31, 2017 at 3:32 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> >
> > Normally, a process just has a NULL mm->context.ldt.  But, we
> > have a syscall for a process to set a new one.  If a process does
> > that, we need to map the new LDT.
> >
> > The original KAISER patch missed this case.
> 
> Tglx suggested that we instead increase the padding at the top of the
> user address space from 4k to 64k and put the LDT there.  This is a
> slight ABI break, but I'd be rather surprised if anything noticed,
> especially because the randomized vdso currently regularly lands there
> (IIRC), so any user code that explicitly uses those 60k already
> collides with the vdso.
> 
> I can make this happen.

Yes, let's try that.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (23 preceding siblings ...)
  2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
@ 2017-11-01  8:54 ` Ingo Molnar
  2017-11-01 14:09   ` Thomas Gleixner
  2017-11-01 22:14   ` Dave Hansen
  2017-11-02 19:01 ` Will Deacon
  2017-11-22 16:19 ` Pavel Machek
  26 siblings, 2 replies; 102+ messages in thread
From: Ingo Molnar @ 2017-11-01  8:54 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, H. Peter Anvin,
	borisBrian Gerst, Denys Vlasenko, Josh Poimboeuf, Thomas Garnier


(Filled in the missing Cc: list)

* Dave Hansen <dave.hansen@linux.intel.com> wrote:

> tl;dr:
> 
> KAISER makes it harder to defeat KASLR, but makes syscalls and
> interrupts slower.  These patches are based on work from a team at
> Graz University of Technology posted here[1].  The major addition is
> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> for a wide variety of use cases.

Ok, while I never thought I'd see the 4g:4g patch come to 64-bit kernels ;-),
this series is a lot better than earlier versions of this feature, and it
solves a number of KASLR timing attacks rather fundamentally.

Beyond the inevitable cavalcade of (solvable) problems that will pop up during 
review, one major item I'd like to see addressed is runtime configurability: it 
should be possible to switch between a CR3-flushing and a regular syscall and page 
table model on the admin level, without restarting the kernel and apps. Distros 
really, really don't want to double the number of kernel variants they have.

The 'Kaiser off' runtime switch doesn't have to be as efficient as 
CONFIG_KAISER=n, at least initialloy, but at minimum it should avoid the most 
expensive page table switching paths in the syscall entry codepaths.

Also, this series should be based on Andy's latest syscall entry cleanup work.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 23/23] x86, kaiser: add Kconfig
  2017-10-31 23:59   ` Kees Cook
@ 2017-11-01  9:07     ` Borislav Petkov
  0 siblings, 0 replies; 102+ messages in thread
From: Borislav Petkov @ 2017-11-01  9:07 UTC (permalink / raw)
  To: Kees Cook
  Cc: Dave Hansen, LKML, Linux-MM, moritz.lipp, daniel.gruss,
	michael.schwarz, Andy Lutomirski, Linus Torvalds, Hugh Dickins,
	x86

On Tue, Oct 31, 2017 at 04:59:37PM -0700, Kees Cook wrote:
> A quick look through "#ifdef CONFIG_KAISER" looks like it might be
> possible to make this a runtime setting at some point. When doing
> KASLR, it was much more useful to make this runtime selectable so that
> distro kernels could build the support in, but let users decide if
> they wanted to enable it.

Yes please.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01  8:01   ` Andy Lutomirski
@ 2017-11-01 10:11     ` Kirill A. Shutemov
  2017-11-01 10:38       ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Kirill A. Shutemov @ 2017-11-01 10:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Wed, Nov 01, 2017 at 01:01:45AM -0700, Andy Lutomirski wrote:
> On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
> >
> > Our CR4-based TLB flush currently requries global pages to be
> > supported *and* enabled.  But, we really only need for them to be
> > supported.  Make the code more robust by alllowing X86_CR4_PGE to
> > clear as well as set.
> >
> > This change was suggested by Kirill Shutemov.
> 
> I may have missed something, but why would be ever have CR4.PGE off?

This came out from me thinking on if we can disable global pages by not
turning on CR4.PGE instead of making _PAGE_GLOBAL zero.

Dave decided to not take this path, but this change would make
__native_flush_tlb_global_irq_disabled() a bit less fragile in case
if the situation would change in the future.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 10:11     ` Kirill A. Shutemov
@ 2017-11-01 10:38       ` Andy Lutomirski
  2017-11-01 10:56         ` Kirill A. Shutemov
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 10:38 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, daniel.gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 3:11 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Wed, Nov 01, 2017 at 01:01:45AM -0700, Andy Lutomirski wrote:
>> On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
>> <dave.hansen@linux.intel.com> wrote:
>> >
>> > Our CR4-based TLB flush currently requries global pages to be
>> > supported *and* enabled.  But, we really only need for them to be
>> > supported.  Make the code more robust by alllowing X86_CR4_PGE to
>> > clear as well as set.
>> >
>> > This change was suggested by Kirill Shutemov.
>>
>> I may have missed something, but why would be ever have CR4.PGE off?
>
> This came out from me thinking on if we can disable global pages by not
> turning on CR4.PGE instead of making _PAGE_GLOBAL zero.
>
> Dave decided to not take this path, but this change would make
> __native_flush_tlb_global_irq_disabled() a bit less fragile in case
> if the situation would change in the future.

How about just adding a VM_WARN_ON_ONCE, then?

--Andy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 10:38       ` Andy Lutomirski
@ 2017-11-01 10:56         ` Kirill A. Shutemov
  2017-11-01 11:18           ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Kirill A. Shutemov @ 2017-11-01 10:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Wed, Nov 01, 2017 at 03:38:23AM -0700, Andy Lutomirski wrote:
> On Wed, Nov 1, 2017 at 3:11 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > On Wed, Nov 01, 2017 at 01:01:45AM -0700, Andy Lutomirski wrote:
> >> On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
> >> <dave.hansen@linux.intel.com> wrote:
> >> >
> >> > Our CR4-based TLB flush currently requries global pages to be
> >> > supported *and* enabled.  But, we really only need for them to be
> >> > supported.  Make the code more robust by alllowing X86_CR4_PGE to
> >> > clear as well as set.
> >> >
> >> > This change was suggested by Kirill Shutemov.
> >>
> >> I may have missed something, but why would be ever have CR4.PGE off?
> >
> > This came out from me thinking on if we can disable global pages by not
> > turning on CR4.PGE instead of making _PAGE_GLOBAL zero.
> >
> > Dave decided to not take this path, but this change would make
> > __native_flush_tlb_global_irq_disabled() a bit less fragile in case
> > if the situation would change in the future.
> 
> How about just adding a VM_WARN_ON_ONCE, then?

What's wrong with xor? The function will continue to work this way even if
CR4.PGE is disabled.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 10:56         ` Kirill A. Shutemov
@ 2017-11-01 11:18           ` Andy Lutomirski
  2017-11-01 22:21             ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 11:18 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 3:56 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
> On Wed, Nov 01, 2017 at 03:38:23AM -0700, Andy Lutomirski wrote:
>> On Wed, Nov 1, 2017 at 3:11 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote:
>> > On Wed, Nov 01, 2017 at 01:01:45AM -0700, Andy Lutomirski wrote:
>> >> On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
>> >> <dave.hansen@linux.intel.com> wrote:
>> >> >
>> >> > Our CR4-based TLB flush currently requries global pages to be
>> >> > supported *and* enabled.  But, we really only need for them to be
>> >> > supported.  Make the code more robust by alllowing X86_CR4_PGE to
>> >> > clear as well as set.
>> >> >
>> >> > This change was suggested by Kirill Shutemov.
>> >>
>> >> I may have missed something, but why would be ever have CR4.PGE off?
>> >
>> > This came out from me thinking on if we can disable global pages by not
>> > turning on CR4.PGE instead of making _PAGE_GLOBAL zero.
>> >
>> > Dave decided to not take this path, but this change would make
>> > __native_flush_tlb_global_irq_disabled() a bit less fragile in case
>> > if the situation would change in the future.
>>
>> How about just adding a VM_WARN_ON_ONCE, then?
>
> What's wrong with xor? The function will continue to work this way even if
> CR4.PGE is disabled.

That's true.  OTOH, since no one is actually proposing doing that,
there's an argument that people should get warned and therefore be
forced to think about it.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01  8:54 ` Ingo Molnar
@ 2017-11-01 14:09   ` Thomas Gleixner
  2017-11-01 22:14   ` Dave Hansen
  1 sibling, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 14:09 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Dave Hansen, LKML, linux-mm, Andy Lutomirski, Linus Torvalds,
	Peter Zijlstra, H. Peter Anvin, borisBrian Gerst, Denys Vlasenko,
	Josh Poimboeuf, Thomas Garnier

On Wed, 1 Nov 2017, Ingo Molnar wrote:
> Beyond the inevitable cavalcade of (solvable) problems that will pop up during 
> review, one major item I'd like to see addressed is runtime configurability: it 
> should be possible to switch between a CR3-flushing and a regular syscall and page 
> table model on the admin level, without restarting the kernel and apps. Distros 
> really, really don't want to double the number of kernel variants they have.

And this removes the !PARAVIRT dependency as well because when the kernel
detects xen_pv() then it simply disables kaiser and all works.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-01  8:03   ` Andy Lutomirski
@ 2017-11-01 14:17     ` Dave Hansen
  2017-11-01 20:31       ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 14:17 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/01/2017 01:03 AM, Andy Lutomirski wrote:
>> This ensures that any futuee context switches will do a full flush
>> of the TLB so they pick up the changes.
> I'm convuced.  What was wrong with the old code?  I guess I just don't
> see what the problem is that is solved by this patch.

Instead of flushing *now* with INVPCID, this lets us flush *later* with
CR3.  It just hijacks the code that you already have that flushes CR3
when loading a new ASID by making all ASIDs look new in the future.

We have to load CR3 anyway, so we might as well just do this flush then.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
  2017-10-31 23:44   ` Dave Hansen
@ 2017-11-01 15:53   ` Dave Hansen
  1 sibling, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 15:53 UTC (permalink / raw)
  To: Linus Torvalds, Andy Lutomirski
  Cc: Linux Kernel Mailing List, linux-mm, Kees Cook, Hugh Dickins

On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>      So even if you don't want to have global pages for normal kernel
> entries, you don't want to just make _PAGE_GLOBAL be defined as zero.
> You'd want to just use _PAGE_GLOBAL conditionally.

I implemented this, then did a quick test with some code that does a
bunch of quick system calls:

> 	https://github.com/antonblanchard/will-it-scale/blob/master/tests/lseek1.c

It helps a wee bit (~3%) with PCIDs, and much more when PCIDs are not in
use (~15%).  Here are the numbers:  ("ge" means "Global Entry"):

no kaiser       : 5.2M
kaiser+  pcid	: 3.0M
kaiser+  pcid+ge: 3.1M
kaiser+nopcid   : 2.2M
kaiser+nopcid+ge: 2.5M

This *does* use Global pages for the process stack (which is not idea),
but it sounds like Andy's entry stack stuff will get rid of the need to
do that in the first place.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 23:44   ` Dave Hansen
  2017-11-01  0:21     ` Dave Hansen
  2017-11-01  7:59     ` Andy Lutomirski
@ 2017-11-01 16:08     ` Linus Torvalds
  2017-11-01 17:31       ` Dave Hansen
  2 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-01 16:08 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

On Tue, Oct 31, 2017 at 4:44 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
> On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>>  (c) am I reading the code correctly, and the shadow page tables are
>> *completely* duplicated?
>>
>>      That seems insane. Why isn't only tyhe top level shadowed, and
>> then lower levels are shared between the shadowed and the "kernel"
>> page tables?
>
> There are obviously two PGDs.  The userspace half of the PGD is an exact
> copy so all the lower levels are shared.  The userspace copying is
> done via the code we add to native_set_pgd().

So the thing that made me think you do all levels was that confusing
kaiser_pagetable_walk() code (and to a lesser degree
get_pa_from_mapping()).

That code definitely walks and allocates all levels.

So it really doesn't seem to be just sharing the top page table entry.

And that worries me because that seems to be a very fundamental coherency issue.

I'm assuming that this is about mapping only the individual kernel
parts, but I'd like to get comments and clarification about that.

                  Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 16:08     ` Linus Torvalds
@ 2017-11-01 17:31       ` Dave Hansen
  2017-11-01 17:58         ` Randy Dunlap
  2017-11-01 18:27         ` Linus Torvalds
  0 siblings, 2 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 17:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

[-- Attachment #1: Type: text/plain, Size: 2141 bytes --]

On 11/01/2017 09:08 AM, Linus Torvalds wrote:
> On Tue, Oct 31, 2017 at 4:44 PM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>> On 10/31/2017 04:27 PM, Linus Torvalds wrote:
>>>  (c) am I reading the code correctly, and the shadow page tables are
>>> *completely* duplicated?
>>>
>>>      That seems insane. Why isn't only tyhe top level shadowed, and
>>> then lower levels are shared between the shadowed and the "kernel"
>>> page tables?
>>
>> There are obviously two PGDs.  The userspace half of the PGD is an exact
>> copy so all the lower levels are shared.  The userspace copying is
>> done via the code we add to native_set_pgd().
> 
> So the thing that made me think you do all levels was that confusing
> kaiser_pagetable_walk() code (and to a lesser degree
> get_pa_from_mapping()).
> 
> That code definitely walks and allocates all levels.
> 
> So it really doesn't seem to be just sharing the top page table entry.

Yeah, they're quite lightly commented and badly named now that I go look
at them.

get_pa_from_mapping() should be called something like
get_pa_from_kernel_map().  Its job is to look at the main (kernel) page
tables and go get an address from there.  It's only ever called on
kernel addresses.

kaiser_pagetable_walk() should probably be
kaiser_shadow_pagetable_walk().  Its job is to walk the shadow copy and
find the location of a 4k PTE.  You can then populate that PTE with the
address you got from get_pa_from_mapping() (or clear it in the remove
mapping case).

I've attached an update to the core patch and Documentation that should
help clear this up.

> And that worries me because that seems to be a very fundamental coherency issue.
> 
> I'm assuming that this is about mapping only the individual kernel
> parts, but I'd like to get comments and clarification about that.

I assume that you're really worried about having to go two places to do
one thing, like clearing a dirty bit, or unmapping a PTE, especially
when we have to do that for userspace.  Thankfully, the sharing of the
page tables (under the PGD) for userspace gets rid of most of this
nastiness.

I hope that's more clear now.

[-- Attachment #2: kaiser-core-update1.patch --]
[-- Type: text/x-patch, Size: 4884 bytes --]

diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
index 67a70d2..5b5e9c4 100644
--- a/Documentation/x86/kaiser.txt
+++ b/Documentation/x86/kaiser.txt
@@ -1,3 +1,6 @@
+Overview
+========
+
 KAISER is a countermeasure against attacks on kernel address
 information.  There are at least three existing, published,
 approaches using the shared user/kernel mapping and hardware features
@@ -18,6 +21,35 @@ This helps ensure that side-channel attacks that leverage the
 paging structures do not function when KAISER is enabled.  It
 can be enabled by setting CONFIG_KAISER=y
 
+Page Table Management
+=====================
+
+KAISER logically keeps a "copy" of the page tables which unmap
+the kernel while in userspace.  The kernel manages the page
+tables as normal, but the "copying" is done with a few tricks
+that mean that we do not have to manage two full copies.
+
+The first trick is that for any any new kernel mapping, we
+presume that we do not want it mapped to userspace.  That means
+we normally have no copying to do.  We only copy the kernel
+entries over to the shadow in response to a kaiser_add_*()
+call which is rare.
+
+For a new userspace mapping, the kernel makes the entries in
+its page tables like normal.  The only difference is when the
+kernel makes entries in the top (PGD) level.  In addition to
+setting the entry in the main kernel PGD, a copy if the entry
+is made in the shadow PGD.
+
+PGD entries always point to another page table.  Two PGD
+entries pointing to the same thing gives us shared page tables
+for all the lower entries.  This leaves a single, shared set of
+userspace page tables to manage.  One PTE to lock, one set set
+of accessed bits, dirty bits, etc...
+
+Overhead
+========
+
 Protection against side-channel attacks is important.  But,
 this protection comes at a cost:
 
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 57f7637..cde9014 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -49,9 +49,21 @@
 static DEFINE_SPINLOCK(shadow_table_allocation_lock);
 
 /*
+ * This is a generic page table walker used only for walking kernel
+ * addresses.  We use it too help recreate the "shadow" page tables
+ * which are used while we are in userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables.  It is never used for userspace
+ * addresses.
+ *
  * Returns -1 on error.
  */
-static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
 {
 	pgd_t *pgd;
 	p4d_t *p4d;
@@ -59,6 +71,8 @@ static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
 	pmd_t *pmd;
 	pte_t *pte;
 
+	WARN_ON_ONCE(vaddr < PAGE_OFFSET);
+
 	pgd = pgd_offset_k(vaddr);
 	/*
 	 * We made all the kernel PGDs present in kaiser_init().
@@ -111,13 +125,19 @@ static inline unsigned long get_pa_from_mapping(unsigned long vaddr)
 }
 
 /*
- * This is a relatively normal page table walk, except that it
- * also tries to allocate page tables pages along the way.
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down.  Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
  *
  * Returns a pointer to a PTE on success, or NULL on failure.
  */
 #define KAISER_WALK_ATOMIC  0x1
-static pte_t *kaiser_pagetable_walk(unsigned long address, unsigned long flags)
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+					   unsigned long flags)
 {
 	pmd_t *pmd;
 	pud_t *pud;
@@ -207,11 +227,11 @@ int kaiser_add_user_map(const void *__start_addr, unsigned long size,
 	unsigned long target_address;
 
 	for (; address < end_addr; address += PAGE_SIZE) {
-		target_address = get_pa_from_mapping(address);
+		target_address = get_pa_from_kernel_map(address);
 		if (target_address == -1)
 			return -EIO;
 
-		pte = kaiser_pagetable_walk(address, false);
+		pte = kaiser_shadow_pagetable_walk(address, false);
 		/*
 		 * Errors come from either -ENOMEM for a page
 		 * table page, or something screwy that did a
@@ -348,7 +368,7 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size)
 		 * context.  This should not do any allocations because we
 		 * should only be walking things that are known to be mapped.
 		 */
-		pte_t *pte = kaiser_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
 
 		/*
 		 * We are removing a mapping that shoud

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 17:31       ` Dave Hansen
@ 2017-11-01 17:58         ` Randy Dunlap
  2017-11-01 18:27         ` Linus Torvalds
  1 sibling, 0 replies; 102+ messages in thread
From: Randy Dunlap @ 2017-11-01 17:58 UTC (permalink / raw)
  To: Dave Hansen, Linus Torvalds
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

On 11/01/2017 10:31 AM, Dave Hansen wrote:

(from attachment)

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 57f7637..cde9014 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -49,9 +49,21 @@
 static DEFINE_SPINLOCK(shadow_table_allocation_lock);
 
 /*
+ * This is a generic page table walker used only for walking kernel
+ * addresses.  We use it too help recreate the "shadow" page tables

                          to help 

+ * which are used while we are in userspace.
+ *



-- 
~Randy

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
  2017-11-01  0:43   ` Brian Gerst
@ 2017-11-01 18:18   ` Borislav Petkov
  2017-11-01 18:27     ` Dave Hansen
  2017-11-01 21:01   ` Thomas Gleixner
  2 siblings, 1 reply; 102+ messages in thread
From: Borislav Petkov @ 2017-11-01 18:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, Oct 31, 2017 at 03:31:48PM -0700, Dave Hansen wrote:
> diff -puN arch/x86/entry/calling.h~kaiser-luto-base-cr3-work arch/x86/entry/calling.h
> --- a/arch/x86/entry/calling.h~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.105007253 -0700
> +++ b/arch/x86/entry/calling.h	2017-10-31 15:03:48.113007631 -0700
> @@ -1,5 +1,6 @@
>  #include <linux/jump_label.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cpufeatures.h>
>  
>  /*
>  
> @@ -217,6 +218,45 @@ For 32-bit we have the following convent
>  #endif
>  .endm
>  
> +.macro ADJUST_KERNEL_CR3 reg:req
> +.endm
> +
> +.macro ADJUST_USER_CR3 reg:req
> +.endm
> +
> +.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
> +	mov	%cr3, \scratch_reg
> +	ADJUST_KERNEL_CR3 \scratch_reg
> +	mov	\scratch_reg, %cr3
> +.endm
> +
> +.macro SWITCH_TO_USER_CR3 scratch_reg:req
> +	mov	%cr3, \scratch_reg
> +	ADJUST_USER_CR3 \scratch_reg
> +	mov	\scratch_reg, %cr3
> +.endm
> +
> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +	movq	%cr3, %r\scratch_reg
> +	movq	%r\scratch_reg, \save_reg

So one of the args gets passed as "ax", for example, which then gets
completed to a register with the "%r" prepended and the other is a full
register: %r14.

What for? Can we stick with one format pls?

> +	/*
> +	 * Just stick a random bit in here that never gets set.  Fixed
> +	 * up in real KAISER patches in a moment.
> +	 */
> +	bt	$63, %r\scratch_reg
> +	jz	.Ldone_\@
> +
> +	ADJUST_KERNEL_CR3 %r\scratch_reg
> +	movq	%r\scratch_reg, %cr3
> +
> +.Ldone_\@:
> +.endm
> +
> +.macro RESTORE_CR3 save_reg:req
> +	/* optimize this */
> +	movq	\save_reg, %cr3
> +.endm
> +
>  #endif /* CONFIG_X86_64 */
>  
>  /*
> diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
> --- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.107007348 -0700
> +++ b/arch/x86/entry/entry_64_compat.S	2017-10-31 15:03:48.113007631 -0700
> @@ -48,8 +48,13 @@
>  ENTRY(entry_SYSENTER_compat)
>  	/* Interrupts are off on entry. */
>  	SWAPGS_UNSAFE_STACK
> +
>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>  
> +	pushq	%rdi
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +	popq	%rdi

So we switch to kernel CR3 right after we've setup kernel stack...

> +
>  	/*
>  	 * User tracing code (ptrace or signal handlers) might assume that
>  	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
> @@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat)
>  	pushq   $0			/* pt_regs->r15 = 0 */
>  	cld
>  
> +	pushq	%rdi
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +	popq	%rdi

... and switch here *again*, after pushing pt_regs?!? What's up?

>  	/*
>  	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
>  	 * ourselves.  To save a few cycles, we can check whether
-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 17:31       ` Dave Hansen
  2017-11-01 17:58         ` Randy Dunlap
@ 2017-11-01 18:27         ` Linus Torvalds
  2017-11-01 18:46           ` Dave Hansen
  1 sibling, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-01 18:27 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

On Wed, Nov 1, 2017 at 10:31 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> I assume that you're really worried about having to go two places to do
> one thing, like clearing a dirty bit, or unmapping a PTE, especially
> when we have to do that for userspace.  Thankfully, the sharing of the
> page tables (under the PGD) for userspace gets rid of most of this
> nastiness.

Right. That's the primary thing, and just clarifying that this is for
kernel addresses only will help at least some.

But even for the kernel case, it worries me a bit. We have much fewer
coherency issues for the kernel, but we do end up having some cases
that modify kernel mappings too. Most notably there are the
cacheability things where we've had machine check exceptions when the
same page is mapped non-cachable in user space and cacheable in kernel
space, which ends up causing  all that pain we have in
arch/x86/mm/pageattr.c.

I very much think you limit the pages that get mapped in the shadow
page tables to the point where this shouldn't be an issue, but at the
same time, I very much do want people to be aware of it and this be
commented very clearly in the code.

Honestly, the code looks like it is designed to, and can, map
arbitrary physical pages at arbitrary virtual addresses. And that is
NOT RIGHT.

So I'd like to see not just the comments about this, but I'd like to
see the code itself actually making that very clear. Have *code* that
verifies that nobody ever tries to use this on a user address (because
that would *completely* screw up all coherency), but also I don't see
why the code possibly looks up the old physical address in ther page
table. Is there _any_ possible reason why you'd want to look up a page
from an old page table? As far as I can tell, we should always know
the physical page we are mapping a priori - we've never re-mapping
random virtual addresses or a highmem page or anything like that.
We're mapping the 1:1 kernel mapping only.

So the code really looks much too generic to me. It seems to be
designed to be used for cases where it simply could not *possibly* be
valid to use.

There's a disease in computer science that thinks that "generic code"
is somehow better code. That's not the case. We aren't mapping generic
pages, and must not map them or let make people make that mistake. I'd
*much* rather the code make it very clear that it's not generic code
in any way shape or form.

                Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-01 18:18   ` Borislav Petkov
@ 2017-11-01 18:27     ` Dave Hansen
  2017-11-01 20:42       ` Borislav Petkov
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 18:27 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On 11/01/2017 11:18 AM, Borislav Petkov wrote:
>> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
>> +	movq	%cr3, %r\scratch_reg
>> +	movq	%r\scratch_reg, \save_reg
> 
> So one of the args gets passed as "ax", for example, which then gets
> completed to a register with the "%r" prepended and the other is a full
> register: %r14.
> 
> What for? Can we stick with one format pls?

This allows for a tiny optimization of Andy's that I realize I must have
blown away at some point.  It lets us do a 32-bit-register instruction
(and using %eXX) when checking KAISER_SWITCH_MASK instead of a 64-bit
register via %rXX.

I don't feel strongly about maintaining that optimization it looks weird
and surely doesn't actually do much.

>> diff -puN arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work arch/x86/entry/entry_64_compat.S
>> --- a/arch/x86/entry/entry_64_compat.S~kaiser-luto-base-cr3-work	2017-10-31 15:03:48.107007348 -0700
>> +++ b/arch/x86/entry/entry_64_compat.S	2017-10-31 15:03:48.113007631 -0700
>> @@ -48,8 +48,13 @@
>>  ENTRY(entry_SYSENTER_compat)
>>  	/* Interrupts are off on entry. */
>>  	SWAPGS_UNSAFE_STACK
>> +
>>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
>>  
>> +	pushq	%rdi
>> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
>> +	popq	%rdi
> 
> So we switch to kernel CR3 right after we've setup kernel stack...
> 
>> +
>>  	/*
>>  	 * User tracing code (ptrace or signal handlers) might assume that
>>  	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
>> @@ -91,6 +96,9 @@ ENTRY(entry_SYSENTER_compat)
>>  	pushq   $0			/* pt_regs->r15 = 0 */
>>  	cld
>>  
>> +	pushq	%rdi
>> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
>> +	popq	%rdi
> 
> ... and switch here *again*, after pushing pt_regs?!? What's up?
> 
>>  	/*
>>  	 * SYSENTER doesn't filter flags, so we need to clear NT and AC
>>  	 * ourselves.  To save a few cycles, we can check whether

Thanks for catching that.  We can kill one of these.  I'm inclined to
kill the first one.  Looking at the second one since we've just saved
off ptregs, that should make %rdi safe to clobber without the push/pop
at all.

Does that seem like it would work?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 18:27         ` Linus Torvalds
@ 2017-11-01 18:46           ` Dave Hansen
  2017-11-01 19:05             ` Linus Torvalds
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 18:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

On 11/01/2017 11:27 AM, Linus Torvalds wrote:
> So I'd like to see not just the comments about this, but I'd like to
> see the code itself actually making that very clear. Have *code* that
> verifies that nobody ever tries to use this on a user address (because
> that would *completely* screw up all coherency), but also I don't see
> why the code possibly looks up the old physical address in ther page
> table. Is there _any_ possible reason why you'd want to look up a page
> from an old page table? As far as I can tell, we should always know
> the physical page we are mapping a priori - we've never re-mapping
> random virtual addresses or a highmem page or anything like that.
> We're mapping the 1:1 kernel mapping only.

The vmalloc()'d stacks definitely need the page table walk.  That's yet
another thing that will get simpler once we stop needing to map the
process stacks.  I think there was also a need to do this for the fixmap
addresses for the GDT.

But, I'm totally with you on making this stuff less generic.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 18:46           ` Dave Hansen
@ 2017-11-01 19:05             ` Linus Torvalds
  2017-11-01 20:33               ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-01 19:05 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, Linux Kernel Mailing List, linux-mm, Kees Cook,
	Hugh Dickins

On Wed, Nov 1, 2017 at 11:46 AM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> The vmalloc()'d stacks definitely need the page table walk.

Ugh, yes. Nasty.

Andy at some point mentioned a per-cpu initial stack trampoline thing
for his exception patches, but I'm not sure he actually ever did that.

Andy?

              Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-01 14:17     ` Dave Hansen
@ 2017-11-01 20:31       ` Andy Lutomirski
  2017-11-01 20:59         ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 20:31 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 7:17 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> On 11/01/2017 01:03 AM, Andy Lutomirski wrote:
>>> This ensures that any futuee context switches will do a full flush
>>> of the TLB so they pick up the changes.
>> I'm convuced.  What was wrong with the old code?  I guess I just don't
>> see what the problem is that is solved by this patch.
>
> Instead of flushing *now* with INVPCID, this lets us flush *later* with
> CR3.  It just hijacks the code that you already have that flushes CR3
> when loading a new ASID by making all ASIDs look new in the future.
>
> We have to load CR3 anyway, so we might as well just do this flush then.

Would it make more sense to put it in flush_tlb_func_common() instead?

Also, I don't understand what clear_non_loaded_ctxs() is trying to do.
It looks like it's invalidating all the other logical address spaces.
And I don't see why you want a all_other_ctxs_invalid variable.  Isn't
the goal to mark a single ASID as needing a *user* flush the next time
we switch to user mode using that ASID?  Your code seems like it's
going to flush a lot of *kernel* PCIDs.

Can you explain the overall logic?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 19:05             ` Linus Torvalds
@ 2017-11-01 20:33               ` Andy Lutomirski
  2017-11-02  7:32                 ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 20:33 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Andy Lutomirski, Linux Kernel Mailing List,
	linux-mm, Kees Cook, Hugh Dickins

On Wed, Nov 1, 2017 at 12:05 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Wed, Nov 1, 2017 at 11:46 AM, Dave Hansen
> <dave.hansen@linux.intel.com> wrote:
>>
>> The vmalloc()'d stacks definitely need the page table walk.
>
> Ugh, yes. Nasty.
>
> Andy at some point mentioned a per-cpu initial stack trampoline thing
> for his exception patches, but I'm not sure he actually ever did that.
>
> Andy?

I'm going to push it to kernel.org very shortly (like twenty minutes
maybe).  Then the 0day bot can chew on it.  With the proposed LDT
rework, we don't need to do any of dynamic mapping stuff, I think.

>
>               Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-01 18:27     ` Dave Hansen
@ 2017-11-01 20:42       ` Borislav Petkov
  0 siblings, 0 replies; 102+ messages in thread
From: Borislav Petkov @ 2017-11-01 20:42 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Wed, Nov 01, 2017 at 11:27:48AM -0700, Dave Hansen wrote:
> This allows for a tiny optimization of Andy's that I realize I must have
> blown away at some point.  It lets us do a 32-bit-register instruction
> (and using %eXX) when checking KAISER_SWITCH_MASK instead of a 64-bit
> register via %rXX.
> 
> I don't feel strongly about maintaining that optimization it looks weird
> and surely doesn't actually do much.

Yeah, and consistent syntax would probably bring more.

> Thanks for catching that.  We can kill one of these.  I'm inclined to
> kill the first one.  Looking at the second one since we've just saved
> off ptregs, that should make %rdi safe to clobber without the push/pop
> at all.
> 
> Does that seem like it would work?

Yap, sounds about right.

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-01 20:31       ` Andy Lutomirski
@ 2017-11-01 20:59         ` Dave Hansen
  2017-11-01 21:04           ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 20:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/01/2017 01:31 PM, Andy Lutomirski wrote:
> On Wed, Nov 1, 2017 at 7:17 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> On 11/01/2017 01:03 AM, Andy Lutomirski wrote:
>>>> This ensures that any futuee context switches will do a full flush
>>>> of the TLB so they pick up the changes.
>>> I'm convuced.  What was wrong with the old code?  I guess I just don't
>>> see what the problem is that is solved by this patch.
>>
>> Instead of flushing *now* with INVPCID, this lets us flush *later* with
>> CR3.  It just hijacks the code that you already have that flushes CR3
>> when loading a new ASID by making all ASIDs look new in the future.
>>
>> We have to load CR3 anyway, so we might as well just do this flush then.
> 
> Would it make more sense to put it in flush_tlb_func_common() instead?
> 
> Also, I don't understand what clear_non_loaded_ctxs() is trying to do.
> It looks like it's invalidating all the other logical address spaces.
> And I don't see why you want a all_other_ctxs_invalid variable.  Isn't
> the goal to mark a single ASID as needing a *user* flush the next time
> we switch to user mode using that ASID?  Your code seems like it's
> going to flush a lot of *kernel* PCIDs.

The point of the whole thing is to (relatively) efficiently flush
*kernel* TLB entries in *other* address spaces.  I did it way down in
the TLB handling functions because not everybody goes through
flush_tlb_func_common() to flush kernel addresses.

I used the variable instead of just invalidating the contexts directly
because I hooked into the __flush_tlb_single() path and it's used in
loops like this:

	for (addr = start; addr < end; addr++)
		__flush_tlb_single()

I didn't want to add a loop that effectively does:

	for (addr = start; addr < end; addr++)
		__flush_tlb_single();
		for (i = 0; i < TLB_NR_DYN_ASIDS; i++)
			this_cpu_write(cpu_tlbstate.ctxs[i].ctx_id, 0);

Even with just 6 ASIDS it seemed a little silly.  It would get _very_
silly if we ever decided to grow TLB_NR_DYN_ASIDS.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
  2017-11-01  0:43   ` Brian Gerst
  2017-11-01 18:18   ` Borislav Petkov
@ 2017-11-01 21:01   ` Thomas Gleixner
  2017-11-01 22:58     ` Dave Hansen
  2 siblings, 1 reply; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:01 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, 31 Oct 2017, Dave Hansen wrote:
>  
> +	pushq	%rdi
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
> +	popq	%rdi

Can you please have a macro variant which does:

    SWITCH_TO_KERNEL_CR3_PUSH reg=%rdi

So the pushq/popq is inside the macro. This has two reasons:

   1) If KAISER=n the pointless pushq/popq go away

   2) We need a boottime switch for that stuff, so we better have all
      related code in the various macros in order to patch it in/out.

Also, please wrap these macros in #ifdef KAISER right away and provide the
stubs as well. It does not make sense to have them in patch 7 when patch 1
introduces them.

Aside of Boris comments this looks about right.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-01 20:59         ` Dave Hansen
@ 2017-11-01 21:04           ` Andy Lutomirski
  2017-11-01 21:06             ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 21:04 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 1:59 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> On 11/01/2017 01:31 PM, Andy Lutomirski wrote:
>> On Wed, Nov 1, 2017 at 7:17 AM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>> On 11/01/2017 01:03 AM, Andy Lutomirski wrote:
>>>>> This ensures that any futuee context switches will do a full flush
>>>>> of the TLB so they pick up the changes.
>>>> I'm convuced.  What was wrong with the old code?  I guess I just don't
>>>> see what the problem is that is solved by this patch.
>>>
>>> Instead of flushing *now* with INVPCID, this lets us flush *later* with
>>> CR3.  It just hijacks the code that you already have that flushes CR3
>>> when loading a new ASID by making all ASIDs look new in the future.
>>>
>>> We have to load CR3 anyway, so we might as well just do this flush then.
>>
>> Would it make more sense to put it in flush_tlb_func_common() instead?
>>
>> Also, I don't understand what clear_non_loaded_ctxs() is trying to do.
>> It looks like it's invalidating all the other logical address spaces.
>> And I don't see why you want a all_other_ctxs_invalid variable.  Isn't
>> the goal to mark a single ASID as needing a *user* flush the next time
>> we switch to user mode using that ASID?  Your code seems like it's
>> going to flush a lot of *kernel* PCIDs.
>
> The point of the whole thing is to (relatively) efficiently flush
> *kernel* TLB entries in *other* address spaces.

Aha!  That wasn't at all clear to me from the changelog.  Can I make a
totally different suggestion?  Add a new function
__flush_tlb_one_kernel() and use it for kernel addresses.  That
function should just do __flush_tlb_all() if KAISER is on.  Then make
sure that there are no performance-critical looks that call
__flush_tlb_one_kernel() in KAISER mode.  The approach you're using is
quite expensive, and I suspect that just going a global flush may
actually be faster.  It's certainly a lot simpler.

Optionally add a warning to __flush_tlb_one() if the address is a
kernel address to help notice any missed conversions.  Or just rename
it to __flush_tlb_one_user().

--Andy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches
  2017-11-01 21:04           ` Andy Lutomirski
@ 2017-11-01 21:06             ` Dave Hansen
  0 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 21:06 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/01/2017 02:04 PM, Andy Lutomirski wrote:
> Aha!  That wasn't at all clear to me from the changelog.  Can I make a
> totally different suggestion?  Add a new function
> __flush_tlb_one_kernel() and use it for kernel addresses. 

I'll look into this.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-10-31 22:31 ` [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables Dave Hansen
@ 2017-11-01 21:11   ` Thomas Gleixner
  2017-11-01 21:24     ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, 31 Oct 2017, Dave Hansen wrote:

> 
> init_mm is for kernel-exclusive use.  If someone is allocating page
> tables in it, do not set _PAGE_USER on them.  This ensures that
> we do *not* set NX on these page tables in the KAISER code.

This changelog is confusing at best.

Why is this a kaiser issue? Nothing should ever create _PAGE_USER entries
in init_mm, right?

So this is a general improvement and creating a _PAGE_USER entry in init_mm
should be considered a bug in the first place.

> +/*
> + * _KERNPG_TABLE has _PAGE_USER clear which tells the KAISER code
> + * that this mapping is for kernel use only.  That makes sure that
> + * we leave the mapping usable by the kernel and do not try to
> + * sabotage it by doing stuff like setting _PAGE_NX on it.

So this comment should not mention KAISER at all. As I explained above
there are no user mappings in init_mm and this should be expressed here.

The fact that KAISER can make use of this information is a different story.

Other than that:

      Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 03/23] x86, kaiser: disable global pages
  2017-10-31 22:31 ` [PATCH 03/23] x86, kaiser: disable global pages Dave Hansen
@ 2017-11-01 21:18   ` Thomas Gleixner
  2017-11-01 22:12     ` Dave Hansen
  0 siblings, 1 reply; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:18 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, 31 Oct 2017, Dave Hansen wrote:
> --- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-10-31 15:03:49.314064402 -0700
> +++ b/arch/x86/include/asm/pgtable_types.h	2017-10-31 15:03:49.323064827 -0700
> @@ -47,7 +47,12 @@
>  #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
>  #define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
> +#ifdef CONFIG_X86_GLOBAL_PAGES
>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
> +#else
> +/* We must ensure that kernel TLBs are unusable while in userspace */
> +#define _PAGE_GLOBAL	(_AT(pteval_t, 0))
> +#endif

What you really want to do here is to clear PAGE_GLOBAL in the
supported_pte_mask. probe_page_size_mask() is the proper place for that.

This allows both .config and boottime configuration.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:11   ` Thomas Gleixner
@ 2017-11-01 21:24     ` Andy Lutomirski
  2017-11-01 21:28       ` Thomas Gleixner
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-01 21:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Andrew Lutomirski, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 2:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Tue, 31 Oct 2017, Dave Hansen wrote:
>
>>
>> init_mm is for kernel-exclusive use.  If someone is allocating page
>> tables in it, do not set _PAGE_USER on them.  This ensures that
>> we do *not* set NX on these page tables in the KAISER code.
>
> This changelog is confusing at best.
>
> Why is this a kaiser issue? Nothing should ever create _PAGE_USER entries
> in init_mm, right?

The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.

--Andy

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
  2017-11-01  8:01   ` Andy Lutomirski
@ 2017-11-01 21:25   ` Thomas Gleixner
  2017-11-01 22:24     ` Dave Hansen
  1 sibling, 1 reply; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:25 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, 31 Oct 2017, Dave Hansen wrote:
> Our CR4-based TLB flush currently requries global pages to be
> supported *and* enabled.  But, we really only need for them to be
> supported.  Make the code more robust by alllowing X86_CR4_PGE to
> clear as well as set.

That's not what the patch is actually doing.

>  	cr4 = this_cpu_read(cpu_tlbstate.cr4);
> -	/* clear PGE */
> -	native_write_cr4(cr4 & ~X86_CR4_PGE);
> -	/* write old PGE again and flush TLBs */
> +	/*
> +	 * This function is only called on systems that support X86_CR4_PGE
> +	 * and where always set X86_CR4_PGE.  Warn if we are called without
> +	 * PGE set.
> +	 */
> +	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));

Because if CR4_PGE is not set, this warning triggers. So this defeats the
toggle mode you are implementing.

> +	/*
> +	 * Architecturally, any _change_ to X86_CR4_PGE will fully flush the
> +	 * TLB of all entries including all entries in all PCIDs and all
> +	 * global pages.  Make sure that we _change_ the bit, regardless of
> +	 * whether we had X86_CR4_PGE set in the first place.
> +	 */
> +	native_write_cr4(cr4 ^ X86_CR4_PGE);
> +	/* Put original CR3 value back: */

That want's to be CR4. Restoring CR3 to CR4 might be suboptimal.

>  	native_write_cr4(cr4);

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:24     ` Andy Lutomirski
@ 2017-11-01 21:28       ` Thomas Gleixner
  2017-11-01 21:52         ` Dave Hansen
  2017-11-02  7:07         ` Andy Lutomirski
  0 siblings, 2 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:28 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Wed, 1 Nov 2017, Andy Lutomirski wrote:

> On Wed, Nov 1, 2017 at 2:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Tue, 31 Oct 2017, Dave Hansen wrote:
> >
> >>
> >> init_mm is for kernel-exclusive use.  If someone is allocating page
> >> tables in it, do not set _PAGE_USER on them.  This ensures that
> >> we do *not* set NX on these page tables in the KAISER code.
> >
> > This changelog is confusing at best.
> >
> > Why is this a kaiser issue? Nothing should ever create _PAGE_USER entries
> > in init_mm, right?
> 
> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.

Groan, forgot about that abomination, but still there is no point in having
it marked PAGE_USER in the init_mm at all, kaiser or not.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas
  2017-10-31 22:31 ` [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas Dave Hansen
@ 2017-11-01 21:47   ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 21:47 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Tue, 31 Oct 2017, Dave Hansen wrote:

> 
> These patches are based on work from a team at Graz University of
> Technology posted here: https://github.com/IAIK/KAISER
> 
> The KAISER approach keeps two copies of the page tables: one for running
> in the kernel and one for running userspace.  But, there are a few
> structures that are needed for switching in and out of the kernel and
> a good subset of *those* are per-cpu data.
> 
> This patch creates a new kind of per-cpu data that is mapped and can be
> used no matter which copy of the page tables we are using.

Please split out the percpu-defs.h change into a seperate patch.
 
> -DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
> +DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);

Ok.

>  /* Provide the original GDT */
>  static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
> diff -puN arch/x86/include/asm/hw_irq.h~kaiser-prep-user-mapped-percpu arch/x86/include/asm/hw_irq.h
> --- a/arch/x86/include/asm/hw_irq.h~kaiser-prep-user-mapped-percpu	2017-10-31 15:03:51.048146366 -0700
> +++ b/arch/x86/include/asm/hw_irq.h	2017-10-31 15:03:51.066147217 -0700
> @@ -160,7 +160,7 @@ extern char irq_entries_start[];
>  #define VECTOR_RETRIGGERED	((void *)~0UL)
>  
>  typedef struct irq_desc* vector_irq_t[NR_VECTORS];
> -DECLARE_PER_CPU(vector_irq_t, vector_irq);
> +DECLARE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq);

Why? The vector_irq array has nothing to do with user space. It's a
software handled storage which is used in the irq dispatcher way after the
exception entry happened.

I think you confused that with the IDT, which is missing here.

> -DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
> +DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);

Ok.

> -DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
> +DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {

Ok.

> -static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
> +DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
>  	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);

Hmm. I don't think that's a good idea. We discussed that in Prague with
Andy and the Peters and came to the conclusion that we want a stub stack in
the user mapping and switch to the kernel stacks in software after
switching back to the kernel mappings. Andys 'Pile o' entry...' series
paves the way to that already. So can we please put kaiser on top of those
and do it proper right away?

> -DEFINE_PER_CPU(vector_irq_t, vector_irq) = {
> +DEFINE_PER_CPU_USER_MAPPED(vector_irq_t, vector_irq) = {
>  	[0 ... NR_VECTORS - 1] = VECTOR_UNUSED,
>  };

See above.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:28       ` Thomas Gleixner
@ 2017-11-01 21:52         ` Dave Hansen
  2017-11-01 22:11           ` Thomas Gleixner
  2017-11-01 22:12           ` Linus Torvalds
  2017-11-02  7:07         ` Andy Lutomirski
  1 sibling, 2 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 21:52 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
> 
> Groan, forgot about that abomination, but still there is no point in having
> it marked PAGE_USER in the init_mm at all, kaiser or not.

So shouldn't this patch effectively make the vsyscall page unusable?
Any idea why that didn't show up in any of the x86 selftests?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:52         ` Dave Hansen
@ 2017-11-01 22:11           ` Thomas Gleixner
  2017-11-01 22:12           ` Linus Torvalds
  1 sibling, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 22:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Andy Lutomirski, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, 1 Nov 2017, Dave Hansen wrote:

> On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
> > On Wed, 1 Nov 2017, Andy Lutomirski wrote:
> >> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
> > 
> > Groan, forgot about that abomination, but still there is no point in having
> > it marked PAGE_USER in the init_mm at all, kaiser or not.
> 
> So shouldn't this patch effectively make the vsyscall page unusable?
> Any idea why that didn't show up in any of the x86 selftests?

vsyscall is the legacy mechanism. Halfways modern userspace does not need
it at all.

The default for it is EMULATE except you set it to NATIVE either via
Kconfig or on the kernel command line. Distros ship it with EMULATE set.
The emulation does not use the fixmap, it traps the access and emulates it.

But that aside. The point is that the fixmap exists in the init_mm and if
vsyscall is enabled then its also established in the process mappings.

So this can be done as a general correctness change:

  - Prevent USER mappings in init_mm

  - Make sure the fixmap gets the USER bit in the process mapping when
    vsyscall is in native mode.

We can avoid the latter by just removing the native vsyscall support and only
support emulation and none. It's about time to kill that stuff anyway.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 03/23] x86, kaiser: disable global pages
  2017-11-01 21:18   ` Thomas Gleixner
@ 2017-11-01 22:12     ` Dave Hansen
  2017-11-01 22:28       ` Thomas Gleixner
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 22:12 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On 11/01/2017 02:18 PM, Thomas Gleixner wrote:
> On Tue, 31 Oct 2017, Dave Hansen wrote:
>> --- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-10-31 15:03:49.314064402 -0700
>> +++ b/arch/x86/include/asm/pgtable_types.h	2017-10-31 15:03:49.323064827 -0700
>> @@ -47,7 +47,12 @@
>>  #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
>>  #define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
>>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
>> +#ifdef CONFIG_X86_GLOBAL_PAGES
>>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
>> +#else
>> +/* We must ensure that kernel TLBs are unusable while in userspace */
>> +#define _PAGE_GLOBAL	(_AT(pteval_t, 0))
>> +#endif
> 
> What you really want to do here is to clear PAGE_GLOBAL in the
> supported_pte_mask. probe_page_size_mask() is the proper place for that.

How does something like this look?  I just remove _PAGE_GLOBAL from the
default __PAGE_KERNEL permissions.

> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/commit/?h=kaiser-dynamic-414rc6-20171101&id=c9f7109207f87c168a6674a4826a701bd0c7333f

I was a bit worried that if we pull _PAGE_GLOBAL out of
__supported_pte_mask itself, we might not be able to use it for the
shadow entries that map the entry/exit code like Linus suggested.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:52         ` Dave Hansen
  2017-11-01 22:11           ` Thomas Gleixner
@ 2017-11-01 22:12           ` Linus Torvalds
  2017-11-01 22:20             ` Thomas Gleixner
  1 sibling, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-01 22:12 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Andy Lutomirski, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
>> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
>>
>> Groan, forgot about that abomination, but still there is no point in having
>> it marked PAGE_USER in the init_mm at all, kaiser or not.
>
> So shouldn't this patch effectively make the vsyscall page unusable?
> Any idea why that didn't show up in any of the x86 selftests?

I actually think there may be two issues here:

 - vsyscall isn't even used much - if any - any more

 - the vsyscall emulation works fine without _PAGE_USER, since the
whole point is that we take a fault on it and then emulate.

We do expose the vsyscall page read-only to user space in the
emulation case, but I'm not convinced that's even required.

Nobody who configures KAISER enabled would possibly want to have the
actual native vsyscall page enabled. That would be an insane
combination.

So the only possibly difference would be a user mode program that
actually looks at the vsyscall page, which sounds unlikely to be an
issue.  It's legacy and not really used.

            Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01  8:54 ` Ingo Molnar
  2017-11-01 14:09   ` Thomas Gleixner
@ 2017-11-01 22:14   ` Dave Hansen
  2017-11-01 22:28     ` Linus Torvalds
                       ` (2 more replies)
  1 sibling, 3 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 22:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, linux-mm, Andy Lutomirski, Linus Torvalds,
	Thomas Gleixner, Peter Zijlstra, H. Peter Anvin,
	borisBrian Gerst, Denys Vlasenko, Josh Poimboeuf, Thomas Garnier,
	Kees Cook

On 11/01/2017 01:54 AM, Ingo Molnar wrote:
> Beyond the inevitable cavalcade of (solvable) problems that will pop up during 
> review, one major item I'd like to see addressed is runtime configurability: it 
> should be possible to switch between a CR3-flushing and a regular syscall and page 
> table model on the admin level, without restarting the kernel and apps. Distros 
> really, really don't want to double the number of kernel variants they have.
> 
> The 'Kaiser off' runtime switch doesn't have to be as efficient as 
> CONFIG_KAISER=n, at least initialloy, but at minimum it should avoid the most 
> expensive page table switching paths in the syscall entry codepaths.

Due to popular demand, I went and implemented this today.  It's not the
prettiest code I ever wrote, but it's pretty small.

Just in case anyone wants to play with it, I threw a snapshot of it up here:

> https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/log/?h=kaiser-dynamic-414rc6-20171101

I ran some quick tests.  When CONFIG_KAISER=y, but "echo 0 >
kaiser-enabled", the tests that I ran were within the noise vs. a
vanilla kernel, and that's with *zero* optimization.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 22:12           ` Linus Torvalds
@ 2017-11-01 22:20             ` Thomas Gleixner
  2017-11-01 22:45               ` Kees Cook
  2017-11-02  7:10               ` Andy Lutomirski
  0 siblings, 2 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 22:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dave Hansen, Andy Lutomirski, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, 1 Nov 2017, Linus Torvalds wrote:
> On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
> >> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
> >>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
> >>
> >> Groan, forgot about that abomination, but still there is no point in having
> >> it marked PAGE_USER in the init_mm at all, kaiser or not.
> >
> > So shouldn't this patch effectively make the vsyscall page unusable?
> > Any idea why that didn't show up in any of the x86 selftests?
> 
> I actually think there may be two issues here:
> 
>  - vsyscall isn't even used much - if any - any more

Only legacy user space uses it.

>  - the vsyscall emulation works fine without _PAGE_USER, since the
> whole point is that we take a fault on it and then emulate.
> 
> We do expose the vsyscall page read-only to user space in the
> emulation case, but I'm not convinced that's even required.

I don't see a reason why it needs to be mapped at all for emulation.

> Nobody who configures KAISER enabled would possibly want to have the
> actual native vsyscall page enabled. That would be an insane
> combination.
> 
> So the only possibly difference would be a user mode program that
> actually looks at the vsyscall page, which sounds unlikely to be an
> issue.  It's legacy and not really used.

Right, and we can either disable the NATIVE mode when KAISER is on or just
rip the native mode out completely. Most distros have native mode disabled
anyway, so you cannot even enable it on the kernel command line.

I'm all for ripping it out or at least removing the config switch to enable
native mode as a first step.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 11:18           ` Andy Lutomirski
@ 2017-11-01 22:21             ` Dave Hansen
  0 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 22:21 UTC (permalink / raw)
  To: Andy Lutomirski, Kirill A. Shutemov
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On 11/01/2017 04:18 AM, Andy Lutomirski wrote:
>>> How about just adding a VM_WARN_ON_ONCE, then?
>> What's wrong with xor? The function will continue to work this way even if
>> CR4.PGE is disabled.
> That's true.  OTOH, since no one is actually proposing doing that,
> there's an argument that people should get warned and therefore be
> forced to think about it.

What this patch does in the end is make sure that
__native_flush_tlb_global_irq_disabled() works, no matter the intiial
state of CR4.PGE, *and* it makes it WARN if it gets called in an
unexpected initial state (CR4.PGE).

That's the best of both worlds IMNHO.  Makes people think, and does the
right thing no matter what.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 21:25   ` Thomas Gleixner
@ 2017-11-01 22:24     ` Dave Hansen
  2017-11-01 22:30       ` Thomas Gleixner
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 22:24 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On 11/01/2017 02:25 PM, Thomas Gleixner wrote:
>>  	cr4 = this_cpu_read(cpu_tlbstate.cr4);
>> -	/* clear PGE */
>> -	native_write_cr4(cr4 & ~X86_CR4_PGE);
>> -	/* write old PGE again and flush TLBs */
>> +	/*
>> +	 * This function is only called on systems that support X86_CR4_PGE
>> +	 * and where always set X86_CR4_PGE.  Warn if we are called without
>> +	 * PGE set.
>> +	 */
>> +	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
> Because if CR4_PGE is not set, this warning triggers. So this defeats the
> toggle mode you are implementing.

The warning is there because there is probably plenty of *other* stuff
that breaks if we have X86_FEATURE_PGE=1, but CR4.PGE=0.

The point of this was to make this function do the right thing no matter
what, but warn if it gets called in an unexpected way.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 03/23] x86, kaiser: disable global pages
  2017-11-01 22:12     ` Dave Hansen
@ 2017-11-01 22:28       ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 22:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86

On Wed, 1 Nov 2017, Dave Hansen wrote:
> On 11/01/2017 02:18 PM, Thomas Gleixner wrote:
> > On Tue, 31 Oct 2017, Dave Hansen wrote:
> >> --- a/arch/x86/include/asm/pgtable_types.h~kaiser-prep-disable-global-pages	2017-10-31 15:03:49.314064402 -0700
> >> +++ b/arch/x86/include/asm/pgtable_types.h	2017-10-31 15:03:49.323064827 -0700
> >> @@ -47,7 +47,12 @@
> >>  #define _PAGE_ACCESSED	(_AT(pteval_t, 1) << _PAGE_BIT_ACCESSED)
> >>  #define _PAGE_DIRTY	(_AT(pteval_t, 1) << _PAGE_BIT_DIRTY)
> >>  #define _PAGE_PSE	(_AT(pteval_t, 1) << _PAGE_BIT_PSE)
> >> +#ifdef CONFIG_X86_GLOBAL_PAGES
> >>  #define _PAGE_GLOBAL	(_AT(pteval_t, 1) << _PAGE_BIT_GLOBAL)
> >> +#else
> >> +/* We must ensure that kernel TLBs are unusable while in userspace */
> >> +#define _PAGE_GLOBAL	(_AT(pteval_t, 0))
> >> +#endif
> > 
> > What you really want to do here is to clear PAGE_GLOBAL in the
> > supported_pte_mask. probe_page_size_mask() is the proper place for that.
> 
> How does something like this look?  I just remove _PAGE_GLOBAL from the
> default __PAGE_KERNEL permissions.

That should work, but how do you bring _PAGE_GLOBAL back when kaiser is
disabled at boot/runtime?

You might want to make __PAGE_KERNEL_GLOBAL a variable, but that might be
impossible for the early ASM stuff.

> I was a bit worried that if we pull _PAGE_GLOBAL out of
> __supported_pte_mask itself, we might not be able to use it for the
> shadow entries that map the entry/exit code like Linus suggested.

Hmm. Good point.  

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 22:14   ` Dave Hansen
@ 2017-11-01 22:28     ` Linus Torvalds
  2017-11-02  8:03     ` Peter Zijlstra
  2017-11-03 11:07     ` Kirill A. Shutemov
  2 siblings, 0 replies; 102+ messages in thread
From: Linus Torvalds @ 2017-11-01 22:28 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, Linux Kernel Mailing List, linux-mm,
	Andy Lutomirski, Thomas Gleixner, Peter Zijlstra, H. Peter Anvin,
	borisBrian Gerst, Denys Vlasenko, Josh Poimboeuf, Thomas Garnier,
	Kees Cook

On Wed, Nov 1, 2017 at 3:14 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>
> I ran some quick tests.  When CONFIG_KAISER=y, but "echo 0 >
> kaiser-enabled", the tests that I ran were within the noise vs. a
> vanilla kernel, and that's with *zero* optimization.

I guess the optimal version just ends up switching between two
different entrypoints for the on/off case.

And the not-quite-as-aggressive, but almost-optimal version would just
be a two-byte asm alternative with an unconditional branch to the
movcr3 code and back, and is turned into a noop when it's off.

But since 99%+ of the cost is going to be that cr3 write, even the
stupid "just load value and branch over the cr3 conditionally" is
going to make things hard to measure.

                Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust
  2017-11-01 22:24     ` Dave Hansen
@ 2017-11-01 22:30       ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-01 22:30 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86


On Wed, 1 Nov 2017, Dave Hansen wrote:

> On 11/01/2017 02:25 PM, Thomas Gleixner wrote:
> >>  	cr4 = this_cpu_read(cpu_tlbstate.cr4);
> >> -	/* clear PGE */
> >> -	native_write_cr4(cr4 & ~X86_CR4_PGE);
> >> -	/* write old PGE again and flush TLBs */
> >> +	/*
> >> +	 * This function is only called on systems that support X86_CR4_PGE
> >> +	 * and where always set X86_CR4_PGE.  Warn if we are called without
> >> +	 * PGE set.
> >> +	 */
> >> +	WARN_ON_ONCE(!(cr4 & X86_CR4_PGE));
> > Because if CR4_PGE is not set, this warning triggers. So this defeats the
> > toggle mode you are implementing.
> 
> The warning is there because there is probably plenty of *other* stuff
> that breaks if we have X86_FEATURE_PGE=1, but CR4.PGE=0.
> 
> The point of this was to make this function do the right thing no matter
> what, but warn if it gets called in an unexpected way.

Fair enough. Can you please reflect that in the changelog ?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 22:20             ` Thomas Gleixner
@ 2017-11-01 22:45               ` Kees Cook
  2017-11-02  7:10               ` Andy Lutomirski
  1 sibling, 0 replies; 102+ messages in thread
From: Kees Cook @ 2017-11-01 22:45 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Hansen, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 3:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 1 Nov 2017, Linus Torvalds wrote:
>> On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> > On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
>> >> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>> >>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
>> >>
>> >> Groan, forgot about that abomination, but still there is no point in having
>> >> it marked PAGE_USER in the init_mm at all, kaiser or not.
>> >
>> > So shouldn't this patch effectively make the vsyscall page unusable?
>> > Any idea why that didn't show up in any of the x86 selftests?
>>
>> I actually think there may be two issues here:
>>
>>  - vsyscall isn't even used much - if any - any more
>
> Only legacy user space uses it.
>
>>  - the vsyscall emulation works fine without _PAGE_USER, since the
>> whole point is that we take a fault on it and then emulate.
>>
>> We do expose the vsyscall page read-only to user space in the
>> emulation case, but I'm not convinced that's even required.
>
> I don't see a reason why it needs to be mapped at all for emulation.
>
>> Nobody who configures KAISER enabled would possibly want to have the
>> actual native vsyscall page enabled. That would be an insane
>> combination.
>>
>> So the only possibly difference would be a user mode program that
>> actually looks at the vsyscall page, which sounds unlikely to be an
>> issue.  It's legacy and not really used.
>
> Right, and we can either disable the NATIVE mode when KAISER is on or just
> rip the native mode out completely. Most distros have native mode disabled
> anyway, so you cannot even enable it on the kernel command line.
>
> I'm all for ripping it out or at least removing the config switch to enable
> native mode as a first step.

I would like to see NATIVE removed too.

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching
  2017-11-01 21:01   ` Thomas Gleixner
@ 2017-11-01 22:58     ` Dave Hansen
  0 siblings, 0 replies; 102+ messages in thread
From: Dave Hansen @ 2017-11-01 22:58 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, linux-mm, moritz.lipp, daniel.gruss,
	michael.schwarz, luto, torvalds, keescook, hughd, x86,
	Borislav Petkov

On 11/01/2017 02:01 PM, Thomas Gleixner wrote:
> On Tue, 31 Oct 2017, Dave Hansen wrote:
>>  
>> +	pushq	%rdi
>> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
>> +	popq	%rdi
> 
> Can you please have a macro variant which does:
> 
>     SWITCH_TO_KERNEL_CR3_PUSH reg=%rdi
> 
> So the pushq/popq is inside the macro. This has two reasons:
> 
>    1) If KAISER=n the pointless pushq/popq go away
> 
>    2) We need a boottime switch for that stuff, so we better have all
>       related code in the various macros in order to patch it in/out.

After Boris's comments, these push/pops are totally unnecessary.  We
just delay the CR3 until after we stashed off pt_regs and are allowed to
clobber things.

> Also, please wrap these macros in #ifdef KAISER right away and provide the
> stubs as well. It does not make sense to have them in patch 7 when patch 1
> introduces them.

Will do.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 21:28       ` Thomas Gleixner
  2017-11-01 21:52         ` Dave Hansen
@ 2017-11-02  7:07         ` Andy Lutomirski
  2017-11-02 11:21           ` Thomas Gleixner
  1 sibling, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02  7:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 2:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>
>> On Wed, Nov 1, 2017 at 2:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Tue, 31 Oct 2017, Dave Hansen wrote:
>> >
>> >>
>> >> init_mm is for kernel-exclusive use.  If someone is allocating page
>> >> tables in it, do not set _PAGE_USER on them.  This ensures that
>> >> we do *not* set NX on these page tables in the KAISER code.
>> >
>> > This changelog is confusing at best.
>> >
>> > Why is this a kaiser issue? Nothing should ever create _PAGE_USER entries
>> > in init_mm, right?
>>
>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
>
> Groan, forgot about that abomination, but still there is no point in having
> it marked PAGE_USER in the init_mm at all, kaiser or not.
>

How can it be PAGE_USER in user mms but not init_mm?  It's the same page table.

> Thanks,
>
>         tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-01 22:20             ` Thomas Gleixner
  2017-11-01 22:45               ` Kees Cook
@ 2017-11-02  7:10               ` Andy Lutomirski
  2017-11-02 11:33                 ` Thomas Gleixner
  1 sibling, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02  7:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Linus Torvalds, Dave Hansen, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Wed, Nov 1, 2017 at 3:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Wed, 1 Nov 2017, Linus Torvalds wrote:
>> On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>> > On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
>> >> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>> >>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
>> >>
>> >> Groan, forgot about that abomination, but still there is no point in having
>> >> it marked PAGE_USER in the init_mm at all, kaiser or not.
>> >
>> > So shouldn't this patch effectively make the vsyscall page unusable?
>> > Any idea why that didn't show up in any of the x86 selftests?
>>
>> I actually think there may be two issues here:
>>
>>  - vsyscall isn't even used much - if any - any more
>
> Only legacy user space uses it.
>
>>  - the vsyscall emulation works fine without _PAGE_USER, since the
>> whole point is that we take a fault on it and then emulate.
>>
>> We do expose the vsyscall page read-only to user space in the
>> emulation case, but I'm not convinced that's even required.
>
> I don't see a reason why it needs to be mapped at all for emulation.

At least a couple years ago, the maintainers of some userspace tracing
tools complained very loudly at the early versions of the patches.
There are programs like pin (semi-open-source IIRC) that parse
instructions, make an instrumented copy, and run it.  This means that
the vsyscall page needs to contain text that is semantically
equivalent to what calling it actually does.

So yes, read access needs to work.  I should add a selftest for this.

This is needed in emulation mode as well as native mode, so removing
native mode is totally orthogonal.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 20:33               ` Andy Lutomirski
@ 2017-11-02  7:32                 ` Andy Lutomirski
  2017-11-02  7:54                   ` Andy Lutomirski
  0 siblings, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02  7:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Kees Cook, Hugh Dickins

On Wed, Nov 1, 2017 at 1:33 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Nov 1, 2017 at 12:05 PM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
>> On Wed, Nov 1, 2017 at 11:46 AM, Dave Hansen
>> <dave.hansen@linux.intel.com> wrote:
>>>
>>> The vmalloc()'d stacks definitely need the page table walk.
>>
>> Ugh, yes. Nasty.
>>
>> Andy at some point mentioned a per-cpu initial stack trampoline thing
>> for his exception patches, but I'm not sure he actually ever did that.
>>
>> Andy?
>
> I'm going to push it to kernel.org very shortly (like twenty minutes
> maybe).  Then the 0day bot can chew on it.  With the proposed LDT
> rework, we don't need to do any of dynamic mapping stuff, I think.

FWIW, I pushed all but the actual stack switching part.  Something
broke in the rebase and it doesn't boot right now :(

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-02  7:32                 ` Andy Lutomirski
@ 2017-11-02  7:54                   ` Andy Lutomirski
  0 siblings, 0 replies; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02  7:54 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dave Hansen, Linux Kernel Mailing List, linux-mm,
	Kees Cook, Hugh Dickins

On Thu, Nov 2, 2017 at 12:32 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Wed, Nov 1, 2017 at 1:33 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Wed, Nov 1, 2017 at 12:05 PM, Linus Torvalds
>> <torvalds@linux-foundation.org> wrote:
>>> On Wed, Nov 1, 2017 at 11:46 AM, Dave Hansen
>>> <dave.hansen@linux.intel.com> wrote:
>>>>
>>>> The vmalloc()'d stacks definitely need the page table walk.
>>>
>>> Ugh, yes. Nasty.
>>>
>>> Andy at some point mentioned a per-cpu initial stack trampoline thing
>>> for his exception patches, but I'm not sure he actually ever did that.
>>>
>>> Andy?
>>
>> I'm going to push it to kernel.org very shortly (like twenty minutes
>> maybe).  Then the 0day bot can chew on it.  With the proposed LDT
>> rework, we don't need to do any of dynamic mapping stuff, I think.
>
> FWIW, I pushed all but the actual stack switching part.  Something
> broke in the rebase and it doesn't boot right now :(

Okay, that was embarrassing.  The rebase error was, drumroll please, I
forgot one of the patches.  Sigh.

It's here:

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation

The last few patches are terminally ugly.  I'll clean them up shortly
and email them out.  That being said, unless there's a showstopper
bug, this should be a fine base for Dave's development.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 22:14   ` Dave Hansen
  2017-11-01 22:28     ` Linus Torvalds
@ 2017-11-02  8:03     ` Peter Zijlstra
  2017-11-03 11:07     ` Kirill A. Shutemov
  2 siblings, 0 replies; 102+ messages in thread
From: Peter Zijlstra @ 2017-11-02  8:03 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Linus Torvalds, Thomas Gleixner, H. Peter Anvin,
	borisBrian Gerst, Denys Vlasenko, Josh Poimboeuf, Thomas Garnier,
	Kees Cook

On Wed, Nov 01, 2017 at 03:14:11PM -0700, Dave Hansen wrote:
> On 11/01/2017 01:54 AM, Ingo Molnar wrote:
> > Beyond the inevitable cavalcade of (solvable) problems that will pop up during 
> > review, one major item I'd like to see addressed is runtime configurability: it 
> > should be possible to switch between a CR3-flushing and a regular syscall and page 
> > table model on the admin level, without restarting the kernel and apps. Distros 
> > really, really don't want to double the number of kernel variants they have.
> > 
> > The 'Kaiser off' runtime switch doesn't have to be as efficient as 
> > CONFIG_KAISER=n, at least initialloy, but at minimum it should avoid the most 
> > expensive page table switching paths in the syscall entry codepaths.
> 
> Due to popular demand, I went and implemented this today.  It's not the
> prettiest code I ever wrote, but it's pretty small.
> 
> Just in case anyone wants to play with it, I threw a snapshot of it up here:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/log/?h=kaiser-dynamic-414rc6-20171101
> 
> I ran some quick tests.  When CONFIG_KAISER=y, but "echo 0 >
> kaiser-enabled", the tests that I ran were within the noise vs. a
> vanilla kernel, and that's with *zero* optimization.

I resent you don't think the NMI is performance critical ;-)

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02  7:07         ` Andy Lutomirski
@ 2017-11-02 11:21           ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 11:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> On Wed, Nov 1, 2017 at 2:28 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 1 Nov 2017, Andy Lutomirski wrote:
> >
> >> On Wed, Nov 1, 2017 at 2:11 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > On Tue, 31 Oct 2017, Dave Hansen wrote:
> >> >
> >> >>
> >> >> init_mm is for kernel-exclusive use.  If someone is allocating page
> >> >> tables in it, do not set _PAGE_USER on them.  This ensures that
> >> >> we do *not* set NX on these page tables in the KAISER code.
> >> >
> >> > This changelog is confusing at best.
> >> >
> >> > Why is this a kaiser issue? Nothing should ever create _PAGE_USER entries
> >> > in init_mm, right?
> >>
> >> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
> >
> > Groan, forgot about that abomination, but still there is no point in having
> > it marked PAGE_USER in the init_mm at all, kaiser or not.
> >
> 
> How can it be PAGE_USER in user mms but not init_mm?  It's the same page table.

Right you are. Brain was already shutdown it seems.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02  7:10               ` Andy Lutomirski
@ 2017-11-02 11:33                 ` Thomas Gleixner
  2017-11-02 11:59                   ` Andy Lutomirski
  2017-11-02 16:38                   ` Dave Hansen
  0 siblings, 2 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 11:33 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Dave Hansen, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, Kees Cook, Hugh Dickins, X86 ML

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> On Wed, Nov 1, 2017 at 3:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Wed, 1 Nov 2017, Linus Torvalds wrote:
> >> On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> >> > On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
> >> >> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
> >> >>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
> >> >>
> >> >> Groan, forgot about that abomination, but still there is no point in having
> >> >> it marked PAGE_USER in the init_mm at all, kaiser or not.
> >> >
> >> > So shouldn't this patch effectively make the vsyscall page unusable?
> >> > Any idea why that didn't show up in any of the x86 selftests?
> >>
> >> I actually think there may be two issues here:
> >>
> >>  - vsyscall isn't even used much - if any - any more
> >
> > Only legacy user space uses it.
> >
> >>  - the vsyscall emulation works fine without _PAGE_USER, since the
> >> whole point is that we take a fault on it and then emulate.
> >>
> >> We do expose the vsyscall page read-only to user space in the
> >> emulation case, but I'm not convinced that's even required.
> >
> > I don't see a reason why it needs to be mapped at all for emulation.
> 
> At least a couple years ago, the maintainers of some userspace tracing
> tools complained very loudly at the early versions of the patches.
> There are programs like pin (semi-open-source IIRC) that parse
> instructions, make an instrumented copy, and run it.  This means that
> the vsyscall page needs to contain text that is semantically
> equivalent to what calling it actually does.
> 
> So yes, read access needs to work.  I should add a selftest for this.
> 
> This is needed in emulation mode as well as native mode, so removing
> native mode is totally orthogonal.

Fair enough. I enabled function tracing with emulate_vsyscall as the filter
on a couple of machines and so far I have no hit at all. Though I found a
VM with a real old user space (~2005) and that actually used it.

So for the problem at hand, I'd suggest we disable the vsyscall stuff if
CONFIG_KAISER=y and be done with it.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 11:33                 ` Thomas Gleixner
@ 2017-11-02 11:59                   ` Andy Lutomirski
  2017-11-02 12:56                     ` Thomas Gleixner
  2017-11-02 16:38                   ` Dave Hansen
  1 sibling, 1 reply; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02 11:59 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Linus Torvalds, Dave Hansen, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML



> On Nov 2, 2017, at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>>> On Wed, Nov 1, 2017 at 3:20 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> On Wed, 1 Nov 2017, Linus Torvalds wrote:
>>>>> On Wed, Nov 1, 2017 at 2:52 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
>>>>>> On 11/01/2017 02:28 PM, Thomas Gleixner wrote:
>>>>>>> On Wed, 1 Nov 2017, Andy Lutomirski wrote:
>>>>>>> The vsyscall page is _PAGE_USER and lives in init_mm via the fixmap.
>>>>>> 
>>>>>> Groan, forgot about that abomination, but still there is no point in having
>>>>>> it marked PAGE_USER in the init_mm at all, kaiser or not.
>>>>> 
>>>>> So shouldn't this patch effectively make the vsyscall page unusable?
>>>>> Any idea why that didn't show up in any of the x86 selftests?
>>>> 
>>>> I actually think there may be two issues here:
>>>> 
>>>> - vsyscall isn't even used much - if any - any more
>>> 
>>> Only legacy user space uses it.
>>> 
>>>> - the vsyscall emulation works fine without _PAGE_USER, since the
>>>> whole point is that we take a fault on it and then emulate.
>>>> 
>>>> We do expose the vsyscall page read-only to user space in the
>>>> emulation case, but I'm not convinced that's even required.
>>> 
>>> I don't see a reason why it needs to be mapped at all for emulation.
>> 
>> At least a couple years ago, the maintainers of some userspace tracing
>> tools complained very loudly at the early versions of the patches.
>> There are programs like pin (semi-open-source IIRC) that parse
>> instructions, make an instrumented copy, and run it.  This means that
>> the vsyscall page needs to contain text that is semantically
>> equivalent to what calling it actually does.
>> 
>> So yes, read access needs to work.  I should add a selftest for this.
>> 
>> This is needed in emulation mode as well as native mode, so removing
>> native mode is totally orthogonal.
> 
> Fair enough. I enabled function tracing with emulate_vsyscall as the filter
> on a couple of machines and so far I have no hit at all. Though I found a
> VM with a real old user space (~2005) and that actually used it.
> 
> So for the problem at hand, I'd suggest we disable the vsyscall stuff if
> CONFIG_KAISER=y and be done with it.

I think that time() on not-so-old glibc uses it.  Even more recent versions of Go use it. :(

> 
> Thanks,
> 
>    tglx
> 
> 
> 
> 
> 

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 11:59                   ` Andy Lutomirski
@ 2017-11-02 12:56                     ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 12:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Linus Torvalds, Dave Hansen, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> > On Nov 2, 2017, at 12:33 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Fair enough. I enabled function tracing with emulate_vsyscall as the filter
> > on a couple of machines and so far I have no hit at all. Though I found a
> > VM with a real old user space (~2005) and that actually used it.
> > 
> > So for the problem at hand, I'd suggest we disable the vsyscall stuff if
> > CONFIG_KAISER=y and be done with it.
> 
> I think that time() on not-so-old glibc uses it.

Sigh.

> Even more recent versions of Go use it. :(

Groan. VDSO is there since 2007 and the first usable version of Go was
released in 2012.....

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 11:33                 ` Thomas Gleixner
  2017-11-02 11:59                   ` Andy Lutomirski
@ 2017-11-02 16:38                   ` Dave Hansen
  2017-11-02 18:19                     ` Andy Lutomirski
  1 sibling, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-02 16:38 UTC (permalink / raw)
  To: Thomas Gleixner, Andy Lutomirski
  Cc: Linus Torvalds, linux-kernel, linux-mm, moritz.lipp,
	Daniel Gruss, michael.schwarz, Kees Cook, Hugh Dickins, X86 ML

On 11/02/2017 04:33 AM, Thomas Gleixner wrote:
> So for the problem at hand, I'd suggest we disable the vsyscall stuff if
> CONFIG_KAISER=y and be done with it.

Just to be clear, are we suggesting to just disable
LEGACY_VSYSCALL_NATIVE if KAISER=y, and allow LEGACY_VSYSCALL_EMULATE?
Or, do we just force LEGACY_VSYSCALL_NONE=y?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 16:38                   ` Dave Hansen
@ 2017-11-02 18:19                     ` Andy Lutomirski
  2017-11-02 18:24                       ` Thomas Gleixner
  2017-11-02 18:24                       ` Linus Torvalds
  0 siblings, 2 replies; 102+ messages in thread
From: Andy Lutomirski @ 2017-11-02 18:19 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Thomas Gleixner, Andy Lutomirski, Linus Torvalds, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML



> On Nov 2, 2017, at 5:38 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> 
>> On 11/02/2017 04:33 AM, Thomas Gleixner wrote:
>> So for the problem at hand, I'd suggest we disable the vsyscall stuff if
>> CONFIG_KAISER=y and be done with it.
> 
> Just to be clear, are we suggesting to just disable
> LEGACY_VSYSCALL_NATIVE if KAISER=y, and allow LEGACY_VSYSCALL_EMULATE?
> Or, do we just force LEGACY_VSYSCALL_NONE=y?

We'd have to force NONE, and Linus won't like it.

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 18:19                     ` Andy Lutomirski
@ 2017-11-02 18:24                       ` Thomas Gleixner
  2017-11-02 18:24                       ` Linus Torvalds
  1 sibling, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 18:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Andy Lutomirski, Linus Torvalds, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Thu, 2 Nov 2017, Andy Lutomirski wrote:

> > On Nov 2, 2017, at 5:38 PM, Dave Hansen <dave.hansen@linux.intel.com> wrote:
> > 
> >> On 11/02/2017 04:33 AM, Thomas Gleixner wrote:
> >> So for the problem at hand, I'd suggest we disable the vsyscall stuff if
> >> CONFIG_KAISER=y and be done with it.
> > 
> > Just to be clear, are we suggesting to just disable
> > LEGACY_VSYSCALL_NATIVE if KAISER=y, and allow LEGACY_VSYSCALL_EMULATE?
> > Or, do we just force LEGACY_VSYSCALL_NONE=y?
> 
> We'd have to force NONE, and Linus won't like it.

The much I hate it, I already accepted grudgingly that we have to keep it
alive in some way or the other.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 18:19                     ` Andy Lutomirski
  2017-11-02 18:24                       ` Thomas Gleixner
@ 2017-11-02 18:24                       ` Linus Torvalds
  2017-11-02 18:40                         ` Thomas Gleixner
  1 sibling, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-02 18:24 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, Thomas Gleixner, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Thu, Nov 2, 2017 at 11:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> We'd have to force NONE, and Linus won't like it.

Oh, I think it's fine for the kaiser case.

I am not convinced anybody will actually use it, but if you do use it,
I suspect "the legacy vsyscall page no longer works" is the least of
your worries.

That said, I think you can keep emulation, and just make it
unreadable. That will keep legacy binaries still working, and will
break a much smaller subset. So we have four cases:

 - native
 - read-only emulation
 - unreadable emulation
 - none

and kaiser triggering that unreadable case sounds like the option
least likely to cause trouble. vsyscalls still work, anybody who tries
to trace them and look at the code will not.

              Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 18:24                       ` Linus Torvalds
@ 2017-11-02 18:40                         ` Thomas Gleixner
  2017-11-02 18:57                           ` Linus Torvalds
  0 siblings, 1 reply; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 18:40 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dave Hansen, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Thu, 2 Nov 2017, Linus Torvalds wrote:

> On Thu, Nov 2, 2017 at 11:19 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >
> > We'd have to force NONE, and Linus won't like it.
> 
> Oh, I think it's fine for the kaiser case.
> 
> I am not convinced anybody will actually use it, but if you do use it,
> I suspect "the legacy vsyscall page no longer works" is the least of
> your worries.
> 
> That said, I think you can keep emulation, and just make it
> unreadable. That will keep legacy binaries still working, and will
> break a much smaller subset. So we have four cases:
> 
>  - native
>  - read-only emulation
>  - unreadable emulation
>  - none
> 
> and kaiser triggering that unreadable case sounds like the option
> least likely to cause trouble. vsyscalls still work, anybody who tries
> to trace them and look at the code will not.

Hmm. Not sure. IIRC you need to be able to read it to figure out where the
entry points are. They are at fixed offsets, but there is some voodoo out
there which reads the 'elf' to get to them.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 18:40                         ` Thomas Gleixner
@ 2017-11-02 18:57                           ` Linus Torvalds
  2017-11-02 21:41                             ` Thomas Gleixner
  0 siblings, 1 reply; 102+ messages in thread
From: Linus Torvalds @ 2017-11-02 18:57 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Dave Hansen, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML

On Thu, Nov 2, 2017 at 11:40 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
>
> Hmm. Not sure. IIRC you need to be able to read it to figure out where the
> entry points are. They are at fixed offsets, but there is some voodoo out
> there which reads the 'elf' to get to them.

That would actually be really painful.

But I *think* you're confusing it with the vdso case, which really
does do that whole "generate ELF information for debuggers and dynamic
linkers" thing. The vsyscall page never did that afaik, and purely
relied on fixed addresses.

              Linus

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (24 preceding siblings ...)
  2017-11-01  8:54 ` Ingo Molnar
@ 2017-11-02 19:01 ` Will Deacon
  2017-11-02 19:38   ` Dave Hansen
  2017-11-22 16:19 ` Pavel Machek
  26 siblings, 1 reply; 102+ messages in thread
From: Will Deacon @ 2017-11-02 19:01 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, linux-arm-kernel

Hi Dave,

[+linux-arm-kernel]

On Tue, Oct 31, 2017 at 03:31:46PM -0700, Dave Hansen wrote:
> KAISER makes it harder to defeat KASLR, but makes syscalls and
> interrupts slower.  These patches are based on work from a team at
> Graz University of Technology posted here[1].  The major addition is
> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> for a wide variety of use cases.

I just wanted to say that I've got a version of this up and running for
arm64. I'm still ironing out a few small details, but I hope to post it
after the merge window. We always use ASIDs, and the perf impact looks
like it aligns roughly with your findings for a PCID-enabled x86 system.

Cheers,

Will

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-02 19:01 ` Will Deacon
@ 2017-11-02 19:38   ` Dave Hansen
  2017-11-03 13:41     ` Will Deacon
  0 siblings, 1 reply; 102+ messages in thread
From: Dave Hansen @ 2017-11-02 19:38 UTC (permalink / raw)
  To: Will Deacon; +Cc: linux-kernel, linux-mm, linux-arm-kernel

On 11/02/2017 12:01 PM, Will Deacon wrote:
> On Tue, Oct 31, 2017 at 03:31:46PM -0700, Dave Hansen wrote:
>> KAISER makes it harder to defeat KASLR, but makes syscalls and
>> interrupts slower.  These patches are based on work from a team at
>> Graz University of Technology posted here[1].  The major addition is
>> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
>> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
>> for a wide variety of use cases.
> I just wanted to say that I've got a version of this up and running for
> arm64. I'm still ironing out a few small details, but I hope to post it
> after the merge window. We always use ASIDs, and the perf impact looks
> like it aligns roughly with your findings for a PCID-enabled x86 system.

Welcome to the party!

I don't know if you've found anything different, but there been woefully
little code that's really cross-architecture.  The kernel task
stack-mapping stuff _was_, but it's going away.  The per-cpu-user-mapped
section stuff might be common, I guess.

Is there any other common infrastructure that we can or should be sharing?

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables
  2017-11-02 18:57                           ` Linus Torvalds
@ 2017-11-02 21:41                             ` Thomas Gleixner
  0 siblings, 0 replies; 102+ messages in thread
From: Thomas Gleixner @ 2017-11-02 21:41 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Dave Hansen, Andy Lutomirski, linux-kernel,
	linux-mm, moritz.lipp, Daniel Gruss, michael.schwarz, Kees Cook,
	Hugh Dickins, X86 ML


On Thu, 2 Nov 2017, Linus Torvalds wrote:

> On Thu, Nov 2, 2017 at 11:40 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >
> > Hmm. Not sure. IIRC you need to be able to read it to figure out where the
> > entry points are. They are at fixed offsets, but there is some voodoo out
> > there which reads the 'elf' to get to them.
> 
> That would actually be really painful.
> 
> But I *think* you're confusing it with the vdso case, which really
> does do that whole "generate ELF information for debuggers and dynamic
> linkers" thing. The vsyscall page never did that afaik, and purely
> relied on fixed addresses.

Yes, managed to confuse myself. The vsycall page has only the fixed offset
entry points at least when its in emulation mode.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-01 22:14   ` Dave Hansen
  2017-11-01 22:28     ` Linus Torvalds
  2017-11-02  8:03     ` Peter Zijlstra
@ 2017-11-03 11:07     ` Kirill A. Shutemov
  2 siblings, 0 replies; 102+ messages in thread
From: Kirill A. Shutemov @ 2017-11-03 11:07 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Ingo Molnar, linux-kernel, linux-mm, Andy Lutomirski,
	Linus Torvalds, Thomas Gleixner, Peter Zijlstra, H. Peter Anvin,
	borisBrian Gerst, Denys Vlasenko, Josh Poimboeuf, Thomas Garnier,
	Kees Cook

On Wed, Nov 01, 2017 at 03:14:11PM -0700, Dave Hansen wrote:
> On 11/01/2017 01:54 AM, Ingo Molnar wrote:
> > Beyond the inevitable cavalcade of (solvable) problems that will pop up during 
> > review, one major item I'd like to see addressed is runtime configurability: it 
> > should be possible to switch between a CR3-flushing and a regular syscall and page 
> > table model on the admin level, without restarting the kernel and apps. Distros 
> > really, really don't want to double the number of kernel variants they have.
> > 
> > The 'Kaiser off' runtime switch doesn't have to be as efficient as 
> > CONFIG_KAISER=n, at least initialloy, but at minimum it should avoid the most 
> > expensive page table switching paths in the syscall entry codepaths.
> 
> Due to popular demand, I went and implemented this today.  It's not the
> prettiest code I ever wrote, but it's pretty small.
> 
> Just in case anyone wants to play with it, I threw a snapshot of it up here:
> 
> > https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/log/?h=kaiser-dynamic-414rc6-20171101
> 
> I ran some quick tests.  When CONFIG_KAISER=y, but "echo 0 >
> kaiser-enabled", the tests that I ran were within the noise vs. a
> vanilla kernel, and that's with *zero* optimization.

It doesn't compile with KASLR enabled :P

Fixup:

diff --git a/arch/x86/boot/compressed/pagetable.c b/arch/x86/boot/compressed/pagetable.c
index f1aa43854bed..7be5fdd77a3f 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -35,6 +35,10 @@
 /* Used by pgtable.h asm code to force instruction serialization. */
 unsigned long __force_order;
 
+#ifdef CONFIG_KAISER
+int kaiser_enabled = 1;
+#endif
+
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
-- 
 Kirill A. Shutemov

^ permalink raw reply related	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-02 19:38   ` Dave Hansen
@ 2017-11-03 13:41     ` Will Deacon
  0 siblings, 0 replies; 102+ messages in thread
From: Will Deacon @ 2017-11-03 13:41 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm, linux-arm-kernel

On Thu, Nov 02, 2017 at 12:38:05PM -0700, Dave Hansen wrote:
> On 11/02/2017 12:01 PM, Will Deacon wrote:
> > On Tue, Oct 31, 2017 at 03:31:46PM -0700, Dave Hansen wrote:
> >> KAISER makes it harder to defeat KASLR, but makes syscalls and
> >> interrupts slower.  These patches are based on work from a team at
> >> Graz University of Technology posted here[1].  The major addition is
> >> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> >> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> >> for a wide variety of use cases.
> > I just wanted to say that I've got a version of this up and running for
> > arm64. I'm still ironing out a few small details, but I hope to post it
> > after the merge window. We always use ASIDs, and the perf impact looks
> > like it aligns roughly with your findings for a PCID-enabled x86 system.
> 
> Welcome to the party!
> 
> I don't know if you've found anything different, but there been woefully
> little code that's really cross-architecture.  The kernel task
> stack-mapping stuff _was_, but it's going away.  The per-cpu-user-mapped
> section stuff might be common, I guess.

I currently don't have anything mapped other than the trampoline page, so
I haven't had to do per-cpu stuff (yet). This will interfere with perf
tracing using SPE, but if that's the only thing that needs it then it's
a hard sell, I think.

> Is there any other common infrastructure that we can or should be sharing?

I really can't see anything. My changes are broadly divided into:

  * Page table setup
  * Exception entry/exit via trampoline
  * User access (e.g. get_user)
  * TLB invalidation
  * Context switch (backend of switch_mm)

which is all deeply arch-specific.

Will

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
                   ` (25 preceding siblings ...)
  2017-11-02 19:01 ` Will Deacon
@ 2017-11-22 16:19 ` Pavel Machek
  2017-11-23 10:47   ` Pavel Machek
  26 siblings, 1 reply; 102+ messages in thread
From: Pavel Machek @ 2017-11-22 16:19 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1108 bytes --]

Hi!

> KAISER makes it harder to defeat KASLR, but makes syscalls and
> interrupts slower.  These patches are based on work from a team at
> Graz University of Technology posted here[1].  The major addition is
> support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> for a wide variety of use cases.

Is it useful?

> Full Description:
> 
> KAISER is a countermeasure against attacks on kernel address
> information.  There are at least three existing, published,
> approaches using the shared user/kernel mapping and hardware features
> to defeat KASLR.  One approach referenced in the paper locates the
> kernel by observing differences in page fault timing between
> present-but-inaccessable kernel pages and non-present pages.

I mean... evil userspace will still be able to determine kernel's
location using cache aliasing effects, right?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

* Re: [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables
  2017-11-22 16:19 ` Pavel Machek
@ 2017-11-23 10:47   ` Pavel Machek
  0 siblings, 0 replies; 102+ messages in thread
From: Pavel Machek @ 2017-11-23 10:47 UTC (permalink / raw)
  To: Dave Hansen; +Cc: linux-kernel, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1421 bytes --]

On Wed 2017-11-22 17:19:07, Pavel Machek wrote:
> Hi!
> 
> > KAISER makes it harder to defeat KASLR, but makes syscalls and
> > interrupts slower.  These patches are based on work from a team at
> > Graz University of Technology posted here[1].  The major addition is
> > support for Intel PCIDs which builds on top of Andy Lutomorski's PCID
> > work merged for 4.14.  PCIDs make KAISER's overhead very reasonable
> > for a wide variety of use cases.
> 
> Is it useful?
> 
> > Full Description:
> > 
> > KAISER is a countermeasure against attacks on kernel address
> > information.  There are at least three existing, published,
> > approaches using the shared user/kernel mapping and hardware features
> > to defeat KASLR.  One approach referenced in the paper locates the
> > kernel by observing differences in page fault timing between
> > present-but-inaccessable kernel pages and non-present pages.
> 
> I mean... evil userspace will still be able to determine kernel's
> location using cache aliasing effects, right?

Issues with AnC attacks are tracked via several CVE identifiers.

CVE-2017-5925 is assigned to track the developments for Intel processors
CVE-2017-5926 is assigned to track the developments for AMD processors

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 181 bytes --]

^ permalink raw reply	[flat|nested] 102+ messages in thread

end of thread, other threads:[~2017-11-23 10:47 UTC | newest]

Thread overview: 102+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-10-31 22:31 [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Dave Hansen
2017-10-31 22:31 ` [PATCH 01/23] x86, kaiser: prepare assembly for entry/exit CR3 switching Dave Hansen
2017-11-01  0:43   ` Brian Gerst
2017-11-01  1:08     ` Dave Hansen
2017-11-01 18:18   ` Borislav Petkov
2017-11-01 18:27     ` Dave Hansen
2017-11-01 20:42       ` Borislav Petkov
2017-11-01 21:01   ` Thomas Gleixner
2017-11-01 22:58     ` Dave Hansen
2017-10-31 22:31 ` [PATCH 02/23] x86, kaiser: do not set _PAGE_USER for init_mm page tables Dave Hansen
2017-11-01 21:11   ` Thomas Gleixner
2017-11-01 21:24     ` Andy Lutomirski
2017-11-01 21:28       ` Thomas Gleixner
2017-11-01 21:52         ` Dave Hansen
2017-11-01 22:11           ` Thomas Gleixner
2017-11-01 22:12           ` Linus Torvalds
2017-11-01 22:20             ` Thomas Gleixner
2017-11-01 22:45               ` Kees Cook
2017-11-02  7:10               ` Andy Lutomirski
2017-11-02 11:33                 ` Thomas Gleixner
2017-11-02 11:59                   ` Andy Lutomirski
2017-11-02 12:56                     ` Thomas Gleixner
2017-11-02 16:38                   ` Dave Hansen
2017-11-02 18:19                     ` Andy Lutomirski
2017-11-02 18:24                       ` Thomas Gleixner
2017-11-02 18:24                       ` Linus Torvalds
2017-11-02 18:40                         ` Thomas Gleixner
2017-11-02 18:57                           ` Linus Torvalds
2017-11-02 21:41                             ` Thomas Gleixner
2017-11-02  7:07         ` Andy Lutomirski
2017-11-02 11:21           ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 03/23] x86, kaiser: disable global pages Dave Hansen
2017-11-01 21:18   ` Thomas Gleixner
2017-11-01 22:12     ` Dave Hansen
2017-11-01 22:28       ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 04/23] x86, tlb: make CR4-based TLB flushes more robust Dave Hansen
2017-11-01  8:01   ` Andy Lutomirski
2017-11-01 10:11     ` Kirill A. Shutemov
2017-11-01 10:38       ` Andy Lutomirski
2017-11-01 10:56         ` Kirill A. Shutemov
2017-11-01 11:18           ` Andy Lutomirski
2017-11-01 22:21             ` Dave Hansen
2017-11-01 21:25   ` Thomas Gleixner
2017-11-01 22:24     ` Dave Hansen
2017-11-01 22:30       ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 05/23] x86, mm: document X86_CR4_PGE toggling behavior Dave Hansen
2017-10-31 23:31   ` Kees Cook
2017-10-31 22:31 ` [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas Dave Hansen
2017-11-01 21:47   ` Thomas Gleixner
2017-10-31 22:31 ` [PATCH 07/23] x86, kaiser: unmap kernel from userspace page tables (core patch) Dave Hansen
2017-10-31 22:32 ` [PATCH 08/23] x86, kaiser: only populate shadow page tables for userspace Dave Hansen
2017-10-31 23:35   ` Kees Cook
2017-10-31 22:32 ` [PATCH 09/23] x86, kaiser: allow NX to be set in p4d/pgd Dave Hansen
2017-10-31 22:32 ` [PATCH 10/23] x86, kaiser: make sure static PGDs are 8k in size Dave Hansen
2017-10-31 22:32 ` [PATCH 11/23] x86, kaiser: map GDT into user page tables Dave Hansen
2017-10-31 22:32 ` [PATCH 12/23] x86, kaiser: map dynamically-allocated LDTs Dave Hansen
2017-11-01  8:00   ` Andy Lutomirski
2017-11-01  8:06     ` Ingo Molnar
2017-10-31 22:32 ` [PATCH 13/23] x86, kaiser: map espfix structures Dave Hansen
2017-10-31 22:32 ` [PATCH 14/23] x86, kaiser: map entry stack variables Dave Hansen
2017-10-31 22:32 ` [PATCH 15/23] x86, kaiser: map trace interrupt entry Dave Hansen
2017-10-31 22:32 ` [PATCH 16/23] x86, kaiser: map debug IDT tables Dave Hansen
2017-10-31 22:32 ` [PATCH 17/23] x86, kaiser: map virtually-addressed performance monitoring buffers Dave Hansen
2017-10-31 22:32 ` [PATCH 18/23] x86, mm: Move CR3 construction functions Dave Hansen
2017-10-31 22:32 ` [PATCH 19/23] x86, mm: remove hard-coded ASID limit checks Dave Hansen
2017-10-31 22:32 ` [PATCH 20/23] x86, mm: put mmu-to-h/w ASID translation in one place Dave Hansen
2017-10-31 22:32 ` [PATCH 21/23] x86, pcid, kaiser: allow flushing for future ASID switches Dave Hansen
2017-11-01  8:03   ` Andy Lutomirski
2017-11-01 14:17     ` Dave Hansen
2017-11-01 20:31       ` Andy Lutomirski
2017-11-01 20:59         ` Dave Hansen
2017-11-01 21:04           ` Andy Lutomirski
2017-11-01 21:06             ` Dave Hansen
2017-10-31 22:32 ` [PATCH 22/23] x86, kaiser: use PCID feature to make user and kernel switches faster Dave Hansen
2017-10-31 22:32 ` [PATCH 23/23] x86, kaiser: add Kconfig Dave Hansen
2017-10-31 23:59   ` Kees Cook
2017-11-01  9:07     ` Borislav Petkov
2017-10-31 23:27 ` [PATCH 00/23] KAISER: unmap most of the kernel from userspace page tables Linus Torvalds
2017-10-31 23:44   ` Dave Hansen
2017-11-01  0:21     ` Dave Hansen
2017-11-01  7:59     ` Andy Lutomirski
2017-11-01 16:08     ` Linus Torvalds
2017-11-01 17:31       ` Dave Hansen
2017-11-01 17:58         ` Randy Dunlap
2017-11-01 18:27         ` Linus Torvalds
2017-11-01 18:46           ` Dave Hansen
2017-11-01 19:05             ` Linus Torvalds
2017-11-01 20:33               ` Andy Lutomirski
2017-11-02  7:32                 ` Andy Lutomirski
2017-11-02  7:54                   ` Andy Lutomirski
2017-11-01 15:53   ` Dave Hansen
2017-11-01  8:54 ` Ingo Molnar
2017-11-01 14:09   ` Thomas Gleixner
2017-11-01 22:14   ` Dave Hansen
2017-11-01 22:28     ` Linus Torvalds
2017-11-02  8:03     ` Peter Zijlstra
2017-11-03 11:07     ` Kirill A. Shutemov
2017-11-02 19:01 ` Will Deacon
2017-11-02 19:38   ` Dave Hansen
2017-11-03 13:41     ` Will Deacon
2017-11-22 16:19 ` Pavel Machek
2017-11-23 10:47   ` Pavel Machek

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).