All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version
@ 2017-11-24  9:14 Ingo Molnar
  2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
                   ` (44 more replies)
  0 siblings, 45 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

This is a linear series of patches of the latest entry-stack plus Kaiser
bits from Andy Lutomirski (v3 series from today) and Dave Hansen
(kaiser-414-tipwip-20171123 version), on top of latest tip:x86/urgent (12a78d43de76),
plus fixes - for easier review.

The code should be the latest posted by Andy and Dave.

Any bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.

Thanks,

    Ingo

Andy Lutomirski (19):
  x86/entry/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/kasan/64: Teach KASAN about the cpu_entry_area
  x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/entry: Remap the TSS into the cpu entry area
  x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/entry/64: Use a percpu trampoline stack for IDT entries
  x86/entry/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a percpu SYSCALL entry trampoline
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: Print the offending IP in the stack overflow warning
  x86/entry/64: Move the IST stacks into cpu_entry_area
  x86/entry/64: Remove the SYSENTER stack canary
  x86/entry: Clean up SYSENTER_stack code

Dave Hansen (22):
  x86/mm/kaiser: Disable global pages by default with KAISER
  x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  x86/mm/kaiser: Introduce user-mapped per-cpu areas
  x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  x86/mm/kaiser: Allow NX poison to be set in p4d/pgd
  x86/mm/kaiser: Make sure static PGDs are 8k in size
  x86/mm/kaiser: Map CPU entry area
  x86/mm/kaiser: Map dynamically-allocated LDTs
  x86/mm/kaiser: Map espfix structures
  x86/mm/kaiser: Map entry stack variable
  x86/mm: Move CR3 construction functions
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Put mmu-to-h/w ASID translation in one place
  x86/mm: Allow flushing for future ASID switches
  x86/mm/kaiser: Use PCID feature to make user and kernel switches faster
  x86/mm/kaiser: Disable native VSYSCALL
  x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime
  x86/mm/kaiser: Add a function to check for KAISER being enabled
  x86/mm/kaiser: Un-poison PGDs at runtime
  x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  x86/mm/kaiser: Add Kconfig

Hugh Dickins (1):
  x86/mm/kaiser: Map virtually-addressed performance monitoring buffers

Masami Hiramatsu (1):
  x86/decoder: Add new TEST instruction pattern

 Documentation/x86/kaiser.txt                | 162 ++++++++
 arch/x86/Kconfig                            |   8 +
 arch/x86/boot/compressed/pagetable.c        |   6 +
 arch/x86/entry/calling.h                    |  89 ++++
 arch/x86/entry/entry_32.S                   |   6 +-
 arch/x86/entry/entry_64.S                   | 215 ++++++++--
 arch/x86/entry/entry_64_compat.S            |  39 +-
 arch/x86/events/intel/ds.c                  |  49 ++-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/desc.h                 |  13 +-
 arch/x86/include/asm/fixmap.h               |  58 ++-
 arch/x86/include/asm/kaiser.h               |  68 +++
 arch/x86/include/asm/mmu_context.h          |  29 +-
 arch/x86/include/asm/pgtable.h              |  19 +-
 arch/x86/include/asm/pgtable_64.h           | 146 +++++++
 arch/x86/include/asm/pgtable_types.h        |  25 +-
 arch/x86/include/asm/processor.h            |  49 ++-
 arch/x86/include/asm/stacktrace.h           |   3 +
 arch/x86/include/asm/switch_to.h            |   2 +-
 arch/x86/include/asm/thread_info.h          |   2 +-
 arch/x86/include/asm/tlbflush.h             | 208 ++++++++--
 arch/x86/include/asm/traps.h                |   1 -
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/asm-offsets.c               |   7 +
 arch/x86/kernel/asm-offsets_32.c            |   5 -
 arch/x86/kernel/asm-offsets_64.c            |   1 +
 arch/x86/kernel/cpu/common.c                | 139 +++++--
 arch/x86/kernel/doublefault.c               |  36 +-
 arch/x86/kernel/dumpstack.c                 |  42 +-
 arch/x86/kernel/dumpstack_32.c              |   6 +
 arch/x86/kernel/dumpstack_64.c              |   6 +
 arch/x86/kernel/espfix_64.c                 |  27 +-
 arch/x86/kernel/head_64.S                   |  30 +-
 arch/x86/kernel/irq.c                       |  12 -
 arch/x86/kernel/irq_64.c                    |   4 +-
 arch/x86/kernel/ldt.c                       |  25 +-
 arch/x86/kernel/process.c                   |  15 +-
 arch/x86/kernel/process_64.c                |   3 +-
 arch/x86/kernel/traps.c                     |  27 +-
 arch/x86/kernel/vmlinux.lds.S               |  10 +
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/lib/x86-opcode-map.txt             |   2 +-
 arch/x86/mm/Makefile                        |   1 +
 arch/x86/mm/init.c                          |  75 ++--
 arch/x86/mm/kaiser.c                        | 620 ++++++++++++++++++++++++++++
 arch/x86/mm/kasan_init_64.c                 |  13 +-
 arch/x86/mm/pageattr.c                      |  18 +-
 arch/x86/mm/pgtable.c                       |  16 +-
 arch/x86/mm/tlb.c                           | 105 ++++-
 arch/x86/power/cpu.c                        |  16 +-
 arch/x86/xen/mmu_pv.c                       |   2 +-
 include/asm-generic/vmlinux.lds.h           |   7 +
 include/linux/kaiser.h                      |  38 ++
 include/linux/percpu-defs.h                 |  30 ++
 init/main.c                                 |   3 +
 kernel/fork.c                               |   1 +
 security/Kconfig                            |  10 +
 57 files changed, 2259 insertions(+), 297 deletions(-)
 create mode 100644 Documentation/x86/kaiser.txt
 create mode 100644 arch/x86/include/asm/kaiser.h
 create mode 100644 arch/x86/mm/kaiser.c
 create mode 100644 include/linux/kaiser.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 01/43] x86/decoder: Add new TEST instruction pattern
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 10:38   ` Borislav Petkov
  2017-12-02  7:39   ` Robert Elliott (Persistent Memory)
  2017-11-24  9:14 ` [PATCH 02/43] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
                   ` (43 subsequent siblings)
  44 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Masami Hiramatsu <mhiramat@kernel.org>

The kbuild test robot reported this build warning:

  Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c

  Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
  Warning: objdump says 3 bytes, but insn_get_length() says 2
  Warning: decoded and checked 1569014 instructions with 1 warnings

This sequence seems to be a new instruction not in the opcode map in the Intel SDM.

The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"

In that table, opcodes listed by the index REG bits as:

  000         001       010 011  100        101        110         111
 TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX

So, it seems TEST Ib is assigned to 001.

Add the new pattern.

Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/lib/x86-opcode-map.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 12e377184ee4..c4d55919fac1 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -896,7 +896,7 @@ EndTable
 
 GrpTable: Grp3_1
 0: TEST Eb,Ib
-1:
+1: TEST Eb,Ib
 2: NOT Eb
 3: NEG Eb
 4: MUL AL,Eb
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 02/43] x86/entry/64: Allocate and enable the SYSENTER stack
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
  2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
                   ` (42 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/c37d6e68a73e1b5b1203e0e95b488fa8092b3cfb.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c    | 5 +++++
 arch/x86/kernel/asm-offsets_32.c | 5 -----
 arch/x86/kernel/cpu/common.c     | 4 +++-
 arch/x86/kernel/process.c        | 2 --
 arch/x86/kernel/traps.c          | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cc16fa882e3e..504a3bb4d5f0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa998ca8aa5a..ccb5f66c4e5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 97fb3e5737f5..35d674157fda 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b7b0f74a2150..2008dd0f8ccb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -800,14 +800,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
  2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
  2017-11-24  9:14 ` [PATCH 02/43] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
                   ` (41 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

PANIC: double fault, error_code: 0x0
...
RIP: 0010:do_error_trap+0x33/0x1c0
...
Call Trace:
Code: ...

With this patch, I get:

PANIC: double fault, error_code: 0x0
...
Call Trace:
 <SYSENTER>
 ? async_page_fault+0x36/0x60
 ? invalid_op+0x22/0x40
 ? async_page_fault+0x36/0x60
 ? sync_regs+0x3c/0x40
 ? sync_regs+0x2e/0x40
 ? error_entry+0x6c/0xd0
 ? async_page_fault+0x36/0x60
 </SYSENTER>
Code: ...

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/c32ce8b363e27fa9b4a4773297d5b4b0f4b39e94.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c       | 19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |  6 ++++++
 arch/x86/kernel/dumpstack_64.c    |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..f8062bfd43a0 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..5e7d10e8ca25 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 	return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..5ff13a6b3680 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (task != current)
 		goto unknown;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..abc828f8c297 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (2 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
                   ` (40 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/3966a6edf6fd45deca4cf52a9b9276402499dda9.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (3 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 11:00   ` Borislav Petkov
  2017-11-24  9:14 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
                   ` (39 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new struct cpu_entry_area
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h   |  9 +--------
 arch/x86/include/asm/fixmap.h | 34 ++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  | 14 +++++++-------
 arch/x86/xen/mmu_pv.c         |  2 +-
 4 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..0f4c92f02968 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,16 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +111,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 	__end_of_permanent_fixed_addresses,
 
@@ -185,5 +195,25 @@ void __init *early_memremap_decrypted_wp(resource_size_t phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ccb5f66c4e5b..c0fb3eb37ee0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int cpu)
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1614,7 +1614,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,7 +1676,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2ccdaba31a07..c2454237fa67 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (4 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 07/43] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
                   ` (38 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: kasan-dev@googlegroups.com
Link: http://lkml.kernel.org/r/8407adf9126440d6467dade88fdb3e3b75fc1019.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..54561dce742e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *cpu_entry_area_begin, *cpu_entry_area_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,18 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
+	cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   kasan_mem_to_shadow(cpu_entry_area_begin));
+
+	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
+			      (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
+		0);
+
+	kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
+				   (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 07/43] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (5 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
                   ` (37 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/d40a2c5ae4539d64090849a374f3169ec492f4e2.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h      |  2 +-
 arch/x86/include/asm/processor.h |  4 ++--
 arch/x86/kernel/cpu/common.c     |  8 ++++----
 arch/x86/kernel/doublefault.c    | 36 +++++++++++++++++-------------------
 arch/x86/power/cpu.c             | 13 +++++++------
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 504a3bb4d5f0..c24456429c7d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -323,7 +323,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0fb3eb37ee0..62cdc10a7d94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
-
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
-
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
+
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 84fcfde53f8f..50593e138281 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (6 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 07/43] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
                   ` (36 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/5454bb325cb30a70457a47b50f22317be65eba7d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 5e7d10e8ca25..a8aa70c05489 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,24 +90,28 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid so that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info, &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (7 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 11:44   ` Borislav Petkov
  2017-11-24  9:14 ` [PATCH 10/43] x86/entry: Remap the TSS into the cpu entry area Ingo Molnar
                   ` (35 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 21 ++++++++++++---------
 arch/x86/kernel/cpu/common.c     | 21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c24456429c7d..48d44fae3d27 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -328,7 +328,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack, used for SYSENTER
+	 * and the entry trampoline as well.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -339,15 +348,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62cdc10a7d94..d173f6013467 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 10/43] x86/entry: Remap the TSS into the cpu entry area
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (8 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
                   ` (34 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S     |  6 ++++--
 arch/x86/include/asm/fixmap.h |  7 +++++++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 38 ++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c          | 11 ++++++-----
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 0f4c92f02968..3a42da14c2cb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -51,6 +51,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d173f6013467..c67742df569a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,19 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++)
+		__set_fixmap(fixmap_index - i, per_cpu_ptr_to_phys(ptr + i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +544,15 @@ static inline void setup_cpu_entry_area(int cpu)
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1303,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1395,6 +1417,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1408,7 +1432,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1618,11 +1642,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1635,7 +1661,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,11 +1701,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1697,7 +1724,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index a8aa70c05489..bb61919c9335 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 50593e138281..04d5157fe7f8 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (9 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 10/43] x86/entry: Remap the TSS into the cpu entry area Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 14:19   ` Borislav Petkov
  2017-11-24  9:14 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
                   ` (33 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c          | 10 ++++++++++
 arch/x86/kernel/process_64.c       |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 48d44fae3d27..3a09e5571a92 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -305,7 +305,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -364,6 +370,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -535,12 +543,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 35d674157fda..86e83762e3b3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..bafe65b08697 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -462,6 +462,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (10 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
                   ` (32 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: http://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2008dd0f8ccb..1bd43f044c62 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *normal_regs =
+			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -390,7 +391,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (11 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 11:27   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
                   ` (31 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a percpu trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
 arch/x86/entry/entry_64_compat.S |  5 ++-
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/include/asm/traps.h     |  1 -
 arch/x86/kernel/cpu/common.c     |  6 ++--
 arch/x86/kernel/traps.c          | 18 +++++------
 6 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..7d47199f405f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
+
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+1:
+
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
@@ -572,12 +579,8 @@ END(irq_entries_start)
 	jz	1f
 
 	/*
-	 * IRQ from user mode.  Switch to kernel gsbase and inform context
-	 * tracking that we're in kernel mode.
-	 */
-	SWAPGS
-
-	/*
+	 * IRQ from user mode.
+	 *
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_FUNC
+
+	pushq	%rdi
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+	pushq	7*8(%rdi)		/* regs->ss */
+	pushq	6*8(%rdi)		/* regs->rsp */
+	pushq	5*8(%rdi)		/* regs->eflags */
+	pushq	4*8(%rdi)		/* regs->cs */
+	pushq	3*8(%rdi)		/* regs->ip */
+	pushq	2*8(%rdi)		/* regs->orig_ax */
+	pushq	8(%rdi)			/* return address */
+	UNWIND_HINT_FUNC
+
+	movq	(%rdi), %rdi
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -848,11 +877,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -894,20 +924,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1170,6 +1195,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index dcc6987f9bae..95ad40eb7eff 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -306,8 +306,11 @@ ENTRY(entry_INT80_compat)
 	 */
 	movl	%eax, %eax
 
-	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
+
+	/* switch to thread stack expects orig_ax to be pushed */
+	call	switch_to_thread_stack
+
 	pushq	%rdi			/* pt_regs->di */
 	pushq	%rsi			/* pt_regs->si */
 	pushq	%rdx			/* pt_regs->dx */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8c6bd6863db9..a6796ac8d311 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
 #else
-	load_sp0(task_top_of_stack(task));
+	/* On x86_64, sp0 always points to the entry trampoline stack. */
 #endif
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1fadd310ff68..31051f35cbb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c67742df569a..7c82a8a8bfda 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1645,11 +1645,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1bd43f044c62..cbc4272bb9dd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,8 +359,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs =
-			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		struct pt_regs *normal_regs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -611,9 +610,10 @@ NOKPROBE_SYMBOL(do_int3);
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -630,13 +630,13 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the iret target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (12 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:46   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
                   ` (30 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7d47199f405f..426b8c669d6a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -633,10 +649,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (13 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:52   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
                   ` (29 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
 arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..0cde243b7542 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3a42da14c2cb..7eb1b5490395 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7c82a8a8bfda..5a05db084659 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..2738cfb6c8c8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,16 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		/* Entry trampoline */
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (14 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:53   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
                   ` (28 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (15 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 14:22   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
                   ` (27 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (16 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 14:23   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
                   ` (26 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
cpu_entry_area.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fixmap.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7eb1b5490395..15cf010225c9 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -60,6 +60,16 @@ struct cpu_entry_area {
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5a05db084659..6b949e6ea0f9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgpr
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -557,6 +573,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1407,20 +1431,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1626,7 +1636,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (17 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 14:23   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
                   ` (25 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Now that the SYSENTER stack has a guard page, there's no need for a
canary to detect overflow after the fact.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 1 -
 arch/x86/kernel/dumpstack.c      | 3 +--
 arch/x86/kernel/process.c        | 1 -
 arch/x86/kernel/traps.c          | 7 -------
 4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a09e5571a92..7743aedb82ea 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,7 +337,6 @@ struct tss_struct {
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
 
 	/*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index bb61919c9335..9ce5fcf7d14d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-	/* Treat the canary as part of the stack for unwinding purposes. */
-	void *begin = &tss->SYSENTER_stack_canary;
+	void *begin = &tss->SYSENTER_stack;
 	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
 	if ((void *)stack < begin || (void *)stack >= end)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 86e83762e3b3..6a04287f222b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-	.SYSENTER_stack_canary	= STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cbc4272bb9dd..19475dbff068 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -801,13 +801,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	/*
-	 * This is the most likely code path that involves non-trivial use
-	 * of the SYSENTER stack.  Check that we haven't overrun it.
-	 */
-	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-	     "Overran or corrupted SYSENTER stack\n");
-
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (18 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 14:24   ` Thomas Gleixner
  2017-11-24  9:14 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
                   ` (24 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The existing code was a mess, mainly because C arrays are nasty.
Turn SYSENTER_stack into a struct, add a helper to find it, and do
all the obvious cleanups this enables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S        |  4 ++--
 arch/x86/entry/entry_64.S        |  2 +-
 arch/x86/include/asm/fixmap.h    |  5 +++++
 arch/x86/include/asm/processor.h |  6 +++++-
 arch/x86/kernel/asm-offsets.c    |  6 ++----
 arch/x86/kernel/cpu/common.c     | 14 +++-----------
 arch/x86/kernel/dumpstack.c      |  7 +++----
 7 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0ab316c46806..3629bcbf85a2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0cde243b7542..34e3110b0876 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
 	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 15cf010225c9..ceb04ab0a642 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -234,5 +234,10 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7743aedb82ea..54f3ee3bc8a0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -332,12 +332,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
+struct SYSENTER_stack {
+	unsigned long		words[64];
+};
+
 struct tss_struct {
 	/*
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack[64];
+	struct SYSENTER_stack	SYSENTER_stack;
 
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 61b1af88ac07..46c0995344aa 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b949e6ea0f9..f9c7e6852874 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1332,12 +1332,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-
-	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
-	      offsetofend(struct tss_struct, SYSENTER_stack),
-	      0);
-
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1451,9 +1446,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1670,8 +1663,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
-		 offsetofend(struct tss_struct, SYSENTER_stack));
+	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 9ce5fcf7d14d..d58dd121c0af 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,11 +45,10 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	int cpu = smp_processor_id();
-	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
+	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
 
-	void *begin = &tss->SYSENTER_stack;
-	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+	void *begin = ss;
+	void *end = ss + 1;
 
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (19 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
                   ` (23 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use [1].

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
available so that it can still be used for a few selected kernel mappings
which must be visible to userspace, when KAISER is enabled, like the
entry/exit code and data.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003441.63DDFC6F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_types.h | 14 +++++++++++++-
 arch/x86/mm/pageattr.c               | 16 ++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 9e9b05fc4860..1fc2f22b9002 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -180,8 +180,20 @@ enum page_cache_mode {
 #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT | _PAGE_USER |	\
 					 _PAGE_ACCESSED)
 
+/*
+ * Disable global pages for anything using the default
+ * __PAGE_KERNEL* macros.  PGE will still be enabled
+ * and _PAGE_GLOBAL may still be used carefully.
+ */
+#ifdef CONFIG_KAISER
+#define __PAGE_KERNEL_GLOBAL	0
+#else
+#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
+#endif
+
 #define __PAGE_KERNEL_EXEC						\
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED |	\
+	 __PAGE_KERNEL_GLOBAL)
 #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 3fe68483463c..ffe584fa1f5e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -585,9 +585,9 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(req_prot) & _PAGE_PRESENT)
-		pgprot_val(req_prot) |= _PAGE_PSE | _PAGE_GLOBAL;
+		pgprot_val(req_prot) |= _PAGE_PSE | __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(req_prot) &= ~(_PAGE_PSE | _PAGE_GLOBAL);
+		pgprot_val(req_prot) &= ~(_PAGE_PSE | __PAGE_KERNEL_GLOBAL);
 
 	req_prot = canon_pgprot(req_prot);
 
@@ -705,9 +705,9 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(ref_prot) & _PAGE_PRESENT)
-		pgprot_val(ref_prot) |= _PAGE_GLOBAL;
+		pgprot_val(ref_prot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(ref_prot) &= ~_PAGE_GLOBAL;
+		pgprot_val(ref_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	/*
 	 * Get the target pfn from the original entry:
@@ -938,9 +938,9 @@ static void populate_pte(struct cpa_data *cpa,
 	 * support it.
 	 */
 	if (pgprot_val(pgprot) & _PAGE_PRESENT)
-		pgprot_val(pgprot) |= _PAGE_GLOBAL;
+		pgprot_val(pgprot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(pgprot) &= ~_PAGE_GLOBAL;
+		pgprot_val(pgprot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	pgprot = canon_pgprot(pgprot);
 
@@ -1242,9 +1242,9 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 		 * support it.
 		 */
 		if (pgprot_val(new_prot) & _PAGE_PRESENT)
-			pgprot_val(new_prot) |= _PAGE_GLOBAL;
+			pgprot_val(new_prot) |= __PAGE_KERNEL_GLOBAL;
 		else
-			pgprot_val(new_prot) &= ~_PAGE_GLOBAL;
+			pgprot_val(new_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 		/*
 		 * We need to keep the pfn from the existing PTE,
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (20 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 12:05   ` Peter Zijlstra
  2017-11-24  9:14 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
                   ` (22 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-cpu space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-cpu debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003442.2D047A7D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h         | 65 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        | 44 +++++++++++++++++++++++++--
 arch/x86/entry/entry_64_compat.S | 32 +++++++++++++++++++-
 3 files changed, 137 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3fd8bc560fae..e1650da01323 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -187,6 +188,70 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KAISER patches in a moment.
+	 */
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 34e3110b0876..07ed55e9e35a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -198,6 +201,15 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+
+	/*
+	 * The kernel CR3 is needed to map the process stack, but we
+	 * need a scratch register to be able to load CR3.  %rsp is
+	 * clobberable right now, so use it as a scratch register.
+	 * %rsp will be look crazy here for a couple instructions.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -247,6 +259,9 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
 	movq	%r10, %rcx
 
+	/* Must wait until we have the kernel CR3 to call C functions: */
+	TRACE_IRQS_OFF
+
 	/*
 	 * This call instruction is handled specially in stub_ptregs_64.
 	 * It might end up jumping to the slow path.  If it jumps, RAX
@@ -393,6 +408,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -729,6 +745,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -937,6 +955,9 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	pushq	%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
@@ -1239,7 +1260,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1261,6 +1286,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1289,6 +1315,9 @@ ENTRY(error_entry)
 	 */
 	SWAPGS
 
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
+
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
 	popq	%r12				/* save return addr in %12 */
@@ -1333,6 +1362,7 @@ ENTRY(error_entry)
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 	jmp .Lerror_entry_done
 
@@ -1343,9 +1373,10 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 
 	/*
@@ -1378,6 +1409,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KAISER is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1441,6 +1476,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1693,6 +1729,8 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 95ad40eb7eff..57cd353c0667 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -215,6 +219,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
@@ -256,10 +266,22 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)
@@ -297,6 +319,14 @@ ENTRY(entry_INT80_compat)
 	ASM_CLAC			/* Do this early to minimize exposure */
 	SWAPGS
 
+	/*
+	 * Must switch CR3 before thread stack is used.  %r8 itself
+	 * is not saved into pt_regs and is not preserved across
+	 * function calls (like TRACE_IRQS_OFF calls), thus should
+	 * be safe to use.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%r8
+
 	/*
 	 * User tracing code (ptrace or signal handlers) might assume that
 	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (21 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
                   ` (21 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003444.196CB6DB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/vmlinux.lds.h |  7 +++++++
 include/linux/percpu-defs.h       | 30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index bdcd1caae092..e12168936d3f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -826,7 +826,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 2d2096ba1cfe..752513674295 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (22 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
                   ` (20 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

Here's a short summary of the things mapped to userspace:
 * The gdt_page's virtual address is pointed to by the LGDT instruction.
   It is needed to define the segments.  Deeply required by CPU to run.
 * cpu_tss tells the CPU, among other things, where the new stacks are
   after user<->kernel transitions.  Needed by the CPU to make ring
   transitions.
 * exception_stacks are needed at interrupt and exception entry
   so that there is storage for, among other things, some temporary
   space to permit clobbering a register to load the kernel CR3.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003445.DF9EA351@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h      | 2 +-
 arch/x86/include/asm/processor.h | 2 +-
 arch/x86/kernel/cpu/common.c     | 4 ++--
 arch/x86/kernel/process.c        | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index aab4fe9f49f8..300090d1c209 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -46,7 +46,7 @@ struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 54f3ee3bc8a0..83dd7c97ba5d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -359,7 +359,7 @@ struct tss_struct {
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f9c7e6852874..3b6920c9fef7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu = {
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
 	/*
 	 * We need valid kernel segments for data and code in long mode too
@@ -515,7 +515,7 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6a04287f222b..9365b4f965e0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,7 +47,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (23 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 12:13   ` Peter Zijlstra
                     ` (2 more replies)
  2017-11-24  9:14 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
                   ` (19 subsequent siblings)
  44 siblings, 3 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when running userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever entering the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When switching back to user
mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

=== Page Table Poisoning ===

KAISER has two copies of the page tables: one for the kernel and
one for when running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and page table pages are allocated.
These updates of the userspace *portion* of the tables need to be
reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at the
address that is being updated.  If it is <PAGE_OFFSET, it is
considered to be doing an update for the userspace portion of the page
tables and must make an entry in the shadow.

However, this has a wrinkle: there are a few places where low
addresses are used in supervisor (kernel) mode.  When EFI calls
are made, they use what are traditionally user addresses in
supervisor mode and trip over these checks.  The trampoline code
that used for booting secondary CPUs has a similar issue.

Remember, there are two things that KAISER needs performed on a
userspace PGD:

 1. Populate the shadow itself
 2. Poison the kernel PGD so it can not be used by userspace.

Only perform these actions when dealing with a user address *and* the
PGD has _PAGE_USER set.  That way, in-kernel users of low addresses
typically used by userspace are not accidentally poisoned.

Changes from original KAISER patch:
 * Gobs of coding style cleanups
 * The original patch tried to allocate an order-2 page, then
   8k-align the result.  That's silly since order-2 is already
   guaranteed to be 16k-aligned.  Removed that gunk and just
   allocate an order-1 page.
 * Handle (or at least detect and warn on) allocation failures
 * Use _KERNPG_TABLE, not _PAGE_TABLE when creating mappings for
   the kernel in the shadow (user) page tables.
 * BUG_ON() for !pte_none() case was totally insane: it checked
   the physical address of the 'struct page' against the physical
   address of the page being mapped.
 * Added 5-level page table support
 * Never free kaiser page tables.  We don't have the locking to
   keep them from getting referenced during the freeing process.
 * Use a totally different scheme in the entry code.  The
   original code just fell apart in horrific ways in debug faults,
   NMIs, or when iret faults.  Big thanks to Andy Lutomirski for
   reducing the number of places that needed to be patched.  He
   made the code a ton simpler.
 * Use new entry trampoline instead of mapping process stacks.

Note: The original KAISER authors signed-off on their patch.  Some of
their code has been broken out into other patches in this series, but
their SoB was only retained here.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003447.1DB395E3@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/kaiser.txt         | 162 +++++++++++++
 arch/x86/boot/compressed/pagetable.c |   6 +
 arch/x86/entry/calling.h             |   1 +
 arch/x86/include/asm/kaiser.h        |  57 +++++
 arch/x86/include/asm/pgtable.h       |   5 +
 arch/x86/include/asm/pgtable_64.h    | 132 +++++++++++
 arch/x86/kernel/espfix_64.c          |  17 ++
 arch/x86/kernel/head_64.S            |  14 +-
 arch/x86/mm/Makefile                 |   1 +
 arch/x86/mm/kaiser.c                 | 441 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   2 +-
 arch/x86/mm/pgtable.c                |  16 +-
 include/linux/kaiser.h               |  29 +++
 init/main.c                          |   3 +
 kernel/fork.c                        |   1 +
 15 files changed, 881 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
new file mode 100644
index 000000000000..745c4be39b92
--- /dev/null
+++ b/Documentation/x86/kaiser.txt
@@ -0,0 +1,162 @@
+Overview
+========
+
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When the kernel is entered via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).  There are a few unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in kaiser.c).
+
+This helps to ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It can be
+enabled by setting CONFIG_KAISER=y
+
+Page Table Management
+=====================
+
+When KAISER is enabled, the kernel manages two sets of page
+tables.  The first copy is very similar to what would be present
+for a kernel without KAISER.  This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The second (shadow) is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy.  It maps a only
+the kernel data needed to enter and exit the kernel.
+
+The shadow is populated by the kaiser_add_*() functions.  Only
+kernel data which has been explicity mapped will appear in the
+shadow copy.  These calls are rare at runtime.
+
+For a new userspace mapping, the kernel makes the entries in its
+page tables like normal.  The only difference is when the kernel
+makes entries in the top (PGD) level.  In addition to setting the
+entry in the main kernel PGD, a copy if the entry is made in the
+shadow PGD.
+
+For user space mappings the kernel creates an entry in the kernel
+PGD and the same entry in the shadow PGD, so the underlying page
+table to which the PGD entry points is shared down to the PTE
+level.  This leaves a single, shared set of userspace page tables
+to manage.  One PTE to lock, one set set of accessed bits, dirty
+bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. Statically-allocated structures and entry/exit text must
+     be padded out to 4k (or 8k for PGDs) so they can be mapped
+     into the user page tables.  This bloats the kernel image
+     by ~20-30k.
+  d. The shadow page tables eventually grow to map all of used
+     vmalloc() space.  They can have roughly the same memory
+     consumption as the vmalloc() page tables.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and are required every at entry and every at exit.
+  b. Task stacks must be mapped/unmapped.  We need to walk
+     and modify the shadow page tables at fork() and exit().
+  c. Global pages are disabled.  This feature of the MMU
+     allows different processes to share TLB entries mapping
+     the kernel.  Losing the feature means potentially more
+     TLB misses after a context switch.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when switching
+     page tables.  This makes switching the page tables (at
+     context switch, or kernel entry/exit) cheaper.  But, on
+     systems with PCID support, the context switch code must flush
+     both the user and kernel entries out of the TLB, with an
+     INVPCID in addition to the CR3 write.  This INVPCID is
+     generally slower than a CR3 write, but still on the order of
+     a hundred cycles.
+  e. The shadow page tables must be populated for each new
+     process.  Even without KAISER, the shared kernel mappings
+     are created by copying top-level (PGD) entries into each
+     new process.  But, with KAISER, there are now *two* kernel
+     mappings: one in the kernel page tables that maps everything
+     and one in the user/shadow page tables mapping the "minimal"
+     kernel.  At fork(), a copy of the portion of the shadow PGD
+     that maps the minimal kernel structures is needed in
+     addition to the normal kernel PGD.
+  f. In addition to the fork()-time copying, there must also
+     be an update to the shadow PGD any time a set_pgd() is done
+     on a PGD used to map userspace.  This ensures that the kernel
+     and user/shadow copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work:
+1. We can be more careful about not actually writing to CR3
+   unless its value is actually changed.
+2. Compress the user/shadow-mapped data to be mapped together
+   underneath a single PGD entry.
+3. Re-enable global pages, but use them for mappings in the
+   user/shadow page tables.  This would allow the kernel to
+   take advantage of TLB entries that were established from
+   the user page tables.  This might speed up the entry/exit
+   code or userspace since it will not have to reload all of
+   its TLB entries.  However, its upside is limited by PCID
+   being used.
+4. Allow KAISER to enabled/disabled at runtime so folks can
+   run a single kernel image.
+
+Debugging:
+
+Bugs in KAISER cause a few different signatures of crashes
+that are worth noting here.
+
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-kaiser-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not kaiser-mapped.
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
diff --git a/arch/x86/boot/compressed/pagetable.c b/arch/x86/boot/compressed/pagetable.c
index d5364ca2e3f9..6b40804c477c 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -36,6 +36,12 @@
 /* Used by pgtable.h asm code to force instruction serialization. */
 unsigned long __force_order;
 
+/*
+ * We share the kernel_ident_mapping_init(), but the early boot
+ * version does not need the Kaiser-logic:
+ */
+int kaiser_enabled = 0;
+
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index e1650da01323..d087c3aa0514 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -2,6 +2,7 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
new file mode 100644
index 000000000000..3c2cc71b4058
--- /dev/null
+++ b/arch/x86/include/asm/kaiser.h
@@ -0,0 +1,57 @@
+#ifndef _ASM_X86_KAISER_H
+#define _ASM_X86_KAISER_H
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KAISER
+/**
+ *  kaiser_add_mapping - map a kernel range into the user page tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ *  @flags: The mapping flags of the pages
+ *
+ *  Use this on all data and code that need to be mapped into both
+ *  copies of the page tables.  This includes the code that switches
+ *  to/from userspace and all of the hardware structures that are
+ *  virtually-addressed and needed in userspace like the interrupt
+ *  table.
+ */
+extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
+			      unsigned long flags);
+
+/**
+ *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ */
+extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
+
+/**
+ *  kaiser_init - Initialize the shadow mapping
+ *
+ *  Most parts of the shadow mapping can be mapped upon boot
+ *  time.  Only per-process things like the thread stacks
+ *  or a new LDT have to be mapped at runtime.  These boot-
+ *  time mappings are permanent and never unmapped.
+ */
+extern void kaiser_init(void);
+
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_KAISER_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f735c3016325..d3901124143f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1106,6 +1106,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KAISER
+	/* Clone the shadow pgd part as well */
+	memcpy(kernel_to_shadow_pgdp(dst), kernel_to_shadow_pgdp(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index e9f05331e732..c239839e92bd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,9 +131,137 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
+#ifdef CONFIG_KAISER
+/*
+ * All top-level KAISER page tables are order-1 pages (8k-aligned
+ * and 8k in size).  The kernel one is at the beginning 4k and
+ * the user (shadow) one is in the last 4k.  To switch between
+ * them, you just need to flip the 12th bit in their addresses.
+ */
+#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= (1<<bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~(1<<bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_shadow_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline pgd_t *shadow_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *kernel_to_shadow_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *shadow_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KAISER */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
+}
+
+/*
+ * Does this PGD allow access from userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return pgd.pgd & _PAGE_USER;
+}
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs
+ * to be set there.  Populates the shadow and returns
+ * the resulting PGD that must be set in the kernel copy
+ * of the page tables.
+ */
+static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KAISER
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user/shadow page tables get the full
+			 * PGD, accessible from userspace:
+			 */
+			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel
+			 * uses, make it unusable to userspace.  This
+			 * ensures if we get out to userspace with the
+			 * wrong CR3 value, userspace will crash
+			 * instead of running.
+			 */
+			pgd.pgd |= _PAGE_NX;
+		}
+	} else if (pgd_userspace_access(*pgdp)) {
+		/*
+		 * We are clearing a _PAGE_USER PGD for which we
+		 * presumably populated the shadow.  We must now
+		 * clear the shadow PGD entry.
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Attempted to clear a _PAGE_USER PGD which
+			 * is in the kernel porttion of the address
+			 * space.  PGDs are pre-populated and we
+			 * never clear them.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	} else {
+		/*
+		 * _PAGE_USER was not set in either the PGD being set
+		 * or cleared.  All kernel PGDs should be
+		 * pre-populated so this should never happen after
+		 * boot.
+		 */
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
+	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
+#else /* CONFIG_KAISER */
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -147,7 +275,11 @@ static inline void native_p4d_clear(p4d_t *p4d)
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KAISER
+	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
+#else /* CONFIG_KAISER */
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 7d7715dde901..4780dba2cc59 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -41,6 +41,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix
+	 * area to ensure it is mapped into the shadow user page
+	 * tables.
+	 *
+	 * For 5-level paging, the espfix pgd was populated when
+	 * kaiser_init() pre-populated all the pgd entries.  The above
+	 * p4d_alloc() would never do anything and the p4d_populate()
+	 * would be done to a p4d already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KAISER
+	if (CONFIG_PGTABLE_LEVELS <= 4) {
+		set_pgd(kernel_to_shadow_pgdp(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+	}
+#endif
 
 	/* Randomize the locations */
 	init_espfix_random();
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7dca675fe78d..43d1cffd1fcf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -341,6 +341,14 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KAISER
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -350,7 +358,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
@@ -364,7 +372,7 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -381,7 +389,7 @@ NEXT_PAGE(level2_ident_pgt)
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
 #endif
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 7ba7f3d7f477..1684e8891165 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_KAISER)		+= kaiser.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
new file mode 100644
index 000000000000..7f7561e9971d
--- /dev/null
+++ b/arch/x86/mm/kaiser.c
@@ -0,0 +1,441 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/kaiser.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+#define KAISER_WALK_ATOMIC  0x1
+
+/*
+ * At runtime, the only things we map are some things for CPU
+ * hotplug, and stacks for new processes.  No two CPUs will ever
+ * be populating the same addresses, so we only need to ensure
+ * that we protect between two CPUs trying to allocate and
+ * populate the same page table page.
+ *
+ * Only take this lock when doing a set_p[4um]d(), but it is not
+ * needed for doing a set_pte().  We assume that only the *owner*
+ * of a given allocation will be doing this for _their_
+ * allocation.
+ *
+ * This ensures that once a system has been running for a while
+ * and there have been stacks all over and these page tables
+ * are fully populated, there will be no further acquisitions of
+ * this lock.
+ */
+static DEFINE_SPINLOCK(shadow_table_allocation_lock);
+
+/*
+ * This is only for walking kernel addresses.  We use it to help
+ * recreate the "shadow" page tables which are used while we are in
+ * userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables.  It is never used for userspace
+ * addresses.
+ *
+ * Returns -1 on error.
+ */
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* We should only be asked to walk kernel addresses */
+	if (vaddr < PAGE_OFFSET) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pgd = pgd_offset_k(vaddr);
+	/*
+	 * We made all the kernel PGDs present in kaiser_init().
+	 * We expect them to stay that way.
+	 */
+	if (pgd_none(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+	/*
+	 * PGDs are either 512GB or 128TB on all x86_64
+	 * configurations.  We don't handle these.
+	 */
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pud = pud_offset(p4d, vaddr);
+	if (pud_none(*pud)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pud_large(*pud))
+		return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
+
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pmd_large(*pmd))
+		return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
+
+	pte = pte_offset_kernel(pmd, vaddr);
+	if (pte_none(*pte)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down.  Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+					   unsigned long flags)
+{
+	pte_t *pte;
+	pmd_t *pmd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pgd_t *pgd = kernel_to_shadow_pgdp(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (flags & KAISER_WALK_ATOMIC) {
+		gfp &= ~GFP_KERNEL;
+		gfp |= __GFP_HIGH | __GFP_ATOMIC;
+	}
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All shadow pgds should have been populated\n");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (p4d_none(*p4d))
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+		else
+			free_page(new_pud_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pud_none(*pud))
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+		else
+			free_page(new_pmd_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pmd = pmd_offset(pud, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pmd_none(*pmd))
+			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
+		else
+			free_page(new_pte_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
+/*
+ * Given a kernel address, @__start_addr, copy that mapping into
+ * the user (shadow) page tables.  This may need to allocate page
+ * table pages.
+ */
+int kaiser_add_user_map(const void *__start_addr, unsigned long size,
+			unsigned long flags)
+{
+	pte_t *pte;
+	unsigned long start_addr = (unsigned long)__start_addr;
+	unsigned long address = start_addr & PAGE_MASK;
+	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
+	unsigned long target_address;
+
+	for (; address < end_addr; address += PAGE_SIZE) {
+		target_address = get_pa_from_kernel_map(address);
+		if (target_address == -1)
+			return -EIO;
+
+		pte = kaiser_shadow_pagetable_walk(address, false);
+		/*
+		 * Errors come from either -ENOMEM for a page
+		 * table page, or something screwy that did a
+		 * WARN_ON().  Just return -ENOMEM.
+		 */
+		if (!pte)
+			return -ENOMEM;
+		if (pte_none(*pte)) {
+			set_pte(pte, __pte(flags | target_address));
+		} else {
+			pte_t tmp;
+			/*
+			 * Make a fake, temporary PTE that mimics the
+			 * one we would have created.
+			 */
+			set_pte(&tmp, __pte(flags | target_address));
+			/*
+			 * Warn if the pte that would have been
+			 * created is different from the one that
+			 * was there previously.  In other words,
+			 * we allow the same PTE value to be set,
+			 * but not changed.
+			 */
+			WARN_ON_ONCE(!pte_same(*pte, tmp));
+		}
+	}
+	return 0;
+}
+
+int kaiser_add_user_map_ptrs(const void *__start_addr,
+			     const void *__end_addr,
+			     unsigned long flags)
+{
+	return kaiser_add_user_map(__start_addr,
+				   __end_addr - __start_addr,
+				   flags);
+}
+
+/*
+ * Ensure that the top level of the (shadow) page tables are
+ * entirely populated.  This ensures that all processes that get
+ * forked have the same entries.  This way, we do not have to
+ * ever go set up new entries in older processes.
+ *
+ * Note: we never free these, so there are no updates to them
+ * after this.
+ */
+static void __init kaiser_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i;
+
+	pgd = kernel_to_shadow_pgdp(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		/*
+		 * Each PGD entry moves up PGDIR_SIZE bytes through
+		 * the address space, so get the first virtual
+		 * address mapped by PGD #i:
+		 */
+		unsigned long addr = i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
+ * Page table allocations called by kaiser_add_user_map() can
+ * theoretically fail, but are very unlikely to fail in early boot.
+ * This would at least output a warning before crashing.
+ *
+ * Do the checking and warning in a macro to make it more readable and
+ * preserve line numbers in the warning message that you would not get
+ * with an inline.
+ */
+#define kaiser_add_user_map_early(start, size, flags) do {	\
+	int __ret = kaiser_add_user_map(start, size, flags);	\
+	WARN_ON(__ret);						\
+} while (0)
+
+#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
+	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
+	WARN_ON(__ret);							\
+} while (0)
+
+extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+/*
+ * If anything in here fails, we will likely die on one of the
+ * first kernel->user transitions and init will die.  But, we
+ * will have most of the kernel up by then and should be able to
+ * get a clean warning out of it.  If we BUG_ON() here, we run
+ * the risk of being before we have good console output.
+ *
+ * When KAISER is enabled, we remove _PAGE_GLOBAL from all of the
+ * kernel PTE permissions.  This ensures that the TLB entries for
+ * the kernel are not available when in userspace.  However, for
+ * the pages that are available to userspace *anyway*, we might as
+ * well continue to map them _PAGE_GLOBAL and enjoy the potential
+ * performance advantages.
+ */
+void __init kaiser_init(void)
+{
+	int cpu;
+
+	kaiser_init_all_pgds();
+
+	for_each_possible_cpu(cpu) {
+		void *percpu_vaddr = __per_cpu_user_mapped_start +
+				     per_cpu_offset(cpu);
+		unsigned long percpu_sz = __per_cpu_user_mapped_end -
+					  __per_cpu_user_mapped_start;
+		kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
+					  __PAGE_KERNEL | _PAGE_GLOBAL);
+	}
+
+	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	/* the fixed map address of the idt_table */
+	kaiser_add_user_map_early((void *)idt_descr.address,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+}
+
+int kaiser_add_mapping(unsigned long addr, unsigned long size,
+		       unsigned long flags)
+{
+	return kaiser_add_user_map((const void *)addr, size, flags);
+}
+
+void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+	unsigned long addr;
+
+	/* The shadow page tables always use small pages: */
+	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+		/*
+		 * Do an "atomic" walk in case this got called from an atomic
+		 * context.  This should not do any allocations because we
+		 * should only be walking things that are known to be mapped.
+		 */
+		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+
+		/*
+		 * We are removing a mapping that should
+		 * exist.  WARN if it was not there:
+		 */
+		if (!pte) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		pte_clear(&init_mm, addr, pte);
+	}
+	/*
+	 * This ensures that the TLB entries used to map this data are
+	 * no longer usable on *this* CPU.  We theoretically want to
+	 * flush the entries on all CPUs here, but that's too
+	 * expensive right now: this is called to unmap process
+	 * stacks in the exit() path.
+	 *
+	 * This can change if we get to the point where this is not
+	 * in a remotely hot path, like only called via write_ldt().
+	 *
+	 * Note: we could probably also just invalidate the individual
+	 * addresses to take care of *this* PCID and then do a
+	 * tlb_flush_shared_nonglobals() to ensure that all other
+	 * PCIDs get flushed before being used again.
+	 */
+	__native_flush_tlb_global();
+}
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index ffe584fa1f5e..1b3dbf3b3846 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end)
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
+void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, start);
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 17ebc5a978cc..1e47ce734404 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KAISER
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
new file mode 100644
index 000000000000..0fd800efa95c
--- /dev/null
+++ b/include/linux/kaiser.h
@@ -0,0 +1,29 @@
+#ifndef _INCLUDE_KAISER_H
+#define _INCLUDE_KAISER_H
+
+#ifdef CONFIG_KAISER
+#include <asm/kaiser.h>
+#else
+
+/*
+ * These stubs are used whenever CONFIG_KAISER is off, which
+ * includes architectures that support KAISER, but have it
+ * disabled.
+ */
+
+static inline void kaiser_init(void)
+{
+}
+
+static inline void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+}
+
+static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
+				     unsigned long flags)
+{
+	return 0;
+}
+
+#endif /* !CONFIG_KAISER */
+#endif /* _INCLUDE_KAISER_H */
diff --git a/init/main.c b/init/main.c
index 3bdd8da90f69..559bc0a6e9ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -76,6 +76,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kaiser.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -505,6 +506,8 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	/* This just needs to be done before we first run userspace: */
+	kaiser_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cc743698d3..685202058d65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
+#include <linux/kaiser.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (24 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
                   ` (18 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The user portion of the kernel page tables use the NX bit to
poison them for userspace.  But, that trips the p4d/pgd_bad()
checks.  Make sure it does not do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003448.C6AB3575@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d3901124143f..9cceaf6c0405 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,12 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -880,7 +885,12 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (25 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 28/43] x86/mm/kaiser: Map CPU entry area Ingo Molnar
                   ` (17 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

A few PGDs come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all
8k-aligned, but they must also be 8k in *size*.

The original KAISER patch did not do this.  It probably just
lucked out that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003450.76492124@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/head_64.S | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 43d1cffd1fcf..58087ab1782e 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -342,11 +342,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
 	.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL	0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -365,6 +378,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -379,6 +393,7 @@ NEXT_PGD_PAGE(init_top_pgt)
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -391,6 +406,7 @@ NEXT_PAGE(level2_ident_pgt)
 #else
 NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KAISER_USER_PGD_FILL,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 28/43] x86/mm/kaiser: Map CPU entry area
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (26 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:43   ` Peter Zijlstra
  2017-11-24  9:14 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
                   ` (16 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There is now a special 'struct cpu_entry' area that contains all
of the data needed to enter the kernel.  It's mapped in the fixmap
area and contains:

 * The GDT (hardware segment descriptor)
 * The TSS (thread information structure that points the hardware
   to the various stacks, and contains the entry stack).
 * The entry trampoline code itself
 * The exception stacks (aka IRQ stacks)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003453.D4CB33A9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kaiser.h |  6 ++++++
 arch/x86/kernel/cpu/common.c  |  4 ++++
 arch/x86/mm/kaiser.c          | 31 +++++++++++++++++++++++++++++++
 include/linux/kaiser.h        |  3 +++
 4 files changed, 44 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 3c2cc71b4058..040cb096d29d 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -33,6 +33,12 @@
 extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
 			      unsigned long flags);
 
+/**
+ *  kaiser_add_mapping_cpu_entry - map the cpu entry area
+ *  @cpu: the CPU for which the entry area is being mapped
+ */
+extern void kaiser_add_mapping_cpu_entry(int cpu);
+
 /**
  *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
  *  @addr: the start address of the range
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3b6920c9fef7..d6bcf397b00d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/percpu.h>
+#include <linux/kaiser.h>
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/delay.h>
@@ -584,6 +585,9 @@ static inline void setup_cpu_entry_area(int cpu)
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
+ 	/* CPU 0's mapping is done in kaiser_init() */
+	if (cpu)
+		kaiser_add_mapping_cpu_entry(cpu);
 }
 
 /* Load the original GDT from the per-cpu structure */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 7f7561e9971d..4665dd724efb 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -353,6 +353,26 @@ static void __init kaiser_init_all_pgds(void)
 	WARN_ON(__ret);							\
 } while (0)
 
+void kaiser_add_mapping_cpu_entry(int cpu)
+{
+	kaiser_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
+				  __PAGE_KERNEL_RO);
+
+	/* includes the entry stack */
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
+				  sizeof(get_cpu_entry_area(cpu)->tss),
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
+	/* Entry code, so needs to be EXEC */
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
+				  sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
+				  __PAGE_KERNEL_EXEC | _PAGE_GLOBAL);
+
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->exception_stacks,
+				 sizeof(get_cpu_entry_area(cpu)->exception_stacks),
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+}
+
 extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
 /*
  * If anything in here fails, we will likely die on one of the
@@ -390,6 +410,17 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_early((void *)idt_descr.address,
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * We delay CPU 0's mappings because these structures are
+	 * created before the page allocator is up.  Deferring it
+	 * until here lets us use the plain page allocator
+	 * unconditionally in the page table code above.
+	 *
+	 * This is OK because kaiser_init() is called long before
+	 * we ever run userspace and need the KAISER mappings.
+	 */
+	kaiser_add_mapping_cpu_entry(0);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 0fd800efa95c..77db4230a0dd 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -25,5 +25,8 @@ static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
 	return 0;
 }
 
+static inline void kaiser_add_mapping_cpu_entry(int cpu)
+{
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (27 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 28/43] x86/mm/kaiser: Map CPU entry area Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
                   ` (15 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Normally, a process has a NULL mm->context.ldt.  But, there is a
syscall for a process to set a new one.  If a process does that,
the LDT be mapped into the user page tables, just like the
default copy.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003455.275397F7@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/ldt.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 1c1eae961340..d6ab1144fdbf 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -11,6 +11,7 @@
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/syscalls.h>
@@ -57,11 +58,21 @@ static void flush_ldt(void *__mm)
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree_atomic(ldt->entries);
+	else
+		free_page((unsigned long)ldt->entries);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	int ret;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
@@ -89,6 +100,12 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 		return NULL;
 	}
 
+	ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+	if (ret) {
+		__free_ldt_struct(new_ldt);
+		return NULL;
+	}
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -115,12 +132,10 @@ static void free_ldt_struct(struct ldt_struct *ldt)
 	if (likely(!ldt))
 		return;
 
+	kaiser_remove_mapping((unsigned long)ldt->entries,
+			      ldt->nr_entries * LDT_ENTRY_SIZE);
 	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (28 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:47   ` Peter Zijlstra
  2017-11-24  9:14 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
                   ` (14 subsequent siblings)
  44 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There is some rather arcane code to help when an IRET returns
to 16-bit segments.  It is referred to as the "espfix" code.
This consists of a few per-cpu variables:

	espfix_stack: tells us where the stack is allocated
		      (the bottom)
	espfix_waddr: tells us to where %rsp may be pointed
		      (the top)

These are in addition to the stack itself.  All three things must
be mapped for the espfix code to function.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  A switch to the kernel page tables could
be performed instead of mapping these structures, but mapping
them is simpler and less likely to break the assembly.  To switch
over to the kernel copy, additional temporary storage would be
required which is in short supply in this context.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003457.EB854D0D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/espfix_64.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4780dba2cc59..8bb116d73aaa 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 
 #include <linux/init.h>
 #include <linux/init_task.h>
+#include <linux/kaiser.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
@@ -41,7 +42,6 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
-#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,10 @@ void init_espfix_ap(int cpu)
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
+	/*
+	 * _PAGE_GLOBAL is not really required.  This is not a hot
+	 * path, but we do it here for consistency.
+	 */
+	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
+			__PAGE_KERNEL | _PAGE_GLOBAL);
 }
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 31/43] x86/mm/kaiser: Map entry stack variable
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (29 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
                   ` (13 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There are times where the kernel is entered but there is no
safe stack, like at SYSCALL entry.  To obtain a safe stack, we
have to clobber %rsp and store the clobbered value in
'rsp_scratch'.

Map this to userspace to allow us to do this stack switch before
the CR3 switch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003459.C0FF167A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/process_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bafe65b08697..9a0220aa2bf9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,7 +59,7 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (30 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
                   ` (12 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Hugh Dickins <hughd@google.com>

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003500.7EC0DB4E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>

x86/mm: Fix Kaiser build on 32-bit, backmerge to: x86/mm/kaiser: Map virtually-addressed performance monitoring buffers

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/ds.c | 49 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b6f8bd..61388b01962d 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -3,11 +3,15 @@
 #include <linux/types.h>
 #include <linux/slab.h>
 
+#include <linux/kaiser.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
@@ -279,6 +283,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+	unsigned int order = get_order(size);
+	struct page *page;
+	unsigned long addr;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	if (!page)
+		return NULL;
+	addr = (unsigned long)page_address(page);
+	if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+		__free_pages(page, order);
+		addr = 0;
+	}
+	return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+	if (!buffer)
+		return;
+	kaiser_remove_mapping((unsigned long)buffer, size);
+	free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -289,7 +318,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -300,7 +329,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree(buffer, x86_pmu.pebs_buffer_size);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
@@ -326,7 +355,8 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+			x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
 }
 
@@ -340,7 +370,7 @@ static int alloc_bts_buffer(int cpu)
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
@@ -366,19 +396,15 @@ static void release_bts_buffer(int cpu)
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
 
 	return 0;
@@ -392,7 +418,6 @@ static void release_ds_buffer(int cpu)
 		return;
 
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 33/43] x86/mm: Move CR3 construction functions
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (31 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
                   ` (11 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

For flushing the TLB, the ASID which has been programmed into the
hardware must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003502.CC87BF47@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 29 +----------------------------
 arch/x86/include/asm/tlbflush.h    | 27 +++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  |  8 ++++----
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6d16d15d09a0..5e1a1ecb65c6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
@@ -317,7 +290,7 @@ static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 509046cfa5ce..df28f1a61afa 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,33 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3118392cdf75..e629dbda01a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -208,7 +208,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
@@ -288,7 +288,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (32 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
                   ` (10 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, KAISER is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell
out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003504.57EDB845@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index df28f1a61afa..3101581c5da0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,19 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KAISER_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
+ * to account for them being zero-based.  Another -1 is because ASID 0
+ * is reserved for use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
  * bits.  This serves two purposes.  It prevents a nasty situation in
@@ -88,7 +101,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -98,7 +111,7 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (33 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 36/43] x86/mm: Allow flushing for future ASID switches Ingo Molnar
                   ` (9 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one programmed into the hardware that goes from 1->6

This consolidates the locations where converting beween the two
(by doing +1) to a single place which gives us a nice place to
comment.  KAISER will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003506.67E81D7F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3101581c5da0..24b27eb5904c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -88,21 +88,26 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
+static inline u16 kern_asid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+	 * bits.  This serves two purposes.  It prevents a nasty situation in
+	 * which PCID-unaware code saves CR3, loads some other value (with PCID
+	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+	 * the saved ASID was nonzero.  It also means that any bugs involving
+	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+	 * deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_asid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -112,7 +117,8 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 36/43] x86/mm: Allow flushing for future ASID switches
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (34 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
                   ` (8 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of
all contexts (aka. PCIDs / ASIDs) is required, they can be
actively invalidated by:

 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
    INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KAISER) at the time that invalidation is required.  Instead of
actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation
to be required.

At the next context-switch, look for this indicator
('all_other_ctxs_invalid' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003507.E8C327F5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 47 ++++++++++++++++++++++++++++++++---------
 arch/x86/mm/tlb.c               | 35 ++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 24b27eb5904c..bb5ba71038ee 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -184,6 +184,17 @@ struct tlb_state {
 	 */
 	bool is_lazy;
 
+	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool all_other_ctxs_invalid;
+
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
@@ -261,6 +272,19 @@ static inline unsigned long cr4_read_shadow(void)
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+		return;
+
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -290,6 +314,10 @@ static inline void __native_flush_tlb(void)
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
 	preempt_enable();
+	/*
+	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
+	 * without PCIDs flushes all non-globals.
+	 */
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -335,24 +363,23 @@ static inline void __native_flush_tlb_single(unsigned long addr)
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+		tlb_flush_shared_nonglobals();
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL	-1UL
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e629dbda01a0..81941f1690fa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.all_other_ctxs_invalid))
+		clear_non_loaded_ctxs();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (35 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 36/43] x86/mm: Allow flushing for future ASID switches Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
                   ` (7 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
      the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
      flush done for each.  For instance, what is currently a
      single instruction without KAISER:

		invpcid_flush_one(current_pcid, addr);

      becomes this with KAISER:

      		invpcid_flush_one(current_kern_pcid, addr);
		invpcid_flush_one(current_user_pcid, addr);

      and this without INVPCID:

      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003509.EC42DD15@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h                    |  25 +++--
 arch/x86/entry/entry_64.S                   |   1 +
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/pgtable_types.h        |  11 +++
 arch/x86/include/asm/tlbflush.h             | 137 +++++++++++++++++++++++-----
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/mm/init.c                          |  75 ++++++++++-----
 arch/x86/mm/tlb.c                           |  66 +++++++++++++-
 9 files changed, 262 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index d087c3aa0514..66af80514197 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -3,6 +3,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/pgtable_types.h>
 
 /*
 
@@ -192,16 +193,20 @@ For 32-bit we have the following conventions - kernel is built with
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK     (KAISER_SWITCH_PGTABLES_MASK|\
+				(1<<X86_CR3_KAISER_SWITCH_BIT))
 
 .macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
-	andq	$(~KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Clear PCID and "KAISER bit", point CR3 at kernel pagetables: */
+	andq    $(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Set user PCID bit, and move CR3 up a page to the user page tables: */
+	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -220,8 +225,14 @@ For 32-bit we have the following conventions - kernel is built with
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KAISER patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not user/shadow.
 	 */
 	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 07ed55e9e35a..20be5e89a36a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -647,6 +647,7 @@ END(irq_entries_start)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jz	1f
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	call	switch_to_thread_stack
 1:
 
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index c0b0e9e8aa66..ea51d4a28d96 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -197,6 +197,7 @@
 #define X86_FEATURE_CAT_L3		( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2		( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3		( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1fc2f22b9002..8bc825a5b125 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -140,6 +140,17 @@
 			 _PAGE_SOFT_DIRTY)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
+/* The ASID is the lower 12 bits of CR3 */
+#define X86_CR3_PCID_ASID_MASK  (_AC((1<<12)-1, UL))
+
+/* Mask for all the PCID-related bits in CR3: */
+#define X86_CR3_PCID_MASK       (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK)
+
+/* Make sure this is only usable in KAISER #ifdef'd code: */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#endif
+
 /*
  * The cache modes defined here are used to translate between pure SW usage
  * and the HW defined cache mode bits and/or PAT entries.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index bb5ba71038ee..ea1a3dce91c2 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -78,7 +78,12 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS 12
 /* When enabled, KAISER consumes a single bit for user/kernel switches */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#define KAISER_CONSUMED_ASID_BITS 1
+#else
 #define KAISER_CONSUMED_ASID_BITS 0
+#endif
 
 #define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KAISER_CONSUMED_ASID_BITS)
 /*
@@ -88,21 +93,62 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
 static inline u16 kern_asid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_KAISER
+	/*
+	 * Make sure that the dynamic ASID space does not confict
+	 * with the bit we are using to switch between user and
+	 * kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<<X86_CR3_KAISER_SWITCH_BIT));
+
 	/*
-	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
-	 * bits.  This serves two purposes.  It prevents a nasty situation in
-	 * which PCID-unaware code saves CR3, loads some other value (with PCID
-	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
-	 * the saved ASID was nonzero.  It also means that any bugs involving
-	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
-	 * deterministically.
+	 * The ASID being passed in here should have respected
+	 * the MAX_ASID_AVAILABLE and thus never have the switch
+	 * bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1<<X86_CR3_KAISER_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in  are
+	 * small (<TLB_NR_DYN_ASIDS).  They never have the high
+	 * switch bit set, so do not bother to clear it.
+	 */
+
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1
+	 * into the PCID bits.  This serves two purposes.  It
+	 * prevents a nasty situation in which PCID-unaware code
+	 * saves CR3, loads some other value (with PCID == 0),
+	 * and then restores CR3, thus corrupting the TLB for
+	 * ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3
+	 * with CR4.PCIDE off will trigger deterministically.
 	 */
 	return asid + 1;
 }
 
+/*
+ * The user ASID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_asid(u16 asid)
+{
+	u16 ret = kern_asid(asid);
+#ifdef CONFIG_KAISER
+	ret |= 1<<X86_CR3_KAISER_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -145,12 +191,6 @@ static inline bool tlb_defer_switch_to_init_mm(void)
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -306,18 +346,42 @@ extern void initialize_tlbstate_and_flush(void);
 
 static inline void __native_flush_tlb(void)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		/*
+		 * native_write_cr3() only clears the current PCID if
+		 * CR4 has X86_CR4_PCIDE set.  In other words, this does
+		 * not fully flush the TLB if PCIDs are in use.
+		 *
+		 * With KAISER and PCIDs, the means that we did not
+		 * flush the user PCID.  Warn if it gets called.
+		 */
+		if (IS_ENABLED(CONFIG_KAISER))
+			WARN_ON_ONCE(this_cpu_read(cpu_tlbstate.cr4) &
+				     X86_CR4_PCIDE);
+		/*
+		 * If current->mm == NULL then we borrow a mm
+		 * which may change during a task switch and
+		 * therefore we must not be preempted while we
+		 * write CR3 back:
+		 */
+		preempt_disable();
+		native_write_cr3(__native_read_cr3());
+		preempt_enable();
+		/*
+		 * Does not need tlb_flush_shared_nonglobals()
+		 * since the CR3 write without PCIDs flushes all
+		 * non-globals.
+		 */
+		return;
+	}
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
-	 */
-	preempt_disable();
-	native_write_cr3(__native_read_cr3());
-	preempt_enable();
-	/*
-	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
-	 * without PCIDs flushes all non-globals.
+	 * We are no longer using globals with KAISER, so a
+	 * "nonglobals" flush would work too. But, this is more
+	 * conservative.
+	 *
+	 * Note, this works with CR4.PCIDE=0 or 1.
 	 */
+	invpcid_flush_all();
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -339,6 +403,8 @@ static inline void __native_flush_tlb_global(void)
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -358,7 +424,30 @@ static inline void __native_flush_tlb_global(void)
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before
+	 * CR4.PCIDE=1.  Just call invpcid in the case we are called
+	 * early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+		return;
+	}
+	/* Flush the address out of both PCIDs. */
+	/*
+	 * An optimization here might be to determine addresses
+	 * that are only kernel-mapped and only flush the kernel
+	 * ASID.  But, userspace flushes are probably much more
+	 * important performance-wise.
+	 *
+	 * Make sure to do only a single invpcid when KAISER is
+	 * disabled and we have only a single ASID.
+	 */
+	if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid))
+		invpcid_flush_one(user_asid(loaded_mm_asid), addr);
+	invpcid_flush_one(kern_asid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 7e1e730396ae..7ef94b64dbb4 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -78,7 +78,8 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb7fcd6..cd7ed7a874d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 
 		/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
-		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
+		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) ||
+		    !is_long_mode(vcpu))
 			return 1;
 	}
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a22c2b95e513..9618e57d46cf 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -196,34 +196,59 @@ static void __init probe_page_size_mask(void)
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * KAISER uses a PCID for the kernel and another
+		 * for userspace.  Both PCIDs need to be flushed
+		 * when the TLB flush functions are called.  But,
+		 * flushing *another* PCID is insane without
+		 * INVPCID.  Just avoid using PCIDs at all if we
+		 * have KAISER and do not have INVPCID.
+		 */
+		if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) &&
+		    !boot_cpu_has(X86_FEATURE_INVPCID)) {
 			setup_clear_cpu_cap(X86_FEATURE_PCID);
+			return;
 		}
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() --
+		 * the trampoline code can't handle CR4.PCIDE and
+		 * it wouldn't do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually
+		 * cause the bits in question to remain set all the
+		 * way through the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE
+		 * manually in start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work
+		 * if we set X86_CR4_PCIDE, *and* we INVPCID
+		 * support.  It's unusable on systems that have
+		 * X86_CR4_PCIDE clear, or that have no INVPCID
+		 * support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't
+		 * work if PCID is on but PGE is not.  Since that
+		 * combination doesn't exist on real hardware, there's
+		 * no reason to try to fully support it, but it's
+		 * polite to avoid corrupting data if we're on
+		 * an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 81941f1690fa..f75b6eb47a6d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+/*
+ * Given a kernel asid, flush the corresponding KAISER
+ * user ASID.
+ */
+static void flush_user_asid(pgd_t *pgd, u16 kern_asid)
+{
+	/* There is no user ASID if KAISER is off */
+	if (!IS_ENABLED(CONFIG_KAISER))
+		return;
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+	/*
+	 * With PCIDs enabled, write_cr3() only flushes TLB
+	 * entries for the current (kernel) ASID.  This leaves
+	 * old TLB entries for the user ASID in place and we must
+	 * flush that context separately.  We can theoretically
+	 * delay doing this until we actually load up the
+	 * userspace CR3, but do it here for simplicity.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		invpcid_flush_single_context(user_asid(kern_asid));
+	} else {
+		/*
+		 * On systems with PCIDs, but no INVPCID, the only
+		 * way to flush a PCID is a CR3 write.  Note that
+		 * we use the kernel page tables with the *user*
+		 * ASID here.
+		 */
+		unsigned long user_asid_flush_cr3;
+		user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid));
+		write_cr3(user_asid_flush_cr3);
+		/*
+		 * We do not use PCIDs with KAISER unless we also
+		 * have INVPCID.  Getting here is unexpected.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		flush_user_asid(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -230,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -243,7 +305,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (36 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
                   ` (6 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER code attempts to "poison" the user portion of the kernel page
tables.  It detects entries that it wants that it wants to poison in two
ways:
 * Looking for addresses >= PAGE_OFFSET
 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KAISER-enforced restriction, _PAGE_USER is never
set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not
even see the memory any more.  Disable it whenever KAISER is enabled.

Also add some help text about how KAISER might affect the emulation
case as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003513.10CAD896@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 09dcc94c4484..d23cd2902b10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2249,6 +2249,9 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
+		# The VSYSCALL page comes from the kernel page tables
+		# and is not available when KAISER is enabled.
+		depends on ! KAISER
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2266,6 +2269,11 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
+		  When KAISER is enabled, the vsyscall area will become
+		  unreadable.  This emulation option still works, but KAISER
+		  will make it harder to do things like trace code using the
+		  emulation.
+
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (37 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
                   ` (5 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

This will be used in a few patches.  Right now, it's not wired up
to do anything useful.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003517.8EAB76E0@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kaiser.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 4665dd724efb..968d5b62d597 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -29,6 +29,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/bug.h>
+#include <linux/debugfs.h>
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
@@ -470,3 +471,50 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size)
 	 */
 	__native_flush_tlb_global();
 }
+
+int kaiser_enabled = 1;
+static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%d\n", kaiser_enabled);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t kaiser_enabled_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	unsigned int enable;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enable))
+		return -EINVAL;
+
+	if (enable > 1)
+		return -EINVAL;
+
+	WRITE_ONCE(kaiser_enabled, enable);
+	return count;
+}
+
+static const struct file_operations fops_kaiser_enabled = {
+	.read = kaiser_enabled_read_file,
+	.write = kaiser_enabled_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_kaiser_enabled(void)
+{
+	debugfs_create_file("kaiser-enabled", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_kaiser_enabled);
+	return 0;
+}
+late_initcall(create_kaiser_enabled);
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (38 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
                   ` (4 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, all of the checks for KAISER are compile-time checks.

Runtime checks are needed for turning it on/off at runtime.

Add a function to do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003518.B7D81B14@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kaiser.h | 5 +++++
 include/linux/kaiser.h        | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 040cb096d29d..35f12a8a7071 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -56,6 +56,11 @@ extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
  */
 extern void kaiser_init(void);
 
+static inline bool kaiser_active(void)
+{
+	extern int kaiser_enabled;
+	return kaiser_enabled;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 77db4230a0dd..a3d28d00d555 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -28,5 +28,10 @@ static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
 static inline void kaiser_add_mapping_cpu_entry(int cpu)
 {
 }
+
+static inline bool kaiser_active(void)
+{
+	return 0;
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (39 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
                   ` (3 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

With KAISER Kernel PGDs that map userspace are "poisoned" with
the NX bit.  This ensures that if a kernel->user CR3 switch is
missed, userspace crashes instead of running in an unhardened
state.

This code will be needed in a moment when KAISER is turned
on and off at runtime.

Note that an __ASSEMBLY__ #ifdef is now required since kaiser.h
is indirectly included into assembly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003521.A90AC3AF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h | 16 +++++++++++++++-
 arch/x86/mm/kaiser.c              | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/kaiser.h            |  3 ++-
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index c239839e92bd..89bde2091af1 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PGTABLE_64_H
 
 #include <linux/const.h>
+#include <linux/kaiser.h>
 #include <asm/pgtable_64_types.h>
 
 #ifndef __ASSEMBLY__
@@ -199,6 +200,18 @@ static inline bool pgd_userspace_access(pgd_t pgd)
 	return pgd.pgd & _PAGE_USER;
 }
 
+static inline void kaiser_poison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd |= _PAGE_NX;
+}
+
+static inline void kaiser_unpoison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd &= ~_PAGE_NX;
+}
+
 /*
  * Take a PGD location (pgdp) and a pgd value that needs
  * to be set there.  Populates the shadow and returns
@@ -222,7 +235,8 @@ static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
 			 * wrong CR3 value, userspace will crash
 			 * instead of running.
 			 */
-			pgd.pgd |= _PAGE_NX;
+			if (kaiser_active())
+				kaiser_poison_pgd(&pgd);
 		}
 	} else if (pgd_userspace_access(*pgdp)) {
 		/*
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 968d5b62d597..06966b111280 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -501,6 +501,9 @@ static ssize_t kaiser_enabled_write_file(struct file *file,
 	if (enable > 1)
 		return -EINVAL;
 
+	if (kaiser_enabled == enable)
+		return count;
+
 	WRITE_ONCE(kaiser_enabled, enable);
 	return count;
 }
@@ -518,3 +521,38 @@ static int __init create_kaiser_enabled(void)
 	return 0;
 }
 late_initcall(create_kaiser_enabled);
+
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
+{
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		pgd_t *pgd = &pgd_page[i];
+
+		/* Stop once we hit kernel addresses: */
+		if (!pgdp_maps_userspace(pgd))
+			break;
+
+		if (do_poison == KAISER_POISON)
+			kaiser_poison_pgd(pgd);
+		else
+			kaiser_unpoison_pgd(pgd);
+	}
+
+}
+
+void kaiser_poison_pgds(enum poison do_poison)
+{
+	struct page *page;
+
+	spin_lock(&pgd_lock);
+	list_for_each_entry(page, &pgd_list, lru) {
+		pgd_t *pgd = (pgd_t *)page_address(page);
+		kaiser_poison_pgd_page(pgd, do_poison);
+	}
+	spin_unlock(&pgd_lock);
+}
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index a3d28d00d555..83d465599646 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -4,7 +4,7 @@
 #ifdef CONFIG_KAISER
 #include <asm/kaiser.h>
 #else
-
+#ifndef __ASSEMBLY__
 /*
  * These stubs are used whenever CONFIG_KAISER is off, which
  * includes architectures that support KAISER, but have it
@@ -33,5 +33,6 @@ static inline bool kaiser_active(void)
 {
 	return 0;
 }
+#endif /* __ASSEMBLY__ */
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (40 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24  9:14 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
                   ` (2 subsequent siblings)
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER CR3 switches are expensive for many reasons.  Not all systems
benefit from the protection provided by KAISER.  Some of them can not
pay the high performance cost.

This patch adds a debugfs file.  To disable KAISER, you do:

	echo 0 > /sys/kernel/debug/x86/kaiser-enabled

and to re-enable it, you can:

	echo 1 > /sys/kernel/debug/x86/kaiser-enabled

This is a *minimal* implementation.  There are certainly plenty of
optimizations that can be done on top of this by using ALTERNATIVES
among other things.

This does, however, completely remove all the KAISER-based CR3 writes.
This permits a paravirtualized system that can not tolerate CR3
writes to theoretically survive with CONFIG_KAISER=y, albeit with
/sys/kernel/debug/x86/kaiser-enabled=0.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003523.28FFBAB6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h | 12 +++++++++
 arch/x86/mm/kaiser.c     | 70 +++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 66af80514197..89ccf7ae0e23 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -209,19 +209,29 @@ For 32-bit we have the following conventions - kernel is built with
 	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
+.macro JUMP_IF_KAISER_OFF	label
+	testq   $1, kaiser_asm_do_switch
+	jz      \label
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
@@ -244,11 +254,13 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Ldone_\@:
 .endm
 
 #else /* CONFIG_KAISER=n: */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 06966b111280..1eb27b410556 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -43,6 +43,9 @@
 
 #define KAISER_WALK_ATOMIC  0x1
 
+__aligned(PAGE_SIZE)
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+
 /*
  * At runtime, the only things we map are some things for CPU
  * hotplug, and stacks for new processes.  No two CPUs will ever
@@ -395,6 +398,9 @@ void __init kaiser_init(void)
 
 	kaiser_init_all_pgds();
 
+	kaiser_add_user_map_early(&kaiser_asm_do_switch, PAGE_SIZE,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	for_each_possible_cpu(cpu) {
 		void *percpu_vaddr = __per_cpu_user_mapped_start +
 				     per_cpu_offset(cpu);
@@ -483,6 +489,56 @@ static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf
 	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgds(enum poison do_poison);
+
+void kaiser_do_disable(void)
+{
+	/* Make sure the kernel PGDs are usable by userspace: */
+	kaiser_poison_pgds(KAISER_UNPOISON);
+
+	/*
+	 * Make sure all the CPUs have the poison clear in their TLBs.
+	 * This also functions as a barrier to ensure that everyone
+	 * sees the unpoisoned PGDs.
+	 */
+	flush_tlb_all();
+
+	/* Tell the assembly code to stop switching CR3. */
+	kaiser_asm_do_switch[0] = 0;
+
+	/*
+	 * Make sure everybody does an interrupt.  This means that
+	 * they have gone through a SWITCH_TO_KERNEL_CR3 amd are no
+	 * longer running on the userspace CR3.  If we did not do
+	 * this, we might have CPUs running on the shadow page tables
+	 * that then enter the kernel and think they do *not* need to
+	 * switch.
+	 */
+	flush_tlb_all();
+}
+
+void kaiser_do_enable(void)
+{
+	/* Tell the assembly code to start switching CR3: */
+	kaiser_asm_do_switch[0] = 1;
+
+	/* Make sure everyone can see the kaiser_asm_do_switch update: */
+	synchronize_rcu();
+
+	/*
+	 * Now that userspace is no longer using the kernel copy of
+	 * the page tables, we can poison it:
+	 */
+	kaiser_poison_pgds(KAISER_POISON);
+
+	/* Make sure all the CPUs see the poison: */
+	flush_tlb_all();
+}
+
 static ssize_t kaiser_enabled_write_file(struct file *file,
 		 const char __user *user_buf, size_t count, loff_t *ppos)
 {
@@ -504,7 +560,17 @@ static ssize_t kaiser_enabled_write_file(struct file *file,
 	if (kaiser_enabled == enable)
 		return count;
 
+	/*
+	 * This tells the page table code to stop poisoning PGDs
+	 */
 	WRITE_ONCE(kaiser_enabled, enable);
+	synchronize_rcu();
+
+	if (enable)
+		kaiser_do_enable();
+	else
+		kaiser_do_disable();
+
 	return count;
 }
 
@@ -522,10 +588,6 @@ static int __init create_kaiser_enabled(void)
 }
 late_initcall(create_kaiser_enabled);
 
-enum poison {
-	KAISER_POISON,
-	KAISER_UNPOISON
-};
 void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
 {
 	int i = 0;
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* [PATCH 43/43] x86/mm/kaiser: Add Kconfig
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (41 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:55 ` [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
  2017-11-24 15:23 ` Thomas Gleixner
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

	It is absurd that KAISER should depend on SMP, but
	apparently nobody has tried a UP build before: which
	breaks on implicit declaration of function
	'per_cpu_offset' in arch/x86/mm/kaiser.c.

	Now, you would expect that to be trivially fixed up; but
	looking at the System.map when that block is #ifdef'ed
	out of kaiser_init(), I see that in a UP build
	__per_cpu_user_mapped_end is precisely at
	__per_cpu_user_mapped_start, and the items carefully
	gathered into that section for user-mapping on SMP,
	dispersed elsewhere on UP.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Kees Cook <keescook@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Richard Fellner <richard.fellner@student.tugraz.at>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-mm@kvack.org
Link: http://lkml.kernel.org/r/20171123003524.88C90659@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 security/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..99b530d0dd9e 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KAISER
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && SMP && !PARAVIRT
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 01/43] x86/decoder: Add new TEST instruction pattern
  2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
@ 2017-11-24 10:38   ` Borislav Petkov
  2017-12-02  7:39   ` Robert Elliott (Persistent Memory)
  1 sibling, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2017-11-24 10:38 UTC (permalink / raw)
  To: Ingo Molnar, H.J. Lu
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:06AM +0100, Ingo Molnar wrote:
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
> The kbuild test robot reported this build warning:
> 
>   Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c
> 
>   Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
>   Warning: objdump says 3 bytes, but insn_get_length() says 2
>   Warning: decoded and checked 1569014 instructions with 1 warnings
> 
> This sequence seems to be a new instruction not in the opcode map in the Intel SDM.
> 
> The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.

So that's TEST Eb,Ib with ModRM.reg == 1b which is documented in the AMD APM but
not in the Intel SDM.

Maybe H.J. has some insights on why.

CCed and leaving in the rest for reference.

> Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
> the ModR/M Byte (bits 2,1,0 in parenthesis)"
> 
> In that table, opcodes listed by the index REG bits as:
> 
>   000         001       010 011  100        101        110         111
>  TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX
> 
> So, it seems TEST Ib is assigned to 001.
> 
> Add the new pattern.
> 
> Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> Reported-by: kbuild test robot <fengguang.wu@intel.com>
> Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
> Cc: <stable@vger.kernel.org>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-kernel@vger.kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/lib/x86-opcode-map.txt | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index 12e377184ee4..c4d55919fac1 100644
> --- a/arch/x86/lib/x86-opcode-map.txt
> +++ b/arch/x86/lib/x86-opcode-map.txt
> @@ -896,7 +896,7 @@ EndTable
>  
>  GrpTable: Grp3_1
>  0: TEST Eb,Ib
> -1:
> +1: TEST Eb,Ib
>  2: NOT Eb
>  3: NEG Eb
>  4: MUL AL,Eb
> -- 
> 2.14.1
> 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-24  9:14 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
@ 2017-11-24 11:00   ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2017-11-24 11:00 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:10AM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
> fixmap.  Generalize it to be an array of a new struct cpu_entry_area
> so that we can cleanly add new things to it.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/desc.h   |  9 +--------
>  arch/x86/include/asm/fixmap.h | 34 ++++++++++++++++++++++++++++++++--
>  arch/x86/kernel/cpu/common.c  | 14 +++++++-------
>  arch/x86/xen/mmu_pv.c         |  2 +-
>  4 files changed, 41 insertions(+), 18 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries
  2017-11-24  9:14 ` [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
@ 2017-11-24 11:27   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 11:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> @@ -563,6 +563,13 @@ END(irq_entries_start)
>  /* 0(%rsp): ~(interrupt number) */
>  	.macro interrupt func
>  	cld
> +
> +	testb	$3, CS-ORIG_RAX(%rsp)
> +	jz	1f
> +	SWAPGS
> +	call	switch_to_thread_stack
> +1:

Yes, that's what I thought it should look like.

>  	ALLOC_PT_GPREGS_ON_STACK
>  	SAVE_C_REGS
>  	SAVE_EXTRA_REGS
> @@ -572,12 +579,8 @@ END(irq_entries_start)
>  	jz	1f

If you change that to 2f and adjust the label down there it gets even
simpler to read. I know it works, but I still find it disturbing.

>  	/*
> -	 * IRQ from user mode.  Switch to kernel gsbase and inform context
> -	 * tracking that we're in kernel mode.
> -	 */
> -	SWAPGS
> -
> -	/*
> +	 * IRQ from user mode.
> +	 *
>  	 * We need to tell lockdep that IRQs are off.  We can't do this until
>  	 * we fix gsbase, and we should do it before enter_from_user_mode
>  	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
> @@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
>   */
>  #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
>  
> +/*
> + * Switch to the thread stack.  This is called with the IRET frame and
> + * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
> + * space has not been allocated for them.)
> + */
> +ENTRY(switch_to_thread_stack)
> +	UNWIND_HINT_FUNC
> +
> +	pushq	%rdi
> +	movq	%rsp, %rdi
> +	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
> +	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
> +
> +	pushq	7*8(%rdi)		/* regs->ss */
> +	pushq	6*8(%rdi)		/* regs->rsp */
> +	pushq	5*8(%rdi)		/* regs->eflags */
> +	pushq	4*8(%rdi)		/* regs->cs */
> +	pushq	3*8(%rdi)		/* regs->ip */
> +	pushq	2*8(%rdi)		/* regs->orig_ax */
> +	pushq	8(%rdi)			/* return address */
> +	UNWIND_HINT_FUNC
> +
> +	movq	(%rdi), %rdi
> +	ret
> +END(switch_to_thread_stack)

Much nicer.

>  .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
>  ENTRY(\sym)
>  	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
> @@ -848,11 +877,12 @@ ENTRY(\sym)
>  
>  	ALLOC_PT_GPREGS_ON_STACK
>  
> -	.if \paranoid
> -	.if \paranoid == 1
> +	.if \paranoid < 2
>  	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
> -	jnz	1f
> +	jnz	.Lfrom_usermode_switch_stack_\@
>  	.endif
> +
> +	.if \paranoid
>  	call	paranoid_entry
>  	.else
>  	call	error_entry
> @@ -894,20 +924,15 @@ ENTRY(\sym)
>  	jmp	error_exit
>  	.endif
>  
> -	.if \paranoid == 1
> +	.if \paranoid < 2
>  	/*
> -	 * Paranoid entry from userspace.  Switch stacks and treat it
> +	 * Entry from userspace.  Switch stacks and treat it
>  	 * as a normal entry.  This means that paranoid handlers
>  	 * run in real process context if user_mode(regs).
>  	 */
> -1:
> +.Lfrom_usermode_switch_stack_\@:
>  	call	error_entry
>  
> -
> -	movq	%rsp, %rdi			/* pt_regs pointer */
> -	call	sync_regs
> -	movq	%rax, %rsp			/* switch stack */
> -
>  	movq	%rsp, %rdi			/* pt_regs pointer */
>  
>  	.if \has_error_code
> @@ -1170,6 +1195,14 @@ ENTRY(error_entry)
>  	SWAPGS
>  
>  .Lerror_entry_from_usermode_after_swapgs:
> +	/* Put us onto the real thread stack. */
> +	popq	%r12				/* save return addr in %12 */
> +	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
> +	call	sync_regs
> +	movq	%rax, %rsp			/* switch stack */
> +	ENCODE_FRAME_POINTER
> +	pushq	%r12
> +

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-24  9:14 ` [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
@ 2017-11-24 11:44   ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2017-11-24 11:44 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:14AM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> SYSENTER_stack should have reliable overflow detection, which
> means that it needs to be at the bottom of a page, not the top.
> Move it to the beginning of struct tss_struct and page-align it.
> 
> Also add an assertion to make sure that the fixed hardware TSS
> doesn't cross a page boundary.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/processor.h | 21 ++++++++++++---------
>  arch/x86/kernel/cpu/common.c     | 21 +++++++++++++++++++++
>  2 files changed, 33 insertions(+), 9 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

Thanks to tglx for clarifying the whole top and bottom meaning here for
me - I was confused.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24  9:14 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
@ 2017-11-24 12:05   ` Peter Zijlstra
  2017-11-24 12:17     ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 12:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:27AM +0100, Ingo Molnar wrote:
> Interactions with SWAPGS: previous versions of the KAISER code
> relied on having per-cpu scratch space to save/restore a register
> that can be used for the CR3 MOV.  The %GS register is used to
> index into our per-cpu space, so SWAPGS *had* to be done before
> the CR3 switch.  That scratch space is gone now, but the semantic
> that SWAPGS must be done before the CR3 MOV is retained.  This is
> good to keep because it is not that hard to do and it allows us
> to do things like add per-cpu debugging information to help us
> figure out what goes wrong sometimes.

> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +	movq	%cr3, %r\scratch_reg
> +	movq	%r\scratch_reg, \save_reg
> +	/*
> +	 * Is the switch bit zero?  This means the address is
> +	 * up in real KAISER patches in a moment.
> +	 */
> +	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
> +	jz	.Ldone_\@
> +
> +	ADJUST_KERNEL_CR3 %r\scratch_reg
> +	movq	%r\scratch_reg, %cr3
> +
> +.Ldone_\@:
> +.endm

> @@ -1333,6 +1362,7 @@ ENTRY(error_entry)
>  	 * gsbase and proceed.  We'll fix up the exception and land in
>  	 * .Lgs_change's error handler with kernel gsbase.
>  	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>  	SWAPGS
>  	jmp .Lerror_entry_done
>  

> @@ -1343,9 +1373,10 @@ ENTRY(error_entry)
>  
>  .Lerror_bad_iret:
>  	/*
> -	 * We came from an IRET to user mode, so we have user gsbase.
> -	 * Switch to kernel gsbase:
> +	 * We came from an IRET to user mode, so we have user
> +	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
>  	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>  	SWAPGS
>  
>  	/*

The Changelog states SWAPGS must be done before, yet the code does
after.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
@ 2017-11-24 12:13   ` Peter Zijlstra
  2017-11-24 13:46     ` Ingo Molnar
  2017-11-24 12:16   ` Peter Zijlstra
  2017-11-24 13:30   ` Peter Zijlstra
  2 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 12:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> Note: The original KAISER authors signed-off on their patch.  Some of
> their code has been broken out into other patches in this series, but
> their SoB was only retained here.

This is not in fact the case anymore..

> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Andy Lutomirski <luto@kernel.org>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Hugh Dickins <hughd@google.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Kees Cook <keescook@google.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
> Cc: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Richard Fellner <richard.fellner@student.tugraz.at>
> Cc: Thomas Gleixner <tglx@linutronix.de>
> Cc: linux-mm@kvack.org
> Link: http://lkml.kernel.org/r/20171123003447.1DB395E3@viggo.jf.intel.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
  2017-11-24 12:13   ` Peter Zijlstra
@ 2017-11-24 12:16   ` Peter Zijlstra
  2017-11-24 16:33     ` Dave Hansen
  2017-11-24 13:30   ` Peter Zijlstra
  2 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 12:16 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> +The minimalistic kernel portion of the user page tables try to
> +map only what is needed to enter/exit the kernel such as the
> +entry/exit functions themselves and the interrupt descriptor
> +table (IDT).  

                There are a few unnecessary things that get mapped
> +such as the first C function when entering an interrupt (see
> +comments in kaiser.c).

If I understood Andy's patches correctly, this should no longer be
required. Is this text still correct?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 12:05   ` Peter Zijlstra
@ 2017-11-24 12:17     ` Ingo Molnar
  2017-11-24 12:45       ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24 12:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 24, 2017 at 10:14:27AM +0100, Ingo Molnar wrote:
> > Interactions with SWAPGS: previous versions of the KAISER code
> > relied on having per-cpu scratch space to save/restore a register
> > that can be used for the CR3 MOV.  The %GS register is used to
> > index into our per-cpu space, so SWAPGS *had* to be done before
> > the CR3 switch.  That scratch space is gone now, but the semantic
> > that SWAPGS must be done before the CR3 MOV is retained.  This is
> > good to keep because it is not that hard to do and it allows us
> > to do things like add per-cpu debugging information to help us
> > figure out what goes wrong sometimes.
> 
> > +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> > +	movq	%cr3, %r\scratch_reg
> > +	movq	%r\scratch_reg, \save_reg
> > +	/*
> > +	 * Is the switch bit zero?  This means the address is
> > +	 * up in real KAISER patches in a moment.
> > +	 */
> > +	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
> > +	jz	.Ldone_\@
> > +
> > +	ADJUST_KERNEL_CR3 %r\scratch_reg
> > +	movq	%r\scratch_reg, %cr3
> > +
> > +.Ldone_\@:
> > +.endm
> 
> > @@ -1333,6 +1362,7 @@ ENTRY(error_entry)
> >  	 * gsbase and proceed.  We'll fix up the exception and land in
> >  	 * .Lgs_change's error handler with kernel gsbase.
> >  	 */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> >  	SWAPGS
> >  	jmp .Lerror_entry_done
> >  
> 
> > @@ -1343,9 +1373,10 @@ ENTRY(error_entry)
> >  
> >  .Lerror_bad_iret:
> >  	/*
> > -	 * We came from an IRET to user mode, so we have user gsbase.
> > -	 * Switch to kernel gsbase:
> > +	 * We came from an IRET to user mode, so we have user
> > +	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
> >  	 */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> >  	SWAPGS
> >  
> >  	/*
> 
> The Changelog states SWAPGS must be done before, yet the code does
> after.

Yes, so this is the SWAPGS that is done before we go back to user-space.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 12:17     ` Ingo Molnar
@ 2017-11-24 12:45       ` Peter Zijlstra
  2017-11-24 13:04         ` Thomas Gleixner
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 12:45 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 01:17:06PM +0100, Ingo Molnar wrote:
> 
> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > On Fri, Nov 24, 2017 at 10:14:27AM +0100, Ingo Molnar wrote:
> > > @@ -1343,9 +1373,10 @@ ENTRY(error_entry)
> > >  
> > >  .Lerror_bad_iret:
> > >  	/*
> > > +	 * We came from an IRET to user mode, so we have user
> > > +	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
> > >  	 */
> > > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> > >  	SWAPGS
> > >  
> > >  	/*
> > 
> > The Changelog states SWAPGS must be done before, yet the code does
> > after.
> 
> Yes, so this is the SWAPGS that is done before we go back to user-space.

The comment there clearly states we have user gs and we need to switch
to kernel gs. The Changelog states we should switch gs before cr3.

So either the comment or the code needs fixing.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 12:45       ` Peter Zijlstra
@ 2017-11-24 13:04         ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Andy Lutomirski,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Peter Zijlstra wrote:

> On Fri, Nov 24, 2017 at 01:17:06PM +0100, Ingo Molnar wrote:
> > 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > On Fri, Nov 24, 2017 at 10:14:27AM +0100, Ingo Molnar wrote:
> > > > @@ -1343,9 +1373,10 @@ ENTRY(error_entry)
> > > >  
> > > >  .Lerror_bad_iret:
> > > >  	/*
> > > > +	 * We came from an IRET to user mode, so we have user
> > > > +	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
> > > >  	 */
> > > > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> > > >  	SWAPGS
> > > >  
> > > >  	/*
> > > 
> > > The Changelog states SWAPGS must be done before, yet the code does
> > > after.
> > 
> > Yes, so this is the SWAPGS that is done before we go back to user-space.
> 
> The comment there clearly states we have user gs and we need to switch
> to kernel gs. The Changelog states we should switch gs before cr3.
> 
> So either the comment or the code needs fixing.

The code :)

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
  2017-11-24 12:13   ` Peter Zijlstra
  2017-11-24 12:16   ` Peter Zijlstra
@ 2017-11-24 13:30   ` Peter Zijlstra
  2017-11-26 15:15     ` Ingo Molnar
  2 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 13:30 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> +static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
> +					   unsigned long flags)
> +{
> +	pte_t *pte;
> +	pmd_t *pmd;
> +	pud_t *pud;
> +	p4d_t *p4d;
> +	pgd_t *pgd = kernel_to_shadow_pgdp(pgd_offset_k(address));
> +	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
> +
> +	if (flags & KAISER_WALK_ATOMIC) {
> +		gfp &= ~GFP_KERNEL;
> +		gfp |= __GFP_HIGH | __GFP_ATOMIC;
> +	}
> +
> +	if (address < PAGE_OFFSET) {
> +		WARN_ONCE(1, "attempt to walk user address\n");
> +		return NULL;
> +	}
> +
> +	if (pgd_none(*pgd)) {
> +		WARN_ONCE(1, "All shadow pgds should have been populated\n");
> +		return NULL;
> +	}
> +	BUILD_BUG_ON(pgd_large(*pgd) != 0);
> +
> +	p4d = p4d_offset(pgd, address);
> +	BUILD_BUG_ON(p4d_large(*p4d) != 0);
> +	if (p4d_none(*p4d)) {
> +		unsigned long new_pud_page = __get_free_page(gfp);
> +		if (!new_pud_page)
> +			return NULL;
> +
> +		spin_lock(&shadow_table_allocation_lock);
> +		if (p4d_none(*p4d))
> +			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
> +		else
> +			free_page(new_pud_page);
> +		spin_unlock(&shadow_table_allocation_lock);

So mm/memory.c has two patterns here.. I prefer the other one:

		spin_lock(&shadow_table_allocation_lock);
		if (p4d_none(*p4d)) {
			set_p4d(p4d, __p4d(_KERNEL_TABLE | __pa(new_pud_page)));
			new_pud_page = NULL;
		}
		spin_unlock(&shadow_table_allocation_lock);
		if (new_pud_page)
			free_page(new_pud_page);

> +	}

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 28/43] x86/mm/kaiser: Map CPU entry area
  2017-11-24  9:14 ` [PATCH 28/43] x86/mm/kaiser: Map CPU entry area Ingo Molnar
@ 2017-11-24 13:43   ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 13:43 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:33AM +0100, Ingo Molnar wrote:
> + 	/* CPU 0's mapping is done in kaiser_init() */
> +	if (cpu)
> +		kaiser_add_mapping_cpu_entry(cpu);

This hard assumes CPU0 is the boot CPU. I know we dropped Voyager
support a while back, but can/should we hard rely on that?

We do have __boot_cpu_id / get_boot_cpu_id() for these here things.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack
  2017-11-24  9:14 ` [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
@ 2017-11-24 13:46   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> By itself, this is useless.  It gives us the ability to run some final
> code before exit that cannnot run on the kernel stack.  This could
> include a CR3 switch a la KAISER or some kernel stack erasing, for
> example.  (Or even weird things like *changing* which kernel stack
> gets used as an ASLR-strengthening mechanism.)
> 
> The SYSRET32 path is not covered yet.  It could be in the future or
> we could just ignore it and force the slow path if needed.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 12:13   ` Peter Zijlstra
@ 2017-11-24 13:46     ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24 13:46 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> > Note: The original KAISER authors signed-off on their patch.  Some of
> > their code has been broken out into other patches in this series, but
> > their SoB was only retained here.
> 
> This is not in fact the case anymore..

Indeed, I have updated the changelog to say this instead:

    Note: The original KAISER authors signed-off on their patch, which
    SoB we retained in arch/x86/mm/kaiser.c.
    
Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-24  9:14 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
@ 2017-11-24 13:47   ` Peter Zijlstra
  2017-11-24 16:17     ` Andy Lutomirski
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-24 13:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:35AM +0100, Ingo Molnar wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> There is some rather arcane code to help when an IRET returns
> to 16-bit segments.  It is referred to as the "espfix" code.
> This consists of a few per-cpu variables:
> 
> 	espfix_stack: tells us where the stack is allocated
> 		      (the bottom)
> 	espfix_waddr: tells us to where %rsp may be pointed
> 		      (the top)
> 
> These are in addition to the stack itself.  All three things must
> be mapped for the espfix code to function.
> 
> Note: the espfix code runs with a kernel GSBASE, but user
> (shadow) page tables.  A switch to the kernel page tables could
> be performed instead of mapping these structures, but mapping
> them is simpler and less likely to break the assembly.  To switch
> over to the kernel copy, additional temporary storage would be
> required which is in short supply in this context.

With Andy's patches that should actually be doable, no?

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
@ 2017-11-24 13:52   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
> single register (except FLAGS), including RSP, live.  It somehow needs
> to set RSP to point to a valid stack, which means it needs to save the
> user RSP somewhere and find its own stack pointer.  The canonical way
> to do this is with SWAPGS, which lets us access percpu data using the
> %gs prefix.
> 
> With KAISER-like pagetable switching, this is problematic.  Without a
> scratch register, switching CR3 is impossible, so %gs-based percpu
> memory would need to be mapped in the user pagetables.  Doing that
> without information leaks is difficult or impossible.
> 
> Instead, use a different sneaky trick.  Map a copy of the first part
> of the SYSCALL asm at a different address for each CPU.  Now RIP
> varies depending on the CPU, so we can use RIP-relative memory access
> to access percpu memory.  By putting the relevant information (one
> scratch slot and the stack address) at a constant offset relative to
> RIP, we can make SYSCALL work without relying on %gs.

Smart!

> A nice thing about this approach is that we can easily switch it on
> and off if we want pagetable switching to be configurable.
> 
> The compat variant of SYSCALL doesn't have this problem in the first
> place -- there are plenty of scratch registers, since we don't care
> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
> at all.
> 
> XXX: Whenever we settle how KAISER gets turned on and off, we should do
> the same to this.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-24  9:14 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
@ 2017-11-24 13:53   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> That race has been fixed and code cleaned up for a while now.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (42 preceding siblings ...)
  2017-11-24  9:14 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
@ 2017-11-24 13:55 ` Ingo Molnar
  2017-11-24 15:23 ` Thomas Gleixner
  44 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24 13:55 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> This is a linear series of patches of the latest entry-stack plus Kaiser
> bits from Andy Lutomirski (v3 series from today) and Dave Hansen
> (kaiser-414-tipwip-20171123 version), on top of latest tip:x86/urgent (12a78d43de76),
> plus fixes - for easier review.
> 
> The code should be the latest posted by Andy and Dave.
> 
> Any bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.

There were some mis-merges in the assembly code, crashing the kernel on bootup 
with Kaiser enabled. Thomas helped find & fix them.

I've pushed out the latest to tip:WIP.x86/mm, the interdiff between the posted and 
the Git version can be found below.

Thanks,

	Ingo

===============>
 arch/x86/entry/entry_64.S        | 12 ++----------
 arch/x86/entry/entry_64_compat.S |  8 --------
 arch/x86/events/intel/ds.c       |  2 +-
 3 files changed, 3 insertions(+), 19 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 20be5e89a36a..4ac952080869 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -201,7 +201,6 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
-
 	/*
 	 * The kernel CR3 is needed to map the process stack, but we
 	 * need a scratch register to be able to load CR3.  %rsp is
@@ -209,7 +208,6 @@ ENTRY(entry_SYSCALL_64)
 	 * %rsp will be look crazy here for a couple instructions.
 	 */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
-
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -259,9 +257,6 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	ja	1f				/* return -ENOSYS (already in pt_regs->ax) */
 	movq	%r10, %rcx
 
-	/* Must wait until we have the kernel CR3 to call C functions: */
-	TRACE_IRQS_OFF
-
 	/*
 	 * This call instruction is handled specially in stub_ptregs_64.
 	 * It might end up jumping to the slow path.  If it jumps, RAX
@@ -647,7 +642,6 @@ END(irq_entries_start)
 	testb	$3, CS-ORIG_RAX(%rsp)
 	jz	1f
 	SWAPGS
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	call	switch_to_thread_stack
 1:
 
@@ -956,10 +950,9 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
 ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
+	pushq	%rdi
 	/* Need to switch before accessing the thread stack. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
-
-	pushq	%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1315,7 +1308,6 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
-
 	/* We have user CR3.  Change to kernel CR3. */
 	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
@@ -1377,8 +1369,8 @@ ENTRY(error_entry)
 	 * We came from an IRET to user mode, so we have user
 	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 57cd353c0667..05238b29895e 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -319,14 +319,6 @@ ENTRY(entry_INT80_compat)
 	ASM_CLAC			/* Do this early to minimize exposure */
 	SWAPGS
 
-	/*
-	 * Must switch CR3 before thread stack is used.  %r8 itself
-	 * is not saved into pt_regs and is not preserved across
-	 * function calls (like TRACE_IRQS_OFF calls), thus should
-	 * be safe to use.
-	 */
-	SWITCH_TO_KERNEL_CR3 scratch_reg=%r8
-
 	/*
 	 * User tracing code (ptrace or signal handlers) might assume that
 	 * the saved RAX contains a 32-bit number when we're invoking a 32-bit
diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 61388b01962d..b5cf473e443a 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1,9 +1,9 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bitops.h>
 #include <linux/types.h>
+#include <linux/kaiser.h>
 #include <linux/slab.h>
 
-#include <linux/kaiser.h>
 #include <asm/perf_event.h>
 #include <asm/insn.h>
 

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* Re: [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-24  9:14 ` [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
@ 2017-11-24 14:19   ` Borislav Petkov
  0 siblings, 0 replies; 82+ messages in thread
From: Borislav Petkov @ 2017-11-24 14:19 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 10:14:16AM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> On 64-bit kernels, we used to assume that TSS.sp0 was the current
> top of stack.  With the addition of an entry trampoline, this will
> no longer be the case.  Store the current top of stack in TSS.sp1,
> which is otherwise unused but shares the same cacheline.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bp@alien8.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Denys Vlasenko <dvlasenk@redhat.com>
> Cc: H. Peter Anvin <hpa@zytor.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: http://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
>  arch/x86/include/asm/thread_info.h |  2 +-
>  arch/x86/kernel/asm-offsets_64.c   |  1 +
>  arch/x86/kernel/process.c          | 10 ++++++++++
>  arch/x86/kernel/process_64.c       |  1 +
>  5 files changed, 26 insertions(+), 6 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning
  2017-11-24  9:14 ` [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
@ 2017-11-24 14:22   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 14:22 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> In case something goes wrong with unwind (not unlikely in case of
> overflow), print the offending IP where we detected the overflow.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-24  9:14 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
@ 2017-11-24 14:23   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> The IST stacks are needed when an IST exception occurs and are
> accessed before any kernel code at all runs.  Move them into
> cpu_entry_area.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-24  9:14 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
@ 2017-11-24 14:23   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 14:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> Now that the SYSENTER stack has a guard page, there's no need for a
> canary to detect overflow after the fact.

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-24  9:14 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
@ 2017-11-24 14:24   ` Thomas Gleixner
  0 siblings, 0 replies; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 14:24 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> The existing code was a mess, mainly because C arrays are nasty.
> Turn SYSENTER_stack into a struct, add a helper to find it, and do
> all the obvious cleanups this enables.

Nice.

> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
                   ` (43 preceding siblings ...)
  2017-11-24 13:55 ` [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
@ 2017-11-24 15:23 ` Thomas Gleixner
  2017-11-24 17:19   ` Ingo Molnar
  44 siblings, 1 reply; 82+ messages in thread
From: Thomas Gleixner @ 2017-11-24 15:23 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> This is a linear series of patches of the latest entry-stack plus Kaiser
> bits from Andy Lutomirski (v3 series from today) and Dave Hansen
> (kaiser-414-tipwip-20171123 version), on top of latest tip:x86/urgent (12a78d43de76),
> plus fixes - for easier review.
> 
> The code should be the latest posted by Andy and Dave.
> 
> Any bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.

There are a few mismerges as we established already. Can you please repost
the series (at least the kaiser bits) so I can continue reviewing on a
working state?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-24 13:47   ` Peter Zijlstra
@ 2017-11-24 16:17     ` Andy Lutomirski
  2017-11-27  9:14       ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Andy Lutomirski @ 2017-11-24 16:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 5:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, Nov 24, 2017 at 10:14:35AM +0100, Ingo Molnar wrote:
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>
>> There is some rather arcane code to help when an IRET returns
>> to 16-bit segments.  It is referred to as the "espfix" code.
>> This consists of a few per-cpu variables:
>>
>>       espfix_stack: tells us where the stack is allocated
>>                     (the bottom)
>>       espfix_waddr: tells us to where %rsp may be pointed
>>                     (the top)
>>
>> These are in addition to the stack itself.  All three things must
>> be mapped for the espfix code to function.
>>
>> Note: the espfix code runs with a kernel GSBASE, but user
>> (shadow) page tables.  A switch to the kernel page tables could
>> be performed instead of mapping these structures, but mapping
>> them is simpler and less likely to break the assembly.  To switch
>> over to the kernel copy, additional temporary storage would be
>> required which is in short supply in this context.
>
> With Andy's patches that should actually be doable, no?

I don't think it has much to do with my patches.  We can freely spill
to the stack in the espfix64 code, though.

--Andy

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 12:16   ` Peter Zijlstra
@ 2017-11-24 16:33     ` Dave Hansen
  2017-11-26 15:13       ` Ingo Molnar
  0 siblings, 1 reply; 82+ messages in thread
From: Dave Hansen @ 2017-11-24 16:33 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Borislav Petkov, Linus Torvalds

On 11/24/2017 04:16 AM, Peter Zijlstra wrote:
> On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
>> +The minimalistic kernel portion of the user page tables try to
>> +map only what is needed to enter/exit the kernel such as the
>> +entry/exit functions themselves and the interrupt descriptor
>> +table (IDT).  
> 
>                 There are a few unnecessary things that get mapped
>> +such as the first C function when entering an interrupt (see
>> +comments in kaiser.c).
> 
> If I understood Andy's patches correctly, this should no longer be
> required. Is this text still correct?

It is out of date.  We were able to remove the irq handler mappings.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version
  2017-11-24 15:23 ` Thomas Gleixner
@ 2017-11-24 17:19   ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 24 Nov 2017, Ingo Molnar wrote:
> 
> > This is a linear series of patches of the latest entry-stack plus Kaiser
> > bits from Andy Lutomirski (v3 series from today) and Dave Hansen
> > (kaiser-414-tipwip-20171123 version), on top of latest tip:x86/urgent (12a78d43de76),
> > plus fixes - for easier review.
> > 
> > The code should be the latest posted by Andy and Dave.
> > 
> > Any bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.
> 
> There are a few mismerges as we established already. Can you please repost
> the series (at least the kaiser bits) so I can continue reviewing on a
> working state?

Sure, will do that in a minute.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 16:33     ` Dave Hansen
@ 2017-11-26 15:13       ` Ingo Molnar
  0 siblings, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-26 15:13 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Peter Zijlstra, linux-kernel, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds


* Dave Hansen <dave.hansen@linux.intel.com> wrote:

> On 11/24/2017 04:16 AM, Peter Zijlstra wrote:
> > On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> >> +The minimalistic kernel portion of the user page tables try to
> >> +map only what is needed to enter/exit the kernel such as the
> >> +entry/exit functions themselves and the interrupt descriptor
> >> +table (IDT).  
> > 
> >                 There are a few unnecessary things that get mapped
> >> +such as the first C function when entering an interrupt (see
> >> +comments in kaiser.c).
> > 
> > If I understood Andy's patches correctly, this should no longer be
> > required. Is this text still correct?
> 
> It is out of date.  We were able to remove the irq handler mappings.

I have updated the Documentation/x86/kaiser.txt file accordingly.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 13:30   ` Peter Zijlstra
@ 2017-11-26 15:15     ` Ingo Molnar
  2017-11-27  8:59       ` [PATCH] x86/mm/kaiser: Use the other page_table_lock pattern Peter Zijlstra
  2017-11-27  8:59       ` [PATCH] mm: Unify page_table_lock allocation pattern Peter Zijlstra
  0 siblings, 2 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-26 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds


* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, Nov 24, 2017 at 10:14:30AM +0100, Ingo Molnar wrote:
> > +static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
> > +					   unsigned long flags)
> > +{
> > +	pte_t *pte;
> > +	pmd_t *pmd;
> > +	pud_t *pud;
> > +	p4d_t *p4d;
> > +	pgd_t *pgd = kernel_to_shadow_pgdp(pgd_offset_k(address));
> > +	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
> > +
> > +	if (flags & KAISER_WALK_ATOMIC) {
> > +		gfp &= ~GFP_KERNEL;
> > +		gfp |= __GFP_HIGH | __GFP_ATOMIC;
> > +	}
> > +
> > +	if (address < PAGE_OFFSET) {
> > +		WARN_ONCE(1, "attempt to walk user address\n");
> > +		return NULL;
> > +	}
> > +
> > +	if (pgd_none(*pgd)) {
> > +		WARN_ONCE(1, "All shadow pgds should have been populated\n");
> > +		return NULL;
> > +	}
> > +	BUILD_BUG_ON(pgd_large(*pgd) != 0);
> > +
> > +	p4d = p4d_offset(pgd, address);
> > +	BUILD_BUG_ON(p4d_large(*p4d) != 0);
> > +	if (p4d_none(*p4d)) {
> > +		unsigned long new_pud_page = __get_free_page(gfp);
> > +		if (!new_pud_page)
> > +			return NULL;
> > +
> > +		spin_lock(&shadow_table_allocation_lock);
> > +		if (p4d_none(*p4d))
> > +			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
> > +		else
> > +			free_page(new_pud_page);
> > +		spin_unlock(&shadow_table_allocation_lock);
> 
> So mm/memory.c has two patterns here.. I prefer the other one:
> 
> 		spin_lock(&shadow_table_allocation_lock);
> 		if (p4d_none(*p4d)) {
> 			set_p4d(p4d, __p4d(_KERNEL_TABLE | __pa(new_pud_page)));
> 			new_pud_page = NULL;
> 		}
> 		spin_unlock(&shadow_table_allocation_lock);
> 		if (new_pud_page)
> 			free_page(new_pud_page);
> 
> > +	}

Ok, would be nice to get this cleanup as a delta patch, because the existing 
pattern has been tested to a fair degree already.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] x86/mm/kaiser: Use the other page_table_lock pattern
  2017-11-26 15:15     ` Ingo Molnar
@ 2017-11-27  8:59       ` Peter Zijlstra
  2017-11-27  8:59       ` [PATCH] mm: Unify page_table_lock allocation pattern Peter Zijlstra
  1 sibling, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-27  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

Subject: x86/mm/kaiser: Use the other page_table_lock pattern
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 27 09:35:08 CET 2017

Use the other page_table_lock pattern; this removes the free from
under the lock, reducing worst case hold times and makes it a leaf
lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 arch/x86/mm/kaiser.c |   24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -183,11 +183,13 @@ static pte_t *kaiser_shadow_pagetable_wa
 			return NULL;
 
 		spin_lock(&shadow_table_allocation_lock);
-		if (p4d_none(*p4d))
+		if (p4d_none(*p4d)) {
 			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
-		else
-			free_page(new_pud_page);
+			new_pud_page = 0;
+		}
 		spin_unlock(&shadow_table_allocation_lock);
+		if (new_pud_page)
+			free_page(new_pud_page);
 	}
 
 	pud = pud_offset(p4d, address);
@@ -202,11 +204,13 @@ static pte_t *kaiser_shadow_pagetable_wa
 			return NULL;
 
 		spin_lock(&shadow_table_allocation_lock);
-		if (pud_none(*pud))
+		if (pud_none(*pud)) {
 			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
-		else
-			free_page(new_pmd_page);
+			new_pmd_page = 0;
+		}
 		spin_unlock(&shadow_table_allocation_lock);
+		if (new_pmd_page)
+			free_page(new_pmd_page);
 	}
 
 	pmd = pmd_offset(pud, address);
@@ -221,11 +225,13 @@ static pte_t *kaiser_shadow_pagetable_wa
 			return NULL;
 
 		spin_lock(&shadow_table_allocation_lock);
-		if (pmd_none(*pmd))
+		if (pmd_none(*pmd)) {
 			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
-		else
-			free_page(new_pte_page);
+			new_pte_page = 0;
+		}
 		spin_unlock(&shadow_table_allocation_lock);
+		if (new_pte_page)
+			free_page(new_pte_page);
 	}
 
 	pte = pte_offset_kernel(pmd, address);

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH] mm: Unify page_table_lock allocation pattern
  2017-11-26 15:15     ` Ingo Molnar
  2017-11-27  8:59       ` [PATCH] x86/mm/kaiser: Use the other page_table_lock pattern Peter Zijlstra
@ 2017-11-27  8:59       ` Peter Zijlstra
  1 sibling, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-27  8:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

Subject: mm: Unify page_table_lock allocation pattern
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Nov 27 09:35:04 CET 2017

There are two different patterns wrt page_table_lock and allocating
new pages. Get rid of this diversity.

I picked this variant because it does less work under the lock and
makes page_table_lock a leaf lock.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 mm/memory.c |   33 ++++++++++++++++++++-------------
 1 file changed, 20 insertions(+), 13 deletions(-)

--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4100,11 +4100,14 @@ int __p4d_alloc(struct mm_struct *mm, pg
 	smp_wmb(); /* See comment in __pte_alloc */
 
 	spin_lock(&mm->page_table_lock);
-	if (pgd_present(*pgd))		/* Another has populated it */
-		p4d_free(mm, new);
-	else
+	if (!pgd_present(*pgd)) {
 		pgd_populate(mm, pgd, new);
+		new = NULL;
+	}
 	spin_unlock(&mm->page_table_lock);
+	if (new)
+		p4d_free(mm, new);
+
 	return 0;
 }
 #endif /* __PAGETABLE_P4D_FOLDED */
@@ -4124,17 +4127,19 @@ int __pud_alloc(struct mm_struct *mm, p4
 
 	spin_lock(&mm->page_table_lock);
 #ifndef __ARCH_HAS_5LEVEL_HACK
-	if (p4d_present(*p4d))		/* Another has populated it */
-		pud_free(mm, new);
-	else
+	if (!p4d_present(*p4d)) {
 		p4d_populate(mm, p4d, new);
+		new = NULL;
+	}
 #else
-	if (pgd_present(*p4d))		/* Another has populated it */
-		pud_free(mm, new);
-	else
+	if (!pgd_present(*p4d)) {
 		pgd_populate(mm, p4d, new);
+		new = NULL;
+	}
 #endif /* __ARCH_HAS_5LEVEL_HACK */
 	spin_unlock(&mm->page_table_lock);
+	if (new)
+		pud_free(mm, new);
 	return 0;
 }
 #endif /* __PAGETABLE_PUD_FOLDED */
@@ -4158,16 +4163,18 @@ int __pmd_alloc(struct mm_struct *mm, pu
 	if (!pud_present(*pud)) {
 		mm_inc_nr_pmds(mm);
 		pud_populate(mm, pud, new);
-	} else	/* Another has populated it */
-		pmd_free(mm, new);
+		new = NULL;
+	}
 #else
 	if (!pgd_present(*pud)) {
 		mm_inc_nr_pmds(mm);
 		pgd_populate(mm, pud, new);
-	} else /* Another has populated it */
-		pmd_free(mm, new);
+		new = NULL;
+	}
 #endif /* __ARCH_HAS_4LEVEL_HACK */
 	spin_unlock(ptl);
+	if (new)
+		pmd_free(mm, new);
 	return 0;
 }
 #endif /* __PAGETABLE_PMD_FOLDED */

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-24 16:17     ` Andy Lutomirski
@ 2017-11-27  9:14       ` Peter Zijlstra
  2017-11-27 15:35         ` Peter Zijlstra
  0 siblings, 1 reply; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-27  9:14 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 08:17:06AM -0800, Andy Lutomirski wrote:
> On Fri, Nov 24, 2017 at 5:47 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, Nov 24, 2017 at 10:14:35AM +0100, Ingo Molnar wrote:
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >>
> >> There is some rather arcane code to help when an IRET returns
> >> to 16-bit segments.  It is referred to as the "espfix" code.
> >> This consists of a few per-cpu variables:
> >>
> >>       espfix_stack: tells us where the stack is allocated
> >>                     (the bottom)
> >>       espfix_waddr: tells us to where %rsp may be pointed
> >>                     (the top)
> >>
> >> These are in addition to the stack itself.  All three things must
> >> be mapped for the espfix code to function.
> >>
> >> Note: the espfix code runs with a kernel GSBASE, but user
> >> (shadow) page tables.  A switch to the kernel page tables could
> >> be performed instead of mapping these structures, but mapping
> >> them is simpler and less likely to break the assembly.  To switch
> >> over to the kernel copy, additional temporary storage would be
> >> required which is in short supply in this context.
> >
> > With Andy's patches that should actually be doable, no?
> 
> I don't think it has much to do with my patches.  We can freely spill
> to the stack in the espfix64 code, though.

Ah, I was thinking of how you made scratch space easier for the SYSENTER
stuff.

But if we can freely spill here, should we not do the kernel switch
instead of doing this user mapping? The way I understand things, the
less of these magic mappings we have the better.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-27  9:14       ` Peter Zijlstra
@ 2017-11-27 15:35         ` Peter Zijlstra
  0 siblings, 0 replies; 82+ messages in thread
From: Peter Zijlstra @ 2017-11-27 15:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Borislav Petkov, Linus Torvalds

On Mon, Nov 27, 2017 at 10:14:24AM +0100, Peter Zijlstra wrote:

> But if we can freely spill here, should we not do the kernel switch
> instead of doing this user mapping? The way I understand things, the
> less of these magic mappings we have the better.

Turns out, we don't need more scratch regs at all.

The below seems to survive tools/testing/selftests/x86/sigreturn_64
which exercises the ESPFIX crud.

---
 arch/x86/entry/entry_64.S   | 11 ++++++++---
 arch/x86/kernel/espfix_64.c | 10 ++--------
 2 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index df0152bee8a8..289ba2680952 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -825,7 +825,9 @@ ENTRY(native_iret)
 	 */
 
 	pushq	%rdi				/* Stash user RDI */
-	SWAPGS
+	SWAPGS					/* to kernel GS */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi	/* to kernel CR3 */
+
 	movq	PER_CPU_VAR(espfix_waddr), %rdi
 	movq	%rax, (0*8)(%rdi)		/* user RAX */
 	movq	(1*8)(%rsp), %rax		/* user RIP */
@@ -841,7 +843,6 @@ ENTRY(native_iret)
 	/* Now RAX == RSP. */
 
 	andl	$0xffff0000, %eax		/* RAX = (RSP & 0xffff0000) */
-	popq	%rdi				/* Restore user RDI */
 
 	/*
 	 * espfix_stack[31:16] == 0.  The page tables are set up such that
@@ -852,7 +853,11 @@ ENTRY(native_iret)
 	 * still points to an RO alias of the ESPFIX stack.
 	 */
 	orq	PER_CPU_VAR(espfix_stack), %rax
-	SWAPGS
+
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi	/* to user CR3 */
+	SWAPGS					/* to user GS */
+	popq	%rdi				/* Restore user RDI */
+
 	movq	%rax, %rsp
 	UNWIND_HINT_IRET_REGS offset=8
 
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8bb116d73aaa..8826475d786c 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
-DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
+DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,10 +225,4 @@ void init_espfix_ap(int cpu)
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
-	/*
-	 * _PAGE_GLOBAL is not really required.  This is not a hot
-	 * path, but we do it here for consistency.
-	 */
-	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
-			__PAGE_KERNEL | _PAGE_GLOBAL);
 }

^ permalink raw reply related	[flat|nested] 82+ messages in thread

* RE: [PATCH 01/43] x86/decoder: Add new TEST instruction pattern
  2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
  2017-11-24 10:38   ` Borislav Petkov
@ 2017-12-02  7:39   ` Robert Elliott (Persistent Memory)
  1 sibling, 0 replies; 82+ messages in thread
From: Robert Elliott (Persistent Memory) @ 2017-12-02  7:39 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds,
	Greg Kroah-Hartman

> -----Original Message-----
> From: linux-kernel-owner@vger.kernel.org [mailto:linux-kernel-
> owner@vger.kernel.org] On Behalf Of Ingo Molnar
> Sent: Friday, November 24, 2017 3:14 AM
> To: linux-kernel@vger.kernel.org
> Subject: [PATCH 01/43] x86/decoder: Add new TEST instruction pattern
> 
> From: Masami Hiramatsu <mhiramat@kernel.org>
> 
...
> diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
> index 12e377184ee4..c4d55919fac1 100644

I think this patch (commit 12a78d43de76, also posted for 3.18, 4.4, and 4.9) 
also needs to update these:
    tools/objtool/arch/x86/lib/x86-opcode-map.txt
    tools/perf/util/intel-pt-decoder/x86-opcode-map.txt

to avoid warnings like:

Warning: synced file at 'tools/objtool/arch/x86/lib/x86-opcode-map.txt' differs from latest kernel version at 'arch/x86/lib/x86-opcode-map.txt'
  LINK     /home/user/linux/tools/objtool/objtool


---
Robert Elliott, HPE Persistent Memory

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-26 17:41   ` Borislav Petkov
  2017-11-27  9:26     ` Ingo Molnar
@ 2017-11-27 21:14     ` Dave Hansen
  1 sibling, 0 replies; 82+ messages in thread
From: Dave Hansen @ 2017-11-27 21:14 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Linus Torvalds

On 11/26/2017 09:41 AM, Borislav Petkov wrote:
>> The KAISER approach keeps two copies of the page tables: one for running
>> in the kernel and one for running userspace.  But, there are a few
>> structures that are needed for switching in and out of the kernel and
>> a good subset of *those* are per-cpu data.
>>
>> This patch creates a new kind of per-cpu data that is mapped and
> Never say "This patch" in the commit message of a patch. It is
> tautologically useless.

Look at any academic paper's abstract.  They almost always describe the
problem and the state of the art, and then start to describe the paper's
content.  It's entirely normal to say "this paper" to help differentiate
these things.

Patches can and should be the same.

We should not litter the text with "this patch does this", "this patch
does that", but we should not outlaw it entirely.  IOW, you can't just
blindly say, not to do it.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-26 17:41   ` Borislav Petkov
@ 2017-11-27  9:26     ` Ingo Molnar
  2017-11-27 21:14     ` Dave Hansen
  1 sibling, 0 replies; 82+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:51PM +0100, Ingo Molnar wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > These patches are based on work from a team at Graz University of
> > Technology posted here: https://github.com/IAIK/KAISER
> > 
> > The KAISER approach keeps two copies of the page tables: one for running
> > in the kernel and one for running userspace.  But, there are a few
> > structures that are needed for switching in and out of the kernel and
> > a good subset of *those* are per-cpu data.
> > 
> > This patch creates a new kind of per-cpu data that is mapped and
> 
> Never say "This patch" in the commit message of a patch. It is
> tautologically useless.

It makes sense in some contexts though. For example:

  The compat variant of SYSCALL doesn't have this problem in the first
  place -- there are plenty of scratch registers, since we don't care
  about preserving r8-r15. This patch therefore doesn't touch SYSCALL32
  at all.

If we only had:

   We don't touch SYSCALL32 at all.

it would be ambiguous: does it describe the current status quo, or perhaps a 
decision we made when writing the patch? The 'this patch' variant adds extra 
emphasis that SYSCALL32 is fine and doesn't require any changes.

Also, even the above variant:

> > This patch creates a new kind of per-cpu data that is mapped and ...

is a bit clearer than:

> > Create a new kind of per-cpu data that is mapped and ...

Because it stresses that this is a new change. The latter form includes that 
information as well, but is pretty close to:

> > The kernel creates a new kind of per-cpu data that is mapped and ...

... which describes the status quo and not new behavior. Explicitly qualifying who 
does something makes it clearer.

I.e. redundancy sometimes helps readability. We don't want 'this patch' in every 
second sentence, but having it written out once, especially where we switch from 
describing existing behavior to describing new behavior (as is the case here) is 
perfectly fine and improves readability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 82+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
@ 2017-11-26 17:41   ` Borislav Petkov
  2017-11-27  9:26     ` Ingo Molnar
  2017-11-27 21:14     ` Dave Hansen
  0 siblings, 2 replies; 82+ messages in thread
From: Borislav Petkov @ 2017-11-26 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:51PM +0100, Ingo Molnar wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> These patches are based on work from a team at Graz University of
> Technology posted here: https://github.com/IAIK/KAISER
> 
> The KAISER approach keeps two copies of the page tables: one for running
> in the kernel and one for running userspace.  But, there are a few
> structures that are needed for switching in and out of the kernel and
> a good subset of *those* are per-cpu data.
> 
> This patch creates a new kind of per-cpu data that is mapped and

Never say "This patch" in the commit message of a patch. It is
tautologically useless.

> can be used no matter which copy of the page tables is active.
> Users of this new section will be forthcoming.
> 
> Thanks to Hugh Dickins for cleanups to this code.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: daniel.gruss@iaik.tugraz.at
> Cc: hughd@google.com
> Cc: keescook@google.com
> Cc: linux-mm@kvack.org
> Cc: luto@kernel.org
> Cc: michael.schwarz@iaik.tugraz.at
> Cc: moritz.lipp@iaik.tugraz.at
> Cc: richard.fellner@student.tugraz.at
> Link: https://lkml.kernel.org/r/20171123003444.196CB6DB@viggo.jf.intel.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/asm-generic/vmlinux.lds.h |  7 +++++++
>  include/linux/percpu-defs.h       | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index bdcd1caae092..e12168936d3f 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -826,7 +826,14 @@
>   */
>  #define PERCPU_INPUT(cacheline)						\
>  	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
> +	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
>  	*(.data..percpu..first)						\
> +	. = ALIGN(cacheline);						\
> +	*(.data..percpu..user_mapped)					\
> +	*(.data..percpu..user_mapped..shared_aligned)			\
> +	. = ALIGN(PAGE_SIZE);						\
> +	*(.data..percpu..user_mapped..page_aligned)			\
> +	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
>  	. = ALIGN(PAGE_SIZE);						\
>  	*(.data..percpu..page_aligned)					\
>  	. = ALIGN(cacheline);						\
> diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
> index 2d2096ba1cfe..752513674295 100644
> --- a/include/linux/percpu-defs.h
> +++ b/include/linux/percpu-defs.h
> @@ -35,6 +35,12 @@
>  
>  #endif
>  
> +#ifdef CONFIG_KAISER
> +#define USER_MAPPED_SECTION "..user_mapped"
> +#else
> +#define USER_MAPPED_SECTION ""
> +#endif
>  /*
>   * Base implementations of per-CPU variable declarations and definitions, where
>   * the section in which the variable is to be placed is provided by the
> @@ -115,6 +121,12 @@
>  #define DEFINE_PER_CPU(type, name)					\
>  	DEFINE_PER_CPU_SECTION(type, name, "")
>  
> +#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
> +
> +#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
> +
>  /*
>   * Declaration/definition used for per-CPU variables that must come first in
>   * the set of variables.
> @@ -144,6 +156,14 @@
>  	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
>  	____cacheline_aligned_in_smp
>  
> +#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
> +	____cacheline_aligned_in_smp
> +
> +#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
> +	____cacheline_aligned_in_smp
> +
>  #define DECLARE_PER_CPU_ALIGNED(type, name)				\
>  	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
>  	____cacheline_aligned
> @@ -162,6 +182,16 @@
>  #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
>  	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
>  	__aligned(PAGE_SIZE)
> +/*
> + * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
> + */

WARNING: line over 100 characters
#122: FILE: include/linux/percpu-defs.h:186:
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.

> +#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
> +	__aligned(PAGE_SIZE)
> +
> +#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
> +	__aligned(PAGE_SIZE)
>  
>  /*
>   * Declaration/definition used for per-CPU variables that must be read mostly.
> -- 

With that:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 82+ messages in thread

* [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-26 17:41   ` Borislav Petkov
  0 siblings, 1 reply; 82+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003444.196CB6DB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/vmlinux.lds.h |  7 +++++++
 include/linux/percpu-defs.h       | 30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index bdcd1caae092..e12168936d3f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -826,7 +826,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 2d2096ba1cfe..752513674295 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
-- 
2.14.1

^ permalink raw reply related	[flat|nested] 82+ messages in thread

end of thread, other threads:[~2017-12-02  7:39 UTC | newest]

Thread overview: 82+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
2017-11-24  9:14 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
2017-11-24 10:38   ` Borislav Petkov
2017-12-02  7:39   ` Robert Elliott (Persistent Memory)
2017-11-24  9:14 ` [PATCH 02/43] x86/entry/64: Allocate and enable the SYSENTER stack Ingo Molnar
2017-11-24  9:14 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
2017-11-24  9:14 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
2017-11-24  9:14 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
2017-11-24 11:00   ` Borislav Petkov
2017-11-24  9:14 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
2017-11-24  9:14 ` [PATCH 07/43] x86/entry: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
2017-11-24  9:14 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
2017-11-24  9:14 ` [PATCH 09/43] x86/entry: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
2017-11-24 11:44   ` Borislav Petkov
2017-11-24  9:14 ` [PATCH 10/43] x86/entry: Remap the TSS into the cpu entry area Ingo Molnar
2017-11-24  9:14 ` [PATCH 11/43] x86/entry/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
2017-11-24 14:19   ` Borislav Petkov
2017-11-24  9:14 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
2017-11-24  9:14 ` [PATCH 13/43] x86/entry/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
2017-11-24 11:27   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 14/43] x86/entry/64: Return to userspace from the trampoline stack Ingo Molnar
2017-11-24 13:46   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
2017-11-24 13:52   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
2017-11-24 13:53   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 17/43] x86/irq/64: Print the offending IP in the stack overflow warning Ingo Molnar
2017-11-24 14:22   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
2017-11-24 14:23   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
2017-11-24 14:23   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
2017-11-24 14:24   ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
2017-11-24  9:14 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
2017-11-24 12:05   ` Peter Zijlstra
2017-11-24 12:17     ` Ingo Molnar
2017-11-24 12:45       ` Peter Zijlstra
2017-11-24 13:04         ` Thomas Gleixner
2017-11-24  9:14 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
2017-11-24  9:14 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
2017-11-24  9:14 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
2017-11-24 12:13   ` Peter Zijlstra
2017-11-24 13:46     ` Ingo Molnar
2017-11-24 12:16   ` Peter Zijlstra
2017-11-24 16:33     ` Dave Hansen
2017-11-26 15:13       ` Ingo Molnar
2017-11-24 13:30   ` Peter Zijlstra
2017-11-26 15:15     ` Ingo Molnar
2017-11-27  8:59       ` [PATCH] x86/mm/kaiser: Use the other page_table_lock pattern Peter Zijlstra
2017-11-27  8:59       ` [PATCH] mm: Unify page_table_lock allocation pattern Peter Zijlstra
2017-11-24  9:14 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
2017-11-24  9:14 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
2017-11-24  9:14 ` [PATCH 28/43] x86/mm/kaiser: Map CPU entry area Ingo Molnar
2017-11-24 13:43   ` Peter Zijlstra
2017-11-24  9:14 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
2017-11-24  9:14 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
2017-11-24 13:47   ` Peter Zijlstra
2017-11-24 16:17     ` Andy Lutomirski
2017-11-27  9:14       ` Peter Zijlstra
2017-11-27 15:35         ` Peter Zijlstra
2017-11-24  9:14 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
2017-11-24  9:14 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
2017-11-24  9:14 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
2017-11-24  9:14 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
2017-11-24  9:14 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
2017-11-24  9:14 ` [PATCH 36/43] x86/mm: Allow flushing for future ASID switches Ingo Molnar
2017-11-24  9:14 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
2017-11-24  9:14 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
2017-11-24  9:14 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
2017-11-24  9:14 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
2017-11-24  9:14 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
2017-11-24  9:14 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
2017-11-24  9:14 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
2017-11-24 13:55 ` [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
2017-11-24 15:23 ` Thomas Gleixner
2017-11-24 17:19   ` Ingo Molnar
2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
2017-11-26 17:41   ` Borislav Petkov
2017-11-27  9:26     ` Ingo Molnar
2017-11-27 21:14     ` Dave Hansen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.