All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version
@ 2017-11-24 17:23 Ingo Molnar
  2017-11-24 17:23 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
                   ` (43 more replies)
  0 siblings, 44 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
(v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
on top of latest tip:x86/urgent (12a78d43de76).

This version is pretty well tested, at least on the usual x86 tree test systems.
It has a couple of merge mistakes fixed, the biggest difference is in patch #22:

  x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching

The other patches are identical or very close to what I posted earlier today.

Any remaining bugs caused by mis-merges, mis-backmerges or mis-fixes are mine.

Thanks,

    Ingo

Andy Lutomirski (19):
  x86/asm/64: Allocate and enable the SYSENTER stack
  x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  x86/gdt: Put per-cpu GDT remaps in ascending order
  x86/fixmap: Generalize the GDT fixmap mechanism
  x86/kasan/64: Teach KASAN about the cpu_entry_area
  x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  x86/dumpstack: Handle stack overflow on all stacks
  x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  x86/asm: Remap the TSS into the cpu entry area
  x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  x86/asm/64: Use a percpu trampoline stack for IDT entries
  x86/asm/64: Return to userspace from the trampoline stack
  x86/entry/64: Create a percpu SYSCALL entry trampoline
  x86/irq: Remove an old outdated comment about context tracking races
  x86/irq/64: In the stack overflow warning, print the offending IP
  x86/entry/64: Move the IST stacks into cpu_entry_area
  x86/entry/64: Remove the SYSENTER stack canary
  x86/entry: Clean up SYSENTER_stack code

Dave Hansen (22):
  x86/mm/kaiser: Disable global pages by default with KAISER
  x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  x86/mm/kaiser: Introduce user-mapped per-cpu areas
  x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  x86/mm/kaiser: Allow NX poison to be set in p4d/pgd
  x86/mm/kaiser: Make sure static PGDs are 8k in size
  x86/mm/kaiser: Map cpu entry area
  x86/mm/kaiser: Map dynamically-allocated LDTs
  x86/mm/kaiser: Map espfix structures
  x86/mm/kaiser: Map entry stack variable
  x86/mm: Move CR3 construction functions
  x86/mm: Remove hard-coded ASID limit checks
  x86/mm: Put mmu-to-h/w ASID translation in one place
  x86/mm/kaiser: Allow flushing for future ASID switches
  x86/mm/kaiser: Use PCID feature to make user and kernel switches faster
  x86/mm/kaiser: Disable native VSYSCALL
  x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime
  x86/mm/kaiser: Add a function to check for KAISER being enabled
  x86/mm/kaiser: Un-poison PGDs at runtime
  x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  x86/mm/kaiser: Add Kconfig

Hugh Dickins (1):
  x86/mm/kaiser: Map virtually-addressed performance monitoring buffers

Masami Hiramatsu (1):
  x86/decoder: Add new TEST instruction pattern

 Documentation/x86/kaiser.txt                | 162 ++++++++
 arch/x86/Kconfig                            |   8 +
 arch/x86/boot/compressed/pagetable.c        |   6 +
 arch/x86/entry/calling.h                    |  89 ++++
 arch/x86/entry/entry_32.S                   |   6 +-
 arch/x86/entry/entry_64.S                   | 207 ++++++++--
 arch/x86/entry/entry_64_compat.S            |  31 +-
 arch/x86/events/intel/ds.c                  |  49 ++-
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/desc.h                 |  13 +-
 arch/x86/include/asm/fixmap.h               |  58 ++-
 arch/x86/include/asm/kaiser.h               |  68 +++
 arch/x86/include/asm/mmu_context.h          |  29 +-
 arch/x86/include/asm/pgtable.h              |  19 +-
 arch/x86/include/asm/pgtable_64.h           | 146 +++++++
 arch/x86/include/asm/pgtable_types.h        |  25 +-
 arch/x86/include/asm/processor.h            |  49 ++-
 arch/x86/include/asm/stacktrace.h           |   3 +
 arch/x86/include/asm/switch_to.h            |   2 +-
 arch/x86/include/asm/thread_info.h          |   2 +-
 arch/x86/include/asm/tlbflush.h             | 208 ++++++++--
 arch/x86/include/asm/traps.h                |   1 -
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kernel/asm-offsets.c               |   7 +
 arch/x86/kernel/asm-offsets_32.c            |   5 -
 arch/x86/kernel/asm-offsets_64.c            |   1 +
 arch/x86/kernel/cpu/common.c                | 139 +++++--
 arch/x86/kernel/doublefault.c               |  36 +-
 arch/x86/kernel/dumpstack.c                 |  42 +-
 arch/x86/kernel/dumpstack_32.c              |   6 +
 arch/x86/kernel/dumpstack_64.c              |   6 +
 arch/x86/kernel/espfix_64.c                 |  27 +-
 arch/x86/kernel/head_64.S                   |  30 +-
 arch/x86/kernel/irq.c                       |  12 -
 arch/x86/kernel/irq_64.c                    |   4 +-
 arch/x86/kernel/ldt.c                       |  25 +-
 arch/x86/kernel/process.c                   |  15 +-
 arch/x86/kernel/process_64.c                |   3 +-
 arch/x86/kernel/traps.c                     |  27 +-
 arch/x86/kernel/vmlinux.lds.S               |  10 +
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/lib/x86-opcode-map.txt             |   2 +-
 arch/x86/mm/Makefile                        |   1 +
 arch/x86/mm/init.c                          |  75 ++--
 arch/x86/mm/kaiser.c                        | 620 ++++++++++++++++++++++++++++
 arch/x86/mm/kasan_init_64.c                 |  13 +-
 arch/x86/mm/pageattr.c                      |  18 +-
 arch/x86/mm/pgtable.c                       |  16 +-
 arch/x86/mm/tlb.c                           | 105 ++++-
 arch/x86/power/cpu.c                        |  16 +-
 arch/x86/xen/mmu_pv.c                       |   2 +-
 include/asm-generic/vmlinux.lds.h           |   7 +
 include/linux/kaiser.h                      |  38 ++
 include/linux/percpu-defs.h                 |  30 ++
 init/main.c                                 |   3 +
 kernel/fork.c                               |   1 +
 security/Kconfig                            |  10 +
 57 files changed, 2243 insertions(+), 297 deletions(-)
 create mode 100644 Documentation/x86/kaiser.txt
 create mode 100644 arch/x86/include/asm/kaiser.h
 create mode 100644 arch/x86/mm/kaiser.c
 create mode 100644 include/linux/kaiser.h

-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 01/43] x86/decoder: Add new TEST instruction pattern
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 02/43] x86/asm/64: Allocate and enable the SYSENTER stack Ingo Molnar
                   ` (42 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Masami Hiramatsu <mhiramat@kernel.org>

The kbuild test robot reported this build warning:

  Warning: arch/x86/tools/test_get_len found difference at <jump_table>:ffffffff8103dd2c

  Warning: ffffffff8103dd82: f6 09 d8 testb $0xd8,(%rcx)
  Warning: objdump says 3 bytes, but insn_get_length() says 2
  Warning: decoded and checked 1569014 instructions with 1 warnings

This sequence seems to be a new instruction not in the opcode map in the Intel SDM.

The instruction sequence is "F6 09 d8", means Group3(F6), MOD(00)REG(001)RM(001), and 0xd8.
Intel SDM vol2 A.4 Table A-6 said the table index in the group is "Encoding of Bits 5,4,3 of
the ModR/M Byte (bits 2,1,0 in parenthesis)"

In that table, opcodes listed by the index REG bits as:

  000         001       010 011  100        101        110         111
 TEST Ib/Iz,(undefined),NOT,NEG,MUL AL/rAX,IMUL AL/rAX,DIV AL/rAX,IDIV AL/rAX

So, it seems TEST Ib is assigned to 001.

Add the new pattern.

Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Reported-by: kbuild test robot <fengguang.wu@intel.com>
Signed-off-by: Masami Hiramatsu <mhiramat@kernel.org>
Cc: <stable@vger.kernel.org>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/lib/x86-opcode-map.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/lib/x86-opcode-map.txt b/arch/x86/lib/x86-opcode-map.txt
index 12e377184ee4..c4d55919fac1 100644
--- a/arch/x86/lib/x86-opcode-map.txt
+++ b/arch/x86/lib/x86-opcode-map.txt
@@ -896,7 +896,7 @@ EndTable
 
 GrpTable: Grp3_1
 0: TEST Eb,Ib
-1:
+1: TEST Eb,Ib
 2: NOT Eb
 3: NEG Eb
 4: MUL AL,Eb
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 02/43] x86/asm/64: Allocate and enable the SYSENTER stack
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
  2017-11-24 17:23 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
                   ` (41 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This will simplify future changes that want scratch variables early in
the SYSENTER handler -- they'll be able to spill registers to the
stack.  It also lets us get rid of a SWAPGS_UNSAFE_STACK user.

This does not depend on CONFIG_IA32_EMULATION because we'll want the
stack space even without IA32 emulation.

As far as I can tell, the reason that this wasn't done from day 1 is
that we use IST for #DB and #BP, which is IMO rather nasty and causes
a lot more problems than it solves.  But, since #DB uses IST, we don't
actually need a real stack for SYSENTER (because SYSENTER with TF set
will invoke #DB on the IST stack rather than the SYSENTER stack).
I want to remove IST usage from these vectors some day, and this patch
is a prerequisite for that as well.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c37d6e68a73e1b5b1203e0e95b488fa8092b3cfb.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64_compat.S | 2 +-
 arch/x86/include/asm/processor.h | 3 ---
 arch/x86/kernel/asm-offsets.c    | 5 +++++
 arch/x86/kernel/asm-offsets_32.c | 5 -----
 arch/x86/kernel/cpu/common.c     | 4 +++-
 arch/x86/kernel/process.c        | 2 --
 arch/x86/kernel/traps.c          | 3 +--
 7 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 568e130d932c..dcc6987f9bae 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -48,7 +48,7 @@
  */
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
-	SWAPGS_UNSAFE_STACK
+	SWAPGS
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index cc16fa882e3e..504a3bb4d5f0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -340,14 +340,11 @@ struct tss_struct {
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 
-#ifdef CONFIG_X86_32
 	/*
 	 * Space for the temporary SYSENTER stack.
 	 */
 	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
-#endif
-
 } ____cacheline_aligned;
 
 DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 8ea78275480d..b275863128eb 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -93,4 +93,9 @@ void common(void) {
 
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
+
+	/* Offset from cpu_tss to SYSENTER_stack */
+	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	/* Size of SYSENTER_stack */
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
 }
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
index dedf428b20b6..52ce4ea16e53 100644
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -50,11 +50,6 @@ void foo(void)
 	DEFINE(TSS_sysenter_sp0, offsetof(struct tss_struct, x86_tss.sp0) -
 	       offsetofend(struct tss_struct, SYSENTER_stack));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
-
 #ifdef CONFIG_CC_STACKPROTECTOR
 	BLANK();
 	OFFSET(stack_canary_offset, stack_canary, canary);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index fa998ca8aa5a..ccb5f66c4e5b 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1386,7 +1386,9 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
+		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 97fb3e5737f5..35d674157fda 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -71,9 +71,7 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-#ifdef CONFIG_X86_32
 	.SYSENTER_stack_canary	= STACK_END_MAGIC,
-#endif
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index b7b0f74a2150..2008dd0f8ccb 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -800,14 +800,13 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-#if defined(CONFIG_X86_32)
 	/*
 	 * This is the most likely code path that involves non-trivial use
 	 * of the SYSENTER stack.  Check that we haven't overrun it.
 	 */
 	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
 	     "Overran or corrupted SYSENTER stack\n");
-#endif
+
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for the SYSENTER stack
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
  2017-11-24 17:23 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
  2017-11-24 17:23 ` [PATCH 02/43] x86/asm/64: Allocate and enable the SYSENTER stack Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
                   ` (40 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

get_stack_info() doesn't currently know about the SYSENTER stack, so
unwinding will fail if we entered the kernel on the SYSENTER stack
and haven't fully switched off.  Teach get_stack_info() about the
SYSENTER stack.

With future patches applied that run part of the entry code on the
SYSENTER stack and introduce an intentional BUG(), I would get:

PANIC: double fault, error_code: 0x0
...
RIP: 0010:do_error_trap+0x33/0x1c0
...
Call Trace:
Code: ...

With this patch, I get:

PANIC: double fault, error_code: 0x0
...
Call Trace:
 <SYSENTER>
 ? async_page_fault+0x36/0x60
 ? invalid_op+0x22/0x40
 ? async_page_fault+0x36/0x60
 ? sync_regs+0x3c/0x40
 ? sync_regs+0x2e/0x40
 ? error_entry+0x6c/0xd0
 ? async_page_fault+0x36/0x60
 </SYSENTER>
Code: ...

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/c32ce8b363e27fa9b4a4773297d5b4b0f4b39e94.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/stacktrace.h |  3 +++
 arch/x86/kernel/dumpstack.c       | 19 +++++++++++++++++++
 arch/x86/kernel/dumpstack_32.c    |  6 ++++++
 arch/x86/kernel/dumpstack_64.c    |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/include/asm/stacktrace.h b/arch/x86/include/asm/stacktrace.h
index 8da111b3c342..f8062bfd43a0 100644
--- a/arch/x86/include/asm/stacktrace.h
+++ b/arch/x86/include/asm/stacktrace.h
@@ -16,6 +16,7 @@ enum stack_type {
 	STACK_TYPE_TASK,
 	STACK_TYPE_IRQ,
 	STACK_TYPE_SOFTIRQ,
+	STACK_TYPE_SYSENTER,
 	STACK_TYPE_EXCEPTION,
 	STACK_TYPE_EXCEPTION_LAST = STACK_TYPE_EXCEPTION + N_EXCEPTION_STACKS-1,
 };
@@ -28,6 +29,8 @@ struct stack_info {
 bool in_task_stack(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info);
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info);
+
 int get_stack_info(unsigned long *stack, struct task_struct *task,
 		   struct stack_info *info, unsigned long *visit_mask);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index f13b4c00a5de..5e7d10e8ca25 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -43,6 +43,25 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 	return true;
 }
 
+bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
+{
+	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+
+	/* Treat the canary as part of the stack for unwinding purposes. */
+	void *begin = &tss->SYSENTER_stack_canary;
+	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+
+	if ((void *)stack < begin || (void *)stack >= end)
+		return false;
+
+	info->type	= STACK_TYPE_SYSENTER;
+	info->begin	= begin;
+	info->end	= end;
+	info->next_sp	= NULL;
+
+	return true;
+}
+
 static void printk_stack_address(unsigned long address, int reliable,
 				 char *log_lvl)
 {
diff --git a/arch/x86/kernel/dumpstack_32.c b/arch/x86/kernel/dumpstack_32.c
index daefae83a3aa..5ff13a6b3680 100644
--- a/arch/x86/kernel/dumpstack_32.c
+++ b/arch/x86/kernel/dumpstack_32.c
@@ -26,6 +26,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_SOFTIRQ)
 		return "SOFTIRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	return NULL;
 }
 
@@ -93,6 +96,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (task != current)
 		goto unknown;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	if (in_hardirq_stack(stack, info))
 		goto recursion_check;
 
diff --git a/arch/x86/kernel/dumpstack_64.c b/arch/x86/kernel/dumpstack_64.c
index 88ce2ffdb110..abc828f8c297 100644
--- a/arch/x86/kernel/dumpstack_64.c
+++ b/arch/x86/kernel/dumpstack_64.c
@@ -37,6 +37,9 @@ const char *stack_type_name(enum stack_type type)
 	if (type == STACK_TYPE_IRQ)
 		return "IRQ";
 
+	if (type == STACK_TYPE_SYSENTER)
+		return "SYSENTER";
+
 	if (type >= STACK_TYPE_EXCEPTION && type <= STACK_TYPE_EXCEPTION_LAST)
 		return exception_stack_names[type - STACK_TYPE_EXCEPTION];
 
@@ -115,6 +118,9 @@ int get_stack_info(unsigned long *stack, struct task_struct *task,
 	if (in_irq_stack(stack, info))
 		goto recursion_check;
 
+	if (in_sysenter_stack(stack, info))
+		goto recursion_check;
+
 	goto unknown;
 
 recursion_check:
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (2 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
                   ` (39 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently have CPU 0's GDT at the top of the GDT range and
higher-numbered CPUs at lower addresses.  This happens because the
fixmap is upside down (index 0 is the top of the fixmap).

Flip it so that GDTs are in ascending order by virtual address.
This will simplify a future patch that will generalize the GDT
remap to contain multiple pages.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/3966a6edf6fd45deca4cf52a9b9276402499dda9.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 4011cb03ef08..95cd95eb7285 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -63,7 +63,7 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 /* Get the fixmap index for a specific processor */
 static inline unsigned int get_cpu_gdt_ro_index(int cpu)
 {
-	return FIX_GDT_REMAP_BEGIN + cpu;
+	return FIX_GDT_REMAP_END - cpu;
 }
 
 /* Provide the fixmap address of the remapped GDT */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (3 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
                   ` (38 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Currently, the GDT is an ad-hoc array of pages, one per CPU, in the
fixmap.  Generalize it to be an array of a new struct cpu_entry_area
so that we can cleanly add new things to it.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/22571d77ba1f3c714df9fa37db9a58218bc17597.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h   |  9 +--------
 arch/x86/include/asm/fixmap.h | 34 ++++++++++++++++++++++++++++++++--
 arch/x86/kernel/cpu/common.c  | 14 +++++++-------
 arch/x86/xen/mmu_pv.c         |  2 +-
 4 files changed, 41 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 95cd95eb7285..194ffab00ebe 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -60,17 +60,10 @@ static inline struct desc_struct *get_current_gdt_rw(void)
 	return this_cpu_ptr(&gdt_page)->gdt;
 }
 
-/* Get the fixmap index for a specific processor */
-static inline unsigned int get_cpu_gdt_ro_index(int cpu)
-{
-	return FIX_GDT_REMAP_END - cpu;
-}
-
 /* Provide the fixmap address of the remapped GDT */
 static inline struct desc_struct *get_cpu_gdt_ro(int cpu)
 {
-	unsigned int idx = get_cpu_gdt_ro_index(cpu);
-	return (struct desc_struct *)__fix_to_virt(idx);
+	return (struct desc_struct *)&get_cpu_entry_area(cpu)->gdt;
 }
 
 /* Provide the current read-only GDT */
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index dcd9fb55e679..0f4c92f02968 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -44,6 +44,16 @@ extern unsigned long __FIXADDR_TOP;
 			 PAGE_SIZE)
 #endif
 
+/*
+ * cpu_entry_area is a percpu region in the fixmap that contains things
+ * needed by the CPU and early entry/exit code.  Real types aren't used
+ * for all fields here to avoid circular header dependencies.
+ */
+struct cpu_entry_area {
+	char gdt[PAGE_SIZE];
+};
+
+#define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
 
 /*
  * Here we define all the compile-time 'special' virtual
@@ -101,8 +111,8 @@ enum fixed_addresses {
 	FIX_LNW_VRTC,
 #endif
 	/* Fixmap entries to remap the GDTs, one per processor. */
-	FIX_GDT_REMAP_BEGIN,
-	FIX_GDT_REMAP_END = FIX_GDT_REMAP_BEGIN + NR_CPUS - 1,
+	FIX_CPU_ENTRY_AREA_TOP,
+	FIX_CPU_ENTRY_AREA_BOTTOM = FIX_CPU_ENTRY_AREA_TOP + (CPU_ENTRY_AREA_PAGES * NR_CPUS) - 1,
 
 	__end_of_permanent_fixed_addresses,
 
@@ -185,5 +195,25 @@ void __init *early_memremap_decrypted_wp(resource_size_t phys_addr,
 void __early_set_fixmap(enum fixed_addresses idx,
 			phys_addr_t phys, pgprot_t flags);
 
+static inline unsigned int __get_cpu_entry_area_page_index(int cpu, int page)
+{
+	BUILD_BUG_ON(sizeof(struct cpu_entry_area) % PAGE_SIZE != 0);
+
+	return FIX_CPU_ENTRY_AREA_BOTTOM - cpu*CPU_ENTRY_AREA_PAGES - page;
+}
+
+#define __get_cpu_entry_area_offset_index(cpu, offset) ({		\
+	BUILD_BUG_ON(offset % PAGE_SIZE != 0);				\
+	__get_cpu_entry_area_page_index(cpu, offset / PAGE_SIZE);	\
+	})
+
+#define get_cpu_entry_area_index(cpu, field)				\
+	__get_cpu_entry_area_offset_index((cpu), offsetof(struct cpu_entry_area, field))
+
+static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
+{
+	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index ccb5f66c4e5b..c0fb3eb37ee0 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,12 +490,12 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
-/* Setup the fixmap mapping only once per-processor */
-static inline void setup_fixmap_gdt(int cpu)
+/* Setup the fixmap mappings only once per-processor */
+static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
-	pgprot_t prot = PAGE_KERNEL_RO;
+	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
 	/*
 	 * On native 32-bit systems, the GDT cannot be read-only because
@@ -506,11 +506,11 @@ static inline void setup_fixmap_gdt(int cpu)
 	 * On Xen PV, the GDT must be read-only because the hypervisor requires
 	 * it.
 	 */
-	pgprot_t prot = boot_cpu_has(X86_FEATURE_XENPV) ?
+	pgprot_t gdt_prot = boot_cpu_has(X86_FEATURE_XENPV) ?
 		PAGE_KERNEL_RO : PAGE_KERNEL;
 #endif
 
-	__set_fixmap(get_cpu_gdt_ro_index(cpu), get_cpu_gdt_paddr(cpu), prot);
+	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1614,7 +1614,7 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,7 +1676,7 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_fixmap_gdt(cpu);
+	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index 2ccdaba31a07..c2454237fa67 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -2272,7 +2272,7 @@ static void xen_set_fixmap(unsigned idx, phys_addr_t phys, pgprot_t prot)
 #endif
 	case FIX_TEXT_POKE0:
 	case FIX_TEXT_POKE1:
-	case FIX_GDT_REMAP_BEGIN ... FIX_GDT_REMAP_END:
+	case FIX_CPU_ENTRY_AREA_TOP ... FIX_CPU_ENTRY_AREA_BOTTOM:
 		/* All local page mappings */
 		pte = pfn_pte(phys, prot);
 		break;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (4 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 07/43] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
                   ` (37 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The cpu_entry_area will contain stacks.  Make sure that KASAN has
appropriate shadow mappings for them.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Alexander Potapenko <glider@google.com>
Cc: Andrey Ryabinin <aryabinin@virtuozzo.com>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: kasan-dev@googlegroups.com
Link: https://lkml.kernel.org/r/8407adf9126440d6467dade88fdb3e3b75fc1019.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kasan_init_64.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/kasan_init_64.c b/arch/x86/mm/kasan_init_64.c
index 99dfed6dfef8..54561dce742e 100644
--- a/arch/x86/mm/kasan_init_64.c
+++ b/arch/x86/mm/kasan_init_64.c
@@ -277,6 +277,7 @@ void __init kasan_early_init(void)
 void __init kasan_init(void)
 {
 	int i;
+	void *cpu_entry_area_begin, *cpu_entry_area_end;
 
 #ifdef CONFIG_KASAN_INLINE
 	register_die_notifier(&kasan_die_notifier);
@@ -329,8 +330,18 @@ void __init kasan_init(void)
 			      (unsigned long)kasan_mem_to_shadow(_end),
 			      early_pfn_to_nid(__pa(_stext)));
 
+	cpu_entry_area_begin = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_BOTTOM));
+	cpu_entry_area_end = (void *)(__fix_to_virt(FIX_CPU_ENTRY_AREA_TOP) + PAGE_SIZE);
+
 	kasan_populate_zero_shadow(kasan_mem_to_shadow((void *)MODULES_END),
-			(void *)KASAN_SHADOW_END);
+				   kasan_mem_to_shadow(cpu_entry_area_begin));
+
+	kasan_populate_shadow((unsigned long)kasan_mem_to_shadow(cpu_entry_area_begin),
+			      (unsigned long)kasan_mem_to_shadow(cpu_entry_area_end),
+		0);
+
+	kasan_populate_zero_shadow(kasan_mem_to_shadow(cpu_entry_area_end),
+				   (void *)KASAN_SHADOW_END);
 
 	load_cr3(init_top_pgt);
 	__flush_tlb_all();
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 07/43] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (5 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
                   ` (36 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

A future patch will move SYSENTER_stack to the beginning of cpu_tss
to help detect overflow.  Before this can happen, fix several code
paths that hardcode assumptions about the old layout

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d40a2c5ae4539d64090849a374f3169ec492f4e2.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h      |  2 +-
 arch/x86/include/asm/processor.h |  4 ++--
 arch/x86/kernel/cpu/common.c     |  8 ++++----
 arch/x86/kernel/doublefault.c    | 36 +++++++++++++++++-------------------
 arch/x86/power/cpu.c             | 13 +++++++------
 5 files changed, 31 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index 194ffab00ebe..aab4fe9f49f8 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -178,7 +178,7 @@ static inline void set_tssldt_descriptor(void *d, unsigned long addr,
 #endif
 }
 
-static inline void __set_tss_desc(unsigned cpu, unsigned int entry, void *addr)
+static inline void __set_tss_desc(unsigned cpu, unsigned int entry, struct x86_hw_tss *addr)
 {
 	struct desc_struct *d = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 504a3bb4d5f0..c24456429c7d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -163,7 +163,7 @@ enum cpuid_regs_idx {
 extern struct cpuinfo_x86	boot_cpu_data;
 extern struct cpuinfo_x86	new_cpu_data;
 
-extern struct tss_struct	doublefault_tss;
+extern struct x86_hw_tss	doublefault_tss;
 extern __u32			cpu_caps_cleared[NCAPINTS];
 extern __u32			cpu_caps_set[NCAPINTS];
 
@@ -323,7 +323,7 @@ struct x86_hw_tss {
 #define IO_BITMAP_BITS			65536
 #define IO_BITMAP_BYTES			(IO_BITMAP_BITS/8)
 #define IO_BITMAP_LONGS			(IO_BITMAP_BYTES/sizeof(long))
-#define IO_BITMAP_OFFSET		offsetof(struct tss_struct, io_bitmap)
+#define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
 struct tss_struct {
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c0fb3eb37ee0..62cdc10a7d94 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1582,7 +1582,7 @@ void cpu_init(void)
 		}
 	}
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 	/*
 	 * <= is required because the CPU will access up to
@@ -1601,7 +1601,7 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1659,12 +1659,12 @@ void cpu_init(void)
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, t);
+	set_tss_desc(cpu, &t->x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
 
-	t->x86_tss.io_bitmap_base = offsetof(struct tss_struct, io_bitmap);
+	t->x86_tss.io_bitmap_base = IO_BITMAP_OFFSET;
 
 #ifdef CONFIG_DOUBLEFAULT
 	/* Set up doublefault TSS pointer in the GDT */
diff --git a/arch/x86/kernel/doublefault.c b/arch/x86/kernel/doublefault.c
index 0e662c55ae90..0b8cedb20d6d 100644
--- a/arch/x86/kernel/doublefault.c
+++ b/arch/x86/kernel/doublefault.c
@@ -50,25 +50,23 @@ static void doublefault_fn(void)
 		cpu_relax();
 }
 
-struct tss_struct doublefault_tss __cacheline_aligned = {
-	.x86_tss = {
-		.sp0		= STACK_START,
-		.ss0		= __KERNEL_DS,
-		.ldt		= 0,
-		.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
-
-		.ip		= (unsigned long) doublefault_fn,
-		/* 0x2 bit is always set */
-		.flags		= X86_EFLAGS_SF | 0x2,
-		.sp		= STACK_START,
-		.es		= __USER_DS,
-		.cs		= __KERNEL_CS,
-		.ss		= __KERNEL_DS,
-		.ds		= __USER_DS,
-		.fs		= __KERNEL_PERCPU,
-
-		.__cr3		= __pa_nodebug(swapper_pg_dir),
-	}
+struct x86_hw_tss doublefault_tss __cacheline_aligned = {
+	.sp0		= STACK_START,
+	.ss0		= __KERNEL_DS,
+	.ldt		= 0,
+	.io_bitmap_base	= INVALID_IO_BITMAP_OFFSET,
+
+	.ip		= (unsigned long) doublefault_fn,
+	/* 0x2 bit is always set */
+	.flags		= X86_EFLAGS_SF | 0x2,
+	.sp		= STACK_START,
+	.es		= __USER_DS,
+	.cs		= __KERNEL_CS,
+	.ss		= __KERNEL_DS,
+	.ds		= __USER_DS,
+	.fs		= __KERNEL_PERCPU,
+
+	.__cr3		= __pa_nodebug(swapper_pg_dir),
 };
 
 /* dummy for do_double_fault() call */
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 84fcfde53f8f..50593e138281 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -165,12 +165,13 @@ static void fix_processor_context(void)
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
-	set_tss_desc(cpu, t);	/*
-				 * This just modifies memory; should not be
-				 * necessary. But... This is necessary, because
-				 * 386 hardware has concept of busy TSS or some
-				 * similar stupidity.
-				 */
+
+	/*
+	 * This just modifies memory; should not be necessary. But... This is
+	 * necessary, because 386 hardware has concept of busy TSS or some
+	 * similar stupidity.
+	 */
+	set_tss_desc(cpu, &t->x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (6 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 07/43] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 09/43] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
                   ` (35 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

We currently special-case stack overflow on the task stack.  We're
going to start putting special stacks in the fixmap with a custom
layout, so they'll have guard pages, too.  Teach the unwinder to be
able to unwind an overflow of any of the stacks.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/5454bb325cb30a70457a47b50f22317be65eba7d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/dumpstack.c | 24 ++++++++++++++----------
 1 file changed, 14 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 5e7d10e8ca25..a8aa70c05489 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -90,24 +90,28 @@ void show_trace_log_lvl(struct task_struct *task, struct pt_regs *regs,
 	 * - task stack
 	 * - interrupt stack
 	 * - HW exception stacks (double fault, nmi, debug, mce)
+	 * - SYSENTER stack
 	 *
-	 * x86-32 can have up to three stacks:
+	 * x86-32 can have up to four stacks:
 	 * - task stack
 	 * - softirq stack
 	 * - hardirq stack
+	 * - SYSENTER stack
 	 */
 	for (regs = NULL; stack; stack = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
 		const char *stack_name;
 
-		/*
-		 * If we overflowed the task stack into a guard page, jump back
-		 * to the bottom of the usable stack.
-		 */
-		if (task_stack_page(task) - (void *)stack < PAGE_SIZE)
-			stack = task_stack_page(task);
-
-		if (get_stack_info(stack, task, &stack_info, &visit_mask))
-			break;
+		if (get_stack_info(stack, task, &stack_info, &visit_mask)) {
+			/*
+			 * We weren't on a valid stack.  It's possible that
+			 * we overflowed a valid stack into a guard page.
+			 * See if the next page up is valid so that we can
+			 * generate some kind of backtrace if this happens.
+			 */
+			stack = (unsigned long *)PAGE_ALIGN((unsigned long)stack);
+			if (get_stack_info(stack, task, &stack_info, &visit_mask))
+				break;
+		}
 
 		stack_name = stack_type_name(stack_info.type);
 		if (stack_name)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 09/43] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (7 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 10/43] x86/asm: Remap the TSS into the cpu entry area Ingo Molnar
                   ` (34 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

SYSENTER_stack should have reliable overflow detection, which
means that it needs to be at the bottom of a page, not the top.
Move it to the beginning of struct tss_struct and page-align it.

Also add an assertion to make sure that the fixed hardware TSS
doesn't cross a page boundary.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/8de9901e7c3a6aa8fac95b37b9c7b96f1900f11a.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 21 ++++++++++++---------
 arch/x86/kernel/cpu/common.c     | 21 +++++++++++++++++++++
 2 files changed, 33 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index c24456429c7d..48d44fae3d27 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -328,7 +328,16 @@ struct x86_hw_tss {
 
 struct tss_struct {
 	/*
-	 * The hardware state:
+	 * Space for the temporary SYSENTER stack, used for SYSENTER
+	 * and the entry trampoline as well.
+	 */
+	unsigned long		SYSENTER_stack_canary;
+	unsigned long		SYSENTER_stack[64];
+
+	/*
+	 * The fixed hardware portion.  This must not cross a page boundary
+	 * at risk of violating the SDM's advice and potentially triggering
+	 * errata.
 	 */
 	struct x86_hw_tss	x86_tss;
 
@@ -339,15 +348,9 @@ struct tss_struct {
 	 * be within the limit.
 	 */
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
+} __aligned(PAGE_SIZE);
 
-	/*
-	 * Space for the temporary SYSENTER stack.
-	 */
-	unsigned long		SYSENTER_stack_canary;
-	unsigned long		SYSENTER_stack[64];
-} ____cacheline_aligned;
-
-DECLARE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 62cdc10a7d94..d173f6013467 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -511,6 +511,27 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 	__set_fixmap(get_cpu_entry_area_index(cpu, gdt), get_cpu_gdt_paddr(cpu), gdt_prot);
+
+	/*
+	 * The Intel SDM says (Volume 3, 7.2.1):
+	 *
+	 *  Avoid placing a page boundary in the part of the TSS that the
+	 *  processor reads during a task switch (the first 104 bytes). The
+	 *  processor may not correctly perform address translations if a
+	 *  boundary occurs in this area. During a task switch, the processor
+	 *  reads and writes into the first 104 bytes of each TSS (using
+	 *  contiguous physical addresses beginning with the physical address
+	 *  of the first byte of the TSS). So, after TSS access begins, if
+	 *  part of the 104 bytes is not physically contiguous, the processor
+	 *  will access incorrect information without generating a page-fault
+	 *  exception.
+	 *
+	 * There are also a lot of errata involving the TSS spanning a page
+	 * boundary.  Assert that we're not doing that.
+	 */
+	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
+		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+
 }
 
 /* Load the original GDT from the per-cpu structure */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 10/43] x86/asm: Remap the TSS into the cpu entry area
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (8 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 09/43] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 11/43] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
                   ` (33 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

This has a secondary purpose: it puts the entry stack into a region
with a well-controlled layout.  A subsequent patch will take
advantage of this to streamline the SYSCALL entry code to be able to
find it more easily.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/cdcba7e1e82122461b3ca36bb3ef6713ba605e35.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S     |  6 ++++--
 arch/x86/include/asm/fixmap.h |  7 +++++++
 arch/x86/kernel/asm-offsets.c |  3 +++
 arch/x86/kernel/cpu/common.c  | 38 ++++++++++++++++++++++++++++++++------
 arch/x86/kernel/dumpstack.c   |  3 ++-
 arch/x86/power/cpu.c          | 11 ++++++-----
 6 files changed, 54 insertions(+), 14 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 4838037f97f6..0ab316c46806 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -941,7 +941,8 @@ ENTRY(debug)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -984,7 +985,8 @@ ENTRY(nmi)
 	movl	%esp, %eax			# pt_regs pointer
 
 	/* Are we currently on the SYSENTER stack? */
-	PER_CPU(cpu_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx)
+	movl	PER_CPU_VAR(cpu_entry_area), %ecx
+	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 0f4c92f02968..3a42da14c2cb 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -51,6 +51,13 @@ extern unsigned long __FIXADDR_TOP;
  */
 struct cpu_entry_area {
 	char gdt[PAGE_SIZE];
+
+	/*
+	 * The GDT is just below cpu_tss and thus serves (on x86_64) as a
+	 * a read-only guard page for the SYSENTER stack at the bottom
+	 * of the TSS region.
+	 */
+	struct tss_struct tss;
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index b275863128eb..55858b277cf6 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -98,4 +98,7 @@ void common(void) {
 	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
 	/* Size of SYSENTER_stack */
 	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+
+	/* Layout info for cpu_entry_area */
+	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d173f6013467..c67742df569a 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -490,6 +490,19 @@ void load_percpu_segment(int cpu)
 	load_stack_canary_segment();
 }
 
+static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgprot_t prot)
+{
+	int i;
+
+	for (i = 0; i < pages; i++)
+		__set_fixmap(fixmap_index - i, per_cpu_ptr_to_phys(ptr + i*PAGE_SIZE), prot);
+}
+
+#ifdef CONFIG_X86_32
+/* The 32-bit entry code needs to find cpu_entry_area. */
+DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -531,7 +544,15 @@ static inline void setup_cpu_entry_area(int cpu)
 	 */
 	BUILD_BUG_ON((offsetof(struct tss_struct, x86_tss) ^
 		      offsetofend(struct tss_struct, x86_tss)) & PAGE_MASK);
+	BUILD_BUG_ON(sizeof(struct tss_struct) % PAGE_SIZE != 0);
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+				&per_cpu(cpu_tss, cpu),
+				sizeof(struct tss_struct) / PAGE_SIZE,
+				PAGE_KERNEL);
 
+#ifdef CONFIG_X86_32
+	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1282,7 +1303,8 @@ void enable_sep_cpu(void)
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
 
 	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
+	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
+	      offsetofend(struct tss_struct, SYSENTER_stack),
 	      0);
 
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
@@ -1395,6 +1417,8 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	int cpu = smp_processor_id();
+
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
 	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
 
@@ -1408,7 +1432,7 @@ void syscall_init(void)
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
 	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)this_cpu_ptr(&cpu_tss) +
+		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
 		    offsetofend(struct tss_struct, SYSENTER_stack));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
@@ -1618,11 +1642,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, me);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1635,7 +1661,6 @@ void cpu_init(void)
 	if (is_uv_system())
 		uv_cpu_init();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 
@@ -1676,11 +1701,13 @@ void cpu_init(void)
 	initialize_tlbstate_and_flush();
 	enter_lazy_tlb(&init_mm, curr);
 
+	setup_cpu_entry_area(cpu);
+
 	/*
 	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
 	 * task never enters user mode.
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
 
 	load_mm_ldt(&init_mm);
@@ -1697,7 +1724,6 @@ void cpu_init(void)
 
 	fpu__init_cpu();
 
-	setup_cpu_entry_area(cpu);
 	load_fixmap_gdt(cpu);
 }
 #endif
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index a8aa70c05489..bb61919c9335 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,7 +45,8 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	struct tss_struct *tss = this_cpu_ptr(&cpu_tss);
+	int cpu = smp_processor_id();
+	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
 	/* Treat the canary as part of the stack for unwinding purposes. */
 	void *begin = &tss->SYSENTER_stack_canary;
diff --git a/arch/x86/power/cpu.c b/arch/x86/power/cpu.c
index 50593e138281..04d5157fe7f8 100644
--- a/arch/x86/power/cpu.c
+++ b/arch/x86/power/cpu.c
@@ -160,18 +160,19 @@ static void do_fpu_end(void)
 static void fix_processor_context(void)
 {
 	int cpu = smp_processor_id();
-	struct tss_struct *t = &per_cpu(cpu_tss, cpu);
 #ifdef CONFIG_X86_64
 	struct desc_struct *desc = get_cpu_gdt_rw(cpu);
 	tss_desc tss;
 #endif
 
 	/*
-	 * This just modifies memory; should not be necessary. But... This is
-	 * necessary, because 386 hardware has concept of busy TSS or some
-	 * similar stupidity.
+	 * We need to reload TR, which requires that we change the
+	 * GDT entry to indicate "available" first.
+	 *
+	 * XXX: This could probably all be replaced by a call to
+	 * force_reload_TR().
 	 */
-	set_tss_desc(cpu, &t->x86_tss);
+	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 
 #ifdef CONFIG_X86_64
 	memcpy(&tss, &desc[GDT_ENTRY_TSS], sizeof(tss_desc));
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 11/43] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (9 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 10/43] x86/asm: Remap the TSS into the cpu entry area Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
                   ` (32 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

On 64-bit kernels, we used to assume that TSS.sp0 was the current
top of stack.  With the addition of an entry trampoline, this will
no longer be the case.  Store the current top of stack in TSS.sp1,
which is otherwise unused but shares the same cacheline.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/f56634c746a2926eb7bae61e7b80ed51a1940769.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h   | 18 +++++++++++++-----
 arch/x86/include/asm/thread_info.h |  2 +-
 arch/x86/kernel/asm-offsets_64.c   |  1 +
 arch/x86/kernel/process.c          | 10 ++++++++++
 arch/x86/kernel/process_64.c       |  1 +
 5 files changed, 26 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 48d44fae3d27..3a09e5571a92 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -305,7 +305,13 @@ struct x86_hw_tss {
 struct x86_hw_tss {
 	u32			reserved1;
 	u64			sp0;
+
+	/*
+	 * We store cpu_current_top_of_stack in sp1 so it's always accessible.
+	 * Linux does not use ring 1, so sp1 is not otherwise needed.
+	 */
 	u64			sp1;
+
 	u64			sp2;
 	u64			reserved2;
 	u64			ist[7];
@@ -364,6 +370,8 @@ DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
 
 #ifdef CONFIG_X86_32
 DECLARE_PER_CPU(unsigned long, cpu_current_top_of_stack);
+#else
+#define cpu_current_top_of_stack cpu_tss.x86_tss.sp1
 #endif
 
 /*
@@ -535,12 +543,12 @@ static inline void native_swapgs(void)
 
 static inline unsigned long current_top_of_stack(void)
 {
-#ifdef CONFIG_X86_64
-	return this_cpu_read_stable(cpu_tss.x86_tss.sp0);
-#else
-	/* sp0 on x86_32 is special in and around vm86 mode. */
+	/*
+	 *  We can't read directly from tss.sp0: sp0 on x86_32 is special in
+	 *  and around vm86 mode and sp0 on x86_64 is special because of the
+	 *  entry trampoline.
+	 */
 	return this_cpu_read_stable(cpu_current_top_of_stack);
-#endif
 }
 
 static inline bool on_thread_stack(void)
diff --git a/arch/x86/include/asm/thread_info.h b/arch/x86/include/asm/thread_info.h
index 70f425947dc5..44a04999791e 100644
--- a/arch/x86/include/asm/thread_info.h
+++ b/arch/x86/include/asm/thread_info.h
@@ -207,7 +207,7 @@ static inline int arch_within_stack_frames(const void * const stack,
 #else /* !__ASSEMBLY__ */
 
 #ifdef CONFIG_X86_64
-# define cpu_current_top_of_stack (cpu_tss + TSS_sp0)
+# define cpu_current_top_of_stack (cpu_tss + TSS_sp1)
 #endif
 
 #endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index 630212fa9b9d..ad649a8a74a0 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,6 +63,7 @@ int main(void)
 
 	OFFSET(TSS_ist, tss_struct, x86_tss.ist);
 	OFFSET(TSS_sp0, tss_struct, x86_tss.sp0);
+	OFFSET(TSS_sp1, tss_struct, x86_tss.sp1);
 	BLANK();
 
 #ifdef CONFIG_CC_STACKPROTECTOR
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 35d674157fda..86e83762e3b3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -56,6 +56,16 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 		 * Poison it.
 		 */
 		.sp0 = (1UL << (BITS_PER_LONG-1)) + 1,
+
+#ifdef CONFIG_X86_64
+		/*
+		 * .sp1 is cpu_current_top_of_stack.  The init task never
+		 * runs user code, but cpu_current_top_of_stack should still
+		 * be well defined before the first context switch.
+		 */
+		.sp1 = TOP_OF_INIT_STACK,
+#endif
+
 #ifdef CONFIG_X86_32
 		.ss0 = __KERNEL_DS,
 		.ss1 = __KERNEL_CS,
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index eeeb34f85c25..bafe65b08697 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -462,6 +462,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * Switch the PDA and FPU contexts.
 	 */
 	this_cpu_write(current_task, next_p);
+	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
 	/* Reload sp0. */
 	update_sp0(next_p);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (10 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 11/43] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 18:25   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
                   ` (31 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

When we start using an entry trampoline, a #GP from userspace will
be delivered on the entry stack, not on the task stack.  Fix the
espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
assuming that pt_regs + 1 == SP0.  This won't change anything
without an entry stack, but it will make the code continue to work
when an entry stack is added.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/traps.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 2008dd0f8ccb..1bd43f044c62 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs = task_pt_regs(current);
+		struct pt_regs *normal_regs =
+			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -390,7 +391,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 	 *
 	 *   Processors update CR2 whenever a page fault is detected. If a
 	 *   second page fault occurs while an earlier page fault is being
-	 *   deliv- ered, the faulting linear address of the second fault will
+	 *   delivered, the faulting linear address of the second fault will
 	 *   overwrite the contents of CR2 (replacing the previous
 	 *   address). These updates to CR2 occur even if the page fault
 	 *   results in a double fault or occurs during the delivery of a
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (11 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 19:02   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack Ingo Molnar
                   ` (30 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Historically, IDT entries from usermode have always gone directly
to the running task's kernel stack.  Rearrange it so that we enter on
a percpu trampoline stack and then manually switch to the task's stack.
This touches a couple of extra cachelines, but it gives us a chance
to run some code before we touch the kernel stack.

The asm isn't exactly beautiful, but I think that fully refactoring
it can wait.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
 arch/x86/entry/entry_64_compat.S |  5 ++-
 arch/x86/include/asm/switch_to.h |  2 +-
 arch/x86/include/asm/traps.h     |  1 -
 arch/x86/kernel/cpu/common.c     |  6 ++--
 arch/x86/kernel/traps.c          | 18 +++++------
 6 files changed, 68 insertions(+), 31 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index f81d50d7ceac..7d47199f405f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -563,6 +563,13 @@ END(irq_entries_start)
 /* 0(%rsp): ~(interrupt number) */
 	.macro interrupt func
 	cld
+
+	testb	$3, CS-ORIG_RAX(%rsp)
+	jz	1f
+	SWAPGS
+	call	switch_to_thread_stack
+1:
+
 	ALLOC_PT_GPREGS_ON_STACK
 	SAVE_C_REGS
 	SAVE_EXTRA_REGS
@@ -572,12 +579,8 @@ END(irq_entries_start)
 	jz	1f
 
 	/*
-	 * IRQ from user mode.  Switch to kernel gsbase and inform context
-	 * tracking that we're in kernel mode.
-	 */
-	SWAPGS
-
-	/*
+	 * IRQ from user mode.
+	 *
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
 	 * (which can take locks).  Since TRACE_IRQS_OFF idempotent,
@@ -831,6 +834,32 @@ apicinterrupt IRQ_WORK_VECTOR			irq_work_interrupt		smp_irq_work_interrupt
  */
 #define CPU_TSS_IST(x) PER_CPU_VAR(cpu_tss) + (TSS_ist + ((x) - 1) * 8)
 
+/*
+ * Switch to the thread stack.  This is called with the IRET frame and
+ * orig_ax on the stack.  (That is, RDI..R12 are not on the stack and
+ * space has not been allocated for them.)
+ */
+ENTRY(switch_to_thread_stack)
+	UNWIND_HINT_FUNC
+
+	pushq	%rdi
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
+	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
+
+	pushq	7*8(%rdi)		/* regs->ss */
+	pushq	6*8(%rdi)		/* regs->rsp */
+	pushq	5*8(%rdi)		/* regs->eflags */
+	pushq	4*8(%rdi)		/* regs->cs */
+	pushq	3*8(%rdi)		/* regs->ip */
+	pushq	2*8(%rdi)		/* regs->orig_ax */
+	pushq	8(%rdi)			/* return address */
+	UNWIND_HINT_FUNC
+
+	movq	(%rdi), %rdi
+	ret
+END(switch_to_thread_stack)
+
 .macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
 ENTRY(\sym)
 	UNWIND_HINT_IRET_REGS offset=\has_error_code*8
@@ -848,11 +877,12 @@ ENTRY(\sym)
 
 	ALLOC_PT_GPREGS_ON_STACK
 
-	.if \paranoid
-	.if \paranoid == 1
+	.if \paranoid < 2
 	testb	$3, CS(%rsp)			/* If coming from userspace, switch stacks */
-	jnz	1f
+	jnz	.Lfrom_usermode_switch_stack_\@
 	.endif
+
+	.if \paranoid
 	call	paranoid_entry
 	.else
 	call	error_entry
@@ -894,20 +924,15 @@ ENTRY(\sym)
 	jmp	error_exit
 	.endif
 
-	.if \paranoid == 1
+	.if \paranoid < 2
 	/*
-	 * Paranoid entry from userspace.  Switch stacks and treat it
+	 * Entry from userspace.  Switch stacks and treat it
 	 * as a normal entry.  This means that paranoid handlers
 	 * run in real process context if user_mode(regs).
 	 */
-1:
+.Lfrom_usermode_switch_stack_\@:
 	call	error_entry
 
-
-	movq	%rsp, %rdi			/* pt_regs pointer */
-	call	sync_regs
-	movq	%rax, %rsp			/* switch stack */
-
 	movq	%rsp, %rdi			/* pt_regs pointer */
 
 	.if \has_error_code
@@ -1170,6 +1195,14 @@ ENTRY(error_entry)
 	SWAPGS
 
 .Lerror_entry_from_usermode_after_swapgs:
+	/* Put us onto the real thread stack. */
+	popq	%r12				/* save return addr in %12 */
+	movq	%rsp, %rdi			/* arg0 = pt_regs pointer */
+	call	sync_regs
+	movq	%rax, %rsp			/* switch stack */
+	ENCODE_FRAME_POINTER
+	pushq	%r12
+
 	/*
 	 * We need to tell lockdep that IRQs are off.  We can't do this until
 	 * we fix gsbase, and we should do it before enter_from_user_mode
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index dcc6987f9bae..95ad40eb7eff 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -306,8 +306,11 @@ ENTRY(entry_INT80_compat)
 	 */
 	movl	%eax, %eax
 
-	/* Construct struct pt_regs on stack (iret frame is already on stack) */
 	pushq	%rax			/* pt_regs->orig_ax */
+
+	/* switch to thread stack expects orig_ax to be pushed */
+	call	switch_to_thread_stack
+
 	pushq	%rdi			/* pt_regs->di */
 	pushq	%rsi			/* pt_regs->si */
 	pushq	%rdx			/* pt_regs->dx */
diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
index 8c6bd6863db9..a6796ac8d311 100644
--- a/arch/x86/include/asm/switch_to.h
+++ b/arch/x86/include/asm/switch_to.h
@@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
 #ifdef CONFIG_X86_32
 	load_sp0(task->thread.sp0);
 #else
-	load_sp0(task_top_of_stack(task));
+	/* On x86_64, sp0 always points to the entry trampoline stack. */
 #endif
 }
 
diff --git a/arch/x86/include/asm/traps.h b/arch/x86/include/asm/traps.h
index 1fadd310ff68..31051f35cbb7 100644
--- a/arch/x86/include/asm/traps.h
+++ b/arch/x86/include/asm/traps.h
@@ -75,7 +75,6 @@ dotraplinkage void do_segment_not_present(struct pt_regs *, long);
 dotraplinkage void do_stack_segment(struct pt_regs *, long);
 #ifdef CONFIG_X86_64
 dotraplinkage void do_double_fault(struct pt_regs *, long);
-asmlinkage struct pt_regs *sync_regs(struct pt_regs *);
 #endif
 dotraplinkage void do_general_protection(struct pt_regs *, long);
 dotraplinkage void do_page_fault(struct pt_regs *, unsigned long);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index c67742df569a..7c82a8a8bfda 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1645,11 +1645,13 @@ void cpu_init(void)
 	setup_cpu_entry_area(cpu);
 
 	/*
-	 * Initialize the TSS.  Don't bother initializing sp0, as the initial
-	 * task never enters user mode.
+	 * Initialize the TSS.  sp0 points to the entry trampoline stack
+	 * regardless of what task is running.
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
+	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
+		 offsetofend(struct tss_struct, SYSENTER_stack));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 1bd43f044c62..cbc4272bb9dd 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -359,8 +359,7 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
 		regs->cs == __KERNEL_CS &&
 		regs->ip == (unsigned long)native_irq_return_iret)
 	{
-		struct pt_regs *normal_regs =
-			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
+		struct pt_regs *normal_regs = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 		/* Fake a #GP(0) from userspace. */
 		memmove(&normal_regs->ip, (void *)regs->sp, 5*8);
@@ -611,9 +610,10 @@ NOKPROBE_SYMBOL(do_int3);
  * interrupted code was in user mode. The actual stack switch is done in
  * entry_64.S
  */
-asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
+asmlinkage __visible notrace
+struct pt_regs *sync_regs(struct pt_regs *eregs)
 {
-	struct pt_regs *regs = task_pt_regs(current);
+	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
 	*regs = *eregs;
 	return regs;
 }
@@ -630,13 +630,13 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
 	/*
 	 * This is called from entry_64.S early in handling a fault
 	 * caused by a bad iret to user mode.  To handle the fault
-	 * correctly, we want move our stack frame to task_pt_regs
-	 * and we want to pretend that the exception came from the
-	 * iret target.
+	 * correctly, we want move our stack frame to where it would
+	 * be had we entered directly on the entry stack (rather than
+	 * just below the IRET frame) and we want to pretend that the
+	 * exception came from the iret target.
 	 */
 	struct bad_iret_stack *new_stack =
-		container_of(task_pt_regs(current),
-			     struct bad_iret_stack, regs);
+		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
 
 	/* Copy the IRET target to the new stack. */
 	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (12 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 19:10   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
                   ` (29 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

By itself, this is useless.  It gives us the ability to run some final
code before exit that cannnot run on the kernel stack.  This could
include a CR3 switch a la KAISER or some kernel stack erasing, for
example.  (Or even weird things like *changing* which kernel stack
gets used as an ASLR-strengthening mechanism.)

The SYSRET32 path is not covered yet.  It could be in the future or
we could just ignore it and force the slow path if needed.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 51 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 7d47199f405f..426b8c669d6a 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -330,8 +330,24 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	popq	%rsi	/* skip rcx */
 	popq	%rdx
 	popq	%rsi
+
+	/*
+	 * Now all regs are restored except RSP and RDI.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	pushq	RSP-RDI(%rdi)	/* RSP */
+	pushq	(%rdi)		/* RDI */
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
 	popq	%rdi
-	movq	RSP-ORIG_RAX(%rsp), %rsp
+	popq	%rsp
 	USERGS_SYSRET64
 END(entry_SYSCALL_64)
 
@@ -633,10 +649,41 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	ud2
 1:
 #endif
-	SWAPGS
 	POP_EXTRA_REGS
-	POP_C_REGS
-	addq	$8, %rsp	/* skip regs->orig_ax */
+	popq	%r11
+	popq	%r10
+	popq	%r9
+	popq	%r8
+	popq	%rax
+	popq	%rcx
+	popq	%rdx
+	popq	%rsi
+
+	/*
+	 * The stack is now user RDI, orig_ax, RIP, CS, EFLAGS, RSP, SS.
+	 * Save old stack pointer and switch to trampoline stack.
+	 */
+	movq	%rsp, %rdi
+	movq	PER_CPU_VAR(cpu_tss + TSS_sp0), %rsp
+
+	/* Copy the IRET frame to the trampoline stack. */
+	pushq	6*8(%rdi)	/* SS */
+	pushq	5*8(%rdi)	/* RSP */
+	pushq	4*8(%rdi)	/* EFLAGS */
+	pushq	3*8(%rdi)	/* CS */
+	pushq	2*8(%rdi)	/* RIP */
+
+	/* Push user RDI on the trampoline stack. */
+	pushq	(%rdi)
+
+	/*
+	 * We are on the trampoline stack.  All regs except RDI are live.
+	 * We can do future final exit work right here.
+	 */
+
+	/* Restore RDI. */
+	popq	%rdi
+	SWAPGS
 	INTERRUPT_RETURN
 
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (13 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 11:40   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
                   ` (28 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
 arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..0cde243b7542 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3a42da14c2cb..7eb1b5490395 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7c82a8a8bfda..5a05db084659 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..2738cfb6c8c8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,16 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		/* Entry trampoline */
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (14 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 12:05   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP Ingo Molnar
                   ` (27 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

That race has been fixed and code cleaned up for a while now.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
index 49cfd9fe7589..68e1867cca80 100644
--- a/arch/x86/kernel/irq.c
+++ b/arch/x86/kernel/irq.c
@@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
 	/* high bit used in ret_from_ code  */
 	unsigned vector = ~regs->orig_ax;
 
-	/*
-	 * NB: Unlike exception entries, IRQ entries do not reliably
-	 * handle context tracking in the low-level entry code.  This is
-	 * because syscall entries execute briefly with IRQs on before
-	 * updating context tracking state, so we can take an IRQ from
-	 * kernel mode with CONTEXT_USER.  The low-level entry code only
-	 * updates the context if we came from user mode, so we won't
-	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
-	 * code is cleaned up enough that we can cleanly defer enabling
-	 * IRQs.
-	 */
-
 	entering_irq();
 
 	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (15 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 12:07   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
                   ` (26 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

In case something goes wrong with unwind (not unlikely in case of
overflow), print the offending IP where we detected the overflow.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/irq_64.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 020efbf5786b..d86e344f5b3d 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
 	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
 		return;
 
-	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
+	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
 		current->comm, curbase, regs->sp,
 		irq_stack_top, irq_stack_bottom,
-		estack_top, estack_bottom);
+		estack_top, estack_bottom, (void *)regs->ip);
 
 	if (sysctl_panic_on_stackoverflow)
 		panic("low stack detected by irq handler - check messages\n");
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (16 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 12:34   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
                   ` (25 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The IST stacks are needed when an IST exception occurs and are
accessed before any kernel code at all runs.  Move them into
cpu_entry_area.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/fixmap.h | 10 ++++++++++
 arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
 2 files changed, 35 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 7eb1b5490395..15cf010225c9 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -60,6 +60,16 @@ struct cpu_entry_area {
 	struct tss_struct tss;
 
 	char entry_trampoline[PAGE_SIZE];
+
+#ifdef CONFIG_X86_64
+	/*
+	 * Exception stacks used for IST entries.
+	 *
+	 * In the future, this should have a separate slot for each stack
+	 * with guard pages between them.
+	 */
+	char exception_stacks[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ];
+#endif
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 5a05db084659..6b949e6ea0f9 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -503,6 +503,22 @@ static void set_percpu_fixmap_pages(int fixmap_index, void *ptr, int pages, pgpr
 DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 #endif
 
+#ifdef CONFIG_X86_64
+/*
+ * Special IST stacks which the CPU switches to when it calls
+ * an IST-marked descriptor entry. Up to 7 stacks (hardware
+ * limit), all of them are 4K, except the debug stack which
+ * is 8K.
+ */
+static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
+	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
+	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
+};
+
+static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+#endif
+
 /* Setup the fixmap mappings only once per-processor */
 static inline void setup_cpu_entry_area(int cpu)
 {
@@ -557,6 +573,14 @@ static inline void setup_cpu_entry_area(int cpu)
 #endif
 
 #ifdef CONFIG_X86_64
+	BUILD_BUG_ON(sizeof(exception_stacks) % PAGE_SIZE != 0);
+	BUILD_BUG_ON(sizeof(exception_stacks) !=
+		     sizeof(((struct cpu_entry_area *)0)->exception_stacks));
+	set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, exception_stacks),
+				&per_cpu(exception_stacks, cpu),
+				sizeof(exception_stacks) / PAGE_SIZE,
+				PAGE_KERNEL);
+
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
@@ -1407,20 +1431,6 @@ DEFINE_PER_CPU(unsigned int, irq_count) __visible = -1;
 DEFINE_PER_CPU(int, __preempt_count) = INIT_PREEMPT_COUNT;
 EXPORT_PER_CPU_SYMBOL(__preempt_count);
 
-/*
- * Special IST stacks which the CPU switches to when it calls
- * an IST-marked descriptor entry. Up to 7 stacks (hardware
- * limit), all of them are 4K, except the debug stack which
- * is 8K.
- */
-static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
-	  [0 ... N_EXCEPTION_STACKS - 1]	= EXCEPTION_STKSZ,
-	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
-};
-
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
-
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
@@ -1626,7 +1636,7 @@ void cpu_init(void)
 	 * set up and load the per-CPU TSS
 	 */
 	if (!oist->ist[0]) {
-		char *estacks = per_cpu(exception_stacks, cpu);
+		char *estacks = get_cpu_entry_area(cpu)->exception_stacks;
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			estacks += exception_stack_sizes[v];
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (17 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 15:29   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
                   ` (24 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Now that the SYSENTER stack has a guard page, there's no need for a
canary to detect overflow after the fact.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/processor.h | 1 -
 arch/x86/kernel/dumpstack.c      | 3 +--
 arch/x86/kernel/process.c        | 1 -
 arch/x86/kernel/traps.c          | 7 -------
 4 files changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 3a09e5571a92..7743aedb82ea 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -337,7 +337,6 @@ struct tss_struct {
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack_canary;
 	unsigned long		SYSENTER_stack[64];
 
 	/*
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index bb61919c9335..9ce5fcf7d14d 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -48,8 +48,7 @@ bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 	int cpu = smp_processor_id();
 	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
 
-	/* Treat the canary as part of the stack for unwinding purposes. */
-	void *begin = &tss->SYSENTER_stack_canary;
+	void *begin = &tss->SYSENTER_stack;
 	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
 
 	if ((void *)stack < begin || (void *)stack >= end)
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 86e83762e3b3..6a04287f222b 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -81,7 +81,6 @@ __visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
 	  */
 	.io_bitmap		= { [0 ... IO_BITMAP_LONGS] = ~0 },
 #endif
-	.SYSENTER_stack_canary	= STACK_END_MAGIC,
 };
 EXPORT_PER_CPU_SYMBOL(cpu_tss);
 
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index cbc4272bb9dd..19475dbff068 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -801,13 +801,6 @@ dotraplinkage void do_debug(struct pt_regs *regs, long error_code)
 	debug_stack_usage_dec();
 
 exit:
-	/*
-	 * This is the most likely code path that involves non-trivial use
-	 * of the SYSENTER stack.  Check that we haven't overrun it.
-	 */
-	WARN(this_cpu_read(cpu_tss.SYSENTER_stack_canary) != STACK_END_MAGIC,
-	     "Overran or corrupted SYSENTER stack\n");
-
 	ist_exit(regs);
 }
 NOKPROBE_SYMBOL(do_debug);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (18 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 16:39   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
                   ` (23 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

The existing code was a mess, mainly because C arrays are nasty.
Turn SYSENTER_stack into a struct, add a helper to find it, and do
all the obvious cleanups this enables.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_32.S        |  4 ++--
 arch/x86/entry/entry_64.S        |  2 +-
 arch/x86/include/asm/fixmap.h    |  5 +++++
 arch/x86/include/asm/processor.h |  6 +++++-
 arch/x86/kernel/asm-offsets.c    |  6 ++----
 arch/x86/kernel/cpu/common.c     | 14 +++-----------
 arch/x86/kernel/dumpstack.c      |  7 +++----
 7 files changed, 21 insertions(+), 23 deletions(-)

diff --git a/arch/x86/entry/entry_32.S b/arch/x86/entry/entry_32.S
index 0ab316c46806..3629bcbf85a2 100644
--- a/arch/x86/entry/entry_32.S
+++ b/arch/x86/entry/entry_32.S
@@ -942,7 +942,7 @@ ENTRY(debug)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Ldebug_from_sysenter_stack
@@ -986,7 +986,7 @@ ENTRY(nmi)
 
 	/* Are we currently on the SYSENTER stack? */
 	movl	PER_CPU_VAR(cpu_entry_area), %ecx
-	addl	$CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
+	addl	$CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + SIZEOF_SYSENTER_stack, %ecx
 	subl	%eax, %ecx	/* ecx = (end of SYSENTER_stack) - esp */
 	cmpl	$SIZEOF_SYSENTER_stack, %ecx
 	jb	.Lnmi_from_sysenter_stack
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 0cde243b7542..34e3110b0876 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -158,7 +158,7 @@ END(native_usergs_sysret64)
 	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
 
 /* The top word of the SYSENTER stack is hot and is usable as scratch space. */
-#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + TSS_STRUCT_SYSENTER_stack + \
 	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
 
 ENTRY(entry_SYSCALL_64_trampoline)
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 15cf010225c9..ceb04ab0a642 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -234,5 +234,10 @@ static inline struct cpu_entry_area *get_cpu_entry_area(int cpu)
 	return (struct cpu_entry_area *)__fix_to_virt(__get_cpu_entry_area_page_index(cpu, 0));
 }
 
+static inline struct SYSENTER_stack *cpu_SYSENTER_stack(int cpu)
+{
+	return &get_cpu_entry_area((cpu))->tss.SYSENTER_stack;
+}
+
 #endif /* !__ASSEMBLY__ */
 #endif /* _ASM_X86_FIXMAP_H */
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 7743aedb82ea..54f3ee3bc8a0 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -332,12 +332,16 @@ struct x86_hw_tss {
 #define IO_BITMAP_OFFSET		(offsetof(struct tss_struct, io_bitmap) - offsetof(struct tss_struct, x86_tss))
 #define INVALID_IO_BITMAP_OFFSET	0x8000
 
+struct SYSENTER_stack {
+	unsigned long		words[64];
+};
+
 struct tss_struct {
 	/*
 	 * Space for the temporary SYSENTER stack, used for SYSENTER
 	 * and the entry trampoline as well.
 	 */
-	unsigned long		SYSENTER_stack[64];
+	struct SYSENTER_stack	SYSENTER_stack;
 
 	/*
 	 * The fixed hardware portion.  This must not cross a page boundary
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 61b1af88ac07..46c0995344aa 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -94,10 +94,8 @@ void common(void) {
 	BLANK();
 	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
 
-	/* Offset from cpu_tss to SYSENTER_stack */
-	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
-	/* Size of SYSENTER_stack */
-	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
+	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
+	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b949e6ea0f9..f9c7e6852874 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1332,12 +1332,7 @@ void enable_sep_cpu(void)
 
 	tss->x86_tss.ss1 = __KERNEL_CS;
 	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
-
-	wrmsr(MSR_IA32_SYSENTER_ESP,
-	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
-	      offsetofend(struct tss_struct, SYSENTER_stack),
-	      0);
-
+	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
 	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
 
 	put_cpu();
@@ -1451,9 +1446,7 @@ void syscall_init(void)
 	 * AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
 	 */
 	wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
-	wrmsrl_safe(MSR_IA32_SYSENTER_ESP,
-		    (unsigned long)&get_cpu_entry_area(cpu)->tss +
-		    offsetofend(struct tss_struct, SYSENTER_stack));
+	wrmsrl_safe(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 	wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
 #else
 	wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
@@ -1670,8 +1663,7 @@ void cpu_init(void)
 	 */
 	set_tss_desc(cpu, &get_cpu_entry_area(cpu)->tss.x86_tss);
 	load_TR_desc();
-	load_sp0((unsigned long)&get_cpu_entry_area(cpu)->tss +
-		 offsetofend(struct tss_struct, SYSENTER_stack));
+	load_sp0((unsigned long)(cpu_SYSENTER_stack(cpu) + 1));
 
 	load_mm_ldt(&init_mm);
 
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 9ce5fcf7d14d..d58dd121c0af 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -45,11 +45,10 @@ bool in_task_stack(unsigned long *stack, struct task_struct *task,
 
 bool in_sysenter_stack(unsigned long *stack, struct stack_info *info)
 {
-	int cpu = smp_processor_id();
-	struct tss_struct *tss = &get_cpu_entry_area(cpu)->tss;
+	struct SYSENTER_stack *ss = cpu_SYSENTER_stack(smp_processor_id());
 
-	void *begin = &tss->SYSENTER_stack;
-	void *end = (void *)&tss->SYSENTER_stack + sizeof(tss->SYSENTER_stack);
+	void *begin = ss;
+	void *end = ss + 1;
 
 	if ((void *)stack < begin || (void *)stack >= end)
 		return false;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (19 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
                   ` (22 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Global pages stay in the TLB across context switches.  Since all contexts
share the same kernel mapping, these mappings are marked as global pages
so kernel entries in the TLB are not flushed out on a context switch.

But, even having these entries in the TLB opens up something that an
attacker can use [1].

That means that even when KAISER switches page tables on return to user
space the global pages would stay in the TLB cache.

Disable global pages so that kernel TLB entries can be flushed before
returning to user space. This way, all accesses to kernel addresses from
userspace result in a TLB miss independent of the existence of a kernel
mapping.

Replace _PAGE_GLOBAL by __PAGE_KERNEL_GLOBAL and keep _PAGE_GLOBAL
available so that it can still be used for a few selected kernel mappings
which must be visible to userspace, when KAISER is enabled, like the
entry/exit code and data.

1. The double-page-fault attack:
   http://www.ieee-security.org/TC/SP2013/papers/4977a191.pdf

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bp@suse.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003441.63DDFC6F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_types.h | 14 +++++++++++++-
 arch/x86/mm/pageattr.c               | 16 ++++++++--------
 2 files changed, 21 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 9e9b05fc4860..1fc2f22b9002 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -180,8 +180,20 @@ enum page_cache_mode {
 #define PAGE_READONLY_EXEC	__pgprot(_PAGE_PRESENT | _PAGE_USER |	\
 					 _PAGE_ACCESSED)
 
+/*
+ * Disable global pages for anything using the default
+ * __PAGE_KERNEL* macros.  PGE will still be enabled
+ * and _PAGE_GLOBAL may still be used carefully.
+ */
+#ifdef CONFIG_KAISER
+#define __PAGE_KERNEL_GLOBAL	0
+#else
+#define __PAGE_KERNEL_GLOBAL	_PAGE_GLOBAL
+#endif
+
 #define __PAGE_KERNEL_EXEC						\
-	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED | _PAGE_GLOBAL)
+	(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED |	\
+	 __PAGE_KERNEL_GLOBAL)
 #define __PAGE_KERNEL		(__PAGE_KERNEL_EXEC | _PAGE_NX)
 
 #define __PAGE_KERNEL_RO		(__PAGE_KERNEL & ~_PAGE_RW)
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index 3fe68483463c..ffe584fa1f5e 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -585,9 +585,9 @@ try_preserve_large_page(pte_t *kpte, unsigned long address,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(req_prot) & _PAGE_PRESENT)
-		pgprot_val(req_prot) |= _PAGE_PSE | _PAGE_GLOBAL;
+		pgprot_val(req_prot) |= _PAGE_PSE | __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(req_prot) &= ~(_PAGE_PSE | _PAGE_GLOBAL);
+		pgprot_val(req_prot) &= ~(_PAGE_PSE | __PAGE_KERNEL_GLOBAL);
 
 	req_prot = canon_pgprot(req_prot);
 
@@ -705,9 +705,9 @@ __split_large_page(struct cpa_data *cpa, pte_t *kpte, unsigned long address,
 	 * for the ancient hardware that doesn't support it.
 	 */
 	if (pgprot_val(ref_prot) & _PAGE_PRESENT)
-		pgprot_val(ref_prot) |= _PAGE_GLOBAL;
+		pgprot_val(ref_prot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(ref_prot) &= ~_PAGE_GLOBAL;
+		pgprot_val(ref_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	/*
 	 * Get the target pfn from the original entry:
@@ -938,9 +938,9 @@ static void populate_pte(struct cpa_data *cpa,
 	 * support it.
 	 */
 	if (pgprot_val(pgprot) & _PAGE_PRESENT)
-		pgprot_val(pgprot) |= _PAGE_GLOBAL;
+		pgprot_val(pgprot) |= __PAGE_KERNEL_GLOBAL;
 	else
-		pgprot_val(pgprot) &= ~_PAGE_GLOBAL;
+		pgprot_val(pgprot) &= ~__PAGE_KERNEL_GLOBAL;
 
 	pgprot = canon_pgprot(pgprot);
 
@@ -1242,9 +1242,9 @@ static int __change_page_attr(struct cpa_data *cpa, int primary)
 		 * support it.
 		 */
 		if (pgprot_val(new_prot) & _PAGE_PRESENT)
-			pgprot_val(new_prot) |= _PAGE_GLOBAL;
+			pgprot_val(new_prot) |= __PAGE_KERNEL_GLOBAL;
 		else
-			pgprot_val(new_prot) &= ~_PAGE_GLOBAL;
+			pgprot_val(new_prot) &= ~__PAGE_KERNEL_GLOBAL;
 
 		/*
 		 * We need to keep the pfn from the existing PTE,
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (20 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25  0:02   ` Thomas Gleixner
  2017-11-26 11:50   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
                   ` (21 subsequent siblings)
  43 siblings, 2 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-cpu scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-cpu space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-cpu debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003442.2D047A7D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h         | 65 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        | 37 +++++++++++++++++++++--
 arch/x86/entry/entry_64_compat.S | 24 ++++++++++++++-
 3 files changed, 122 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 3fd8bc560fae..e1650da01323 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -187,6 +188,70 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KAISER patches in a moment.
+	 */
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 34e3110b0876..4ac952080869 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -198,6 +201,13 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+	/*
+	 * The kernel CR3 is needed to map the process stack, but we
+	 * need a scratch register to be able to load CR3.  %rsp is
+	 * clobberable right now, so use it as a scratch register.
+	 * %rsp will be look crazy here for a couple instructions.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -393,6 +403,7 @@ GLOBAL(entry_SYSCALL_64_after_hwframe)
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -729,6 +740,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -938,6 +951,8 @@ ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
 	pushq	%rdi
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1239,7 +1254,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1261,6 +1280,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1288,6 +1308,8 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1333,6 +1355,7 @@ ENTRY(error_entry)
 	 * gsbase and proceed.  We'll fix up the exception and land in
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	SWAPGS
 	jmp .Lerror_entry_done
 
@@ -1343,10 +1366,11 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1378,6 +1402,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KAISER is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1441,6 +1469,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1693,6 +1722,8 @@ ENTRY(nmi)
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
diff --git a/arch/x86/entry/entry_64_compat.S b/arch/x86/entry/entry_64_compat.S
index 95ad40eb7eff..05238b29895e 100644
--- a/arch/x86/entry/entry_64_compat.S
+++ b/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -215,6 +219,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	pushq   $0			/* pt_regs->r14 = 0 */
 	pushq   $0			/* pt_regs->r15 = 0 */
 
+	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
 	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
@@ -256,10 +266,22 @@ GLOBAL(entry_SYSCALL_compat_after_hwframe)
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (21 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-26 17:41   ` Borislav Petkov
  2017-11-24 17:23 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
                   ` (20 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

This patch creates a new kind of per-cpu data that is mapped and
can be used no matter which copy of the page tables is active.
Users of this new section will be forthcoming.

Thanks to Hugh Dickins for cleanups to this code.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003444.196CB6DB@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/asm-generic/vmlinux.lds.h |  7 +++++++
 include/linux/percpu-defs.h       | 30 ++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index bdcd1caae092..e12168936d3f 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -826,7 +826,14 @@
  */
 #define PERCPU_INPUT(cacheline)						\
 	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
 	*(.data..percpu..first)						\
+	. = ALIGN(cacheline);						\
+	*(.data..percpu..user_mapped)					\
+	*(.data..percpu..user_mapped..shared_aligned)			\
+	. = ALIGN(PAGE_SIZE);						\
+	*(.data..percpu..user_mapped..page_aligned)			\
+	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
 	. = ALIGN(PAGE_SIZE);						\
 	*(.data..percpu..page_aligned)					\
 	. = ALIGN(cacheline);						\
diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 2d2096ba1cfe..752513674295 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -35,6 +35,12 @@
 
 #endif
 
+#ifdef CONFIG_KAISER
+#define USER_MAPPED_SECTION "..user_mapped"
+#else
+#define USER_MAPPED_SECTION ""
+#endif
+
 /*
  * Base implementations of per-CPU variable declarations and definitions, where
  * the section in which the variable is to be placed is provided by the
@@ -115,6 +121,12 @@
 #define DEFINE_PER_CPU(type, name)					\
 	DEFINE_PER_CPU_SECTION(type, name, "")
 
+#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
+#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
+
 /*
  * Declaration/definition used for per-CPU variables that must come first in
  * the set of variables.
@@ -144,6 +156,14 @@
 	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
 	____cacheline_aligned_in_smp
 
+#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
+#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
+	____cacheline_aligned_in_smp
+
 #define DECLARE_PER_CPU_ALIGNED(type, name)				\
 	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
 	____cacheline_aligned
@@ -162,6 +182,16 @@
 #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
+/*
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
+ */
+#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
+
+#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
+	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
+	__aligned(PAGE_SIZE)
 
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (22 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 17:17   ` Thomas Gleixner
  2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
                   ` (19 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology posted here: https://github.com/IAIK/KAISER

The KAISER approach keeps two copies of the page tables: one for running
in the kernel and one for running userspace.  But, there are a few
structures that are needed for switching in and out of the kernel and
a good subset of *those* are per-cpu data.

Here's a short summary of the things mapped to userspace:
 * The gdt_page's virtual address is pointed to by the LGDT instruction.
   It is needed to define the segments.  Deeply required by CPU to run.
 * cpu_tss tells the CPU, among other things, where the new stacks are
   after user<->kernel transitions.  Needed by the CPU to make ring
   transitions.
 * exception_stacks are needed at interrupt and exception entry
   so that there is storage for, among other things, some temporary
   space to permit clobbering a register to load the kernel CR3.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003445.DF9EA351@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/desc.h      | 2 +-
 arch/x86/include/asm/processor.h | 2 +-
 arch/x86/kernel/cpu/common.c     | 4 ++--
 arch/x86/kernel/process.c        | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
index aab4fe9f49f8..300090d1c209 100644
--- a/arch/x86/include/asm/desc.h
+++ b/arch/x86/include/asm/desc.h
@@ -46,7 +46,7 @@ struct gdt_page {
 	struct desc_struct gdt[GDT_ENTRIES];
 } __attribute__((aligned(PAGE_SIZE)));
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
 
 /* Provide the original GDT */
 static inline struct desc_struct *get_cpu_gdt_rw(unsigned int cpu)
diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 54f3ee3bc8a0..83dd7c97ba5d 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -359,7 +359,7 @@ struct tss_struct {
 	unsigned long		io_bitmap[IO_BITMAP_LONGS + 1];
 } __aligned(PAGE_SIZE);
 
-DECLARE_PER_CPU_PAGE_ALIGNED(struct tss_struct, cpu_tss);
+DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss);
 
 /*
  * sizeof(unsigned long) coming from an extra "long" at the end
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index f9c7e6852874..3b6920c9fef7 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -98,7 +98,7 @@ static const struct cpu_dev default_cpu = {
 
 static const struct cpu_dev *this_cpu = &default_cpu;
 
-DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page) = { .gdt = {
 #ifdef CONFIG_X86_64
 	/*
 	 * We need valid kernel segments for data and code in long mode too
@@ -515,7 +515,7 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
-static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
+DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(char, exception_stacks
 	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
 #endif
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 6a04287f222b..9365b4f965e0 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -47,7 +47,7 @@
  * section. Since TSS's are completely CPU-local, we want them
  * on exact cacheline boundaries, to eliminate cacheline ping-pong.
  */
-__visible DEFINE_PER_CPU_SHARED_ALIGNED(struct tss_struct, cpu_tss) = {
+__visible DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct tss_struct, cpu_tss) = {
 	.x86_tss = {
 		/*
 		 * .sp0 is only used when entering ring 0 from a lower
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (23 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-26 18:51   ` Borislav Petkov
                     ` (2 more replies)
  2017-11-24 17:23 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
                   ` (18 subsequent siblings)
  43 siblings, 3 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

These patches are based on work from a team at Graz University of
Technology: https://github.com/IAIK/KAISER .  This work would not have
been possible without their work as a starting point.

KAISER is a countermeasure against side channel attacks against kernel
virtual memory.  It leaves the existing page tables largely alone and
refers to them as the "kernel page tables.  It adds a "shadow" pgd for
every process which is intended for use when running userspace.  The
shadow pgd maps all the same user memory as the "kernel" copy, but
only maps a minimal set of kernel memory.

Whenever entering the kernel (syscalls, interrupts, exceptions), the
pgd is switched to the "kernel" copy.  When switching back to user
mode, the shadow pgd is used.

The minimalistic kernel page tables try to map only what is needed to
enter/exit the kernel such as the entry/exit functions themselves and
the interrupt descriptors (IDT).

=== Page Table Poisoning ===

KAISER has two copies of the page tables: one for the kernel and
one for when running in userspace.  There is also a kernel
portion of each of the page tables: the part that *maps* the
kernel.

The kernel portion is relatively static and uses pre-populated
PGDs.  Nobody ever calls set_pgd() on the kernel portion during
normal operation.

The userspace portion of the page tables is updated frequently as
userspace pages are mapped and page table pages are allocated.
These updates of the userspace *portion* of the tables need to be
reflected into both the kernel and user/shadow copies.

The original KAISER patches did this by effectively looking at the
address that is being updated.  If it is <PAGE_OFFSET, it is
considered to be doing an update for the userspace portion of the page
tables and must make an entry in the shadow.

However, this has a wrinkle: there are a few places where low
addresses are used in supervisor (kernel) mode.  When EFI calls
are made, they use what are traditionally user addresses in
supervisor mode and trip over these checks.  The trampoline code
that used for booting secondary CPUs has a similar issue.

Remember, there are two things that KAISER needs performed on a
userspace PGD:

 1. Populate the shadow itself
 2. Poison the kernel PGD so it can not be used by userspace.

Only perform these actions when dealing with a user address *and* the
PGD has _PAGE_USER set.  That way, in-kernel users of low addresses
typically used by userspace are not accidentally poisoned.

Changes from original KAISER patch:
 * Gobs of coding style cleanups
 * The original patch tried to allocate an order-2 page, then
   8k-align the result.  That's silly since order-2 is already
   guaranteed to be 16k-aligned.  Removed that gunk and just
   allocate an order-1 page.
 * Handle (or at least detect and warn on) allocation failures
 * Use _KERNPG_TABLE, not _PAGE_TABLE when creating mappings for
   the kernel in the shadow (user) page tables.
 * BUG_ON() for !pte_none() case was totally insane: it checked
   the physical address of the 'struct page' against the physical
   address of the page being mapped.
 * Added 5-level page table support
 * Never free kaiser page tables.  We don't have the locking to
   keep them from getting referenced during the freeing process.
 * Use a totally different scheme in the entry code.  The
   original code just fell apart in horrific ways in debug faults,
   NMIs, or when iret faults.  Big thanks to Andy Lutomirski for
   reducing the number of places that needed to be patched.  He
   made the code a ton simpler.
 * Use new entry trampoline instead of mapping process stacks.

Note: The original KAISER authors signed-off on their patch, which
SoB we retained in arch/x86/mm/kaiser.c.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003447.1DB395E3@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 Documentation/x86/kaiser.txt         | 162 +++++++++++++
 arch/x86/boot/compressed/pagetable.c |   6 +
 arch/x86/entry/calling.h             |   1 +
 arch/x86/include/asm/kaiser.h        |  57 +++++
 arch/x86/include/asm/pgtable.h       |   5 +
 arch/x86/include/asm/pgtable_64.h    | 132 +++++++++++
 arch/x86/kernel/espfix_64.c          |  17 ++
 arch/x86/kernel/head_64.S            |  14 +-
 arch/x86/mm/Makefile                 |   1 +
 arch/x86/mm/kaiser.c                 | 441 +++++++++++++++++++++++++++++++++++
 arch/x86/mm/pageattr.c               |   2 +-
 arch/x86/mm/pgtable.c                |  16 +-
 include/linux/kaiser.h               |  29 +++
 init/main.c                          |   3 +
 kernel/fork.c                        |   1 +
 15 files changed, 881 insertions(+), 6 deletions(-)

diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
new file mode 100644
index 000000000000..745c4be39b92
--- /dev/null
+++ b/Documentation/x86/kaiser.txt
@@ -0,0 +1,162 @@
+Overview
+========
+
+KAISER is a countermeasure against attacks on kernel address
+information.  There are at least three existing, published,
+approaches using the shared user/kernel mapping and hardware features
+to defeat KASLR.  One approach referenced in the paper locates the
+kernel by observing differences in page fault timing between
+present-but-inaccessable kernel pages and non-present pages.
+
+When the kernel is entered via syscalls, interrupts or exceptions,
+page tables are switched to the full "kernel" copy.  When the
+system switches back to user mode, the user/shadow copy is used.
+
+The minimalistic kernel portion of the user page tables try to
+map only what is needed to enter/exit the kernel such as the
+entry/exit functions themselves and the interrupt descriptor
+table (IDT).  There are a few unnecessary things that get mapped
+such as the first C function when entering an interrupt (see
+comments in kaiser.c).
+
+This helps to ensure that side-channel attacks that leverage the
+paging structures do not function when KAISER is enabled.  It can be
+enabled by setting CONFIG_KAISER=y
+
+Page Table Management
+=====================
+
+When KAISER is enabled, the kernel manages two sets of page
+tables.  The first copy is very similar to what would be present
+for a kernel without KAISER.  This includes a complete mapping of
+userspace that the kernel can use for things like copy_to_user().
+
+The second (shadow) is used when running userspace and mirrors the
+mapping of userspace present in the kernel copy.  It maps a only
+the kernel data needed to enter and exit the kernel.
+
+The shadow is populated by the kaiser_add_*() functions.  Only
+kernel data which has been explicity mapped will appear in the
+shadow copy.  These calls are rare at runtime.
+
+For a new userspace mapping, the kernel makes the entries in its
+page tables like normal.  The only difference is when the kernel
+makes entries in the top (PGD) level.  In addition to setting the
+entry in the main kernel PGD, a copy if the entry is made in the
+shadow PGD.
+
+For user space mappings the kernel creates an entry in the kernel
+PGD and the same entry in the shadow PGD, so the underlying page
+table to which the PGD entry points is shared down to the PTE
+level.  This leaves a single, shared set of userspace page tables
+to manage.  One PTE to lock, one set set of accessed bits, dirty
+bits, etc...
+
+Overhead
+========
+
+Protection against side-channel attacks is important.  But,
+this protection comes at a cost:
+
+1. Increased Memory Use
+  a. Each process now needs an order-1 PGD instead of order-0.
+     (Consumes 4k per process).
+  b. The pre-allocated second-level (p4d or pud) kernel page
+     table pages cost ~1MB of additional memory at boot.  This
+     is not totally wasted because some of these pages would
+     have been needed eventually for normal kernel page tables
+     and things in the vmalloc() area like vmemmap[].
+  c. Statically-allocated structures and entry/exit text must
+     be padded out to 4k (or 8k for PGDs) so they can be mapped
+     into the user page tables.  This bloats the kernel image
+     by ~20-30k.
+  d. The shadow page tables eventually grow to map all of used
+     vmalloc() space.  They can have roughly the same memory
+     consumption as the vmalloc() page tables.
+
+2. Runtime Cost
+  a. CR3 manipulation to switch between the page table copies
+     must be done at interrupt, syscall, and exception entry
+     and exit (it can be skipped when the kernel is interrupted,
+     though.)  Moves to CR3 are on the order of a hundred
+     cycles, and are required every at entry and every at exit.
+  b. Task stacks must be mapped/unmapped.  We need to walk
+     and modify the shadow page tables at fork() and exit().
+  c. Global pages are disabled.  This feature of the MMU
+     allows different processes to share TLB entries mapping
+     the kernel.  Losing the feature means potentially more
+     TLB misses after a context switch.
+  d. Process Context IDentifiers (PCID) is a CPU feature that
+     allows us to skip flushing the entire TLB when switching
+     page tables.  This makes switching the page tables (at
+     context switch, or kernel entry/exit) cheaper.  But, on
+     systems with PCID support, the context switch code must flush
+     both the user and kernel entries out of the TLB, with an
+     INVPCID in addition to the CR3 write.  This INVPCID is
+     generally slower than a CR3 write, but still on the order of
+     a hundred cycles.
+  e. The shadow page tables must be populated for each new
+     process.  Even without KAISER, the shared kernel mappings
+     are created by copying top-level (PGD) entries into each
+     new process.  But, with KAISER, there are now *two* kernel
+     mappings: one in the kernel page tables that maps everything
+     and one in the user/shadow page tables mapping the "minimal"
+     kernel.  At fork(), a copy of the portion of the shadow PGD
+     that maps the minimal kernel structures is needed in
+     addition to the normal kernel PGD.
+  f. In addition to the fork()-time copying, there must also
+     be an update to the shadow PGD any time a set_pgd() is done
+     on a PGD used to map userspace.  This ensures that the kernel
+     and user/shadow copies always map the same userspace
+     memory.
+  g. On systems without PCID support, each CR3 write flushes
+     the entire TLB.  That means that each syscall, interrupt
+     or exception flushes the TLB.
+
+Possible Future Work:
+1. We can be more careful about not actually writing to CR3
+   unless its value is actually changed.
+2. Compress the user/shadow-mapped data to be mapped together
+   underneath a single PGD entry.
+3. Re-enable global pages, but use them for mappings in the
+   user/shadow page tables.  This would allow the kernel to
+   take advantage of TLB entries that were established from
+   the user page tables.  This might speed up the entry/exit
+   code or userspace since it will not have to reload all of
+   its TLB entries.  However, its upside is limited by PCID
+   being used.
+4. Allow KAISER to enabled/disabled at runtime so folks can
+   run a single kernel image.
+
+Debugging:
+
+Bugs in KAISER cause a few different signatures of crashes
+that are worth noting here.
+
+ * Crashes in early boot, especially around CPU bringup.  Bugs
+   in the trampoline code or mappings cause these.
+ * Crashes at the first interrupt.  Caused by bugs in entry_64.S,
+   like screwing up a page table switch.  Also caused by
+   incorrectly mapping the IRQ handler entry code.
+ * Crashes at the first NMI.  The NMI code is separate from main
+   interrupt handlers and can have bugs that do not affect
+   normal interrupts.  Also caused by incorrectly mapping NMI
+   code.  NMIs that interrupt the entry code must be very
+   careful and can be the cause of crashes that show up when
+   running perf.
+ * Kernel crashes at the first exit to userspace.  entry_64.S
+   bugs, or failing to map some of the exit code.
+ * Crashes at first interrupt that interrupts userspace. The paths
+   in entry_64.S that return to userspace are sometimes separate
+   from the ones that return to the kernel.
+ * Double faults: overflowing the kernel stack because of page
+   faults upon page faults.  Caused by touching non-kaiser-mapped
+   data in the entry code, or forgetting to switch to kernel
+   CR3 before calling into C functions which are not kaiser-mapped.
+ * Failures of the selftests/x86 code.  Usually a bug in one of the
+   more obscure corners of entry_64.S
+ * Userspace segfaults early in boot, sometimes manifesting
+   as mount(8) failing to mount the rootfs.  These have
+   tended to be TLB invalidation issues.  Usually invalidating
+   the wrong PCID, or otherwise missing an invalidation.
+
diff --git a/arch/x86/boot/compressed/pagetable.c b/arch/x86/boot/compressed/pagetable.c
index d5364ca2e3f9..6b40804c477c 100644
--- a/arch/x86/boot/compressed/pagetable.c
+++ b/arch/x86/boot/compressed/pagetable.c
@@ -36,6 +36,12 @@
 /* Used by pgtable.h asm code to force instruction serialization. */
 unsigned long __force_order;
 
+/*
+ * We share the kernel_ident_mapping_init(), but the early boot
+ * version does not need the Kaiser-logic:
+ */
+int kaiser_enabled = 0;
+
 /* Used to track our page table allocation area. */
 struct alloc_pgt_data {
 	unsigned char *pgt_buf;
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index e1650da01323..d087c3aa0514 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -2,6 +2,7 @@
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
+#include <asm/page_types.h>
 
 /*
 
diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
new file mode 100644
index 000000000000..3c2cc71b4058
--- /dev/null
+++ b/arch/x86/include/asm/kaiser.h
@@ -0,0 +1,57 @@
+#ifndef _ASM_X86_KAISER_H
+#define _ASM_X86_KAISER_H
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * Based on work published here: https://github.com/IAIK/KAISER
+ * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
+ */
+#ifndef __ASSEMBLY__
+
+#ifdef CONFIG_KAISER
+/**
+ *  kaiser_add_mapping - map a kernel range into the user page tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ *  @flags: The mapping flags of the pages
+ *
+ *  Use this on all data and code that need to be mapped into both
+ *  copies of the page tables.  This includes the code that switches
+ *  to/from userspace and all of the hardware structures that are
+ *  virtually-addressed and needed in userspace like the interrupt
+ *  table.
+ */
+extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
+			      unsigned long flags);
+
+/**
+ *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
+ *  @addr: the start address of the range
+ *  @size: the size of the range
+ */
+extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
+
+/**
+ *  kaiser_init - Initialize the shadow mapping
+ *
+ *  Most parts of the shadow mapping can be mapped upon boot
+ *  time.  Only per-process things like the thread stacks
+ *  or a new LDT have to be mapped at runtime.  These boot-
+ *  time mappings are permanent and never unmapped.
+ */
+extern void kaiser_init(void);
+
+#endif
+
+#endif /* __ASSEMBLY__ */
+
+#endif /* _ASM_X86_KAISER_H */
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index f735c3016325..d3901124143f 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1106,6 +1106,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
 {
        memcpy(dst, src, count * sizeof(pgd_t));
+#ifdef CONFIG_KAISER
+	/* Clone the shadow pgd part as well */
+	memcpy(kernel_to_shadow_pgdp(dst), kernel_to_shadow_pgdp(src),
+	       count * sizeof(pgd_t));
+#endif
 }
 
 #define PTE_SHIFT ilog2(PTRS_PER_PTE)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index e9f05331e732..c239839e92bd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -131,9 +131,137 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
 #endif
 }
 
+#ifdef CONFIG_KAISER
+/*
+ * All top-level KAISER page tables are order-1 pages (8k-aligned
+ * and 8k in size).  The kernel one is at the beginning 4k and
+ * the user (shadow) one is in the last 4k.  To switch between
+ * them, you just need to flip the 12th bit in their addresses.
+ */
+#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
+
+/*
+ * This generates better code than the inline assembly in
+ * __set_bit().
+ */
+static inline void *ptr_set_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr |= (1<<bit);
+	return (void *)__ptr;
+}
+static inline void *ptr_clear_bit(void *ptr, int bit)
+{
+	unsigned long __ptr = (unsigned long)ptr;
+
+	__ptr &= ~(1<<bit);
+	return (void *)__ptr;
+}
+
+static inline pgd_t *kernel_to_shadow_pgdp(pgd_t *pgdp)
+{
+	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline pgd_t *shadow_to_kernel_pgdp(pgd_t *pgdp)
+{
+	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *kernel_to_shadow_p4dp(p4d_t *p4dp)
+{
+	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+static inline p4d_t *shadow_to_kernel_p4dp(p4d_t *p4dp)
+{
+	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
+}
+#endif /* CONFIG_KAISER */
+
+/*
+ * Page table pages are page-aligned.  The lower half of the top
+ * level is used for userspace and the top half for the kernel.
+ *
+ * Returns true for parts of the PGD that map userspace and
+ * false for the parts that map the kernel.
+ */
+static inline bool pgdp_maps_userspace(void *__ptr)
+{
+	unsigned long ptr = (unsigned long)__ptr;
+
+	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
+}
+
+/*
+ * Does this PGD allow access from userspace?
+ */
+static inline bool pgd_userspace_access(pgd_t pgd)
+{
+	return pgd.pgd & _PAGE_USER;
+}
+
+/*
+ * Take a PGD location (pgdp) and a pgd value that needs
+ * to be set there.  Populates the shadow and returns
+ * the resulting PGD that must be set in the kernel copy
+ * of the page tables.
+ */
+static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
+{
+#ifdef CONFIG_KAISER
+	if (pgd_userspace_access(pgd)) {
+		if (pgdp_maps_userspace(pgdp)) {
+			/*
+			 * The user/shadow page tables get the full
+			 * PGD, accessible from userspace:
+			 */
+			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
+			/*
+			 * For the copy of the pgd that the kernel
+			 * uses, make it unusable to userspace.  This
+			 * ensures if we get out to userspace with the
+			 * wrong CR3 value, userspace will crash
+			 * instead of running.
+			 */
+			pgd.pgd |= _PAGE_NX;
+		}
+	} else if (pgd_userspace_access(*pgdp)) {
+		/*
+		 * We are clearing a _PAGE_USER PGD for which we
+		 * presumably populated the shadow.  We must now
+		 * clear the shadow PGD entry.
+		 */
+		if (pgdp_maps_userspace(pgdp)) {
+			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
+		} else {
+			/*
+			 * Attempted to clear a _PAGE_USER PGD which
+			 * is in the kernel porttion of the address
+			 * space.  PGDs are pre-populated and we
+			 * never clear them.
+			 */
+			WARN_ON_ONCE(1);
+		}
+	} else {
+		/*
+		 * _PAGE_USER was not set in either the PGD being set
+		 * or cleared.  All kernel PGDs should be
+		 * pre-populated so this should never happen after
+		 * boot.
+		 */
+	}
+#endif
+	/* return the copy of the PGD we want the kernel to use: */
+	return pgd;
+}
+
+
 static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
+#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
+	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
+#else /* CONFIG_KAISER */
 	*p4dp = p4d;
+#endif
 }
 
 static inline void native_p4d_clear(p4d_t *p4d)
@@ -147,7 +275,11 @@ static inline void native_p4d_clear(p4d_t *p4d)
 
 static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
+#ifdef CONFIG_KAISER
+	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
+#else /* CONFIG_KAISER */
 	*pgdp = pgd;
+#endif
 }
 
 static inline void native_pgd_clear(pgd_t *pgd)
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 7d7715dde901..4780dba2cc59 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -41,6 +41,7 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
+#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+	/*
+	 * Just copy the top-level PGD that is mapping the espfix
+	 * area to ensure it is mapped into the shadow user page
+	 * tables.
+	 *
+	 * For 5-level paging, the espfix pgd was populated when
+	 * kaiser_init() pre-populated all the pgd entries.  The above
+	 * p4d_alloc() would never do anything and the p4d_populate()
+	 * would be done to a p4d already mapped in the userspace pgd.
+	 */
+#ifdef CONFIG_KAISER
+	if (CONFIG_PGTABLE_LEVELS <= 4) {
+		set_pgd(kernel_to_shadow_pgdp(pgd),
+			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
+	}
+#endif
 
 	/* Randomize the locations */
 	init_espfix_random();
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 7dca675fe78d..43d1cffd1fcf 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -341,6 +341,14 @@ GLOBAL(early_recursion_flag)
 	.balign	PAGE_SIZE; \
 GLOBAL(name)
 
+#ifdef CONFIG_KAISER
+#define NEXT_PGD_PAGE(name) \
+	.balign 2 * PAGE_SIZE; \
+GLOBAL(name)
+#else
+#define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#endif
+
 /* Automate the creation of 1 to 1 mapping pmd entries */
 #define PMDS(START, PERM, COUNT)			\
 	i = 0 ;						\
@@ -350,7 +358,7 @@ GLOBAL(name)
 	.endr
 
 	__INITDATA
-NEXT_PAGE(early_top_pgt)
+NEXT_PGD_PAGE(early_top_pgt)
 	.fill	511,8,0
 #ifdef CONFIG_X86_5LEVEL
 	.quad	level4_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
@@ -364,7 +372,7 @@ NEXT_PAGE(early_dynamic_pgts)
 	.data
 
 #if defined(CONFIG_XEN_PV) || defined(CONFIG_XEN_PVH)
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
 	.org    init_top_pgt + PGD_PAGE_OFFSET*8, 0
 	.quad   level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -381,7 +389,7 @@ NEXT_PAGE(level2_ident_pgt)
 	 */
 	PMDS(0, __PAGE_KERNEL_IDENT_LARGE_EXEC, PTRS_PER_PMD)
 #else
-NEXT_PAGE(init_top_pgt)
+NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
 #endif
 
diff --git a/arch/x86/mm/Makefile b/arch/x86/mm/Makefile
index 7ba7f3d7f477..1684e8891165 100644
--- a/arch/x86/mm/Makefile
+++ b/arch/x86/mm/Makefile
@@ -46,6 +46,7 @@ obj-$(CONFIG_NUMA_EMU)		+= numa_emulation.o
 obj-$(CONFIG_X86_INTEL_MPX)	+= mpx.o
 obj-$(CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS) += pkeys.o
 obj-$(CONFIG_RANDOMIZE_MEMORY) += kaslr.o
+obj-$(CONFIG_KAISER)		+= kaiser.o
 
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt.o
 obj-$(CONFIG_AMD_MEM_ENCRYPT)	+= mem_encrypt_boot.o
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
new file mode 100644
index 000000000000..7f7561e9971d
--- /dev/null
+++ b/arch/x86/mm/kaiser.c
@@ -0,0 +1,441 @@
+/*
+ * Copyright(c) 2017 Intel Corporation. All rights reserved.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * This code is based in part on work published here:
+ *
+ *	https://github.com/IAIK/KAISER
+ *
+ * The original work was written by and and signed off by for the Linux
+ * kernel by:
+ *
+ *   Signed-off-by: Richard Fellner <richard.fellner@student.tugraz.at>
+ *   Signed-off-by: Moritz Lipp <moritz.lipp@iaik.tugraz.at>
+ *   Signed-off-by: Daniel Gruss <daniel.gruss@iaik.tugraz.at>
+ *   Signed-off-by: Michael Schwarz <michael.schwarz@iaik.tugraz.at>
+ *
+ * Major changes to the original code by: Dave Hansen <dave.hansen@intel.com>
+ */
+#include <linux/kernel.h>
+#include <linux/errno.h>
+#include <linux/string.h>
+#include <linux/types.h>
+#include <linux/bug.h>
+#include <linux/init.h>
+#include <linux/spinlock.h>
+#include <linux/mm.h>
+#include <linux/uaccess.h>
+
+#include <asm/kaiser.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/desc.h>
+
+#define KAISER_WALK_ATOMIC  0x1
+
+/*
+ * At runtime, the only things we map are some things for CPU
+ * hotplug, and stacks for new processes.  No two CPUs will ever
+ * be populating the same addresses, so we only need to ensure
+ * that we protect between two CPUs trying to allocate and
+ * populate the same page table page.
+ *
+ * Only take this lock when doing a set_p[4um]d(), but it is not
+ * needed for doing a set_pte().  We assume that only the *owner*
+ * of a given allocation will be doing this for _their_
+ * allocation.
+ *
+ * This ensures that once a system has been running for a while
+ * and there have been stacks all over and these page tables
+ * are fully populated, there will be no further acquisitions of
+ * this lock.
+ */
+static DEFINE_SPINLOCK(shadow_table_allocation_lock);
+
+/*
+ * This is only for walking kernel addresses.  We use it to help
+ * recreate the "shadow" page tables which are used while we are in
+ * userspace.
+ *
+ * This can be called on any kernel memory addresses and will work
+ * with any page sizes and any types: normal linear map memory,
+ * vmalloc(), even kmap().
+ *
+ * Note: this is only used when mapping new *kernel* entries into
+ * the user/shadow page tables.  It is never used for userspace
+ * addresses.
+ *
+ * Returns -1 on error.
+ */
+static inline unsigned long get_pa_from_kernel_map(unsigned long vaddr)
+{
+	pgd_t *pgd;
+	p4d_t *p4d;
+	pud_t *pud;
+	pmd_t *pmd;
+	pte_t *pte;
+
+	/* We should only be asked to walk kernel addresses */
+	if (vaddr < PAGE_OFFSET) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pgd = pgd_offset_k(vaddr);
+	/*
+	 * We made all the kernel PGDs present in kaiser_init().
+	 * We expect them to stay that way.
+	 */
+	if (pgd_none(*pgd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+	/*
+	 * PGDs are either 512GB or 128TB on all x86_64
+	 * configurations.  We don't handle these.
+	 */
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, vaddr);
+	if (p4d_none(*p4d)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	pud = pud_offset(p4d, vaddr);
+	if (pud_none(*pud)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pud_large(*pud))
+		return (pud_pfn(*pud) << PAGE_SHIFT) | (vaddr & ~PUD_PAGE_MASK);
+
+	pmd = pmd_offset(pud, vaddr);
+	if (pmd_none(*pmd)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	if (pmd_large(*pmd))
+		return (pmd_pfn(*pmd) << PAGE_SHIFT) | (vaddr & ~PMD_PAGE_MASK);
+
+	pte = pte_offset_kernel(pmd, vaddr);
+	if (pte_none(*pte)) {
+		WARN_ON_ONCE(1);
+		return -1;
+	}
+
+	return (pte_pfn(*pte) << PAGE_SHIFT) | (vaddr & ~PAGE_MASK);
+}
+
+/*
+ * Walk the shadow copy of the page tables (optionally) trying to
+ * allocate page table pages on the way down.  Does not support
+ * large pages since the data we are mapping is (generally) not
+ * large enough or aligned to 2MB.
+ *
+ * Note: this is only used when mapping *new* kernel data into the
+ * user/shadow page tables.  It is never used for userspace data.
+ *
+ * Returns a pointer to a PTE on success, or NULL on failure.
+ */
+static pte_t *kaiser_shadow_pagetable_walk(unsigned long address,
+					   unsigned long flags)
+{
+	pte_t *pte;
+	pmd_t *pmd;
+	pud_t *pud;
+	p4d_t *p4d;
+	pgd_t *pgd = kernel_to_shadow_pgdp(pgd_offset_k(address));
+	gfp_t gfp = (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO);
+
+	if (flags & KAISER_WALK_ATOMIC) {
+		gfp &= ~GFP_KERNEL;
+		gfp |= __GFP_HIGH | __GFP_ATOMIC;
+	}
+
+	if (address < PAGE_OFFSET) {
+		WARN_ONCE(1, "attempt to walk user address\n");
+		return NULL;
+	}
+
+	if (pgd_none(*pgd)) {
+		WARN_ONCE(1, "All shadow pgds should have been populated\n");
+		return NULL;
+	}
+	BUILD_BUG_ON(pgd_large(*pgd) != 0);
+
+	p4d = p4d_offset(pgd, address);
+	BUILD_BUG_ON(p4d_large(*p4d) != 0);
+	if (p4d_none(*p4d)) {
+		unsigned long new_pud_page = __get_free_page(gfp);
+		if (!new_pud_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (p4d_none(*p4d))
+			set_p4d(p4d, __p4d(_KERNPG_TABLE | __pa(new_pud_page)));
+		else
+			free_page(new_pud_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pud = pud_offset(p4d, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pud_large(*pud)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pud_none(*pud)) {
+		unsigned long new_pmd_page = __get_free_page(gfp);
+		if (!new_pmd_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pud_none(*pud))
+			set_pud(pud, __pud(_KERNPG_TABLE | __pa(new_pmd_page)));
+		else
+			free_page(new_pmd_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pmd = pmd_offset(pud, address);
+	/* The shadow page tables do not use large mappings: */
+	if (pmd_large(*pmd)) {
+		WARN_ON(1);
+		return NULL;
+	}
+	if (pmd_none(*pmd)) {
+		unsigned long new_pte_page = __get_free_page(gfp);
+		if (!new_pte_page)
+			return NULL;
+
+		spin_lock(&shadow_table_allocation_lock);
+		if (pmd_none(*pmd))
+			set_pmd(pmd, __pmd(_KERNPG_TABLE  | __pa(new_pte_page)));
+		else
+			free_page(new_pte_page);
+		spin_unlock(&shadow_table_allocation_lock);
+	}
+
+	pte = pte_offset_kernel(pmd, address);
+	if (pte_flags(*pte) & _PAGE_USER) {
+		WARN_ONCE(1, "attempt to walk to user pte\n");
+		return NULL;
+	}
+	return pte;
+}
+
+/*
+ * Given a kernel address, @__start_addr, copy that mapping into
+ * the user (shadow) page tables.  This may need to allocate page
+ * table pages.
+ */
+int kaiser_add_user_map(const void *__start_addr, unsigned long size,
+			unsigned long flags)
+{
+	pte_t *pte;
+	unsigned long start_addr = (unsigned long)__start_addr;
+	unsigned long address = start_addr & PAGE_MASK;
+	unsigned long end_addr = PAGE_ALIGN(start_addr + size);
+	unsigned long target_address;
+
+	for (; address < end_addr; address += PAGE_SIZE) {
+		target_address = get_pa_from_kernel_map(address);
+		if (target_address == -1)
+			return -EIO;
+
+		pte = kaiser_shadow_pagetable_walk(address, false);
+		/*
+		 * Errors come from either -ENOMEM for a page
+		 * table page, or something screwy that did a
+		 * WARN_ON().  Just return -ENOMEM.
+		 */
+		if (!pte)
+			return -ENOMEM;
+		if (pte_none(*pte)) {
+			set_pte(pte, __pte(flags | target_address));
+		} else {
+			pte_t tmp;
+			/*
+			 * Make a fake, temporary PTE that mimics the
+			 * one we would have created.
+			 */
+			set_pte(&tmp, __pte(flags | target_address));
+			/*
+			 * Warn if the pte that would have been
+			 * created is different from the one that
+			 * was there previously.  In other words,
+			 * we allow the same PTE value to be set,
+			 * but not changed.
+			 */
+			WARN_ON_ONCE(!pte_same(*pte, tmp));
+		}
+	}
+	return 0;
+}
+
+int kaiser_add_user_map_ptrs(const void *__start_addr,
+			     const void *__end_addr,
+			     unsigned long flags)
+{
+	return kaiser_add_user_map(__start_addr,
+				   __end_addr - __start_addr,
+				   flags);
+}
+
+/*
+ * Ensure that the top level of the (shadow) page tables are
+ * entirely populated.  This ensures that all processes that get
+ * forked have the same entries.  This way, we do not have to
+ * ever go set up new entries in older processes.
+ *
+ * Note: we never free these, so there are no updates to them
+ * after this.
+ */
+static void __init kaiser_init_all_pgds(void)
+{
+	pgd_t *pgd;
+	int i;
+
+	pgd = kernel_to_shadow_pgdp(pgd_offset_k(0UL));
+	for (i = PTRS_PER_PGD / 2; i < PTRS_PER_PGD; i++) {
+		/*
+		 * Each PGD entry moves up PGDIR_SIZE bytes through
+		 * the address space, so get the first virtual
+		 * address mapped by PGD #i:
+		 */
+		unsigned long addr = i * PGDIR_SIZE;
+#if CONFIG_PGTABLE_LEVELS > 4
+		p4d_t *p4d = p4d_alloc_one(&init_mm, addr);
+		if (!p4d) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(p4d)));
+#else /* CONFIG_PGTABLE_LEVELS <= 4 */
+		pud_t *pud = pud_alloc_one(&init_mm, addr);
+		if (!pud) {
+			WARN_ON(1);
+			break;
+		}
+		set_pgd(pgd + i, __pgd(_KERNPG_TABLE | __pa(pud)));
+#endif /* CONFIG_PGTABLE_LEVELS */
+	}
+}
+
+/*
+ * Page table allocations called by kaiser_add_user_map() can
+ * theoretically fail, but are very unlikely to fail in early boot.
+ * This would at least output a warning before crashing.
+ *
+ * Do the checking and warning in a macro to make it more readable and
+ * preserve line numbers in the warning message that you would not get
+ * with an inline.
+ */
+#define kaiser_add_user_map_early(start, size, flags) do {	\
+	int __ret = kaiser_add_user_map(start, size, flags);	\
+	WARN_ON(__ret);						\
+} while (0)
+
+#define kaiser_add_user_map_ptrs_early(start, end, flags) do {		\
+	int __ret = kaiser_add_user_map_ptrs(start, end, flags);	\
+	WARN_ON(__ret);							\
+} while (0)
+
+extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
+/*
+ * If anything in here fails, we will likely die on one of the
+ * first kernel->user transitions and init will die.  But, we
+ * will have most of the kernel up by then and should be able to
+ * get a clean warning out of it.  If we BUG_ON() here, we run
+ * the risk of being before we have good console output.
+ *
+ * When KAISER is enabled, we remove _PAGE_GLOBAL from all of the
+ * kernel PTE permissions.  This ensures that the TLB entries for
+ * the kernel are not available when in userspace.  However, for
+ * the pages that are available to userspace *anyway*, we might as
+ * well continue to map them _PAGE_GLOBAL and enjoy the potential
+ * performance advantages.
+ */
+void __init kaiser_init(void)
+{
+	int cpu;
+
+	kaiser_init_all_pgds();
+
+	for_each_possible_cpu(cpu) {
+		void *percpu_vaddr = __per_cpu_user_mapped_start +
+				     per_cpu_offset(cpu);
+		unsigned long percpu_sz = __per_cpu_user_mapped_end -
+					  __per_cpu_user_mapped_start;
+		kaiser_add_user_map_early(percpu_vaddr, percpu_sz,
+					  __PAGE_KERNEL | _PAGE_GLOBAL);
+	}
+
+	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+
+	/* the fixed map address of the idt_table */
+	kaiser_add_user_map_early((void *)idt_descr.address,
+				  sizeof(gate_desc) * NR_VECTORS,
+				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+}
+
+int kaiser_add_mapping(unsigned long addr, unsigned long size,
+		       unsigned long flags)
+{
+	return kaiser_add_user_map((const void *)addr, size, flags);
+}
+
+void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+	unsigned long addr;
+
+	/* The shadow page tables always use small pages: */
+	for (addr = start; addr < start + size; addr += PAGE_SIZE) {
+		/*
+		 * Do an "atomic" walk in case this got called from an atomic
+		 * context.  This should not do any allocations because we
+		 * should only be walking things that are known to be mapped.
+		 */
+		pte_t *pte = kaiser_shadow_pagetable_walk(addr, KAISER_WALK_ATOMIC);
+
+		/*
+		 * We are removing a mapping that should
+		 * exist.  WARN if it was not there:
+		 */
+		if (!pte) {
+			WARN_ON_ONCE(1);
+			continue;
+		}
+
+		pte_clear(&init_mm, addr, pte);
+	}
+	/*
+	 * This ensures that the TLB entries used to map this data are
+	 * no longer usable on *this* CPU.  We theoretically want to
+	 * flush the entries on all CPUs here, but that's too
+	 * expensive right now: this is called to unmap process
+	 * stacks in the exit() path.
+	 *
+	 * This can change if we get to the point where this is not
+	 * in a remotely hot path, like only called via write_ldt().
+	 *
+	 * Note: we could probably also just invalidate the individual
+	 * addresses to take care of *this* PCID and then do a
+	 * tlb_flush_shared_nonglobals() to ensure that all other
+	 * PCIDs get flushed before being used again.
+	 */
+	__native_flush_tlb_global();
+}
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
index ffe584fa1f5e..1b3dbf3b3846 100644
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -859,7 +859,7 @@ static void unmap_pmd_range(pud_t *pud, unsigned long start, unsigned long end)
 			pud_clear(pud);
 }
 
-static void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
+void unmap_pud_range(p4d_t *p4d, unsigned long start, unsigned long end)
 {
 	pud_t *pud = pud_offset(p4d, start);
 
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 17ebc5a978cc..1e47ce734404 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -355,14 +355,26 @@ static inline void _pgd_free(pgd_t *pgd)
 		kmem_cache_free(pgd_cache, pgd);
 }
 #else
+
+#ifdef CONFIG_KAISER
+/*
+ * Instead of one pgd, we aquire two pgds.  Being order-1, it is
+ * both 8k in size and 8k-aligned.  That lets us just flip bit 12
+ * in a pointer to swap between the two 4k halves.
+ */
+#define PGD_ALLOCATION_ORDER 1
+#else
+#define PGD_ALLOCATION_ORDER 0
+#endif
+
 static inline pgd_t *_pgd_alloc(void)
 {
-	return (pgd_t *)__get_free_page(PGALLOC_GFP);
+	return (pgd_t *)__get_free_pages(PGALLOC_GFP, PGD_ALLOCATION_ORDER);
 }
 
 static inline void _pgd_free(pgd_t *pgd)
 {
-	free_page((unsigned long)pgd);
+	free_pages((unsigned long)pgd, PGD_ALLOCATION_ORDER);
 }
 #endif /* CONFIG_X86_PAE */
 
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
new file mode 100644
index 000000000000..0fd800efa95c
--- /dev/null
+++ b/include/linux/kaiser.h
@@ -0,0 +1,29 @@
+#ifndef _INCLUDE_KAISER_H
+#define _INCLUDE_KAISER_H
+
+#ifdef CONFIG_KAISER
+#include <asm/kaiser.h>
+#else
+
+/*
+ * These stubs are used whenever CONFIG_KAISER is off, which
+ * includes architectures that support KAISER, but have it
+ * disabled.
+ */
+
+static inline void kaiser_init(void)
+{
+}
+
+static inline void kaiser_remove_mapping(unsigned long start, unsigned long size)
+{
+}
+
+static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
+				     unsigned long flags)
+{
+	return 0;
+}
+
+#endif /* !CONFIG_KAISER */
+#endif /* _INCLUDE_KAISER_H */
diff --git a/init/main.c b/init/main.c
index 3bdd8da90f69..559bc0a6e9ad 100644
--- a/init/main.c
+++ b/init/main.c
@@ -76,6 +76,7 @@
 #include <linux/slab.h>
 #include <linux/perf_event.h>
 #include <linux/ptrace.h>
+#include <linux/kaiser.h>
 #include <linux/blkdev.h>
 #include <linux/elevator.h>
 #include <linux/sched_clock.h>
@@ -505,6 +506,8 @@ static void __init mm_init(void)
 	pgtable_init();
 	vmalloc_init();
 	ioremap_huge_init();
+	/* This just needs to be done before we first run userspace: */
+	kaiser_init();
 }
 
 asmlinkage __visible void __init start_kernel(void)
diff --git a/kernel/fork.c b/kernel/fork.c
index 07cc743698d3..685202058d65 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -70,6 +70,7 @@
 #include <linux/tsacct_kern.h>
 #include <linux/cn_proc.h>
 #include <linux/freezer.h>
+#include <linux/kaiser.h>
 #include <linux/delayacct.h>
 #include <linux/taskstats_kern.h>
 #include <linux/random.h>
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (24 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
                   ` (17 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The user portion of the kernel page tables use the NX bit to
poison them for userspace.  But, that trips the p4d/pgd_bad()
checks.  Make sure it does not do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003448.C6AB3575@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable.h | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index d3901124143f..9cceaf6c0405 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -846,7 +846,12 @@ static inline pud_t *pud_offset(p4d_t *p4d, unsigned long address)
 
 static inline int p4d_bad(p4d_t p4d)
 {
-	return (p4d_flags(p4d) & ~(_KERNPG_TABLE | _PAGE_USER)) != 0;
+	unsigned long ignore_flags = _KERNPG_TABLE | _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (p4d_flags(p4d) & ~ignore_flags) != 0;
 }
 #endif  /* CONFIG_PGTABLE_LEVELS > 3 */
 
@@ -880,7 +885,12 @@ static inline p4d_t *p4d_offset(pgd_t *pgd, unsigned long address)
 
 static inline int pgd_bad(pgd_t pgd)
 {
-	return (pgd_flags(pgd) & ~_PAGE_USER) != _KERNPG_TABLE;
+	unsigned long ignore_flags = _PAGE_USER;
+
+	if (IS_ENABLED(CONFIG_KAISER))
+		ignore_flags |= _PAGE_NX;
+
+	return (pgd_flags(pgd) & ~ignore_flags) != _KERNPG_TABLE;
 }
 
 static inline int pgd_none(pgd_t pgd)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (25 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 28/43] x86/mm/kaiser: Map cpu entry area Ingo Molnar
                   ` (16 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

A few PGDs come out of the kernel binary instead of being
allocated dynamically.  Before this patch, they are all
8k-aligned, but they must also be 8k in *size*.

The original KAISER patch did not do this.  It probably just
lucked out that it did not trample over data after the last PGD.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003450.76492124@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/head_64.S | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
index 43d1cffd1fcf..58087ab1782e 100644
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -342,11 +342,24 @@ GLOBAL(early_recursion_flag)
 GLOBAL(name)
 
 #ifdef CONFIG_KAISER
+/*
+ * Each PGD needs to be 8k long and 8k aligned.  We do not
+ * ever go out to userspace with these, so we do not
+ * strictly *need* the second page, but this allows us to
+ * have a single set_pgd() implementation that does not
+ * need to worry about whether it has 4k or 8k to work
+ * with.
+ *
+ * This ensures PGDs are 8k long:
+ */
+#define KAISER_USER_PGD_FILL	512
+/* This ensures they are 8k-aligned: */
 #define NEXT_PGD_PAGE(name) \
 	.balign 2 * PAGE_SIZE; \
 GLOBAL(name)
 #else
 #define NEXT_PGD_PAGE(name) NEXT_PAGE(name)
+#define KAISER_USER_PGD_FILL	0
 #endif
 
 /* Automate the creation of 1 to 1 mapping pmd entries */
@@ -365,6 +378,7 @@ NEXT_PGD_PAGE(early_top_pgt)
 #else
 	.quad	level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
 #endif
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(early_dynamic_pgts)
 	.fill	512*EARLY_DYNAMIC_PAGE_TABLES,8,0
@@ -379,6 +393,7 @@ NEXT_PGD_PAGE(init_top_pgt)
 	.org    init_top_pgt + PGD_START_KERNEL*8, 0
 	/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
 	.quad   level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE_NOENC
+	.fill	KAISER_USER_PGD_FILL,8,0
 
 NEXT_PAGE(level3_ident_pgt)
 	.quad	level2_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE_NOENC
@@ -391,6 +406,7 @@ NEXT_PAGE(level2_ident_pgt)
 #else
 NEXT_PGD_PAGE(init_top_pgt)
 	.fill	512,8,0
+	.fill	KAISER_USER_PGD_FILL,8,0
 #endif
 
 #ifdef CONFIG_X86_5LEVEL
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 28/43] x86/mm/kaiser: Map cpu entry area
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (26 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-25 21:40   ` Thomas Gleixner
  2017-11-24 17:23 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
                   ` (15 subsequent siblings)
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There is now a special 'struct cpu_entry' area that contains all
of the data needed to enter the kernel.  It's mapped in the fixmap
area and contains:

 * The GDT (hardware segment descriptor)
 * The TSS (thread information structure that points the hardware
   to the various stacks, and contains the entry stack).
 * The entry trampoline code itself
 * The exception stacks (aka IRQ stacks)

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003453.D4CB33A9@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kaiser.h |  6 ++++++
 arch/x86/kernel/cpu/common.c  |  4 ++++
 arch/x86/mm/kaiser.c          | 31 +++++++++++++++++++++++++++++++
 include/linux/kaiser.h        |  3 +++
 4 files changed, 44 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 3c2cc71b4058..040cb096d29d 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -33,6 +33,12 @@
 extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
 			      unsigned long flags);
 
+/**
+ *  kaiser_add_mapping_cpu_entry - map the cpu entry area
+ *  @cpu: the CPU for which the entry area is being mapped
+ */
+extern void kaiser_add_mapping_cpu_entry(int cpu);
+
 /**
  *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
  *  @addr: the start address of the range
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 3b6920c9fef7..d6bcf397b00d 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -4,6 +4,7 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/percpu.h>
+#include <linux/kaiser.h>
 #include <linux/string.h>
 #include <linux/ctype.h>
 #include <linux/delay.h>
@@ -584,6 +585,9 @@ static inline void setup_cpu_entry_area(int cpu)
 	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
 		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
 #endif
+ 	/* CPU 0's mapping is done in kaiser_init() */
+	if (cpu)
+		kaiser_add_mapping_cpu_entry(cpu);
 }
 
 /* Load the original GDT from the per-cpu structure */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 7f7561e9971d..4665dd724efb 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -353,6 +353,26 @@ static void __init kaiser_init_all_pgds(void)
 	WARN_ON(__ret);							\
 } while (0)
 
+void kaiser_add_mapping_cpu_entry(int cpu)
+{
+	kaiser_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
+				  __PAGE_KERNEL_RO);
+
+	/* includes the entry stack */
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
+				  sizeof(get_cpu_entry_area(cpu)->tss),
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
+	/* Entry code, so needs to be EXEC */
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
+				  sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
+				  __PAGE_KERNEL_EXEC | _PAGE_GLOBAL);
+
+	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->exception_stacks,
+				 sizeof(get_cpu_entry_area(cpu)->exception_stacks),
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+}
+
 extern char __per_cpu_user_mapped_start[], __per_cpu_user_mapped_end[];
 /*
  * If anything in here fails, we will likely die on one of the
@@ -390,6 +410,17 @@ void __init kaiser_init(void)
 	kaiser_add_user_map_early((void *)idt_descr.address,
 				  sizeof(gate_desc) * NR_VECTORS,
 				  __PAGE_KERNEL_RO | _PAGE_GLOBAL);
+
+	/*
+	 * We delay CPU 0's mappings because these structures are
+	 * created before the page allocator is up.  Deferring it
+	 * until here lets us use the plain page allocator
+	 * unconditionally in the page table code above.
+	 *
+	 * This is OK because kaiser_init() is called long before
+	 * we ever run userspace and need the KAISER mappings.
+	 */
+	kaiser_add_mapping_cpu_entry(0);
 }
 
 int kaiser_add_mapping(unsigned long addr, unsigned long size,
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 0fd800efa95c..77db4230a0dd 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -25,5 +25,8 @@ static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
 	return 0;
 }
 
+static inline void kaiser_add_mapping_cpu_entry(int cpu)
+{
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (27 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 28/43] x86/mm/kaiser: Map cpu entry area Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
                   ` (14 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Normally, a process has a NULL mm->context.ldt.  But, there is a
syscall for a process to set a new one.  If a process does that,
the LDT be mapped into the user page tables, just like the
default copy.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003455.275397F7@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/ldt.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/ldt.c b/arch/x86/kernel/ldt.c
index 1c1eae961340..d6ab1144fdbf 100644
--- a/arch/x86/kernel/ldt.c
+++ b/arch/x86/kernel/ldt.c
@@ -11,6 +11,7 @@
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/string.h>
+#include <linux/kaiser.h>
 #include <linux/mm.h>
 #include <linux/smp.h>
 #include <linux/syscalls.h>
@@ -57,11 +58,21 @@ static void flush_ldt(void *__mm)
 	refresh_ldt_segments();
 }
 
+static void __free_ldt_struct(struct ldt_struct *ldt)
+{
+	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
+		vfree_atomic(ldt->entries);
+	else
+		free_page((unsigned long)ldt->entries);
+	kfree(ldt);
+}
+
 /* The caller must call finalize_ldt_struct on the result. LDT starts zeroed. */
 static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 {
 	struct ldt_struct *new_ldt;
 	unsigned int alloc_size;
+	int ret;
 
 	if (num_entries > LDT_ENTRIES)
 		return NULL;
@@ -89,6 +100,12 @@ static struct ldt_struct *alloc_ldt_struct(unsigned int num_entries)
 		return NULL;
 	}
 
+	ret = kaiser_add_mapping((unsigned long)new_ldt->entries, alloc_size,
+				 __PAGE_KERNEL | _PAGE_GLOBAL);
+	if (ret) {
+		__free_ldt_struct(new_ldt);
+		return NULL;
+	}
 	new_ldt->nr_entries = num_entries;
 	return new_ldt;
 }
@@ -115,12 +132,10 @@ static void free_ldt_struct(struct ldt_struct *ldt)
 	if (likely(!ldt))
 		return;
 
+	kaiser_remove_mapping((unsigned long)ldt->entries,
+			      ldt->nr_entries * LDT_ENTRY_SIZE);
 	paravirt_free_ldt(ldt->entries, ldt->nr_entries);
-	if (ldt->nr_entries * LDT_ENTRY_SIZE > PAGE_SIZE)
-		vfree_atomic(ldt->entries);
-	else
-		free_page((unsigned long)ldt->entries);
-	kfree(ldt);
+	__free_ldt_struct(ldt);
 }
 
 /*
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 30/43] x86/mm/kaiser: Map espfix structures
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (28 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:23 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
                   ` (13 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There is some rather arcane code to help when an IRET returns
to 16-bit segments.  It is referred to as the "espfix" code.
This consists of a few per-cpu variables:

	espfix_stack: tells us where the stack is allocated
	  	      (the bottom)
	espfix_waddr: tells us to where %rsp may be pointed
		      (the top)

These are in addition to the stack itself.  All three things must
be mapped for the espfix code to function.

Note: the espfix code runs with a kernel GSBASE, but user
(shadow) page tables.  A switch to the kernel page tables could
be performed instead of mapping these structures, but mapping
them is simpler and less likely to break the assembly.  To switch
over to the kernel copy, additional temporary storage would be
required which is in short supply in this context.

The original KAISER patch missed this case.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003457.EB854D0D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/espfix_64.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 4780dba2cc59..8bb116d73aaa 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -33,6 +33,7 @@
 
 #include <linux/init.h>
 #include <linux/init_task.h>
+#include <linux/kaiser.h>
 #include <linux/kernel.h>
 #include <linux/percpu.h>
 #include <linux/gfp.h>
@@ -41,7 +42,6 @@
 #include <asm/pgalloc.h>
 #include <asm/setup.h>
 #include <asm/espfix.h>
-#include <asm/kaiser.h>
 
 /*
  * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
@@ -61,8 +61,8 @@
 #define PGALLOC_GFP (GFP_KERNEL | __GFP_NOTRACK | __GFP_ZERO)
 
 /* This contains the *bottom* address of the espfix stack */
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_stack);
-DEFINE_PER_CPU_READ_MOSTLY(unsigned long, espfix_waddr);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_stack);
+DEFINE_PER_CPU_USER_MAPPED(unsigned long, espfix_waddr);
 
 /* Initialization mutex - should this be a spinlock? */
 static DEFINE_MUTEX(espfix_init_mutex);
@@ -225,4 +225,10 @@ void init_espfix_ap(int cpu)
 	per_cpu(espfix_stack, cpu) = addr;
 	per_cpu(espfix_waddr, cpu) = (unsigned long)stack_page
 				      + (addr & ~PAGE_MASK);
+	/*
+	 * _PAGE_GLOBAL is not really required.  This is not a hot
+	 * path, but we do it here for consistency.
+	 */
+	kaiser_add_mapping((unsigned long)stack_page, PAGE_SIZE,
+			__PAGE_KERNEL | _PAGE_GLOBAL);
 }
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 31/43] x86/mm/kaiser: Map entry stack variable
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (29 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
@ 2017-11-24 17:23 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
                   ` (12 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:23 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There are times where the kernel is entered but there is no
safe stack, like at SYSCALL entry.  To obtain a safe stack, we
have to clobber %rsp and store the clobbered value in
'rsp_scratch'.

Map this to userspace to allow us to do this stack switch before
the CR3 switch.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003459.C0FF167A@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/process_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index bafe65b08697..9a0220aa2bf9 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -59,7 +59,7 @@
 #include <asm/unistd_32_ia32.h>
 #endif
 
-__visible DEFINE_PER_CPU(unsigned long, rsp_scratch);
+__visible DEFINE_PER_CPU_USER_MAPPED(unsigned long, rsp_scratch);
 
 /* Prints also some state that isn't saved in the pt_regs */
 void __show_regs(struct pt_regs *regs, int all)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (30 preceding siblings ...)
  2017-11-24 17:23 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
                   ` (11 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Hugh Dickins <hughd@google.com>

[Dave] Add explicit _PAGE_GLOBAL
[Dave] remove KAISER #ifdefs by moving kmalloc() to plain page allocator
[Dave] reword the commit message a bit to be consistent with other patches

The BTS and PEBS buffers both have their virtual addresses
programmed into the hardware.  This means that any access to them
is performed via the page tables.  The times that the hardware
accesses these are entirely dependent on how the performance
monitoring hardware events are set up.  In other words, there is
no way for the kernel to tell when the hardware might access
these buffers.

To avoid perf crashes, place 'debug_store' in the user-mapped
per-cpu area instead of dynamically allocating.  Also use the
page allocator plus kaiser_add_mapping() to keep the BTS and PEBS
buffers user-mapped (that is, present in the user mapping, though
visible only to kernel and hardware).  The PEBS fixup buffer does
not need this treatment.

The need for a user-mapped struct debug_store showed up before doing
any conscious perf testing: in a couple of kernel paging oopses on
Westmere, implicating the debug_store offset of the per-cpu area.

Signed-off-by: Hugh Dickins <hughd@google.com>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003500.7EC0DB4E@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/events/intel/ds.c | 49 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 37 insertions(+), 12 deletions(-)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 3674a4b6f8bd..b5cf473e443a 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -1,6 +1,7 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/bitops.h>
 #include <linux/types.h>
+#include <linux/kaiser.h>
 #include <linux/slab.h>
 
 #include <asm/perf_event.h>
@@ -8,6 +9,9 @@
 
 #include "../perf_event.h"
 
+static
+DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(struct debug_store, cpu_debug_store);
+
 /* The size of a BTS record in bytes: */
 #define BTS_RECORD_SIZE		24
 
@@ -279,6 +283,31 @@ void fini_debug_store_on_cpu(int cpu)
 
 static DEFINE_PER_CPU(void *, insn_buffer);
 
+static void *dsalloc(size_t size, gfp_t flags, int node)
+{
+	unsigned int order = get_order(size);
+	struct page *page;
+	unsigned long addr;
+
+	page = __alloc_pages_node(node, flags | __GFP_ZERO, order);
+	if (!page)
+		return NULL;
+	addr = (unsigned long)page_address(page);
+	if (kaiser_add_mapping(addr, size, __PAGE_KERNEL | _PAGE_GLOBAL) < 0) {
+		__free_pages(page, order);
+		addr = 0;
+	}
+	return (void *)addr;
+}
+
+static void dsfree(const void *buffer, size_t size)
+{
+	if (!buffer)
+		return;
+	kaiser_remove_mapping((unsigned long)buffer, size);
+	free_pages((unsigned long)buffer, get_order(size));
+}
+
 static int alloc_pebs_buffer(int cpu)
 {
 	struct debug_store *ds = per_cpu(cpu_hw_events, cpu).ds;
@@ -289,7 +318,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (!x86_pmu.pebs)
 		return 0;
 
-	buffer = kzalloc_node(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
+	buffer = dsalloc(x86_pmu.pebs_buffer_size, GFP_KERNEL, node);
 	if (unlikely(!buffer))
 		return -ENOMEM;
 
@@ -300,7 +329,7 @@ static int alloc_pebs_buffer(int cpu)
 	if (x86_pmu.intel_cap.pebs_format < 2) {
 		ibuffer = kzalloc_node(PEBS_FIXUP_SIZE, GFP_KERNEL, node);
 		if (!ibuffer) {
-			kfree(buffer);
+			dsfree(buffer, x86_pmu.pebs_buffer_size);
 			return -ENOMEM;
 		}
 		per_cpu(insn_buffer, cpu) = ibuffer;
@@ -326,7 +355,8 @@ static void release_pebs_buffer(int cpu)
 	kfree(per_cpu(insn_buffer, cpu));
 	per_cpu(insn_buffer, cpu) = NULL;
 
-	kfree((void *)(unsigned long)ds->pebs_buffer_base);
+	dsfree((void *)(unsigned long)ds->pebs_buffer_base,
+			x86_pmu.pebs_buffer_size);
 	ds->pebs_buffer_base = 0;
 }
 
@@ -340,7 +370,7 @@ static int alloc_bts_buffer(int cpu)
 	if (!x86_pmu.bts)
 		return 0;
 
-	buffer = kzalloc_node(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
+	buffer = dsalloc(BTS_BUFFER_SIZE, GFP_KERNEL | __GFP_NOWARN, node);
 	if (unlikely(!buffer)) {
 		WARN_ONCE(1, "%s: BTS buffer allocation failure\n", __func__);
 		return -ENOMEM;
@@ -366,19 +396,15 @@ static void release_bts_buffer(int cpu)
 	if (!ds || !x86_pmu.bts)
 		return;
 
-	kfree((void *)(unsigned long)ds->bts_buffer_base);
+	dsfree((void *)(unsigned long)ds->bts_buffer_base, BTS_BUFFER_SIZE);
 	ds->bts_buffer_base = 0;
 }
 
 static int alloc_ds_buffer(int cpu)
 {
-	int node = cpu_to_node(cpu);
-	struct debug_store *ds;
-
-	ds = kzalloc_node(sizeof(*ds), GFP_KERNEL, node);
-	if (unlikely(!ds))
-		return -ENOMEM;
+	struct debug_store *ds = per_cpu_ptr(&cpu_debug_store, cpu);
 
+	memset(ds, 0, sizeof(*ds));
 	per_cpu(cpu_hw_events, cpu).ds = ds;
 
 	return 0;
@@ -392,7 +418,6 @@ static void release_ds_buffer(int cpu)
 		return;
 
 	per_cpu(cpu_hw_events, cpu).ds = NULL;
-	kfree(ds);
 }
 
 void release_ds_buffers(void)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 33/43] x86/mm: Move CR3 construction functions
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (31 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
                   ` (10 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

For flushing the TLB, the ASID which has been programmed into the
hardware must be known.  That differs from what is in 'cpu_tlbstate'.

Add functions to transform the 'cpu_tlbstate' values into to the one
programmed into the hardware (CR3).

It's not easy to include mmu_context.h into tlbflush.h, so just move
the CR3 building over to tlbflush.h.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003502.CC87BF47@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/mmu_context.h | 29 +----------------------------
 arch/x86/include/asm/tlbflush.h    | 27 +++++++++++++++++++++++++++
 arch/x86/mm/tlb.c                  |  8 ++++----
 3 files changed, 32 insertions(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 6d16d15d09a0..5e1a1ecb65c6 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -281,33 +281,6 @@ static inline bool arch_vma_access_permitted(struct vm_area_struct *vma,
 	return __pkru_allows_pkey(vma_pkey(vma), write);
 }
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
-
-static inline unsigned long build_cr3(struct mm_struct *mm, u16 asid)
-{
-	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
-		return __sme_pa(mm->pgd) | (asid + 1);
-	} else {
-		VM_WARN_ON_ONCE(asid != 0);
-		return __sme_pa(mm->pgd);
-	}
-}
-
-static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
-{
-	VM_WARN_ON_ONCE(asid > 4094);
-	return __sme_pa(mm->pgd) | (asid + 1) | CR3_NOFLUSH;
-}
-
 /*
  * This can be used from process context to figure out what the value of
  * CR3 is without needing to do a (slow) __read_cr3().
@@ -317,7 +290,7 @@ static inline unsigned long build_cr3_noflush(struct mm_struct *mm, u16 asid)
  */
 static inline unsigned long __get_current_cr3_fast(void)
 {
-	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm),
+	unsigned long cr3 = build_cr3(this_cpu_read(cpu_tlbstate.loaded_mm)->pgd,
 		this_cpu_read(cpu_tlbstate.loaded_mm_asid));
 
 	/* For now, be very restrictive about when this can be called. */
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 509046cfa5ce..df28f1a61afa 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,33 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return new_tlb_gen;
 }
 
+/*
+ * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+ * bits.  This serves two purposes.  It prevents a nasty situation in
+ * which PCID-unaware code saves CR3, loads some other value (with PCID
+ * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+ * the saved ASID was nonzero.  It also means that any bugs involving
+ * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+ * deterministically.
+ */
+struct pgd_t;
+static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
+{
+	if (static_cpu_has(X86_FEATURE_PCID)) {
+		VM_WARN_ON_ONCE(asid > 4094);
+		return __sme_pa(pgd) | (asid + 1);
+	} else {
+		VM_WARN_ON_ONCE(asid != 0);
+		return __sme_pa(pgd);
+	}
+}
+
+static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > 4094);
+	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+}
+
 #ifdef CONFIG_PARAVIRT
 #include <asm/paravirt.h>
 #else
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 3118392cdf75..e629dbda01a0 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -128,7 +128,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 	 * isn't free.
 	 */
 #ifdef CONFIG_DEBUG_VM
-	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev, prev_asid))) {
+	if (WARN_ON_ONCE(__read_cr3() != build_cr3(real_prev->pgd, prev_asid))) {
 		/*
 		 * If we were to BUG here, we'd be very likely to kill
 		 * the system so hard that we don't see the call trace.
@@ -195,7 +195,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next, new_asid));
+			write_cr3(build_cr3(next->pgd, new_asid));
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -208,7 +208,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next, new_asid));
+			write_cr3(build_cr3_noflush(next->pgd, new_asid));
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
@@ -288,7 +288,7 @@ void initialize_tlbstate_and_flush(void)
 		!(cr4_read_shadow() & X86_CR4_PCIDE));
 
 	/* Force ASID 0 and force a TLB flush. */
-	write_cr3(build_cr3(mm, 0));
+	write_cr3(build_cr3(mm->pgd, 0));
 
 	/* Reinitialize tlbstate. */
 	this_cpu_write(cpu_tlbstate.loaded_mm_asid, 0);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (32 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
                   ` (9 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

First, it's nice to remove the magic numbers.

Second, KAISER is going to consume half of the available ASID
space.  The space is currently unused, but add a comment to spell
out this new restriction.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003504.57EDB845@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index df28f1a61afa..3101581c5da0 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -75,6 +75,19 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 	return new_tlb_gen;
 }
 
+/* There are 12 bits of space for ASIDS in CR3 */
+#define CR3_HW_ASID_BITS 12
+/* When enabled, KAISER consumes a single bit for user/kernel switches */
+#define KAISER_CONSUMED_ASID_BITS 0
+
+#define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KAISER_CONSUMED_ASID_BITS)
+/*
+ * ASIDs are zero-based: 0->MAX_AVAIL_ASID are valid.  -1 below
+ * to account for them being zero-based.  Another -1 is because ASID 0
+ * is reserved for use by non-PCID-aware users.
+ */
+#define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
+
 /*
  * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
  * bits.  This serves two purposes.  It prevents a nasty situation in
@@ -88,7 +101,7 @@ struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > 4094);
+		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 		return __sme_pa(pgd) | (asid + 1);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
@@ -98,7 +111,7 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
-	VM_WARN_ON_ONCE(asid > 4094);
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
 	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
 }
 
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (33 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 36/43] x86/mm/kaiser: Allow flushing for future ASID switches Ingo Molnar
                   ` (8 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

There are effectively two ASID types:
1. The one stored in the mmu_context that goes from 0->5
2. The one programmed into the hardware that goes from 1->6

This consolidates the locations where converting beween the two
(by doing +1) to a single place which gives us a nice place to
comment.  KAISER will also need to, given an ASID, know which
hardware ASID to flush for the userspace mapping.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003506.67E81D7F@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 30 ++++++++++++++++++------------
 1 file changed, 18 insertions(+), 12 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 3101581c5da0..24b27eb5904c 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -88,21 +88,26 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
-/*
- * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
- * bits.  This serves two purposes.  It prevents a nasty situation in
- * which PCID-unaware code saves CR3, loads some other value (with PCID
- * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
- * the saved ASID was nonzero.  It also means that any bugs involving
- * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
- * deterministically.
- */
+static inline u16 kern_asid(u16 asid)
+{
+	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
+	 * bits.  This serves two purposes.  It prevents a nasty situation in
+	 * which PCID-unaware code saves CR3, loads some other value (with PCID
+	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
+	 * the saved ASID was nonzero.  It also means that any bugs involving
+	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
+	 * deterministically.
+	 */
+	return asid + 1;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
 	if (static_cpu_has(X86_FEATURE_PCID)) {
-		VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-		return __sme_pa(pgd) | (asid + 1);
+		return __sme_pa(pgd) | kern_asid(asid);
 	} else {
 		VM_WARN_ON_ONCE(asid != 0);
 		return __sme_pa(pgd);
@@ -112,7 +117,8 @@ static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 static inline unsigned long build_cr3_noflush(pgd_t *pgd, u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
-	return __sme_pa(pgd) | (asid + 1) | CR3_NOFLUSH;
+	VM_WARN_ON_ONCE(!this_cpu_has(X86_FEATURE_PCID));
+	return __sme_pa(pgd) | kern_asid(asid) | CR3_NOFLUSH;
 }
 
 #ifdef CONFIG_PARAVIRT
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 36/43] x86/mm/kaiser: Allow flushing for future ASID switches
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (34 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
                   ` (7 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

If changing the page tables in such a way that an invalidation of
all contexts (aka. PCIDs / ASIDs) is required, they can be
actively invalidated by:

 1. INVPCID for each PCID (works for single pages too).
 2. Load CR3 with each PCID without the NOFLUSH bit set
 3. Load CR3 with the NOFLUSH bit set for each and do
    INVLPG for each address.

But, none of these are really feasible since there are ~6 ASIDs (12 with
KAISER) at the time that invalidation is required.  Instead of
actively invalidating them, invalidate the *current* context and
also mark the cpu_tlbstate _quickly_ to indicate future invalidation
to be required.

At the next context-switch, look for this indicator
('all_other_ctxs_invalid' being set) invalidate all of the
cpu_tlbstate.ctxs[] entries.

This ensures that any future context switches will do a full flush
of the TLB, picking up the previous changes.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003507.E8C327F5@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/tlbflush.h | 47 ++++++++++++++++++++++++++++++++---------
 arch/x86/mm/tlb.c               | 35 ++++++++++++++++++++++++++++++
 2 files changed, 72 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 24b27eb5904c..bb5ba71038ee 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -184,6 +184,17 @@ struct tlb_state {
 	 */
 	bool is_lazy;
 
+	/*
+	 * If set we changed the page tables in such a way that we
+	 * needed an invalidation of all contexts (aka. PCIDs / ASIDs).
+	 * This tells us to go invalidate all the non-loaded ctxs[]
+	 * on the next context switch.
+	 *
+	 * The current ctx was kept up-to-date as it ran and does not
+	 * need to be invalidated.
+	 */
+	bool all_other_ctxs_invalid;
+
 	/*
 	 * Access to this CR4 shadow and to H/W CR4 is protected by
 	 * disabling interrupts when modifying either one.
@@ -261,6 +272,19 @@ static inline unsigned long cr4_read_shadow(void)
 	return this_cpu_read(cpu_tlbstate.cr4);
 }
 
+static inline void tlb_flush_shared_nonglobals(void)
+{
+	/*
+	 * With global pages, all of the shared kenel page tables
+	 * are set as _PAGE_GLOBAL.  We have no shared nonglobals
+	 * and nothing to do here.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES))
+		return;
+
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, true);
+}
+
 /*
  * Save some of cr4 feature set we're using (e.g.  Pentium 4MB
  * enable and PPro Global page enable), so that any CPU's that boot
@@ -290,6 +314,10 @@ static inline void __native_flush_tlb(void)
 	preempt_disable();
 	native_write_cr3(__native_read_cr3());
 	preempt_enable();
+	/*
+	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
+	 * without PCIDs flushes all non-globals.
+	 */
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -335,24 +363,23 @@ static inline void __native_flush_tlb_single(unsigned long addr)
 
 static inline void __flush_tlb_all(void)
 {
-	if (boot_cpu_has(X86_FEATURE_PGE))
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
 		__flush_tlb_global();
-	else
+	} else {
 		__flush_tlb();
-
-	/*
-	 * Note: if we somehow had PCID but not PGE, then this wouldn't work --
-	 * we'd end up flushing kernel translations for the current ASID but
-	 * we might fail to flush kernel translations for other cached ASIDs.
-	 *
-	 * To avoid this issue, we force PCID off if PGE is off.
-	 */
+		tlb_flush_shared_nonglobals();
+	}
 }
 
 static inline void __flush_tlb_one(unsigned long addr)
 {
 	count_vm_tlb_event(NR_TLB_LOCAL_FLUSH_ONE);
 	__flush_tlb_single(addr);
+	/*
+	 * Invalidate other address spaces inaccessible to single-page
+	 * invalidation:
+	 */
+	tlb_flush_shared_nonglobals();
 }
 
 #define TLB_FLUSH_ALL	-1UL
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index e629dbda01a0..81941f1690fa 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -28,6 +28,38 @@
  *	Implement flush IPI by CALL_FUNCTION_VECTOR, Alex Shi
  */
 
+/*
+ * We get here when we do something requiring a TLB invalidation
+ * but could not go invalidate all of the contexts.  We do the
+ * necessary invalidation by clearing out the 'ctx_id' which
+ * forces a TLB flush when the context is loaded.
+ */
+void clear_non_loaded_ctxs(void)
+{
+	u16 asid;
+
+	/*
+	 * This is only expected to be set if we have disabled
+	 * kernel _PAGE_GLOBAL pages.
+	 */
+	if (IS_ENABLED(CONFIG_X86_GLOBAL_PAGES)) {
+		WARN_ON_ONCE(1);
+		return;
+	}
+
+	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
+		/* Do not need to flush the current asid */
+		if (asid == this_cpu_read(cpu_tlbstate.loaded_mm_asid))
+			continue;
+		/*
+		 * Make sure the next time we go to switch to
+		 * this asid, we do a flush:
+		 */
+		this_cpu_write(cpu_tlbstate.ctxs[asid].ctx_id, 0);
+	}
+	this_cpu_write(cpu_tlbstate.all_other_ctxs_invalid, false);
+}
+
 atomic64_t last_mm_ctx_id = ATOMIC64_INIT(1);
 
 
@@ -42,6 +74,9 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 		return;
 	}
 
+	if (this_cpu_read(cpu_tlbstate.all_other_ctxs_invalid))
+		clear_non_loaded_ctxs();
+
 	for (asid = 0; asid < TLB_NR_DYN_ASIDS; asid++) {
 		if (this_cpu_read(cpu_tlbstate.ctxs[asid].ctx_id) !=
 		    next->context.ctx_id)
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (35 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 36/43] x86/mm/kaiser: Allow flushing for future ASID switches Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
                   ` (6 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Short summary: Use x86 PCID feature to avoid flushing the TLB at all
interrupts and syscalls.  Speed them up.  Makes context switches
and TLB flushing slower.

Background:

KAISER keeps two copies of the page tables.  Switches between the
copies are performed by writing to the CR3 register.  But, CR3
was really designed for context switches and writes to it also
flush the entire TLB (modulo global pages).  This TLB flush
increases the cost of interrupts and context switches.  For
syscall-heavy microbenchmarks it can cut the rate of syscalls by
2/3.

The kernel recently gained support for and Intel CPU feature
called Process Context IDentifiers (PCID) thanks to Andy
Lutomirski.  This feature is intended to allow you to switch
between contexts without flushing the TLB.

Implementation:

PCIDs can be used to avoid flushing the TLB at kernel entry/exit.
This is speeds up both interrupts and syscalls.

First, the kernel and userspace must be assigned different ASIDs.
On entry from userspace, move over to the kernel page tables
*and* ASID.  On exit, restore the user page tables and ASID.
Fortunately, the ASID is programmed via CR3, which is already
being used to switch between the user and kernel page tables.
This gives us convenient, one-stop shopping.

The CR3 write which is used to switch between processes provides
all the TLB flushing normally required at context switch time.
But, with KAISER, that CR3 write only flushes the current
(kernel) ASID.  An extra TLB flush operation is now required in
order to flush the user ASID.  This new instruction (INVPCID) is
probably ~100 cycles, but this is done with the assumption that
the time lost in context switches is more than made up for by
lower cost of interrupts and syscalls.

Support:

PCIDs are generally available on Sandybridge and newer CPUs.  However,
the accompanying INVPCID instruction did not become available until
Haswell (the ones with "v4", or called fourth-generation Core).  This
instruction allows non-current-PCID TLB entries to be flushed without
switching CR3 and global pages to be flushed without a double
MOV-to-CR4.

Without INVPCID, PCIDs are much harder to use.  TLB invalidation gets
much more onerous:

1. Every kernel TLB flush (even for a single page) requires an
   interrupts-off MOV-to-CR4 which is very expensive.  This is because
   there is no way to flush a kernel address that might be loaded
   in *EVERY* PCID.  Right now, there are "only" ~12 of these per-cpu,
   but that's too painful to use the MOV-to-CR3 to flush them.  That
   leaves only the MOV-to-CR4.
2. Every userspace flush (even for a single page requires one of the
   following:
   a. A pair of flushing (bit 63 clear) CR3 writes: one for
      the kernel ASID and another for userspace.
   b. A pair of non-flushing CR3 writes (bit 63 set) with the
      flush done for each.  For instance, what is currently a
      single instruction without KAISER:

		invpcid_flush_one(current_pcid, addr);

      becomes this with KAISER:

      		invpcid_flush_one(current_kern_pcid, addr);
		invpcid_flush_one(current_user_pcid, addr);

      and this without INVPCID:

      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_user_pcid | NOFLUSH);
      		__native_flush_tlb_single(addr);
		write_cr3(mm->pgd | current_kern_pcid | NOFLUSH);

So, for now, fully disable PCIDs with KAISER when INVPCID is not
available.  This is fixable, but it's an optimization that can be
performed later.

Hugh Dickins also points out that PCIDs really have two distinct
use-cases in the context of KAISER.  The first way they can be used
is as "TLB preservation across context-switch", which is what
Andy Lutomirksi's 4.14 PCID code does.  They can also be used as
a "KAISER syscall/interrupt accelerator".  If we just use them to
speed up syscall/interrupts (and ignore the context-switch TLB
preservation), then the deficiency of not having INVPCID
becomes much less onerous.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003509.EC42DD15@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h                    |  25 +++--
 arch/x86/include/asm/cpufeatures.h          |   1 +
 arch/x86/include/asm/pgtable_types.h        |  11 +++
 arch/x86/include/asm/tlbflush.h             | 137 +++++++++++++++++++++++-----
 arch/x86/include/uapi/asm/processor-flags.h |   3 +-
 arch/x86/kvm/x86.c                          |   3 +-
 arch/x86/mm/init.c                          |  75 ++++++++++-----
 arch/x86/mm/tlb.c                           |  66 +++++++++++++-
 8 files changed, 261 insertions(+), 60 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index d087c3aa0514..66af80514197 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -3,6 +3,7 @@
 #include <asm/unwind_hints.h>
 #include <asm/cpufeatures.h>
 #include <asm/page_types.h>
+#include <asm/pgtable_types.h>
 
 /*
 
@@ -192,16 +193,20 @@ For 32-bit we have the following conventions - kernel is built with
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_PGTABLES_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK     (KAISER_SWITCH_PGTABLES_MASK|\
+				(1<<X86_CR3_KAISER_SWITCH_BIT))
 
 .macro ADJUST_KERNEL_CR3 reg:req
-	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
-	andq	$(~KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Clear PCID and "KAISER bit", point CR3 at kernel pagetables: */
+	andq    $(~KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro ADJUST_USER_CR3 reg:req
-	/* Move CR3 up a page to the user page tables: */
-	orq	$(KAISER_SWITCH_MASK), \reg
+	ALTERNATIVE "", "bts $63, \reg", X86_FEATURE_PCID
+	/* Set user PCID bit, and move CR3 up a page to the user page tables: */
+	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
@@ -220,8 +225,14 @@ For 32-bit we have the following conventions - kernel is built with
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
-	 * Is the switch bit zero?  This means the address is
-	 * up in real KAISER patches in a moment.
+	 * Is the "switch mask" all zero?  That means that both of
+	 * these are zero:
+	 *
+	 *	1. The user/kernel PCID bit, and
+	 *	2. The user/kernel "bit" that points CR3 to the
+	 *	   bottom half of the 8k PGD
+	 *
+	 * That indicates a kernel CR3 value, not user/shadow.
 	 */
 	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
 	jz	.Ldone_\@
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index c0b0e9e8aa66..ea51d4a28d96 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -197,6 +197,7 @@
 #define X86_FEATURE_CAT_L3		( 7*32+ 4) /* Cache Allocation Technology L3 */
 #define X86_FEATURE_CAT_L2		( 7*32+ 5) /* Cache Allocation Technology L2 */
 #define X86_FEATURE_CDP_L3		( 7*32+ 6) /* Code and Data Prioritization L3 */
+#define X86_FEATURE_INVPCID_SINGLE ( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 1fc2f22b9002..8bc825a5b125 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -140,6 +140,17 @@
 			 _PAGE_SOFT_DIRTY)
 #define _HPAGE_CHG_MASK (_PAGE_CHG_MASK | _PAGE_PSE)
 
+/* The ASID is the lower 12 bits of CR3 */
+#define X86_CR3_PCID_ASID_MASK  (_AC((1<<12)-1, UL))
+
+/* Mask for all the PCID-related bits in CR3: */
+#define X86_CR3_PCID_MASK       (X86_CR3_PCID_NOFLUSH | X86_CR3_PCID_ASID_MASK)
+
+/* Make sure this is only usable in KAISER #ifdef'd code: */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#endif
+
 /*
  * The cache modes defined here are used to translate between pure SW usage
  * and the HW defined cache mode bits and/or PAT entries.
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index bb5ba71038ee..ea1a3dce91c2 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -78,7 +78,12 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
 /* There are 12 bits of space for ASIDS in CR3 */
 #define CR3_HW_ASID_BITS 12
 /* When enabled, KAISER consumes a single bit for user/kernel switches */
+#ifdef CONFIG_KAISER
+#define X86_CR3_KAISER_SWITCH_BIT 11
+#define KAISER_CONSUMED_ASID_BITS 1
+#else
 #define KAISER_CONSUMED_ASID_BITS 0
+#endif
 
 #define CR3_AVAIL_ASID_BITS (CR3_HW_ASID_BITS - KAISER_CONSUMED_ASID_BITS)
 /*
@@ -88,21 +93,62 @@ static inline u64 inc_mm_tlb_gen(struct mm_struct *mm)
  */
 #define MAX_ASID_AVAILABLE ((1<<CR3_AVAIL_ASID_BITS) - 2)
 
+/*
+ * 6 because 6 should be plenty and struct tlb_state will fit in
+ * two cache lines.
+ */
+#define TLB_NR_DYN_ASIDS 6
+
 static inline u16 kern_asid(u16 asid)
 {
 	VM_WARN_ON_ONCE(asid > MAX_ASID_AVAILABLE);
+
+#ifdef CONFIG_KAISER
+	/*
+	 * Make sure that the dynamic ASID space does not confict
+	 * with the bit we are using to switch between user and
+	 * kernel ASIDs.
+	 */
+	BUILD_BUG_ON(TLB_NR_DYN_ASIDS >= (1<<X86_CR3_KAISER_SWITCH_BIT));
+
 	/*
-	 * If PCID is on, ASID-aware code paths put the ASID+1 into the PCID
-	 * bits.  This serves two purposes.  It prevents a nasty situation in
-	 * which PCID-unaware code saves CR3, loads some other value (with PCID
-	 * == 0), and then restores CR3, thus corrupting the TLB for ASID 0 if
-	 * the saved ASID was nonzero.  It also means that any bugs involving
-	 * loading a PCID-enabled CR3 with CR4.PCIDE off will trigger
-	 * deterministically.
+	 * The ASID being passed in here should have respected
+	 * the MAX_ASID_AVAILABLE and thus never have the switch
+	 * bit set.
+	 */
+	VM_WARN_ON_ONCE(asid & (1<<X86_CR3_KAISER_SWITCH_BIT));
+#endif
+	/*
+	 * The dynamically-assigned ASIDs that get passed in  are
+	 * small (<TLB_NR_DYN_ASIDS).  They never have the high
+	 * switch bit set, so do not bother to clear it.
+	 */
+
+	/*
+	 * If PCID is on, ASID-aware code paths put the ASID+1
+	 * into the PCID bits.  This serves two purposes.  It
+	 * prevents a nasty situation in which PCID-unaware code
+	 * saves CR3, loads some other value (with PCID == 0),
+	 * and then restores CR3, thus corrupting the TLB for
+	 * ASID 0 if the saved ASID was nonzero.  It also means
+	 * that any bugs involving loading a PCID-enabled CR3
+	 * with CR4.PCIDE off will trigger deterministically.
 	 */
 	return asid + 1;
 }
 
+/*
+ * The user ASID is just the kernel one, plus the "switch bit".
+ */
+static inline u16 user_asid(u16 asid)
+{
+	u16 ret = kern_asid(asid);
+#ifdef CONFIG_KAISER
+	ret |= 1<<X86_CR3_KAISER_SWITCH_BIT;
+#endif
+	return ret;
+}
+
 struct pgd_t;
 static inline unsigned long build_cr3(pgd_t *pgd, u16 asid)
 {
@@ -145,12 +191,6 @@ static inline bool tlb_defer_switch_to_init_mm(void)
 	return !static_cpu_has(X86_FEATURE_PCID);
 }
 
-/*
- * 6 because 6 should be plenty and struct tlb_state will fit in
- * two cache lines.
- */
-#define TLB_NR_DYN_ASIDS 6
-
 struct tlb_context {
 	u64 ctx_id;
 	u64 tlb_gen;
@@ -306,18 +346,42 @@ extern void initialize_tlbstate_and_flush(void);
 
 static inline void __native_flush_tlb(void)
 {
+	if (!cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		/*
+		 * native_write_cr3() only clears the current PCID if
+		 * CR4 has X86_CR4_PCIDE set.  In other words, this does
+		 * not fully flush the TLB if PCIDs are in use.
+		 *
+		 * With KAISER and PCIDs, the means that we did not
+		 * flush the user PCID.  Warn if it gets called.
+		 */
+		if (IS_ENABLED(CONFIG_KAISER))
+			WARN_ON_ONCE(this_cpu_read(cpu_tlbstate.cr4) &
+				     X86_CR4_PCIDE);
+		/*
+		 * If current->mm == NULL then we borrow a mm
+		 * which may change during a task switch and
+		 * therefore we must not be preempted while we
+		 * write CR3 back:
+		 */
+		preempt_disable();
+		native_write_cr3(__native_read_cr3());
+		preempt_enable();
+		/*
+		 * Does not need tlb_flush_shared_nonglobals()
+		 * since the CR3 write without PCIDs flushes all
+		 * non-globals.
+		 */
+		return;
+	}
 	/*
-	 * If current->mm == NULL then we borrow a mm which may change during a
-	 * task switch and therefore we must not be preempted while we write CR3
-	 * back:
-	 */
-	preempt_disable();
-	native_write_cr3(__native_read_cr3());
-	preempt_enable();
-	/*
-	 * Does not need tlb_flush_shared_nonglobals() since the CR3 write
-	 * without PCIDs flushes all non-globals.
+	 * We are no longer using globals with KAISER, so a
+	 * "nonglobals" flush would work too. But, this is more
+	 * conservative.
+	 *
+	 * Note, this works with CR4.PCIDE=0 or 1.
 	 */
+	invpcid_flush_all();
 }
 
 static inline void __native_flush_tlb_global_irq_disabled(void)
@@ -339,6 +403,8 @@ static inline void __native_flush_tlb_global(void)
 		/*
 		 * Using INVPCID is considerably faster than a pair of writes
 		 * to CR4 sandwiched inside an IRQ flag save/restore.
+		 *
+		 * Note, this works with CR4.PCIDE=0 or 1.
 		 */
 		invpcid_flush_all();
 		return;
@@ -358,7 +424,30 @@ static inline void __native_flush_tlb_global(void)
 
 static inline void __native_flush_tlb_single(unsigned long addr)
 {
-	asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+	u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+
+	/*
+	 * Some platforms #GP if we call invpcid(type=1/2) before
+	 * CR4.PCIDE=1.  Just call invpcid in the case we are called
+	 * early.
+	 */
+	if (!this_cpu_has(X86_FEATURE_INVPCID_SINGLE)) {
+		asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
+		return;
+	}
+	/* Flush the address out of both PCIDs. */
+	/*
+	 * An optimization here might be to determine addresses
+	 * that are only kernel-mapped and only flush the kernel
+	 * ASID.  But, userspace flushes are probably much more
+	 * important performance-wise.
+	 *
+	 * Make sure to do only a single invpcid when KAISER is
+	 * disabled and we have only a single ASID.
+	 */
+	if (kern_asid(loaded_mm_asid) != user_asid(loaded_mm_asid))
+		invpcid_flush_one(user_asid(loaded_mm_asid), addr);
+	invpcid_flush_one(kern_asid(loaded_mm_asid), addr);
 }
 
 static inline void __flush_tlb_all(void)
diff --git a/arch/x86/include/uapi/asm/processor-flags.h b/arch/x86/include/uapi/asm/processor-flags.h
index 7e1e730396ae..7ef94b64dbb4 100644
--- a/arch/x86/include/uapi/asm/processor-flags.h
+++ b/arch/x86/include/uapi/asm/processor-flags.h
@@ -78,7 +78,8 @@
 #define X86_CR3_PWT		_BITUL(X86_CR3_PWT_BIT)
 #define X86_CR3_PCD_BIT		4 /* Page Cache Disable */
 #define X86_CR3_PCD		_BITUL(X86_CR3_PCD_BIT)
-#define X86_CR3_PCID_MASK	_AC(0x00000fff,UL) /* PCID Mask */
+#define X86_CR3_PCID_NOFLUSH_BIT 63 /* Preserve old PCID */
+#define X86_CR3_PCID_NOFLUSH    _BITULL(X86_CR3_PCID_NOFLUSH_BIT)
 
 /*
  * Intel CPU features in CR4
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 03869eb7fcd6..cd7ed7a874d1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -805,7 +805,8 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 
 		/* PCID can not be enabled when cr3[11:0]!=000H or EFER.LMA=0 */
-		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_MASK) || !is_long_mode(vcpu))
+		if ((kvm_read_cr3(vcpu) & X86_CR3_PCID_ASID_MASK) ||
+		    !is_long_mode(vcpu))
 			return 1;
 	}
 
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index a22c2b95e513..9618e57d46cf 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -196,34 +196,59 @@ static void __init probe_page_size_mask(void)
 
 static void setup_pcid(void)
 {
-#ifdef CONFIG_X86_64
-	if (boot_cpu_has(X86_FEATURE_PCID)) {
-		if (boot_cpu_has(X86_FEATURE_PGE)) {
-			/*
-			 * This can't be cr4_set_bits_and_update_boot() --
-			 * the trampoline code can't handle CR4.PCIDE and
-			 * it wouldn't do any good anyway.  Despite the name,
-			 * cr4_set_bits_and_update_boot() doesn't actually
-			 * cause the bits in question to remain set all the
-			 * way through the secondary boot asm.
-			 *
-			 * Instead, we brute-force it and set CR4.PCIDE
-			 * manually in start_secondary().
-			 */
-			cr4_set_bits(X86_CR4_PCIDE);
-		} else {
-			/*
-			 * flush_tlb_all(), as currently implemented, won't
-			 * work if PCID is on but PGE is not.  Since that
-			 * combination doesn't exist on real hardware, there's
-			 * no reason to try to fully support it, but it's
-			 * polite to avoid corrupting data if we're on
-			 * an improperly configured VM.
-			 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	if (!boot_cpu_has(X86_FEATURE_PCID))
+		return;
+
+	if (boot_cpu_has(X86_FEATURE_PGE)) {
+		/*
+		 * KAISER uses a PCID for the kernel and another
+		 * for userspace.  Both PCIDs need to be flushed
+		 * when the TLB flush functions are called.  But,
+		 * flushing *another* PCID is insane without
+		 * INVPCID.  Just avoid using PCIDs at all if we
+		 * have KAISER and do not have INVPCID.
+		 */
+		if (!IS_ENABLED(CONFIG_X86_GLOBAL_PAGES) &&
+		    !boot_cpu_has(X86_FEATURE_INVPCID)) {
 			setup_clear_cpu_cap(X86_FEATURE_PCID);
+			return;
 		}
+		/*
+		 * This can't be cr4_set_bits_and_update_boot() --
+		 * the trampoline code can't handle CR4.PCIDE and
+		 * it wouldn't do any good anyway.  Despite the name,
+		 * cr4_set_bits_and_update_boot() doesn't actually
+		 * cause the bits in question to remain set all the
+		 * way through the secondary boot asm.
+		 *
+		 * Instead, we brute-force it and set CR4.PCIDE
+		 * manually in start_secondary().
+		 */
+		cr4_set_bits(X86_CR4_PCIDE);
+
+		/*
+		 * INVPCID's single-context modes (2/3) only work
+		 * if we set X86_CR4_PCIDE, *and* we INVPCID
+		 * support.  It's unusable on systems that have
+		 * X86_CR4_PCIDE clear, or that have no INVPCID
+		 * support at all.
+		 */
+		if (boot_cpu_has(X86_FEATURE_INVPCID))
+			setup_force_cpu_cap(X86_FEATURE_INVPCID_SINGLE);
+	} else {
+		/*
+		 * flush_tlb_all(), as currently implemented, won't
+		 * work if PCID is on but PGE is not.  Since that
+		 * combination doesn't exist on real hardware, there's
+		 * no reason to try to fully support it, but it's
+		 * polite to avoid corrupting data if we're on
+		 * an improperly configured VM.
+		 */
+		setup_clear_cpu_cap(X86_FEATURE_PCID);
 	}
-#endif
 }
 
 #ifdef CONFIG_X86_32
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 81941f1690fa..f75b6eb47a6d 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -100,6 +100,68 @@ static void choose_new_asid(struct mm_struct *next, u64 next_tlb_gen,
 	*need_flush = true;
 }
 
+/*
+ * Given a kernel asid, flush the corresponding KAISER
+ * user ASID.
+ */
+static void flush_user_asid(pgd_t *pgd, u16 kern_asid)
+{
+	/* There is no user ASID if KAISER is off */
+	if (!IS_ENABLED(CONFIG_KAISER))
+		return;
+	/*
+	 * We only have a single ASID if PCID is off and the CR3
+	 * write will have flushed it.
+	 */
+	if (!cpu_feature_enabled(X86_FEATURE_PCID))
+		return;
+	/*
+	 * With PCIDs enabled, write_cr3() only flushes TLB
+	 * entries for the current (kernel) ASID.  This leaves
+	 * old TLB entries for the user ASID in place and we must
+	 * flush that context separately.  We can theoretically
+	 * delay doing this until we actually load up the
+	 * userspace CR3, but do it here for simplicity.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_INVPCID)) {
+		invpcid_flush_single_context(user_asid(kern_asid));
+	} else {
+		/*
+		 * On systems with PCIDs, but no INVPCID, the only
+		 * way to flush a PCID is a CR3 write.  Note that
+		 * we use the kernel page tables with the *user*
+		 * ASID here.
+		 */
+		unsigned long user_asid_flush_cr3;
+		user_asid_flush_cr3 = build_cr3(pgd, user_asid(kern_asid));
+		write_cr3(user_asid_flush_cr3);
+		/*
+		 * We do not use PCIDs with KAISER unless we also
+		 * have INVPCID.  Getting here is unexpected.
+		 */
+		WARN_ON_ONCE(1);
+	}
+}
+
+static void load_new_mm_cr3(pgd_t *pgdir, u16 new_asid, bool need_flush)
+{
+	unsigned long new_mm_cr3;
+
+	if (need_flush) {
+		flush_user_asid(pgdir, new_asid);
+		new_mm_cr3 = build_cr3(pgdir, new_asid);
+	} else {
+		new_mm_cr3 = build_cr3_noflush(pgdir, new_asid);
+	}
+
+	/*
+	 * Caution: many callers of this function expect
+	 * that load_cr3() is serializing and orders TLB
+	 * fills with respect to the mm_cpumask writes.
+	 */
+	write_cr3(new_mm_cr3);
+}
+
 void leave_mm(int cpu)
 {
 	struct mm_struct *loaded_mm = this_cpu_read(cpu_tlbstate.loaded_mm);
@@ -230,7 +292,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 		if (need_flush) {
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].ctx_id, next->context.ctx_id);
 			this_cpu_write(cpu_tlbstate.ctxs[new_asid].tlb_gen, next_tlb_gen);
-			write_cr3(build_cr3(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, true);
 
 			/*
 			 * NB: This gets called via leave_mm() in the idle path
@@ -243,7 +305,7 @@ void switch_mm_irqs_off(struct mm_struct *prev, struct mm_struct *next,
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, TLB_FLUSH_ALL);
 		} else {
 			/* The new ASID is already up to date. */
-			write_cr3(build_cr3_noflush(next->pgd, new_asid));
+			load_new_mm_cr3(next->pgd, new_asid, false);
 
 			/* See above wrt _rcuidle. */
 			trace_tlb_flush_rcuidle(TLB_FLUSH_ON_TASK_SWITCH, 0);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (36 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
                   ` (5 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER code attempts to "poison" the user portion of the kernel page
tables.  It detects entries that it wants that it wants to poison in two
ways:
 * Looking for addresses >= PAGE_OFFSET
 * Looking for entries without _PAGE_USER set

But, to allow the _PAGE_USER check to work, it must never be set on
init_mm entries, and an earlier patch in this series ensured that it
will never be set.

The VDSO is at a address >= PAGE_OFFSET and it is also mapped by init_mm.
Because of the earlier, KAISER-enforced restriction, _PAGE_USER is never
set which makes the VDSO unreadable to userspace.

This makes the "NATIVE" case totally unusable since userspace can not
even see the memory any more.  Disable it whenever KAISER is enabled.

Also add some help text about how KAISER might affect the emulation
case as well.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003513.10CAD896@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/Kconfig | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 09dcc94c4484..d23cd2902b10 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2249,6 +2249,9 @@ choice
 
 	config LEGACY_VSYSCALL_NATIVE
 		bool "Native"
+		# The VSYSCALL page comes from the kernel page tables
+		# and is not available when KAISER is enabled.
+		depends on ! KAISER
 		help
 		  Actual executable code is located in the fixed vsyscall
 		  address mapping, implementing time() efficiently. Since
@@ -2266,6 +2269,11 @@ choice
 		  exploits. This configuration is recommended when userspace
 		  still uses the vsyscall area.
 
+		  When KAISER is enabled, the vsyscall area will become
+		  unreadable.  This emulation option still works, but KAISER
+		  will make it harder to do things like trace code using the
+		  emulation.
+
 	config LEGACY_VSYSCALL_NONE
 		bool "None"
 		help
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (37 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
                   ` (4 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

This will be used in a few patches.  Right now, it's not wired up
to do anything useful.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003517.8EAB76E0@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kaiser.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 4665dd724efb..968d5b62d597 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -29,6 +29,7 @@
 #include <linux/string.h>
 #include <linux/types.h>
 #include <linux/bug.h>
+#include <linux/debugfs.h>
 #include <linux/init.h>
 #include <linux/spinlock.h>
 #include <linux/mm.h>
@@ -470,3 +471,50 @@ void kaiser_remove_mapping(unsigned long start, unsigned long size)
 	 */
 	__native_flush_tlb_global();
 }
+
+int kaiser_enabled = 1;
+static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf,
+			     size_t count, loff_t *ppos)
+{
+	char buf[32];
+	unsigned int len;
+
+	len = sprintf(buf, "%d\n", kaiser_enabled);
+	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t kaiser_enabled_write_file(struct file *file,
+		 const char __user *user_buf, size_t count, loff_t *ppos)
+{
+	char buf[32];
+	ssize_t len;
+	unsigned int enable;
+
+	len = min(count, sizeof(buf) - 1);
+	if (copy_from_user(buf, user_buf, len))
+		return -EFAULT;
+
+	buf[len] = '\0';
+	if (kstrtoint(buf, 0, &enable))
+		return -EINVAL;
+
+	if (enable > 1)
+		return -EINVAL;
+
+	WRITE_ONCE(kaiser_enabled, enable);
+	return count;
+}
+
+static const struct file_operations fops_kaiser_enabled = {
+	.read = kaiser_enabled_read_file,
+	.write = kaiser_enabled_write_file,
+	.llseek = default_llseek,
+};
+
+static int __init create_kaiser_enabled(void)
+{
+	debugfs_create_file("kaiser-enabled", S_IRUSR | S_IWUSR,
+			    arch_debugfs_dir, NULL, &fops_kaiser_enabled);
+	return 0;
+}
+late_initcall(create_kaiser_enabled);
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (38 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
                   ` (3 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

Currently, all of the checks for KAISER are compile-time checks.

Runtime checks are needed for turning it on/off at runtime.

Add a function to do that.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003518.B7D81B14@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/kaiser.h | 5 +++++
 include/linux/kaiser.h        | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
index 040cb096d29d..35f12a8a7071 100644
--- a/arch/x86/include/asm/kaiser.h
+++ b/arch/x86/include/asm/kaiser.h
@@ -56,6 +56,11 @@ extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
  */
 extern void kaiser_init(void);
 
+static inline bool kaiser_active(void)
+{
+	extern int kaiser_enabled;
+	return kaiser_enabled;
+}
 #endif
 
 #endif /* __ASSEMBLY__ */
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index 77db4230a0dd..a3d28d00d555 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -28,5 +28,10 @@ static inline int kaiser_add_mapping(unsigned long addr, unsigned long size,
 static inline void kaiser_add_mapping_cpu_entry(int cpu)
 {
 }
+
+static inline bool kaiser_active(void)
+{
+	return 0;
+}
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (39 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 17:24 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
                   ` (2 subsequent siblings)
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

With KAISER Kernel PGDs that map userspace are "poisoned" with
the NX bit.  This ensures that if a kernel->user CR3 switch is
missed, userspace crashes instead of running in an unhardened
state.

This code will be needed in a moment when KAISER is turned
on and off at runtime.

Note that an __ASSEMBLY__ #ifdef is now required since kaiser.h
is indirectly included into assembly.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003521.A90AC3AF@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/include/asm/pgtable_64.h | 16 +++++++++++++++-
 arch/x86/mm/kaiser.c              | 38 ++++++++++++++++++++++++++++++++++++++
 include/linux/kaiser.h            |  3 ++-
 3 files changed, 55 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index c239839e92bd..89bde2091af1 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -3,6 +3,7 @@
 #define _ASM_X86_PGTABLE_64_H
 
 #include <linux/const.h>
+#include <linux/kaiser.h>
 #include <asm/pgtable_64_types.h>
 
 #ifndef __ASSEMBLY__
@@ -199,6 +200,18 @@ static inline bool pgd_userspace_access(pgd_t pgd)
 	return pgd.pgd & _PAGE_USER;
 }
 
+static inline void kaiser_poison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd |= _PAGE_NX;
+}
+
+static inline void kaiser_unpoison_pgd(pgd_t *pgd)
+{
+	if (pgd->pgd & _PAGE_PRESENT)
+		pgd->pgd &= ~_PAGE_NX;
+}
+
 /*
  * Take a PGD location (pgdp) and a pgd value that needs
  * to be set there.  Populates the shadow and returns
@@ -222,7 +235,8 @@ static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
 			 * wrong CR3 value, userspace will crash
 			 * instead of running.
 			 */
-			pgd.pgd |= _PAGE_NX;
+			if (kaiser_active())
+				kaiser_poison_pgd(&pgd);
 		}
 	} else if (pgd_userspace_access(*pgdp)) {
 		/*
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 968d5b62d597..06966b111280 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -501,6 +501,9 @@ static ssize_t kaiser_enabled_write_file(struct file *file,
 	if (enable > 1)
 		return -EINVAL;
 
+	if (kaiser_enabled == enable)
+		return count;
+
 	WRITE_ONCE(kaiser_enabled, enable);
 	return count;
 }
@@ -518,3 +521,38 @@ static int __init create_kaiser_enabled(void)
 	return 0;
 }
 late_initcall(create_kaiser_enabled);
+
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
+{
+	int i = 0;
+
+	for (i = 0; i < PTRS_PER_PGD; i++) {
+		pgd_t *pgd = &pgd_page[i];
+
+		/* Stop once we hit kernel addresses: */
+		if (!pgdp_maps_userspace(pgd))
+			break;
+
+		if (do_poison == KAISER_POISON)
+			kaiser_poison_pgd(pgd);
+		else
+			kaiser_unpoison_pgd(pgd);
+	}
+
+}
+
+void kaiser_poison_pgds(enum poison do_poison)
+{
+	struct page *page;
+
+	spin_lock(&pgd_lock);
+	list_for_each_entry(page, &pgd_list, lru) {
+		pgd_t *pgd = (pgd_t *)page_address(page);
+		kaiser_poison_pgd_page(pgd, do_poison);
+	}
+	spin_unlock(&pgd_lock);
+}
diff --git a/include/linux/kaiser.h b/include/linux/kaiser.h
index a3d28d00d555..83d465599646 100644
--- a/include/linux/kaiser.h
+++ b/include/linux/kaiser.h
@@ -4,7 +4,7 @@
 #ifdef CONFIG_KAISER
 #include <asm/kaiser.h>
 #else
-
+#ifndef __ASSEMBLY__
 /*
  * These stubs are used whenever CONFIG_KAISER is off, which
  * includes architectures that support KAISER, but have it
@@ -33,5 +33,6 @@ static inline bool kaiser_active(void)
 {
 	return 0;
 }
+#endif /* __ASSEMBLY__ */
 #endif /* !CONFIG_KAISER */
 #endif /* _INCLUDE_KAISER_H */
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (40 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-25 19:18   ` Thomas Gleixner
  2017-11-24 17:24 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
  2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
  43 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

The KAISER CR3 switches are expensive for many reasons.  Not all systems
benefit from the protection provided by KAISER.  Some of them can not
pay the high performance cost.

This patch adds a debugfs file.  To disable KAISER, you do:

	echo 0 > /sys/kernel/debug/x86/kaiser-enabled

and to re-enable it, you can:

	echo 1 > /sys/kernel/debug/x86/kaiser-enabled

This is a *minimal* implementation.  There are certainly plenty of
optimizations that can be done on top of this by using ALTERNATIVES
among other things.

This does, however, completely remove all the KAISER-based CR3 writes.
This permits a paravirtualized system that can not tolerate CR3
writes to theoretically survive with CONFIG_KAISER=y, albeit with
/sys/kernel/debug/x86/kaiser-enabled=0.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003523.28FFBAB6@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h | 12 +++++++++
 arch/x86/mm/kaiser.c     | 70 +++++++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 78 insertions(+), 4 deletions(-)

diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index 66af80514197..89ccf7ae0e23 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -209,19 +209,29 @@ For 32-bit we have the following conventions - kernel is built with
 	orq     $(KAISER_SWITCH_MASK), \reg
 .endm
 
+.macro JUMP_IF_KAISER_OFF	label
+	testq   $1, kaiser_asm_do_switch
+	jz      \label
+.endm
+
 .macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_KERNEL_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SWITCH_TO_USER_CR3 scratch_reg:req
+	JUMP_IF_KAISER_OFF	.Lswitch_done_\@
 	mov	%cr3, \scratch_reg
 	ADJUST_USER_CR3 \scratch_reg
 	mov	\scratch_reg, %cr3
+.Lswitch_done_\@:
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	movq	%cr3, %r\scratch_reg
 	movq	%r\scratch_reg, \save_reg
 	/*
@@ -244,11 +254,13 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro RESTORE_CR3 save_reg:req
+	JUMP_IF_KAISER_OFF	.Ldone_\@
 	/*
 	 * The CR3 write could be avoided when not changing its value,
 	 * but would require a CR3 read *and* a scratch register.
 	 */
 	movq	\save_reg, %cr3
+.Ldone_\@:
 .endm
 
 #else /* CONFIG_KAISER=n: */
diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 06966b111280..1eb27b410556 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -43,6 +43,9 @@
 
 #define KAISER_WALK_ATOMIC  0x1
 
+__aligned(PAGE_SIZE)
+unsigned long kaiser_asm_do_switch[PAGE_SIZE/sizeof(unsigned long)] = { 1 };
+
 /*
  * At runtime, the only things we map are some things for CPU
  * hotplug, and stacks for new processes.  No two CPUs will ever
@@ -395,6 +398,9 @@ void __init kaiser_init(void)
 
 	kaiser_init_all_pgds();
 
+	kaiser_add_user_map_early(&kaiser_asm_do_switch, PAGE_SIZE,
+				  __PAGE_KERNEL | _PAGE_GLOBAL);
+
 	for_each_possible_cpu(cpu) {
 		void *percpu_vaddr = __per_cpu_user_mapped_start +
 				     per_cpu_offset(cpu);
@@ -483,6 +489,56 @@ static ssize_t kaiser_enabled_read_file(struct file *file, char __user *user_buf
 	return simple_read_from_buffer(user_buf, count, ppos, buf, len);
 }
 
+enum poison {
+	KAISER_POISON,
+	KAISER_UNPOISON
+};
+void kaiser_poison_pgds(enum poison do_poison);
+
+void kaiser_do_disable(void)
+{
+	/* Make sure the kernel PGDs are usable by userspace: */
+	kaiser_poison_pgds(KAISER_UNPOISON);
+
+	/*
+	 * Make sure all the CPUs have the poison clear in their TLBs.
+	 * This also functions as a barrier to ensure that everyone
+	 * sees the unpoisoned PGDs.
+	 */
+	flush_tlb_all();
+
+	/* Tell the assembly code to stop switching CR3. */
+	kaiser_asm_do_switch[0] = 0;
+
+	/*
+	 * Make sure everybody does an interrupt.  This means that
+	 * they have gone through a SWITCH_TO_KERNEL_CR3 amd are no
+	 * longer running on the userspace CR3.  If we did not do
+	 * this, we might have CPUs running on the shadow page tables
+	 * that then enter the kernel and think they do *not* need to
+	 * switch.
+	 */
+	flush_tlb_all();
+}
+
+void kaiser_do_enable(void)
+{
+	/* Tell the assembly code to start switching CR3: */
+	kaiser_asm_do_switch[0] = 1;
+
+	/* Make sure everyone can see the kaiser_asm_do_switch update: */
+	synchronize_rcu();
+
+	/*
+	 * Now that userspace is no longer using the kernel copy of
+	 * the page tables, we can poison it:
+	 */
+	kaiser_poison_pgds(KAISER_POISON);
+
+	/* Make sure all the CPUs see the poison: */
+	flush_tlb_all();
+}
+
 static ssize_t kaiser_enabled_write_file(struct file *file,
 		 const char __user *user_buf, size_t count, loff_t *ppos)
 {
@@ -504,7 +560,17 @@ static ssize_t kaiser_enabled_write_file(struct file *file,
 	if (kaiser_enabled == enable)
 		return count;
 
+	/*
+	 * This tells the page table code to stop poisoning PGDs
+	 */
 	WRITE_ONCE(kaiser_enabled, enable);
+	synchronize_rcu();
+
+	if (enable)
+		kaiser_do_enable();
+	else
+		kaiser_do_disable();
+
 	return count;
 }
 
@@ -522,10 +588,6 @@ static int __init create_kaiser_enabled(void)
 }
 late_initcall(create_kaiser_enabled);
 
-enum poison {
-	KAISER_POISON,
-	KAISER_UNPOISON
-};
 void kaiser_poison_pgd_page(pgd_t *pgd_page, enum poison do_poison)
 {
 	int i = 0;
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 43/43] x86/mm/kaiser: Add Kconfig
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (41 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
@ 2017-11-24 17:24 ` Ingo Molnar
  2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
  43 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 17:24 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Dave Hansen <dave.hansen@linux.intel.com>

PARAVIRT generally requires that the kernel not manage its own page
tables.  It also means that the hypervisor and kernel must agree
wholeheartedly about what format the page tables are in and what
they contain.  KAISER, unfortunately, changes the rules and they
can not be used together.

I've seen conflicting feedback from maintainers lately about whether
they want the Kconfig magic to go first or last in a patch series.
It's going last here because the partially-applied series leads to
kernels that can not boot in a bunch of cases.  I did a run through
the entire series with CONFIG_KAISER=y to look for build errors,
though.

Note from Hugh Dickins on why it depends on SMP:

	It is absurd that KAISER should depend on SMP, but
	apparently nobody has tried a UP build before: which
	breaks on implicit declaration of function
	'per_cpu_offset' in arch/x86/mm/kaiser.c.

	Now, you would expect that to be trivially fixed up; but
	looking at the System.map when that block is #ifdef'ed
	out of kaiser_init(), I see that in a UP build
	__per_cpu_user_mapped_end is precisely at
	__per_cpu_user_mapped_start, and the items carefully
	gathered into that section for user-mapping on SMP,
	dispersed elsewhere on UP.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003524.88C90659@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 security/Kconfig | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/security/Kconfig b/security/Kconfig
index e8e449444e65..99b530d0dd9e 100644
--- a/security/Kconfig
+++ b/security/Kconfig
@@ -54,6 +54,16 @@ config SECURITY_NETWORK
 	  implement socket and networking access controls.
 	  If you are unsure how to answer this question, answer N.
 
+config KAISER
+	bool "Remove the kernel mapping in user mode"
+	depends on X86_64 && SMP && !PARAVIRT
+	help
+	  This feature reduces the number of hardware side channels by
+	  ensuring that the majority of kernel addresses are not mapped
+	  into userspace.
+
+	  See Documentation/x86/kaiser.txt for more details.
+
 config SECURITY_INFINIBAND
 	bool "Infiniband Security Hooks"
 	depends on SECURITY && INFINIBAND
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24 17:23 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
@ 2017-11-24 18:25   ` Borislav Petkov
  2017-11-24 19:12     ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-24 18:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:40PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> When we start using an entry trampoline, a #GP from userspace will
> be delivered on the entry stack, not on the task stack.  Fix the
> espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
> assuming that pt_regs + 1 == SP0.  This won't change anything
> without an entry stack, but it will make the code continue to work
> when an entry stack is added.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/traps.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> index 2008dd0f8ccb..1bd43f044c62 100644
> --- a/arch/x86/kernel/traps.c
> +++ b/arch/x86/kernel/traps.c
> @@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>  		regs->cs == __KERNEL_CS &&
>  		regs->ip == (unsigned long)native_irq_return_iret)
>  	{
> -		struct pt_regs *normal_regs = task_pt_regs(current);
> +		struct pt_regs *normal_regs =
> +			(struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;

Just let that line stick out. Also, you can shorten it by renaming
normal_regs to something much shorter - it is a local variable and the
comment already explains everything you you can just as well have:

	 struct pt_regs *r = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-24 17:23 ` [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
@ 2017-11-24 19:02   ` Borislav Petkov
  2017-11-26 14:16     ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-24 19:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:41PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Historically, IDT entries from usermode have always gone directly
> to the running task's kernel stack.  Rearrange it so that we enter on
> a percpu trampoline stack and then manually switch to the task's stack.
> This touches a couple of extra cachelines, but it gives us a chance
> to run some code before we touch the kernel stack.
> 
> The asm isn't exactly beautiful, but I think that fully refactoring
> it can wait.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

I think you mean Reviewed-by: here. The following patches have it too.

> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
>  arch/x86/entry/entry_64_compat.S |  5 ++-
>  arch/x86/include/asm/switch_to.h |  2 +-
>  arch/x86/include/asm/traps.h     |  1 -
>  arch/x86/kernel/cpu/common.c     |  6 ++--
>  arch/x86/kernel/traps.c          | 18 +++++------
>  6 files changed, 68 insertions(+), 31 deletions(-)

...

> diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> index 8c6bd6863db9..a6796ac8d311 100644
> --- a/arch/x86/include/asm/switch_to.h
> +++ b/arch/x86/include/asm/switch_to.h
> @@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
>  #ifdef CONFIG_X86_32
>  	load_sp0(task->thread.sp0);
>  #else
> -	load_sp0(task_top_of_stack(task));
> +	/* On x86_64, sp0 always points to the entry trampoline stack. */
>  #endif

You can put this comment to the one ontop and remove the #else.
ifdeffery is always ugly and the less, the better.

...

> -asmlinkage __visible notrace struct pt_regs *sync_regs(struct pt_regs *eregs)
> +asmlinkage __visible notrace
> +struct pt_regs *sync_regs(struct pt_regs *eregs)
>  {
> -	struct pt_regs *regs = task_pt_regs(current);
> +	struct pt_regs *regs = (struct pt_regs *)this_cpu_read(cpu_current_top_of_stack) - 1;
>  	*regs = *eregs;
>  	return regs;
>  }
> @@ -630,13 +630,13 @@ struct bad_iret_stack *fixup_bad_iret(struct bad_iret_stack *s)
>  	/*
>  	 * This is called from entry_64.S early in handling a fault
>  	 * caused by a bad iret to user mode.  To handle the fault
> -	 * correctly, we want move our stack frame to task_pt_regs
> -	 * and we want to pretend that the exception came from the
> -	 * iret target.
> +	 * correctly, we want move our stack frame to where it would

" ... we want to move... "

> +	 * be had we entered directly on the entry stack (rather than
> +	 * just below the IRET frame) and we want to pretend that the
> +	 * exception came from the iret target.

s/iret/IRET/

>  	 */
>  	struct bad_iret_stack *new_stack =
> -		container_of(task_pt_regs(current),
> -			     struct bad_iret_stack, regs);
> +		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
>  
>  	/* Copy the IRET target to the new stack. */
>  	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
> -- 

with that:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-24 17:23 ` [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack Ingo Molnar
@ 2017-11-24 19:10   ` Borislav Petkov
  2017-11-26 14:18     ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-24 19:10 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:42PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> By itself, this is useless.  It gives us the ability to run some final
> code before exit that cannnot run on the kernel stack.  This could
> include a CR3 switch a la KAISER or some kernel stack erasing, for
> example.  (Or even weird things like *changing* which kernel stack
> gets used as an ASLR-strengthening mechanism.)
> 
> The SYSRET32 path is not covered yet.  It could be in the future or
> we could just ignore it and force the slow path if needed.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 51 insertions(+), 4 deletions(-)

Nice commenting, future generations will appreciate it!

:-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24 18:25   ` Borislav Petkov
@ 2017-11-24 19:12     ` Andy Lutomirski
  2017-11-26 14:05       ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-24 19:12 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 10:25 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 24, 2017 at 06:23:40PM +0100, Ingo Molnar wrote:
>> From: Andy Lutomirski <luto@kernel.org>
>>
>> When we start using an entry trampoline, a #GP from userspace will
>> be delivered on the entry stack, not on the task stack.  Fix the
>> espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
>> assuming that pt_regs + 1 == SP0.  This won't change anything
>> without an entry stack, but it will make the code continue to work
>> when an entry stack is added.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Borislav Petkov <bpetkov@suse.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  arch/x86/kernel/traps.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
>> index 2008dd0f8ccb..1bd43f044c62 100644
>> --- a/arch/x86/kernel/traps.c
>> +++ b/arch/x86/kernel/traps.c
>> @@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
>>               regs->cs == __KERNEL_CS &&
>>               regs->ip == (unsigned long)native_irq_return_iret)
>>       {
>> -             struct pt_regs *normal_regs = task_pt_regs(current);
>> +             struct pt_regs *normal_regs =
>> +                     (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
>
> Just let that line stick out. Also, you can shorten it by renaming
> normal_regs to something much shorter - it is a local variable and the
> comment already explains everything you you can just as well have:
>
>          struct pt_regs *r = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
>

Done, along with much better comments.

https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack&id=6510485d026abdd144d170b1bc8508327acb5648

> --
> Regards/Gruss,
>     Boris.
>
> Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
                   ` (42 preceding siblings ...)
  2017-11-24 17:24 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
@ 2017-11-24 20:22 ` Ingo Molnar
  2017-11-24 20:59   ` Andy Lutomirski
                     ` (2 more replies)
  43 siblings, 3 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 20:22 UTC (permalink / raw)
  To: linux-kernel, Andy Lutomirski, Dave Hansen
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
> (v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
> on top of latest tip:x86/urgent (12a78d43de76).
> 
> This version is pretty well tested, at least on the usual x86 tree test systems.
> It has a couple of merge mistakes fixed, the biggest difference is in patch #22:
> 
>   x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
> 
> The other patches are identical or very close to what I posted earlier today.

Here's a new bug, on a testsystem I get the double fault boot crash attached 
below. The same bzImage crashes on other systems as well, so it's not CPU 
dependent.

Via Kconfig-bisection I have narrowed it down to the following .config detail: 
it's triggered by _disabling_ CONFIG_DEBUG_ENTRY and enabling CONFIG_KAISER=y.

I.e. one of the sanity checks of CONFIG_DEBUG_ENTRY has some positive side effect. 
I'll try to track down which one it is - any ideas meanwhile?

Thanks,

	Ingo

[    8.797733] calling  pt_dump_init+0x0/0x3b @ 1
[    8.803144] initcall pt_dump_init+0x0/0x3b returned 0 after 1 usecs
[    8.810589] calling  aes_init+0x0/0x11 @ 1
[    8.815757] initcall aes_init+0x0/0x11 returned 0 after 141 usecs
[    8.823020] calling  ghash_pclmulqdqni_mod_init+0x0/0x54 @ 1
[    8.831002] PANIC: double fault, error_code: 0x0
[    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
[    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[    8.831002] task: ffff880828ba8000 task.stack: ffffc90004444000
[    8.831002] RIP: 0010:page_fault+0x11/0x60
[    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
[    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
[    8.831002] RDX: 0000000000000003 RSI: 0000000000000010 RDI: ffffffffff0e8078
[    8.831002] RBP: 0000000000000000 R08: 00007ffd7f1aa530 R09: 00007f9407f98400
[    8.831002] R10: 0000000000000007 R11: 0000000000000000 R12: 00007ffd7f1aa680
[    8.831002] R13: 00007f9407f91f80 R14: 0000000000000007 R15: 0000000000000000
[    8.831002] FS:  00007f9407f8f700(0000) GS:ffff88082e640000(0000) knlGS:0000000000000000
[    8.831002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.831002] CR2: ffffffffff0e7fb8 CR3: 0000000828bc4000 CR4: 00000000001406e0
[    8.831002] Call Trace:
[    8.831002]  <SYSENTER>
[    8.831002]  ? __do_page_fault+0x4c0/0x4c0
[    8.831002]  ? page_fault+0x2c/0x60
[    8.831002]  ? native_iret+0x7/0x7
[    8.831002]  ? __do_page_fault+0x4c0/0x4c0
[    8.831002]  ? page_fault+0x2c/0x60
[    8.831002]  ? __entry_text_end+0x1/0x1
[    8.831002]  </SYSENTER>
[    8.831002] Code: ff e8 a4 75 6a ff e9 9f 02 00 00 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 83 c4 88 f6 84 24 88 00 00 00 03 75 20 <e8> 4a 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff 
[    8.831002] Kernel panic - not syncing: Machine halted.
[    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
[    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
[    8.831002] Call Trace:
[    8.831002]  <#DF>
[    8.831002]  dump_stack+0x46/0x62
[    8.831002]  panic+0xde/0x221
[    8.831002]  df_debug+0x29/0x30
[    8.831002]  do_double_fault+0x8f/0x120
[    8.831002]  double_fault+0x22/0x30
[    8.831002] RIP: 0010:page_fault+0x11/0x60
[    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
[    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
[    8.831002] RDX: 0000000000000003 RSI: 0000000000000010 RDI: ffffffffff0e8078
[    8.831002] RBP: 0000000000000000 R08: 00007ffd7f1aa530 R09: 00007f9407f98400
[    8.831002] R10: 0000000000000007 R11: 0000000000000000 R12: 00007ffd7f1aa680
[    8.831002] R13: 00007f9407f91f80 R14: 0000000000000007 R15: 0000000000000000
[    8.831002]  ? native_iret+0x7/0x7
[    8.831002] WARNING: can't dereference iret registers at ffffffffff0e8048 for ip page_fault+0x11/0x60
[    8.831002]  </#DF>
[    8.831002]  <SYSENTER>
[    8.831002]  ? __do_page_fault+0x4c0/0x4c0
[    8.831002]  ? page_fault+0x2c/0x60
[    8.831002]  ? native_iret+0x7/0x7
[    8.831002]  ? __do_page_fault+0x4c0/0x4c0
[    8.831002]  ? page_fault+0x2c/0x60
[    8.831002]  ? __entry_text_end+0x1/0x1
[    8.831002]  </SYSENTER>
[    8.831002] Kernel Offset: disabled
[    8.831002] Rebooting in 1 seconds..
[    8.831002] ACPI MEMORY or I/O RESET_REG.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
@ 2017-11-24 20:59   ` Andy Lutomirski
  2017-11-24 21:49     ` Ingo Molnar
  2017-11-24 22:09   ` Ingo Molnar
  2017-11-25  4:09   ` [crash] PANIC: double fault, error_code: 0x0 Dave Hansen
  2 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-24 20:59 UTC (permalink / raw)
  To: Ingo Molnar, Josh Poimboeuf
  Cc: linux-kernel, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, Nov 24, 2017 at 12:22 PM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Ingo Molnar <mingo@kernel.org> wrote:
>
>> This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
>> (v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
>> on top of latest tip:x86/urgent (12a78d43de76).
>>
>> This version is pretty well tested, at least on the usual x86 tree test systems.
>> It has a couple of merge mistakes fixed, the biggest difference is in patch #22:
>>
>>   x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
>>
>> The other patches are identical or very close to what I posted earlier today.
>
> Here's a new bug, on a testsystem I get the double fault boot crash attached
> below. The same bzImage crashes on other systems as well, so it's not CPU
> dependent.
>
> Via Kconfig-bisection I have narrowed it down to the following .config detail:
> it's triggered by _disabling_ CONFIG_DEBUG_ENTRY and enabling CONFIG_KAISER=y.
>
> I.e. one of the sanity checks of CONFIG_DEBUG_ENTRY has some positive side effect.

That's weird and definitely not intentional.

> I'll try to track down which one it is - any ideas meanwhile?
>
> Thanks,
>
>         Ingo
>
> [    8.797733] calling  pt_dump_init+0x0/0x3b @ 1
> [    8.803144] initcall pt_dump_init+0x0/0x3b returned 0 after 1 usecs
> [    8.810589] calling  aes_init+0x0/0x11 @ 1
> [    8.815757] initcall aes_init+0x0/0x11 returned 0 after 141 usecs
> [    8.823020] calling  ghash_pclmulqdqni_mod_init+0x0/0x54 @ 1
> [    8.831002] PANIC: double fault, error_code: 0x0

The double fault will be a stack overflow on the SYSENTER stack caused
by the page fault.  You could try increasing the [64] to something
larger (PAGE_SIZE/8 perhaps) to see if the stack trace is better.

> [    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
> [    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [    8.831002] task: ffff880828ba8000 task.stack: ffffc90004444000
> [    8.831002] RIP: 0010:page_fault+0x11/0x60
> [    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
> [    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
> [    8.831002] RDX: 0000000000000003 RSI: 0000000000000010 RDI: ffffffffff0e8078
> [    8.831002] RBP: 0000000000000000 R08: 00007ffd7f1aa530 R09: 00007f9407f98400
> [    8.831002] R10: 0000000000000007 R11: 0000000000000000 R12: 00007ffd7f1aa680
> [    8.831002] R13: 00007f9407f91f80 R14: 0000000000000007 R15: 0000000000000000
> [    8.831002] FS:  00007f9407f8f700(0000) GS:ffff88082e640000(0000) knlGS:0000000000000000
> [    8.831002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [    8.831002] CR2: ffffffffff0e7fb8 CR3: 0000000828bc4000 CR4: 00000000001406e0

Sadly CR2 is likely useless at this point because the double fault
will have clobbered it.

> [    8.831002] Call Trace:
> [    8.831002]  <SYSENTER>
> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60
> [    8.831002]  ? native_iret+0x7/0x7

This is weird.  native_iret+7 is IRETQ, but that should only appear at
the very top of the stack unless it's a nested entry.  But a nested
IRETQ should never fail because it's a kernel context which is, by
construction, always valid.

> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60

It looks like the initial entry may have been a page fault from user mode.

> [    8.831002]  ? __entry_text_end+0x1/0x1

Um, what?

Josh, do you know why these stack traces are crappy?  I think they
should unwind perfectly with ORC enabled.  My guess is that the stack
access check is failing because RSP is out of bounds, but it shouldn't
need access to the out-of-bounds part to unwind back to where it's in
bounds.

Anyway, my best guess is that there's an error in the page tables
that's causing GDT access to fault.  But even that's a rather weak
theory, since we shouldn't be executing page_fault on the SYSENTER
stack under any circumstances.  Perhaps one of the CR3 switches is
page faulting or is selecting bogus page tables, causing an immediate
nested fault.  On a nested fault, the stack switch won't activate
because regs point to kernel mode.

Maybe CONFIG_DEBUG_ENTRY should add some logic to the beginning of
page_fault to detect this condition and deliberately double-fault to
get a better trace.  That would be a bit nontrivial, though.

> [    8.831002]  </SYSENTER>
> [    8.831002] Code: ff e8 a4 75 6a ff e9 9f 02 00 00 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 83 c4 88 f6 84 24 88 00 00 00 03 75 20 <e8> 4a 01 00 00 48 89 e7 48 8b 74 24 78 48 c7 44 24 78 ff ff ff
> [    8.831002] Kernel panic - not syncing: Machine halted.
> [    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
> [    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [    8.831002] Call Trace:
> [    8.831002]  <#DF>
> [    8.831002]  dump_stack+0x46/0x62
> [    8.831002]  panic+0xde/0x221
> [    8.831002]  df_debug+0x29/0x30
> [    8.831002]  do_double_fault+0x8f/0x120
> [    8.831002]  double_fault+0x22/0x30
> [    8.831002] RIP: 0010:page_fault+0x11/0x60
> [    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
> [    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
> [    8.831002] RDX: 0000000000000003 RSI: 0000000000000010 RDI: ffffffffff0e8078
> [    8.831002] RBP: 0000000000000000 R08: 00007ffd7f1aa530 R09: 00007f9407f98400
> [    8.831002] R10: 0000000000000007 R11: 0000000000000000 R12: 00007ffd7f1aa680
> [    8.831002] R13: 00007f9407f91f80 R14: 0000000000000007 R15: 0000000000000000
> [    8.831002]  ? native_iret+0x7/0x7
> [    8.831002] WARNING: can't dereference iret registers at ffffffffff0e8048 for ip page_fault+0x11/0x60
> [    8.831002]  </#DF>
> [    8.831002]  <SYSENTER>
> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60
> [    8.831002]  ? native_iret+0x7/0x7
> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60
> [    8.831002]  ? __entry_text_end+0x1/0x1
> [    8.831002]  </SYSENTER>
> [    8.831002] Kernel Offset: disabled
> [    8.831002] Rebooting in 1 seconds..
> [    8.831002] ACPI MEMORY or I/O RESET_REG.
>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 20:59   ` Andy Lutomirski
@ 2017-11-24 21:49     ` Ingo Molnar
  2017-11-24 21:52       ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 21:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Andy Lutomirski <luto@kernel.org> wrote:

> On Fri, Nov 24, 2017 at 12:22 PM, Ingo Molnar <mingo@kernel.org> wrote:
> >
> > * Ingo Molnar <mingo@kernel.org> wrote:
> >
> >> This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
> >> (v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
> >> on top of latest tip:x86/urgent (12a78d43de76).
> >>
> >> This version is pretty well tested, at least on the usual x86 tree test systems.
> >> It has a couple of merge mistakes fixed, the biggest difference is in patch #22:
> >>
> >>   x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
> >>
> >> The other patches are identical or very close to what I posted earlier today.
> >
> > Here's a new bug, on a testsystem I get the double fault boot crash attached
> > below. The same bzImage crashes on other systems as well, so it's not CPU
> > dependent.
> >
> > Via Kconfig-bisection I have narrowed it down to the following .config detail:
> > it's triggered by _disabling_ CONFIG_DEBUG_ENTRY and enabling CONFIG_KAISER=y.
> >
> > I.e. one of the sanity checks of CONFIG_DEBUG_ENTRY has some positive side effect.
> 
> That's weird and definitely not intentional.

Btw., can you reproduce the crash by disabling CONFIG_DEBUG_ENTRY with 
CONFIG_KAISER=y?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 21:49     ` Ingo Molnar
@ 2017-11-24 21:52       ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 21:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Josh Poimboeuf, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 358 bytes --]


* Ingo Molnar <mingo@kernel.org> wrote:

> > That's weird and definitely not intentional.
> 
> Btw., can you reproduce the crash by disabling CONFIG_DEBUG_ENTRY with 
> CONFIG_KAISER=y?

I've attached a .config that triggers the crash on bootup on just about any 
hardware, before even hitting user-space. So I think it's IRQ entry related.

Thanks,

	Ingo

[-- Attachment #2: config --]
[-- Type: text/plain, Size: 115237 bytes --]

#
# Automatically generated file; DO NOT EDIT.
# Linux/x86_64 4.14.0 Kernel Configuration
#
CONFIG_64BIT=y
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_MMU=y
CONFIG_ARCH_MMAP_RND_BITS_MIN=28
CONFIG_ARCH_MMAP_RND_BITS_MAX=32
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MIN=8
CONFIG_ARCH_MMAP_RND_COMPAT_BITS_MAX=16
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_ARCH_HAS_CPU_RELAX=y
CONFIG_ARCH_HAS_CACHE_LINE_SIZE=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_NEED_PER_CPU_EMBED_FIRST_CHUNK=y
CONFIG_NEED_PER_CPU_PAGE_FIRST_CHUNK=y
CONFIG_ARCH_HIBERNATION_POSSIBLE=y
CONFIG_ARCH_SUSPEND_POSSIBLE=y
CONFIG_ARCH_WANT_HUGE_PMD_SHARE=y
CONFIG_ARCH_WANT_GENERAL_HUGETLB=y
CONFIG_ZONE_DMA32=y
CONFIG_AUDIT_ARCH=y
CONFIG_ARCH_SUPPORTS_OPTIMIZED_INLINING=y
CONFIG_ARCH_SUPPORTS_DEBUG_PAGEALLOC=y
CONFIG_HAVE_INTEL_TXT=y
CONFIG_X86_64_SMP=y
CONFIG_ARCH_SUPPORTS_UPROBES=y
CONFIG_FIX_EARLYCON_MEM=y
CONFIG_PGTABLE_LEVELS=4
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"
CONFIG_IRQ_WORK=y
CONFIG_BUILDTIME_EXTABLE_SORT=y
CONFIG_THREAD_INFO_IN_TASK=y

#
# General setup
#
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_CROSS_COMPILE=""
# CONFIG_COMPILE_TEST is not set
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_HAVE_KERNEL_GZIP=y
CONFIG_HAVE_KERNEL_BZIP2=y
CONFIG_HAVE_KERNEL_LZMA=y
CONFIG_HAVE_KERNEL_XZ=y
CONFIG_HAVE_KERNEL_LZO=y
CONFIG_HAVE_KERNEL_LZ4=y
CONFIG_KERNEL_GZIP=y
# CONFIG_KERNEL_BZIP2 is not set
# CONFIG_KERNEL_LZMA is not set
# CONFIG_KERNEL_XZ is not set
# CONFIG_KERNEL_LZO is not set
# CONFIG_KERNEL_LZ4 is not set
CONFIG_DEFAULT_HOSTNAME="(none)"
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_POSIX_MQUEUE_SYSCTL=y
CONFIG_CROSS_MEMORY_ATTACH=y
CONFIG_FHANDLE=y
CONFIG_USELIB=y
CONFIG_AUDIT=y
CONFIG_HAVE_ARCH_AUDITSYSCALL=y
CONFIG_AUDITSYSCALL=y
CONFIG_AUDIT_WATCH=y
CONFIG_AUDIT_TREE=y

#
# IRQ subsystem
#
CONFIG_GENERIC_IRQ_PROBE=y
CONFIG_GENERIC_IRQ_SHOW=y
CONFIG_GENERIC_IRQ_EFFECTIVE_AFF_MASK=y
CONFIG_GENERIC_PENDING_IRQ=y
CONFIG_GENERIC_IRQ_MIGRATION=y
CONFIG_IRQ_DOMAIN=y
CONFIG_IRQ_DOMAIN_HIERARCHY=y
CONFIG_GENERIC_MSI_IRQ=y
CONFIG_GENERIC_MSI_IRQ_DOMAIN=y
CONFIG_GENERIC_IRQ_MATRIX_ALLOCATOR=y
CONFIG_GENERIC_IRQ_RESERVATION_MODE=y
# CONFIG_IRQ_DOMAIN_DEBUG is not set
CONFIG_IRQ_FORCED_THREADING=y
CONFIG_SPARSE_IRQ=y
# CONFIG_GENERIC_IRQ_DEBUGFS is not set
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_ARCH_CLOCKSOURCE_DATA=y
CONFIG_CLOCKSOURCE_VALIDATE_LAST_CYCLE=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_GENERIC_CLOCKEVENTS_MIN_ADJUST=y
CONFIG_GENERIC_CMOS_UPDATE=y

#
# Timers subsystem
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ_COMMON=y
# CONFIG_HZ_PERIODIC is not set
CONFIG_NO_HZ_IDLE=y
# CONFIG_NO_HZ_FULL is not set
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y

#
# CPU/Task time and stats accounting
#
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_TICK_CPU_ACCOUNTING is not set
CONFIG_VIRT_CPU_ACCOUNTING_GEN=y
# CONFIG_IRQ_TIME_ACCOUNTING is not set
CONFIG_BSD_PROCESS_ACCT=y
CONFIG_BSD_PROCESS_ACCT_V3=y
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
CONFIG_TASK_XACCT=y
CONFIG_TASK_IO_ACCOUNTING=y
# CONFIG_CPU_ISOLATION is not set

#
# RCU Subsystem
#
CONFIG_TREE_RCU=y
# CONFIG_RCU_EXPERT is not set
CONFIG_SRCU=y
CONFIG_TREE_SRCU=y
# CONFIG_TASKS_RCU is not set
CONFIG_RCU_STALL_COMMON=y
CONFIG_RCU_NEED_SEGCBLIST=y
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_BUILD_BIN2C=y
CONFIG_IKCONFIG=m
# CONFIG_IKCONFIG_PROC is not set
CONFIG_LOG_BUF_SHIFT=22
CONFIG_LOG_CPU_MAX_BUF_SHIFT=12
CONFIG_PRINTK_SAFE_LOG_BUF_SHIFT=13
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_ARCH_SUPPORTS_NUMA_BALANCING=y
CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH=y
CONFIG_ARCH_SUPPORTS_INT128=y
# CONFIG_NUMA_BALANCING is not set
CONFIG_CGROUPS=y
# CONFIG_MEMCG is not set
CONFIG_BLK_CGROUP=y
# CONFIG_DEBUG_BLK_CGROUP is not set
CONFIG_CGROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_CFS_BANDWIDTH=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_CGROUP_PIDS is not set
# CONFIG_CGROUP_RDMA is not set
CONFIG_CGROUP_FREEZER=y
# CONFIG_CGROUP_HUGETLB is not set
CONFIG_CPUSETS=y
CONFIG_PROC_PID_CPUSET=y
CONFIG_CGROUP_DEVICE=y
CONFIG_CGROUP_CPUACCT=y
CONFIG_CGROUP_PERF=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_SOCK_CGROUP_DATA=y
# CONFIG_CHECKPOINT_RESTORE is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
CONFIG_PID_NS=y
CONFIG_NET_NS=y
CONFIG_SCHED_AUTOGROUP=y
CONFIG_SYSFS_DEPRECATED=y
# CONFIG_SYSFS_DEPRECATED_V2 is not set
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
CONFIG_RD_GZIP=y
CONFIG_RD_BZIP2=y
CONFIG_RD_LZMA=y
CONFIG_RD_XZ=y
CONFIG_RD_LZO=y
CONFIG_RD_LZ4=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
CONFIG_ANON_INODES=y
CONFIG_HAVE_UID16=y
CONFIG_SYSCTL_EXCEPTION_TRACE=y
CONFIG_HAVE_PCSPKR_PLATFORM=y
CONFIG_BPF=y
# CONFIG_EXPERT is not set
CONFIG_UID16=y
CONFIG_MULTIUSER=y
CONFIG_SGETMASK_SYSCALL=y
CONFIG_SYSFS_SYSCALL=y
# CONFIG_SYSCTL_SYSCALL is not set
CONFIG_POSIX_TIMERS=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
CONFIG_KALLSYMS_ABSOLUTE_PERCPU=y
CONFIG_KALLSYMS_BASE_RELATIVE=y
CONFIG_PRINTK=y
CONFIG_PRINTK_NMI=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_PCSPKR_PLATFORM=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_FUTEX_PI=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
# CONFIG_BPF_SYSCALL is not set
CONFIG_SHMEM=y
CONFIG_AIO=y
CONFIG_ADVISE_SYSCALLS=y
# CONFIG_USERFAULTFD is not set
CONFIG_PCI_QUIRKS=y
CONFIG_MEMBARRIER=y
# CONFIG_EMBEDDED is not set
CONFIG_HAVE_PERF_EVENTS=y
# CONFIG_PC104 is not set

#
# Kernel Performance Events And Counters
#
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_COMPAT_BRK is not set
# CONFIG_SLAB is not set
CONFIG_SLUB=y
CONFIG_SLAB_MERGE_DEFAULT=y
# CONFIG_SLAB_FREELIST_RANDOM is not set
# CONFIG_SLAB_FREELIST_HARDENED is not set
CONFIG_SLUB_CPU_PARTIAL=y
# CONFIG_SYSTEM_DATA_VERIFICATION is not set
CONFIG_PROFILING=y
CONFIG_TRACEPOINTS=y
CONFIG_CRASH_CORE=y
CONFIG_KEXEC_CORE=y
# CONFIG_OPROFILE is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_OPROFILE_NMI_TIMER=y
CONFIG_KPROBES=y
CONFIG_JUMP_LABEL=y
# CONFIG_STATIC_KEYS_SELFTEST is not set
CONFIG_OPTPROBES=y
CONFIG_KPROBES_ON_FTRACE=y
CONFIG_UPROBES=y
# CONFIG_HAVE_64BIT_ALIGNED_ACCESS is not set
CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS=y
CONFIG_ARCH_USE_BUILTIN_BSWAP=y
CONFIG_KRETPROBES=y
CONFIG_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_IOREMAP_PROT=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
CONFIG_HAVE_OPTPROBES=y
CONFIG_HAVE_KPROBES_ON_FTRACE=y
CONFIG_HAVE_NMI=y
CONFIG_HAVE_ARCH_TRACEHOOK=y
CONFIG_HAVE_DMA_CONTIGUOUS=y
CONFIG_GENERIC_SMP_IDLE_THREAD=y
CONFIG_ARCH_HAS_FORTIFY_SOURCE=y
CONFIG_ARCH_HAS_SET_MEMORY=y
CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
CONFIG_HAVE_CLK=y
CONFIG_HAVE_DMA_API_DEBUG=y
CONFIG_HAVE_HW_BREAKPOINT=y
CONFIG_HAVE_MIXED_BREAKPOINTS_REGS=y
CONFIG_HAVE_USER_RETURN_NOTIFIER=y
CONFIG_HAVE_PERF_EVENTS_NMI=y
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
CONFIG_HAVE_ARCH_JUMP_LABEL=y
CONFIG_HAVE_RCU_TABLE_FREE=y
CONFIG_ARCH_HAVE_NMI_SAFE_CMPXCHG=y
CONFIG_HAVE_ALIGNED_STRUCT_PAGE=y
CONFIG_HAVE_CMPXCHG_LOCAL=y
CONFIG_HAVE_CMPXCHG_DOUBLE=y
CONFIG_ARCH_WANT_COMPAT_IPC_PARSE_VERSION=y
CONFIG_ARCH_WANT_OLD_COMPAT_IPC=y
CONFIG_HAVE_ARCH_SECCOMP_FILTER=y
CONFIG_SECCOMP_FILTER=y
CONFIG_HAVE_GCC_PLUGINS=y
# CONFIG_GCC_PLUGINS is not set
CONFIG_HAVE_CC_STACKPROTECTOR=y
CONFIG_CC_STACKPROTECTOR=y
# CONFIG_CC_STACKPROTECTOR_NONE is not set
# CONFIG_CC_STACKPROTECTOR_REGULAR is not set
CONFIG_CC_STACKPROTECTOR_STRONG=y
CONFIG_THIN_ARCHIVES=y
CONFIG_HAVE_ARCH_WITHIN_STACK_FRAMES=y
CONFIG_HAVE_CONTEXT_TRACKING=y
CONFIG_HAVE_VIRT_CPU_ACCOUNTING_GEN=y
CONFIG_HAVE_IRQ_TIME_ACCOUNTING=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE=y
CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD=y
CONFIG_HAVE_ARCH_HUGE_VMAP=y
CONFIG_HAVE_ARCH_SOFT_DIRTY=y
CONFIG_HAVE_MOD_ARCH_SPECIFIC=y
CONFIG_MODULES_USE_ELF_RELA=y
CONFIG_HAVE_IRQ_EXIT_ON_IRQ_STACK=y
CONFIG_ARCH_HAS_ELF_RANDOMIZE=y
CONFIG_HAVE_ARCH_MMAP_RND_BITS=y
CONFIG_HAVE_EXIT_THREAD=y
CONFIG_ARCH_MMAP_RND_BITS=28
CONFIG_HAVE_ARCH_MMAP_RND_COMPAT_BITS=y
CONFIG_ARCH_MMAP_RND_COMPAT_BITS=8
CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES=y
CONFIG_HAVE_COPY_THREAD_TLS=y
CONFIG_HAVE_STACK_VALIDATION=y
# CONFIG_HAVE_ARCH_HASH is not set
# CONFIG_ISA_BUS_API is not set
CONFIG_OLD_SIGSUSPEND3=y
CONFIG_COMPAT_OLD_SIGACTION=y
# CONFIG_CPU_NO_EFFICIENT_FFS is not set
CONFIG_HAVE_ARCH_VMAP_STACK=y
CONFIG_VMAP_STACK=y
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX is not set
# CONFIG_ARCH_OPTIONAL_KERNEL_RWX_DEFAULT is not set
CONFIG_ARCH_HAS_STRICT_KERNEL_RWX=y
CONFIG_STRICT_KERNEL_RWX=y
CONFIG_ARCH_HAS_STRICT_MODULE_RWX=y
CONFIG_STRICT_MODULE_RWX=y
CONFIG_ARCH_HAS_REFCOUNT=y
# CONFIG_REFCOUNT_FULL is not set

#
# GCOV-based kernel profiling
#
# CONFIG_GCOV_KERNEL is not set
CONFIG_ARCH_HAS_GCOV_PROFILE_ALL=y
# CONFIG_HAVE_GENERIC_DMA_COHERENT is not set
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
# CONFIG_MODVERSIONS is not set
# CONFIG_MODULE_SRCVERSION_ALL is not set
# CONFIG_MODULE_SIG is not set
# CONFIG_MODULE_COMPRESS is not set
CONFIG_MODULES_TREE_LOOKUP=y
CONFIG_BLOCK=y
CONFIG_BLK_SCSI_REQUEST=y
CONFIG_BLK_DEV_BSG=y
CONFIG_BLK_DEV_BSGLIB=y
CONFIG_BLK_DEV_INTEGRITY=y
# CONFIG_BLK_DEV_ZONED is not set
CONFIG_BLK_DEV_THROTTLING=y
# CONFIG_BLK_DEV_THROTTLING_LOW is not set
# CONFIG_BLK_CMDLINE_PARSER is not set
# CONFIG_BLK_WBT is not set
CONFIG_BLK_DEBUG_FS=y
# CONFIG_BLK_SED_OPAL is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_AIX_PARTITION is not set
CONFIG_OSF_PARTITION=y
CONFIG_AMIGA_PARTITION=y
# CONFIG_ATARI_PARTITION is not set
CONFIG_MAC_PARTITION=y
CONFIG_MSDOS_PARTITION=y
CONFIG_BSD_DISKLABEL=y
CONFIG_MINIX_SUBPARTITION=y
CONFIG_SOLARIS_X86_PARTITION=y
CONFIG_UNIXWARE_DISKLABEL=y
# CONFIG_LDM_PARTITION is not set
CONFIG_SGI_PARTITION=y
# CONFIG_ULTRIX_PARTITION is not set
CONFIG_SUN_PARTITION=y
CONFIG_KARMA_PARTITION=y
CONFIG_EFI_PARTITION=y
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_CMDLINE_PARTITION is not set
CONFIG_BLOCK_COMPAT=y
CONFIG_BLK_MQ_PCI=y
CONFIG_BLK_MQ_VIRTIO=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_CFQ_GROUP_IOSCHED=y
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="cfq"
CONFIG_MQ_IOSCHED_DEADLINE=y
CONFIG_MQ_IOSCHED_KYBER=y
# CONFIG_IOSCHED_BFQ is not set
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_INLINE_SPIN_UNLOCK_IRQ=y
CONFIG_INLINE_READ_UNLOCK=y
CONFIG_INLINE_READ_UNLOCK_IRQ=y
CONFIG_INLINE_WRITE_UNLOCK=y
CONFIG_INLINE_WRITE_UNLOCK_IRQ=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_FREEZER=y

#
# Processor type and features
#
CONFIG_ZONE_DMA=y
CONFIG_SMP=y
CONFIG_X86_FEATURE_NAMES=y
CONFIG_X86_FAST_FEATURE_TESTS=y
CONFIG_X86_X2APIC=y
CONFIG_X86_MPPARSE=y
# CONFIG_GOLDFISH is not set
# CONFIG_INTEL_RDT is not set
CONFIG_X86_EXTENDED_PLATFORM=y
# CONFIG_X86_NUMACHIP is not set
# CONFIG_X86_VSMP is not set
# CONFIG_X86_UV is not set
# CONFIG_X86_GOLDFISH is not set
# CONFIG_X86_INTEL_MID is not set
# CONFIG_X86_INTEL_LPSS is not set
# CONFIG_X86_AMD_PLATFORM_DEVICE is not set
# CONFIG_IOSF_MBI is not set
CONFIG_X86_SUPPORTS_MEMORY_FAILURE=y
CONFIG_SCHED_OMIT_FRAME_POINTER=y
# CONFIG_HYPERVISOR_GUEST is not set
CONFIG_NO_BOOTMEM=y
# CONFIG_MK8 is not set
# CONFIG_MPSC is not set
# CONFIG_MCORE2 is not set
# CONFIG_MATOM is not set
CONFIG_GENERIC_CPU=y
CONFIG_X86_INTERNODE_CACHE_SHIFT=6
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_TSC=y
CONFIG_X86_CMPXCHG64=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=64
CONFIG_X86_DEBUGCTLMSR=y
CONFIG_CPU_SUP_INTEL=y
CONFIG_CPU_SUP_AMD=y
CONFIG_CPU_SUP_CENTAUR=y
CONFIG_HPET_TIMER=y
CONFIG_HPET_EMULATE_RTC=y
CONFIG_DMI=y
CONFIG_GART_IOMMU=y
# CONFIG_CALGARY_IOMMU is not set
CONFIG_SWIOTLB=y
CONFIG_IOMMU_HELPER=y
# CONFIG_MAXSMP is not set
CONFIG_NR_CPUS=128
CONFIG_SCHED_SMT=y
CONFIG_SCHED_MC=y
CONFIG_SCHED_MC_PRIO=y
# CONFIG_PREEMPT_NONE is not set
CONFIG_PREEMPT_VOLUNTARY=y
# CONFIG_PREEMPT is not set
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_REROUTE_FOR_BROKEN_BOOT_IRQS=y
CONFIG_X86_MCE=y
# CONFIG_X86_MCELOG_LEGACY is not set
CONFIG_X86_MCE_INTEL=y
CONFIG_X86_MCE_AMD=y
CONFIG_X86_MCE_THRESHOLD=y
# CONFIG_X86_MCE_INJECT is not set
CONFIG_X86_THERMAL_VECTOR=y

#
# Performance monitoring
#
CONFIG_PERF_EVENTS_INTEL_UNCORE=y
CONFIG_PERF_EVENTS_INTEL_RAPL=y
CONFIG_PERF_EVENTS_INTEL_CSTATE=y
# CONFIG_PERF_EVENTS_AMD_POWER is not set
# CONFIG_VM86 is not set
CONFIG_X86_16BIT=y
CONFIG_X86_ESPFIX64=y
CONFIG_X86_VSYSCALL_EMULATION=y
# CONFIG_I8K is not set
CONFIG_MICROCODE=y
CONFIG_MICROCODE_INTEL=y
CONFIG_MICROCODE_AMD=y
CONFIG_MICROCODE_OLD_INTERFACE=y
CONFIG_X86_MSR=y
CONFIG_X86_CPUID=y
# CONFIG_X86_5LEVEL is not set
CONFIG_ARCH_PHYS_ADDR_T_64BIT=y
CONFIG_ARCH_DMA_ADDR_T_64BIT=y
CONFIG_X86_DIRECT_GBPAGES=y
CONFIG_ARCH_HAS_MEM_ENCRYPT=y
# CONFIG_AMD_MEM_ENCRYPT is not set
CONFIG_NUMA=y
CONFIG_AMD_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
CONFIG_NODES_SPAN_OTHER_NODES=y
# CONFIG_NUMA_EMU is not set
CONFIG_NODES_SHIFT=9
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_ARCH_PROC_KCORE_TEXT=y
CONFIG_ILLEGAL_POINTER_VALUE=0xdead000000000000
CONFIG_SELECT_MEMORY_MODEL=y
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_NEED_MULTIPLE_NODES=y
CONFIG_HAVE_MEMORY_PRESENT=y
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_HAVE_MEMBLOCK=y
CONFIG_HAVE_MEMBLOCK_NODE_MAP=y
CONFIG_HAVE_GENERIC_GUP=y
CONFIG_ARCH_DISCARD_MEMBLOCK=y
CONFIG_MEMORY_ISOLATION=y
# CONFIG_HAVE_BOOTMEM_INFO_NODE is not set
# CONFIG_MEMORY_HOTPLUG is not set
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_COMPACTION=y
CONFIG_MIGRATION=y
CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION=y
CONFIG_ARCH_ENABLE_THP_MIGRATION=y
CONFIG_PHYS_ADDR_T_64BIT=y
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y
CONFIG_MMU_NOTIFIER=y
CONFIG_KSM=y
CONFIG_DEFAULT_MMAP_MIN_ADDR=4096
CONFIG_ARCH_SUPPORTS_MEMORY_FAILURE=y
CONFIG_MEMORY_FAILURE=y
# CONFIG_HWPOISON_INJECT is not set
CONFIG_TRANSPARENT_HUGEPAGE=y
CONFIG_TRANSPARENT_HUGEPAGE_ALWAYS=y
# CONFIG_TRANSPARENT_HUGEPAGE_MADVISE is not set
CONFIG_ARCH_WANTS_THP_SWAP=y
CONFIG_THP_SWAP=y
CONFIG_TRANSPARENT_HUGE_PAGECACHE=y
CONFIG_CLEANCACHE=y
CONFIG_FRONTSWAP=y
# CONFIG_CMA is not set
# CONFIG_ZSWAP is not set
# CONFIG_ZPOOL is not set
# CONFIG_ZBUD is not set
# CONFIG_ZSMALLOC is not set
CONFIG_GENERIC_EARLY_IOREMAP=y
CONFIG_ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT=y
# CONFIG_IDLE_PAGE_TRACKING is not set
CONFIG_ARCH_HAS_ZONE_DEVICE=y
CONFIG_ARCH_USES_HIGH_VMA_FLAGS=y
CONFIG_ARCH_HAS_PKEYS=y
# CONFIG_PERCPU_STATS is not set
# CONFIG_X86_PMEM_LEGACY is not set
CONFIG_X86_CHECK_BIOS_CORRUPTION=y
# CONFIG_X86_BOOTPARAM_MEMORY_CORRUPTION_CHECK is not set
CONFIG_X86_RESERVE_LOW=64
CONFIG_MTRR=y
CONFIG_MTRR_SANITIZER=y
CONFIG_MTRR_SANITIZER_ENABLE_DEFAULT=1
CONFIG_MTRR_SANITIZER_SPARE_REG_NR_DEFAULT=1
CONFIG_X86_PAT=y
CONFIG_ARCH_USES_PG_UNCACHED=y
CONFIG_ARCH_RANDOM=y
CONFIG_X86_SMAP=y
CONFIG_X86_INTEL_UMIP=y
# CONFIG_X86_INTEL_MPX is not set
CONFIG_X86_INTEL_MEMORY_PROTECTION_KEYS=y
CONFIG_EFI=y
CONFIG_EFI_STUB=y
# CONFIG_EFI_MIXED is not set
CONFIG_SECCOMP=y
# CONFIG_HZ_100 is not set
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
CONFIG_HZ_1000=y
CONFIG_HZ=1000
CONFIG_SCHED_HRTICK=y
CONFIG_KEXEC=y
# CONFIG_KEXEC_FILE is not set
CONFIG_CRASH_DUMP=y
CONFIG_KEXEC_JUMP=y
CONFIG_PHYSICAL_START=0x1000000
CONFIG_RELOCATABLE=y
# CONFIG_RANDOMIZE_BASE is not set
CONFIG_PHYSICAL_ALIGN=0x1000000
CONFIG_HOTPLUG_CPU=y
# CONFIG_BOOTPARAM_HOTPLUG_CPU0 is not set
# CONFIG_DEBUG_HOTPLUG_CPU0 is not set
# CONFIG_COMPAT_VDSO is not set
CONFIG_LEGACY_VSYSCALL_EMULATE=y
# CONFIG_LEGACY_VSYSCALL_NONE is not set
CONFIG_CMDLINE_BOOL=y
CONFIG_CMDLINE=""
# CONFIG_CMDLINE_OVERRIDE is not set
CONFIG_MODIFY_LDT_SYSCALL=y
CONFIG_HAVE_LIVEPATCH=y
# CONFIG_LIVEPATCH is not set
CONFIG_ARCH_HAS_ADD_PAGES=y
CONFIG_ARCH_ENABLE_MEMORY_HOTPLUG=y
CONFIG_USE_PERCPU_NUMA_NODE_ID=y

#
# Power management and ACPI options
#
CONFIG_ARCH_HIBERNATION_HEADER=y
CONFIG_SUSPEND=y
CONFIG_SUSPEND_FREEZER=y
CONFIG_HIBERNATE_CALLBACKS=y
CONFIG_HIBERNATION=y
CONFIG_PM_STD_PARTITION=""
CONFIG_PM_SLEEP=y
CONFIG_PM_SLEEP_SMP=y
# CONFIG_PM_AUTOSLEEP is not set
# CONFIG_PM_WAKELOCKS is not set
CONFIG_PM=y
CONFIG_PM_DEBUG=y
CONFIG_PM_ADVANCED_DEBUG=y
# CONFIG_PM_TEST_SUSPEND is not set
CONFIG_PM_SLEEP_DEBUG=y
CONFIG_PM_TRACE=y
CONFIG_PM_TRACE_RTC=y
CONFIG_PM_CLK=y
# CONFIG_WQ_POWER_EFFICIENT_DEFAULT is not set
CONFIG_ACPI=y
CONFIG_ACPI_LEGACY_TABLES_LOOKUP=y
CONFIG_ARCH_MIGHT_HAVE_ACPI_PDC=y
CONFIG_ACPI_SYSTEM_POWER_STATES_SUPPORT=y
# CONFIG_ACPI_DEBUGGER is not set
CONFIG_ACPI_SLEEP=y
# CONFIG_ACPI_PROCFS_POWER is not set
CONFIG_ACPI_REV_OVERRIDE_POSSIBLE=y
# CONFIG_ACPI_EC_DEBUGFS is not set
CONFIG_ACPI_AC=y
CONFIG_ACPI_BATTERY=y
CONFIG_ACPI_BUTTON=y
# CONFIG_ACPI_VIDEO is not set
CONFIG_ACPI_FAN=y
CONFIG_ACPI_DOCK=y
CONFIG_ACPI_CPU_FREQ_PSS=y
CONFIG_ACPI_PROCESSOR_CSTATE=y
CONFIG_ACPI_PROCESSOR_IDLE=y
CONFIG_ACPI_CPPC_LIB=y
CONFIG_ACPI_PROCESSOR=y
CONFIG_ACPI_HOTPLUG_CPU=y
# CONFIG_ACPI_PROCESSOR_AGGREGATOR is not set
CONFIG_ACPI_THERMAL=y
CONFIG_ACPI_NUMA=y
# CONFIG_ACPI_CUSTOM_DSDT is not set
CONFIG_ARCH_HAS_ACPI_TABLE_UPGRADE=y
CONFIG_ACPI_TABLE_UPGRADE=y
# CONFIG_ACPI_DEBUG is not set
CONFIG_ACPI_PCI_SLOT=y
CONFIG_X86_PM_TIMER=y
CONFIG_ACPI_CONTAINER=y
CONFIG_ACPI_HOTPLUG_IOAPIC=y
# CONFIG_ACPI_SBS is not set
CONFIG_ACPI_HED=y
# CONFIG_ACPI_CUSTOM_METHOD is not set
# CONFIG_ACPI_BGRT is not set
# CONFIG_ACPI_REDUCED_HARDWARE_ONLY is not set
# CONFIG_ACPI_NFIT is not set
CONFIG_HAVE_ACPI_APEI=y
CONFIG_HAVE_ACPI_APEI_NMI=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_GHES=y
CONFIG_ACPI_APEI_PCIEAER=y
CONFIG_ACPI_APEI_MEMORY_FAILURE=y
# CONFIG_ACPI_APEI_EINJ is not set
# CONFIG_ACPI_APEI_ERST_DEBUG is not set
# CONFIG_DPTF_POWER is not set
# CONFIG_ACPI_EXTLOG is not set
# CONFIG_PMIC_OPREGION is not set
# CONFIG_ACPI_CONFIGFS is not set
CONFIG_SFI=y

#
# CPU Frequency scaling
#
CONFIG_CPU_FREQ=y
CONFIG_CPU_FREQ_GOV_ATTR_SET=y
CONFIG_CPU_FREQ_GOV_COMMON=y
# CONFIG_CPU_FREQ_STAT is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
# CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
# CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_POWERSAVE=y
CONFIG_CPU_FREQ_GOV_USERSPACE=y
CONFIG_CPU_FREQ_GOV_ONDEMAND=y
CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
# CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set

#
# CPU frequency scaling drivers
#
CONFIG_X86_INTEL_PSTATE=y
CONFIG_X86_PCC_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ=y
CONFIG_X86_ACPI_CPUFREQ_CPB=y
CONFIG_X86_POWERNOW_K8=y
# CONFIG_X86_AMD_FREQ_SENSITIVITY is not set
# CONFIG_X86_SPEEDSTEP_CENTRINO is not set
CONFIG_X86_P4_CLOCKMOD=y

#
# shared options
#
CONFIG_X86_SPEEDSTEP_LIB=y

#
# CPU Idle
#
CONFIG_CPU_IDLE=y
CONFIG_CPU_IDLE_GOV_LADDER=y
CONFIG_CPU_IDLE_GOV_MENU=y
# CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is not set
CONFIG_INTEL_IDLE=y

#
# Bus options (PCI etc.)
#
CONFIG_PCI=y
CONFIG_PCI_DIRECT=y
CONFIG_PCI_MMCONFIG=y
CONFIG_PCI_DOMAINS=y
CONFIG_PCIEPORTBUS=y
CONFIG_HOTPLUG_PCI_PCIE=y
CONFIG_PCIEAER=y
CONFIG_PCIE_ECRC=y
# CONFIG_PCIEAER_INJECT is not set
CONFIG_PCIEASPM=y
# CONFIG_PCIEASPM_DEBUG is not set
CONFIG_PCIEASPM_DEFAULT=y
# CONFIG_PCIEASPM_POWERSAVE is not set
# CONFIG_PCIEASPM_POWER_SUPERSAVE is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
CONFIG_PCIE_PME=y
# CONFIG_PCIE_DPC is not set
# CONFIG_PCIE_PTM is not set
CONFIG_PCI_BUS_ADDR_T_64BIT=y
CONFIG_PCI_MSI=y
CONFIG_PCI_MSI_IRQ_DOMAIN=y
# CONFIG_PCI_DEBUG is not set
# CONFIG_PCI_REALLOC_ENABLE_AUTO is not set
CONFIG_PCI_STUB=y
CONFIG_PCI_ATS=y
CONFIG_PCI_LOCKLESS_CONFIG=y
CONFIG_PCI_IOV=y
CONFIG_PCI_PRI=y
CONFIG_PCI_PASID=y
CONFIG_PCI_LABEL=y
CONFIG_HOTPLUG_PCI=y
CONFIG_HOTPLUG_PCI_ACPI=y
# CONFIG_HOTPLUG_PCI_ACPI_IBM is not set
# CONFIG_HOTPLUG_PCI_CPCI is not set
# CONFIG_HOTPLUG_PCI_SHPC is not set

#
# DesignWare PCI Core Support
#
# CONFIG_PCIE_DW_PLAT is not set

#
# PCI host controller drivers
#
# CONFIG_VMD is not set

#
# PCI Endpoint
#
# CONFIG_PCI_ENDPOINT is not set

#
# PCI switch controller drivers
#
# CONFIG_PCI_SW_SWITCHTEC is not set
CONFIG_ISA_DMA_API=y
CONFIG_AMD_NB=y
CONFIG_PCCARD=y
CONFIG_PCMCIA=y
CONFIG_PCMCIA_LOAD_CIS=y
CONFIG_CARDBUS=y

#
# PC-card bridges
#
# CONFIG_YENTA is not set
# CONFIG_PD6729 is not set
# CONFIG_I82092 is not set
# CONFIG_RAPIDIO is not set
# CONFIG_X86_SYSFB is not set

#
# Executable file formats / Emulations
#
CONFIG_BINFMT_ELF=y
CONFIG_COMPAT_BINFMT_ELF=y
CONFIG_ELFCORE=y
CONFIG_CORE_DUMP_DEFAULT_ELF_HEADERS=y
CONFIG_BINFMT_SCRIPT=y
# CONFIG_HAVE_AOUT is not set
# CONFIG_BINFMT_MISC is not set
CONFIG_COREDUMP=y
CONFIG_IA32_EMULATION=y
# CONFIG_IA32_AOUT is not set
# CONFIG_X86_X32 is not set
CONFIG_COMPAT_32=y
CONFIG_COMPAT=y
CONFIG_COMPAT_FOR_U64_ALIGNMENT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_X86_DEV_DMA_OPS=y
CONFIG_NET=y
CONFIG_NET_INGRESS=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_DIAG is not set
CONFIG_UNIX=y
# CONFIG_UNIX_DIAG is not set
# CONFIG_TLS is not set
CONFIG_XFRM=y
CONFIG_XFRM_ALGO=y
CONFIG_XFRM_USER=y
CONFIG_XFRM_SUB_POLICY=y
CONFIG_XFRM_MIGRATE=y
CONFIG_XFRM_STATISTICS=y
# CONFIG_NET_KEY is not set
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_FIB_TRIE_STATS=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_VERBOSE=y
CONFIG_IP_PNP=y
CONFIG_IP_PNP_DHCP=y
# CONFIG_IP_PNP_BOOTP is not set
# CONFIG_IP_PNP_RARP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE_DEMUX is not set
# CONFIG_NET_IP_TUNNEL is not set
CONFIG_IP_MROUTE=y
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_PIMSM_V1=y
CONFIG_IP_PIMSM_V2=y
CONFIG_SYN_COOKIES=y
# CONFIG_NET_UDP_TUNNEL is not set
# CONFIG_NET_FOU is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
# CONFIG_INET_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET_XFRM_MODE_TUNNEL is not set
# CONFIG_INET_XFRM_MODE_BEET is not set
# CONFIG_INET_DIAG is not set
CONFIG_TCP_CONG_ADVANCED=y
# CONFIG_TCP_CONG_BIC is not set
CONFIG_TCP_CONG_CUBIC=y
# CONFIG_TCP_CONG_WESTWOOD is not set
# CONFIG_TCP_CONG_HTCP is not set
# CONFIG_TCP_CONG_HSTCP is not set
# CONFIG_TCP_CONG_HYBLA is not set
# CONFIG_TCP_CONG_VEGAS is not set
# CONFIG_TCP_CONG_NV is not set
# CONFIG_TCP_CONG_SCALABLE is not set
# CONFIG_TCP_CONG_LP is not set
# CONFIG_TCP_CONG_VENO is not set
# CONFIG_TCP_CONG_YEAH is not set
# CONFIG_TCP_CONG_ILLINOIS is not set
# CONFIG_TCP_CONG_DCTCP is not set
# CONFIG_TCP_CONG_CDG is not set
# CONFIG_TCP_CONG_BBR is not set
CONFIG_DEFAULT_CUBIC=y
# CONFIG_DEFAULT_RENO is not set
CONFIG_DEFAULT_TCP_CONG="cubic"
CONFIG_TCP_MD5SIG=y
CONFIG_IPV6=y
CONFIG_IPV6_ROUTER_PREF=y
CONFIG_IPV6_ROUTE_INFO=y
CONFIG_IPV6_OPTIMISTIC_DAD=y
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
CONFIG_IPV6_MIP6=y
# CONFIG_IPV6_ILA is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_TRANSPORT is not set
# CONFIG_INET6_XFRM_MODE_TUNNEL is not set
# CONFIG_INET6_XFRM_MODE_BEET is not set
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
# CONFIG_IPV6_SIT is not set
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_FOU is not set
# CONFIG_IPV6_FOU_TUNNEL is not set
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_IPV6_SUBTREES=y
CONFIG_IPV6_MROUTE=y
CONFIG_IPV6_MROUTE_MULTIPLE_TABLES=y
CONFIG_IPV6_PIMSM_V2=y
# CONFIG_IPV6_SEG6_LWTUNNEL is not set
# CONFIG_IPV6_SEG6_HMAC is not set
CONFIG_NETLABEL=y
CONFIG_NETWORK_SECMARK=y
CONFIG_NET_PTP_CLASSIFY=y
# CONFIG_NETWORK_PHY_TIMESTAMPING is not set
CONFIG_NETFILTER=y
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_INGRESS=y
# CONFIG_NETFILTER_NETLINK_ACCT is not set
# CONFIG_NETFILTER_NETLINK_QUEUE is not set
# CONFIG_NETFILTER_NETLINK_LOG is not set
# CONFIG_NF_CONNTRACK is not set
# CONFIG_NF_LOG_NETDEV is not set
# CONFIG_NF_TABLES is not set
CONFIG_NETFILTER_XTABLES=y

#
# Xtables combined modules
#
# CONFIG_NETFILTER_XT_MARK is not set

#
# Xtables targets
#
# CONFIG_NETFILTER_XT_TARGET_AUDIT is not set
# CONFIG_NETFILTER_XT_TARGET_CLASSIFY is not set
# CONFIG_NETFILTER_XT_TARGET_HMARK is not set
# CONFIG_NETFILTER_XT_TARGET_IDLETIMER is not set
# CONFIG_NETFILTER_XT_TARGET_LED is not set
# CONFIG_NETFILTER_XT_TARGET_LOG is not set
# CONFIG_NETFILTER_XT_TARGET_MARK is not set
# CONFIG_NETFILTER_XT_TARGET_NFLOG is not set
# CONFIG_NETFILTER_XT_TARGET_NFQUEUE is not set
# CONFIG_NETFILTER_XT_TARGET_RATEEST is not set
# CONFIG_NETFILTER_XT_TARGET_TEE is not set
# CONFIG_NETFILTER_XT_TARGET_SECMARK is not set
# CONFIG_NETFILTER_XT_TARGET_TCPMSS is not set

#
# Xtables matches
#
# CONFIG_NETFILTER_XT_MATCH_ADDRTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_BPF is not set
# CONFIG_NETFILTER_XT_MATCH_CGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_COMMENT is not set
# CONFIG_NETFILTER_XT_MATCH_CPU is not set
# CONFIG_NETFILTER_XT_MATCH_DCCP is not set
# CONFIG_NETFILTER_XT_MATCH_DEVGROUP is not set
# CONFIG_NETFILTER_XT_MATCH_DSCP is not set
# CONFIG_NETFILTER_XT_MATCH_ECN is not set
# CONFIG_NETFILTER_XT_MATCH_ESP is not set
# CONFIG_NETFILTER_XT_MATCH_HASHLIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_HL is not set
# CONFIG_NETFILTER_XT_MATCH_IPCOMP is not set
# CONFIG_NETFILTER_XT_MATCH_IPRANGE is not set
# CONFIG_NETFILTER_XT_MATCH_L2TP is not set
# CONFIG_NETFILTER_XT_MATCH_LENGTH is not set
# CONFIG_NETFILTER_XT_MATCH_LIMIT is not set
# CONFIG_NETFILTER_XT_MATCH_MAC is not set
# CONFIG_NETFILTER_XT_MATCH_MARK is not set
# CONFIG_NETFILTER_XT_MATCH_MULTIPORT is not set
# CONFIG_NETFILTER_XT_MATCH_NFACCT is not set
# CONFIG_NETFILTER_XT_MATCH_OWNER is not set
# CONFIG_NETFILTER_XT_MATCH_POLICY is not set
# CONFIG_NETFILTER_XT_MATCH_PKTTYPE is not set
# CONFIG_NETFILTER_XT_MATCH_QUOTA is not set
# CONFIG_NETFILTER_XT_MATCH_RATEEST is not set
# CONFIG_NETFILTER_XT_MATCH_REALM is not set
# CONFIG_NETFILTER_XT_MATCH_RECENT is not set
# CONFIG_NETFILTER_XT_MATCH_SCTP is not set
# CONFIG_NETFILTER_XT_MATCH_STATISTIC is not set
# CONFIG_NETFILTER_XT_MATCH_STRING is not set
# CONFIG_NETFILTER_XT_MATCH_TCPMSS is not set
# CONFIG_NETFILTER_XT_MATCH_TIME is not set
# CONFIG_NETFILTER_XT_MATCH_U32 is not set
# CONFIG_IP_SET is not set
# CONFIG_IP_VS is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV4 is not set
# CONFIG_NF_SOCKET_IPV4 is not set
# CONFIG_NF_DUP_IPV4 is not set
# CONFIG_NF_LOG_ARP is not set
# CONFIG_NF_LOG_IPV4 is not set
CONFIG_NF_REJECT_IPV4=y
CONFIG_IP_NF_IPTABLES=y
# CONFIG_IP_NF_MATCH_AH is not set
# CONFIG_IP_NF_MATCH_ECN is not set
# CONFIG_IP_NF_MATCH_TTL is not set
CONFIG_IP_NF_FILTER=y
CONFIG_IP_NF_TARGET_REJECT=y
# CONFIG_IP_NF_MANGLE is not set
# CONFIG_IP_NF_RAW is not set
# CONFIG_IP_NF_SECURITY is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_DEFRAG_IPV6 is not set
# CONFIG_NF_SOCKET_IPV6 is not set
# CONFIG_NF_DUP_IPV6 is not set
# CONFIG_NF_REJECT_IPV6 is not set
# CONFIG_NF_LOG_IPV6 is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_IP_DCCP is not set
# CONFIG_IP_SCTP is not set
# CONFIG_RDS is not set
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_L2TP is not set
# CONFIG_BRIDGE is not set
CONFIG_HAVE_NET_DSA=y
# CONFIG_NET_DSA is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_PHONET is not set
# CONFIG_6LOWPAN is not set
# CONFIG_IEEE802154 is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
# CONFIG_NET_SCH_CBQ is not set
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
# CONFIG_NET_SCH_PRIO is not set
# CONFIG_NET_SCH_MULTIQ is not set
# CONFIG_NET_SCH_RED is not set
# CONFIG_NET_SCH_SFB is not set
# CONFIG_NET_SCH_SFQ is not set
# CONFIG_NET_SCH_TEQL is not set
# CONFIG_NET_SCH_TBF is not set
# CONFIG_NET_SCH_GRED is not set
# CONFIG_NET_SCH_DSMARK is not set
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_DRR is not set
# CONFIG_NET_SCH_MQPRIO is not set
# CONFIG_NET_SCH_CHOKE is not set
# CONFIG_NET_SCH_QFQ is not set
# CONFIG_NET_SCH_CODEL is not set
# CONFIG_NET_SCH_FQ_CODEL is not set
# CONFIG_NET_SCH_FQ is not set
# CONFIG_NET_SCH_HHF is not set
# CONFIG_NET_SCH_PIE is not set
# CONFIG_NET_SCH_INGRESS is not set
# CONFIG_NET_SCH_PLUG is not set
# CONFIG_NET_SCH_DEFAULT is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
# CONFIG_NET_CLS_TCINDEX is not set
# CONFIG_NET_CLS_ROUTE4 is not set
# CONFIG_NET_CLS_FW is not set
# CONFIG_NET_CLS_U32 is not set
# CONFIG_NET_CLS_RSVP is not set
# CONFIG_NET_CLS_RSVP6 is not set
# CONFIG_NET_CLS_FLOW is not set
CONFIG_NET_CLS_CGROUP=y
# CONFIG_NET_CLS_BPF is not set
# CONFIG_NET_CLS_FLOWER is not set
# CONFIG_NET_CLS_MATCHALL is not set
CONFIG_NET_EMATCH=y
CONFIG_NET_EMATCH_STACK=32
# CONFIG_NET_EMATCH_CMP is not set
# CONFIG_NET_EMATCH_NBYTE is not set
# CONFIG_NET_EMATCH_U32 is not set
# CONFIG_NET_EMATCH_META is not set
# CONFIG_NET_EMATCH_TEXT is not set
CONFIG_NET_CLS_ACT=y
# CONFIG_NET_ACT_POLICE is not set
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
# CONFIG_NET_ACT_SAMPLE is not set
# CONFIG_NET_ACT_IPT is not set
# CONFIG_NET_ACT_NAT is not set
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_ACT_SKBEDIT is not set
# CONFIG_NET_ACT_CSUM is not set
# CONFIG_NET_ACT_VLAN is not set
# CONFIG_NET_ACT_BPF is not set
# CONFIG_NET_ACT_SKBMOD is not set
# CONFIG_NET_ACT_IFE is not set
# CONFIG_NET_ACT_TUNNEL_KEY is not set
CONFIG_NET_SCH_FIFO=y
CONFIG_DCB=y
CONFIG_DNS_RESOLVER=y
# CONFIG_BATMAN_ADV is not set
# CONFIG_OPENVSWITCH is not set
# CONFIG_VSOCKETS is not set
# CONFIG_NETLINK_DIAG is not set
# CONFIG_MPLS is not set
# CONFIG_NET_NSH is not set
# CONFIG_HSR is not set
# CONFIG_NET_SWITCHDEV is not set
# CONFIG_NET_L3_MASTER_DEV is not set
# CONFIG_NET_NCSI is not set
CONFIG_RPS=y
CONFIG_RFS_ACCEL=y
CONFIG_XPS=y
# CONFIG_CGROUP_NET_PRIO is not set
CONFIG_CGROUP_NET_CLASSID=y
CONFIG_NET_RX_BUSY_POLL=y
CONFIG_BQL=y
CONFIG_BPF_JIT=y
CONFIG_NET_FLOW_LIMIT=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_NET_DROP_MONITOR=y
CONFIG_HAMRADIO=y

#
# Packet Radio protocols
#
# CONFIG_AX25 is not set
# CONFIG_CAN is not set
CONFIG_BT=y
CONFIG_BT_BREDR=y
# CONFIG_BT_RFCOMM is not set
CONFIG_BT_BNEP=y
CONFIG_BT_BNEP_MC_FILTER=y
CONFIG_BT_BNEP_PROTO_FILTER=y
# CONFIG_BT_HIDP is not set
CONFIG_BT_HS=y
CONFIG_BT_LE=y
# CONFIG_BT_LEDS is not set
# CONFIG_BT_SELFTEST is not set
CONFIG_BT_DEBUGFS=y

#
# Bluetooth device drivers
#
# CONFIG_BT_HCIBTUSB is not set
# CONFIG_BT_HCIUART is not set
# CONFIG_BT_HCIBCM203X is not set
# CONFIG_BT_HCIBFUSB is not set
# CONFIG_BT_HCIDTL1 is not set
# CONFIG_BT_HCIBT3C is not set
# CONFIG_BT_HCIBLUECARD is not set
# CONFIG_BT_HCIBTUART is not set
# CONFIG_BT_HCIVHCI is not set
# CONFIG_BT_MRVL is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_AF_KCM is not set
# CONFIG_STREAM_PARSER is not set
CONFIG_FIB_RULES=y
CONFIG_WIRELESS=y
# CONFIG_CFG80211 is not set
# CONFIG_LIB80211 is not set

#
# CFG80211 needs to be enabled for MAC80211
#
CONFIG_MAC80211_STA_HASH_MAX_SIZE=0
# CONFIG_WIMAX is not set
CONFIG_RFKILL=y
CONFIG_RFKILL_LEDS=y
CONFIG_RFKILL_INPUT=y
CONFIG_NET_9P=y
CONFIG_NET_9P_VIRTIO=y
# CONFIG_NET_9P_DEBUG is not set
# CONFIG_CAIF is not set
# CONFIG_CEPH_LIB is not set
# CONFIG_NFC is not set
# CONFIG_PSAMPLE is not set
# CONFIG_NET_IFE is not set
# CONFIG_LWTUNNEL is not set
# CONFIG_DST_CACHE is not set
CONFIG_GRO_CELLS=y
# CONFIG_NET_DEVLINK is not set
CONFIG_MAY_USE_DEVLINK=y
CONFIG_HAVE_EBPF_JIT=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER=y
CONFIG_UEVENT_HELPER_PATH=""
CONFIG_DEVTMPFS=y
CONFIG_DEVTMPFS_MOUNT=y
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
CONFIG_FW_LOADER=y
# CONFIG_FIRMWARE_IN_KERNEL is not set
CONFIG_EXTRA_FIRMWARE=""
# CONFIG_FW_LOADER_USER_HELPER_FALLBACK is not set
CONFIG_ALLOW_DEV_COREDUMP=y
# CONFIG_DEBUG_DRIVER is not set
CONFIG_DEBUG_DEVRES=y
# CONFIG_DEBUG_TEST_DRIVER_REMOVE is not set
# CONFIG_TEST_ASYNC_DRIVER_PROBE is not set
# CONFIG_SYS_HYPERVISOR is not set
# CONFIG_GENERIC_CPU_DEVICES is not set
CONFIG_GENERIC_CPU_AUTOPROBE=y
CONFIG_REGMAP=y
CONFIG_REGMAP_I2C=y
# CONFIG_DMA_SHARED_BUFFER is not set

#
# Bus devices
#
CONFIG_CONNECTOR=y
CONFIG_PROC_EVENTS=y
# CONFIG_MTD is not set
# CONFIG_OF is not set
CONFIG_ARCH_MIGHT_HAVE_PC_PARPORT=y
# CONFIG_PARPORT is not set
CONFIG_PNP=y
# CONFIG_PNP_DEBUG_MESSAGES is not set

#
# Protocols
#
CONFIG_PNPACPI=y
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_NULL_BLK is not set
# CONFIG_BLK_DEV_FD is not set
# CONFIG_BLK_DEV_PCIESSD_MTIP32XX is not set
# CONFIG_BLK_DEV_DAC960 is not set
# CONFIG_BLK_DEV_UMEM is not set
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=y
CONFIG_BLK_DEV_LOOP_MIN_COUNT=8
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
# CONFIG_BLK_DEV_DRBD is not set
# CONFIG_BLK_DEV_NBD is not set
# CONFIG_BLK_DEV_SKD is not set
# CONFIG_BLK_DEV_SX8 is not set
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set
CONFIG_VIRTIO_BLK=y
# CONFIG_VIRTIO_BLK_SCSI is not set
# CONFIG_BLK_DEV_RBD is not set
# CONFIG_BLK_DEV_RSXX is not set
# CONFIG_BLK_DEV_NVME is not set
# CONFIG_NVME_FC is not set
# CONFIG_NVME_TARGET is not set

#
# Misc devices
#
# CONFIG_SENSORS_LIS3LV02D is not set
# CONFIG_AD525X_DPOT is not set
# CONFIG_DUMMY_IRQ is not set
# CONFIG_IBM_ASM is not set
# CONFIG_PHANTOM is not set
# CONFIG_SGI_IOC4 is not set
# CONFIG_TIFM_CORE is not set
# CONFIG_ICS932S401 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HP_ILO is not set
# CONFIG_APDS9802ALS is not set
# CONFIG_ISL29003 is not set
# CONFIG_ISL29020 is not set
# CONFIG_SENSORS_TSL2550 is not set
# CONFIG_SENSORS_BH1770 is not set
# CONFIG_SENSORS_APDS990X is not set
# CONFIG_HMC6352 is not set
# CONFIG_DS1682 is not set
# CONFIG_USB_SWITCH_FSA9480 is not set
# CONFIG_SRAM is not set
# CONFIG_PCI_ENDPOINT_TEST is not set
# CONFIG_C2PORT is not set

#
# EEPROM support
#
# CONFIG_EEPROM_AT24 is not set
# CONFIG_EEPROM_LEGACY is not set
# CONFIG_EEPROM_MAX6875 is not set
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_EEPROM_IDT_89HPESX is not set
# CONFIG_CB710_CORE is not set

#
# Texas Instruments shared transport line discipline
#
# CONFIG_SENSORS_LIS3_I2C is not set

#
# Altera FPGA firmware download module
#
# CONFIG_ALTERA_STAPL is not set
# CONFIG_INTEL_MEI is not set
# CONFIG_INTEL_MEI_ME is not set
# CONFIG_INTEL_MEI_TXE is not set
# CONFIG_VMWARE_VMCI is not set

#
# Intel MIC Bus Driver
#
# CONFIG_INTEL_MIC_BUS is not set

#
# SCIF Bus Driver
#
# CONFIG_SCIF_BUS is not set

#
# VOP Bus Driver
#
# CONFIG_VOP_BUS is not set

#
# Intel MIC Host Driver
#

#
# Intel MIC Card Driver
#

#
# SCIF Driver
#

#
# Intel MIC Coprocessor State Management (COSM) Drivers
#

#
# VOP Driver
#
# CONFIG_GENWQE is not set
# CONFIG_ECHO is not set
# CONFIG_CXL_BASE is not set
# CONFIG_CXL_AFU_DRIVER_OPS is not set
# CONFIG_CXL_LIB is not set
CONFIG_HAVE_IDE=y
# CONFIG_IDE is not set

#
# SCSI device support
#
CONFIG_SCSI_MOD=y
CONFIG_RAID_ATTRS=y
CONFIG_SCSI=y
CONFIG_SCSI_DMA=y
# CONFIG_SCSI_NETLINK is not set
# CONFIG_SCSI_MQ_DEFAULT is not set
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
# CONFIG_CHR_DEV_ST is not set
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y

#
# SCSI Transports
#
CONFIG_SCSI_SPI_ATTRS=y
# CONFIG_SCSI_FC_ATTRS is not set
# CONFIG_SCSI_ISCSI_ATTRS is not set
CONFIG_SCSI_SAS_ATTRS=y
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_ISCSI_BOOT_SYSFS is not set
# CONFIG_SCSI_CXGB3_ISCSI is not set
# CONFIG_SCSI_CXGB4_ISCSI is not set
# CONFIG_SCSI_BNX2_ISCSI is not set
# CONFIG_BE2ISCSI is not set
# CONFIG_BLK_DEV_3W_XXXX_RAID is not set
# CONFIG_SCSI_HPSA is not set
# CONFIG_SCSI_3W_9XXX is not set
# CONFIG_SCSI_3W_SAS is not set
# CONFIG_SCSI_ACARD is not set
# CONFIG_SCSI_AACRAID is not set
CONFIG_SCSI_AIC7XXX=y
CONFIG_AIC7XXX_CMDS_PER_DEVICE=32
CONFIG_AIC7XXX_RESET_DELAY_MS=5000
CONFIG_AIC7XXX_DEBUG_ENABLE=y
CONFIG_AIC7XXX_DEBUG_MASK=0
CONFIG_AIC7XXX_REG_PRETTY_PRINT=y
# CONFIG_SCSI_AIC79XX is not set
# CONFIG_SCSI_AIC94XX is not set
# CONFIG_SCSI_MVSAS is not set
# CONFIG_SCSI_MVUMI is not set
# CONFIG_SCSI_DPT_I2O is not set
# CONFIG_SCSI_ADVANSYS is not set
# CONFIG_SCSI_ARCMSR is not set
# CONFIG_SCSI_ESAS2R is not set
CONFIG_MEGARAID_NEWGEN=y
# CONFIG_MEGARAID_MM is not set
# CONFIG_MEGARAID_LEGACY is not set
# CONFIG_MEGARAID_SAS is not set
CONFIG_SCSI_MPT3SAS=y
CONFIG_SCSI_MPT2SAS_MAX_SGE=128
CONFIG_SCSI_MPT3SAS_MAX_SGE=128
CONFIG_SCSI_MPT2SAS=y
# CONFIG_SCSI_SMARTPQI is not set
# CONFIG_SCSI_UFSHCD is not set
# CONFIG_SCSI_HPTIOP is not set
# CONFIG_SCSI_BUSLOGIC is not set
# CONFIG_VMWARE_PVSCSI is not set
# CONFIG_SCSI_SNIC is not set
# CONFIG_SCSI_DMX3191D is not set
# CONFIG_SCSI_EATA is not set
# CONFIG_SCSI_FUTURE_DOMAIN is not set
# CONFIG_SCSI_GDTH is not set
# CONFIG_SCSI_ISCI is not set
# CONFIG_SCSI_IPS is not set
# CONFIG_SCSI_INITIO is not set
# CONFIG_SCSI_INIA100 is not set
# CONFIG_SCSI_STEX is not set
# CONFIG_SCSI_SYM53C8XX_2 is not set
# CONFIG_SCSI_IPR is not set
# CONFIG_SCSI_QLOGIC_1280 is not set
# CONFIG_SCSI_QLA_ISCSI is not set
# CONFIG_SCSI_DC395x is not set
# CONFIG_SCSI_AM53C974 is not set
# CONFIG_SCSI_WD719X is not set
# CONFIG_SCSI_DEBUG is not set
# CONFIG_SCSI_PMCRAID is not set
# CONFIG_SCSI_PM8001 is not set
# CONFIG_SCSI_VIRTIO is not set
# CONFIG_SCSI_LOWLEVEL_PCMCIA is not set
CONFIG_SCSI_DH=y
# CONFIG_SCSI_DH_RDAC is not set
# CONFIG_SCSI_DH_HP_SW is not set
# CONFIG_SCSI_DH_EMC is not set
# CONFIG_SCSI_DH_ALUA is not set
# CONFIG_SCSI_OSD_INITIATOR is not set
CONFIG_ATA=y
# CONFIG_ATA_NONSTANDARD is not set
CONFIG_ATA_VERBOSE_ERROR=y
CONFIG_ATA_ACPI=y
# CONFIG_SATA_ZPODD is not set
CONFIG_SATA_PMP=y

#
# Controllers with non-SFF native interface
#
CONFIG_SATA_AHCI=y
# CONFIG_SATA_AHCI_PLATFORM is not set
# CONFIG_SATA_INIC162X is not set
# CONFIG_SATA_ACARD_AHCI is not set
# CONFIG_SATA_SIL24 is not set
CONFIG_ATA_SFF=y

#
# SFF controllers with custom DMA interface
#
# CONFIG_PDC_ADMA is not set
# CONFIG_SATA_QSTOR is not set
# CONFIG_SATA_SX4 is not set
CONFIG_ATA_BMDMA=y

#
# SATA SFF controllers with BMDMA
#
CONFIG_ATA_PIIX=y
# CONFIG_SATA_DWC is not set
# CONFIG_SATA_MV is not set
CONFIG_SATA_NV=y
# CONFIG_SATA_PROMISE is not set
# CONFIG_SATA_SIL is not set
# CONFIG_SATA_SIS is not set
# CONFIG_SATA_SVW is not set
# CONFIG_SATA_ULI is not set
# CONFIG_SATA_VIA is not set
# CONFIG_SATA_VITESSE is not set

#
# PATA SFF controllers with BMDMA
#
# CONFIG_PATA_ALI is not set
CONFIG_PATA_AMD=y
# CONFIG_PATA_ARTOP is not set
CONFIG_PATA_ATIIXP=y
# CONFIG_PATA_ATP867X is not set
# CONFIG_PATA_CMD64X is not set
# CONFIG_PATA_CYPRESS is not set
# CONFIG_PATA_EFAR is not set
# CONFIG_PATA_HPT366 is not set
# CONFIG_PATA_HPT37X is not set
# CONFIG_PATA_HPT3X2N is not set
# CONFIG_PATA_HPT3X3 is not set
# CONFIG_PATA_IT8213 is not set
# CONFIG_PATA_IT821X is not set
CONFIG_PATA_JMICRON=y
# CONFIG_PATA_MARVELL is not set
# CONFIG_PATA_NETCELL is not set
# CONFIG_PATA_NINJA32 is not set
# CONFIG_PATA_NS87415 is not set
CONFIG_PATA_OLDPIIX=y
# CONFIG_PATA_OPTIDMA is not set
# CONFIG_PATA_PDC2027X is not set
# CONFIG_PATA_PDC_OLD is not set
# CONFIG_PATA_RADISYS is not set
# CONFIG_PATA_RDC is not set
# CONFIG_PATA_SCH is not set
# CONFIG_PATA_SERVERWORKS is not set
# CONFIG_PATA_SIL680 is not set
# CONFIG_PATA_SIS is not set
# CONFIG_PATA_TOSHIBA is not set
# CONFIG_PATA_TRIFLEX is not set
CONFIG_PATA_VIA=y
# CONFIG_PATA_WINBOND is not set

#
# PIO-only SFF controllers
#
# CONFIG_PATA_CMD640_PCI is not set
# CONFIG_PATA_MPIIX is not set
# CONFIG_PATA_NS87410 is not set
# CONFIG_PATA_OPTI is not set
# CONFIG_PATA_PCMCIA is not set
# CONFIG_PATA_RZ1000 is not set

#
# Generic fallback / legacy drivers
#
CONFIG_PATA_ACPI=y
CONFIG_ATA_GENERIC=y
# CONFIG_PATA_LEGACY is not set
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_AUTODETECT=y
CONFIG_MD_LINEAR=y
CONFIG_MD_RAID0=y
CONFIG_MD_RAID1=y
CONFIG_MD_RAID10=y
CONFIG_MD_RAID456=y
CONFIG_MD_MULTIPATH=y
CONFIG_MD_FAULTY=y
CONFIG_BCACHE=y
# CONFIG_BCACHE_DEBUG is not set
# CONFIG_BCACHE_CLOSURES_DEBUG is not set
CONFIG_BLK_DEV_DM_BUILTIN=y
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_MQ_DEFAULT is not set
CONFIG_DM_DEBUG=y
CONFIG_DM_BUFIO=y
# CONFIG_DM_DEBUG_BLOCK_MANAGER_LOCKING is not set
# CONFIG_DM_CRYPT is not set
CONFIG_DM_SNAPSHOT=y
# CONFIG_DM_THIN_PROVISIONING is not set
# CONFIG_DM_CACHE is not set
# CONFIG_DM_ERA is not set
CONFIG_DM_MIRROR=y
# CONFIG_DM_LOG_USERSPACE is not set
# CONFIG_DM_RAID is not set
CONFIG_DM_ZERO=y
# CONFIG_DM_MULTIPATH is not set
# CONFIG_DM_DELAY is not set
CONFIG_DM_UEVENT=y
# CONFIG_DM_FLAKEY is not set
# CONFIG_DM_VERITY is not set
# CONFIG_DM_SWITCH is not set
# CONFIG_DM_LOG_WRITES is not set
# CONFIG_DM_INTEGRITY is not set
# CONFIG_TARGET_CORE is not set
CONFIG_FUSION=y
# CONFIG_FUSION_SPI is not set
# CONFIG_FUSION_SAS is not set
CONFIG_FUSION_MAX_SGE=40
CONFIG_FUSION_LOGGING=y

#
# IEEE 1394 (FireWire) support
#
# CONFIG_FIREWIRE is not set
# CONFIG_FIREWIRE_NOSY is not set
CONFIG_MACINTOSH_DRIVERS=y
CONFIG_MAC_EMUMOUSEBTN=y
CONFIG_NETDEVICES=y
CONFIG_MII=y
CONFIG_NET_CORE=y
# CONFIG_BONDING is not set
# CONFIG_DUMMY is not set
# CONFIG_EQUALIZER is not set
CONFIG_NET_FC=y
# CONFIG_IFB is not set
# CONFIG_NET_TEAM is not set
# CONFIG_MACVLAN is not set
# CONFIG_VXLAN is not set
# CONFIG_MACSEC is not set
CONFIG_NETCONSOLE=y
CONFIG_NETCONSOLE_DYNAMIC=y
CONFIG_NETPOLL=y
CONFIG_NET_POLL_CONTROLLER=y
# CONFIG_TUN is not set
# CONFIG_TUN_VNET_CROSS_LE is not set
# CONFIG_VETH is not set
CONFIG_VIRTIO_NET=y
# CONFIG_NLMON is not set
# CONFIG_ARCNET is not set

#
# CAIF transport drivers
#

#
# Distributed Switch Architecture drivers
#
CONFIG_ETHERNET=y
CONFIG_MDIO=y
CONFIG_NET_VENDOR_3COM=y
# CONFIG_PCMCIA_3C574 is not set
# CONFIG_PCMCIA_3C589 is not set
CONFIG_VORTEX=y
# CONFIG_TYPHOON is not set
CONFIG_NET_VENDOR_ADAPTEC=y
# CONFIG_ADAPTEC_STARFIRE is not set
CONFIG_NET_VENDOR_AGERE=y
# CONFIG_ET131X is not set
CONFIG_NET_VENDOR_ALACRITECH=y
# CONFIG_SLICOSS is not set
CONFIG_NET_VENDOR_ALTEON=y
# CONFIG_ACENIC is not set
# CONFIG_ALTERA_TSE is not set
CONFIG_NET_VENDOR_AMAZON=y
# CONFIG_ENA_ETHERNET is not set
CONFIG_NET_VENDOR_AMD=y
# CONFIG_AMD8111_ETH is not set
# CONFIG_PCNET32 is not set
# CONFIG_PCMCIA_NMCLAN is not set
# CONFIG_AMD_XGBE is not set
# CONFIG_AMD_XGBE_HAVE_ECC is not set
CONFIG_NET_VENDOR_AQUANTIA=y
# CONFIG_AQTION is not set
CONFIG_NET_VENDOR_ARC=y
CONFIG_NET_VENDOR_ATHEROS=y
# CONFIG_ATL2 is not set
# CONFIG_ATL1 is not set
# CONFIG_ATL1E is not set
# CONFIG_ATL1C is not set
# CONFIG_ALX is not set
# CONFIG_NET_VENDOR_AURORA is not set
CONFIG_NET_CADENCE=y
# CONFIG_MACB is not set
CONFIG_NET_VENDOR_BROADCOM=y
# CONFIG_B44 is not set
# CONFIG_BNX2 is not set
# CONFIG_CNIC is not set
CONFIG_TIGON3=y
CONFIG_TIGON3_HWMON=y
# CONFIG_BNX2X is not set
# CONFIG_BNXT is not set
CONFIG_NET_VENDOR_BROCADE=y
# CONFIG_BNA is not set
CONFIG_NET_VENDOR_CAVIUM=y
# CONFIG_THUNDER_NIC_PF is not set
# CONFIG_THUNDER_NIC_VF is not set
# CONFIG_THUNDER_NIC_BGX is not set
# CONFIG_THUNDER_NIC_RGX is not set
# CONFIG_LIQUIDIO is not set
# CONFIG_LIQUIDIO_VF is not set
CONFIG_NET_VENDOR_CHELSIO=y
# CONFIG_CHELSIO_T1 is not set
# CONFIG_CHELSIO_T3 is not set
# CONFIG_CHELSIO_T4 is not set
# CONFIG_CHELSIO_T4VF is not set
CONFIG_NET_VENDOR_CISCO=y
# CONFIG_ENIC is not set
# CONFIG_CX_ECAT is not set
# CONFIG_DNET is not set
CONFIG_NET_VENDOR_DEC=y
CONFIG_NET_TULIP=y
# CONFIG_DE2104X is not set
# CONFIG_TULIP is not set
# CONFIG_DE4X5 is not set
# CONFIG_WINBOND_840 is not set
# CONFIG_DM9102 is not set
# CONFIG_ULI526X is not set
# CONFIG_PCMCIA_XIRCOM is not set
CONFIG_NET_VENDOR_DLINK=y
# CONFIG_DL2K is not set
# CONFIG_SUNDANCE is not set
CONFIG_NET_VENDOR_EMULEX=y
# CONFIG_BE2NET is not set
CONFIG_NET_VENDOR_EZCHIP=y
CONFIG_NET_VENDOR_EXAR=y
# CONFIG_S2IO is not set
# CONFIG_VXGE is not set
# CONFIG_NET_VENDOR_FUJITSU is not set
# CONFIG_NET_VENDOR_HP is not set
CONFIG_NET_VENDOR_HUAWEI=y
# CONFIG_HINIC is not set
CONFIG_NET_VENDOR_INTEL=y
CONFIG_E100=y
# CONFIG_E1000 is not set
CONFIG_E1000E=y
CONFIG_E1000E_HWTS=y
CONFIG_IGB=y
CONFIG_IGB_HWMON=y
CONFIG_IGB_DCA=y
# CONFIG_IGBVF is not set
CONFIG_IXGB=y
CONFIG_IXGBE=y
CONFIG_IXGBE_HWMON=y
CONFIG_IXGBE_DCA=y
# CONFIG_IXGBE_DCB is not set
# CONFIG_IXGBEVF is not set
# CONFIG_I40E is not set
# CONFIG_I40EVF is not set
# CONFIG_FM10K is not set
# CONFIG_NET_VENDOR_I825XX is not set
# CONFIG_JME is not set
CONFIG_NET_VENDOR_MARVELL=y
# CONFIG_MVMDIO is not set
CONFIG_SKGE=y
# CONFIG_SKGE_DEBUG is not set
CONFIG_SKGE_GENESIS=y
# CONFIG_SKY2 is not set
CONFIG_NET_VENDOR_MELLANOX=y
# CONFIG_MLX4_EN is not set
# CONFIG_MLX4_CORE is not set
# CONFIG_MLX5_CORE is not set
# CONFIG_MLXSW_CORE is not set
# CONFIG_MLXFW is not set
CONFIG_NET_VENDOR_MICREL=y
# CONFIG_KS8842 is not set
# CONFIG_KS8851_MLL is not set
# CONFIG_KSZ884X_PCI is not set
CONFIG_NET_VENDOR_MYRI=y
# CONFIG_MYRI10GE is not set
# CONFIG_FEALNX is not set
CONFIG_NET_VENDOR_NATSEMI=y
# CONFIG_NATSEMI is not set
# CONFIG_NS83820 is not set
CONFIG_NET_VENDOR_NETRONOME=y
# CONFIG_NFP is not set
CONFIG_NET_VENDOR_8390=y
# CONFIG_PCMCIA_AXNET is not set
# CONFIG_NE2K_PCI is not set
# CONFIG_PCMCIA_PCNET is not set
CONFIG_NET_VENDOR_NVIDIA=y
CONFIG_FORCEDETH=y
CONFIG_NET_VENDOR_OKI=y
# CONFIG_ETHOC is not set
CONFIG_NET_PACKET_ENGINE=y
# CONFIG_HAMACHI is not set
# CONFIG_YELLOWFIN is not set
CONFIG_NET_VENDOR_QLOGIC=y
# CONFIG_QLA3XXX is not set
# CONFIG_QLCNIC is not set
# CONFIG_QLGE is not set
# CONFIG_NETXEN_NIC is not set
# CONFIG_QED is not set
CONFIG_NET_VENDOR_QUALCOMM=y
# CONFIG_QCOM_EMAC is not set
# CONFIG_RMNET is not set
CONFIG_NET_VENDOR_REALTEK=y
# CONFIG_8139CP is not set
CONFIG_8139TOO=y
# CONFIG_8139TOO_PIO is not set
# CONFIG_8139TOO_TUNE_TWISTER is not set
CONFIG_8139TOO_8129=y
# CONFIG_8139_OLD_RX_RESET is not set
# CONFIG_R8169 is not set
CONFIG_NET_VENDOR_RENESAS=y
CONFIG_NET_VENDOR_RDC=y
# CONFIG_R6040 is not set
CONFIG_NET_VENDOR_ROCKER=y
CONFIG_NET_VENDOR_SAMSUNG=y
# CONFIG_SXGBE_ETH is not set
# CONFIG_NET_VENDOR_SEEQ is not set
CONFIG_NET_VENDOR_SILAN=y
# CONFIG_SC92031 is not set
CONFIG_NET_VENDOR_SIS=y
# CONFIG_SIS900 is not set
# CONFIG_SIS190 is not set
CONFIG_NET_VENDOR_SOLARFLARE=y
# CONFIG_SFC is not set
# CONFIG_SFC_FALCON is not set
CONFIG_NET_VENDOR_SMSC=y
# CONFIG_PCMCIA_SMC91C92 is not set
# CONFIG_EPIC100 is not set
# CONFIG_SMSC911X is not set
# CONFIG_SMSC9420 is not set
CONFIG_NET_VENDOR_STMICRO=y
# CONFIG_STMMAC_ETH is not set
CONFIG_NET_VENDOR_SUN=y
# CONFIG_HAPPYMEAL is not set
# CONFIG_SUNGEM is not set
# CONFIG_CASSINI is not set
# CONFIG_NIU is not set
CONFIG_NET_VENDOR_TEHUTI=y
# CONFIG_TEHUTI is not set
CONFIG_NET_VENDOR_TI=y
# CONFIG_TI_CPSW_ALE is not set
# CONFIG_TLAN is not set
CONFIG_NET_VENDOR_VIA=y
# CONFIG_VIA_RHINE is not set
# CONFIG_VIA_VELOCITY is not set
CONFIG_NET_VENDOR_WIZNET=y
# CONFIG_WIZNET_W5100 is not set
# CONFIG_WIZNET_W5300 is not set
CONFIG_NET_VENDOR_XIRCOM=y
# CONFIG_PCMCIA_XIRC2PS is not set
CONFIG_NET_VENDOR_SYNOPSYS=y
# CONFIG_DWC_XLGMAC is not set
# CONFIG_FDDI is not set
# CONFIG_HIPPI is not set
# CONFIG_NET_SB1000 is not set
CONFIG_MDIO_DEVICE=y
CONFIG_MDIO_BUS=y
# CONFIG_MDIO_BITBANG is not set
# CONFIG_MDIO_THUNDER is not set
CONFIG_PHYLIB=y
CONFIG_SWPHY=y
# CONFIG_LED_TRIGGER_PHY is not set

#
# MII PHY device drivers
#
# CONFIG_AMD_PHY is not set
# CONFIG_AQUANTIA_PHY is not set
# CONFIG_AT803X_PHY is not set
# CONFIG_BCM7XXX_PHY is not set
# CONFIG_BCM87XX_PHY is not set
# CONFIG_BROADCOM_PHY is not set
# CONFIG_CICADA_PHY is not set
# CONFIG_CORTINA_PHY is not set
# CONFIG_DAVICOM_PHY is not set
# CONFIG_DP83848_PHY is not set
# CONFIG_DP83867_PHY is not set
CONFIG_FIXED_PHY=y
# CONFIG_ICPLUS_PHY is not set
# CONFIG_INTEL_XWAY_PHY is not set
# CONFIG_LSI_ET1011C_PHY is not set
# CONFIG_LXT_PHY is not set
# CONFIG_MARVELL_PHY is not set
# CONFIG_MARVELL_10G_PHY is not set
# CONFIG_MICREL_PHY is not set
# CONFIG_MICROCHIP_PHY is not set
# CONFIG_MICROSEMI_PHY is not set
# CONFIG_NATIONAL_PHY is not set
# CONFIG_QSEMI_PHY is not set
# CONFIG_REALTEK_PHY is not set
# CONFIG_ROCKCHIP_PHY is not set
# CONFIG_SMSC_PHY is not set
# CONFIG_STE10XP is not set
# CONFIG_TERANETICS_PHY is not set
# CONFIG_VITESSE_PHY is not set
# CONFIG_XILINX_GMII2RGMII is not set
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
CONFIG_USB_NET_DRIVERS=y
# CONFIG_USB_CATC is not set
# CONFIG_USB_KAWETH is not set
# CONFIG_USB_PEGASUS is not set
# CONFIG_USB_RTL8150 is not set
# CONFIG_USB_RTL8152 is not set
# CONFIG_USB_LAN78XX is not set
# CONFIG_USB_USBNET is not set
# CONFIG_USB_HSO is not set
# CONFIG_USB_IPHETH is not set
CONFIG_WLAN=y
CONFIG_WLAN_VENDOR_ADMTEK=y
CONFIG_WLAN_VENDOR_ATH=y
# CONFIG_ATH_DEBUG is not set
# CONFIG_ATH5K_PCI is not set
CONFIG_WLAN_VENDOR_ATMEL=y
CONFIG_WLAN_VENDOR_BROADCOM=y
CONFIG_WLAN_VENDOR_CISCO=y
CONFIG_WLAN_VENDOR_INTEL=y
CONFIG_WLAN_VENDOR_INTERSIL=y
# CONFIG_HOSTAP is not set
# CONFIG_PRISM54 is not set
CONFIG_WLAN_VENDOR_MARVELL=y
CONFIG_WLAN_VENDOR_MEDIATEK=y
CONFIG_WLAN_VENDOR_RALINK=y
CONFIG_WLAN_VENDOR_REALTEK=y
CONFIG_WLAN_VENDOR_RSI=y
CONFIG_WLAN_VENDOR_ST=y
CONFIG_WLAN_VENDOR_TI=y
CONFIG_WLAN_VENDOR_ZYDAS=y
CONFIG_WLAN_VENDOR_QUANTENNA=y
# CONFIG_PCMCIA_RAYCS is not set

#
# Enable WiMAX (Networking options) to see the WiMAX drivers
#
# CONFIG_WAN is not set
# CONFIG_VMXNET3 is not set
# CONFIG_FUJITSU_ES is not set
CONFIG_ISDN=y
# CONFIG_ISDN_I4L is not set
# CONFIG_ISDN_CAPI is not set
# CONFIG_ISDN_DRV_GIGASET is not set
# CONFIG_HYSDN is not set
# CONFIG_MISDN is not set
# CONFIG_NVM is not set

#
# Input device support
#
CONFIG_INPUT=y
CONFIG_INPUT_LEDS=y
CONFIG_INPUT_FF_MEMLESS=y
CONFIG_INPUT_POLLDEV=y
# CONFIG_INPUT_SPARSEKMAP is not set
# CONFIG_INPUT_MATRIXKMAP is not set

#
# Userland interfaces
#
CONFIG_INPUT_MOUSEDEV=y
# CONFIG_INPUT_MOUSEDEV_PSAUX is not set
CONFIG_INPUT_MOUSEDEV_SCREEN_X=1024
CONFIG_INPUT_MOUSEDEV_SCREEN_Y=768
# CONFIG_INPUT_JOYDEV is not set
CONFIG_INPUT_EVDEV=y
# CONFIG_INPUT_EVBUG is not set

#
# Input Device Drivers
#
CONFIG_INPUT_KEYBOARD=y
# CONFIG_KEYBOARD_ADP5588 is not set
# CONFIG_KEYBOARD_ADP5589 is not set
CONFIG_KEYBOARD_ATKBD=y
# CONFIG_KEYBOARD_QT1070 is not set
# CONFIG_KEYBOARD_QT2160 is not set
# CONFIG_KEYBOARD_DLINK_DIR685 is not set
# CONFIG_KEYBOARD_LKKBD is not set
# CONFIG_KEYBOARD_TCA6416 is not set
# CONFIG_KEYBOARD_TCA8418 is not set
# CONFIG_KEYBOARD_LM8323 is not set
# CONFIG_KEYBOARD_LM8333 is not set
# CONFIG_KEYBOARD_MAX7359 is not set
# CONFIG_KEYBOARD_MCS is not set
# CONFIG_KEYBOARD_MPR121 is not set
# CONFIG_KEYBOARD_NEWTON is not set
# CONFIG_KEYBOARD_OPENCORES is not set
# CONFIG_KEYBOARD_SAMSUNG is not set
# CONFIG_KEYBOARD_STOWAWAY is not set
# CONFIG_KEYBOARD_SUNKBD is not set
# CONFIG_KEYBOARD_TM2_TOUCHKEY is not set
# CONFIG_KEYBOARD_XTKBD is not set
CONFIG_INPUT_MOUSE=y
CONFIG_MOUSE_PS2=y
CONFIG_MOUSE_PS2_ALPS=y
CONFIG_MOUSE_PS2_BYD=y
CONFIG_MOUSE_PS2_LOGIPS2PP=y
CONFIG_MOUSE_PS2_SYNAPTICS=y
CONFIG_MOUSE_PS2_SYNAPTICS_SMBUS=y
CONFIG_MOUSE_PS2_CYPRESS=y
CONFIG_MOUSE_PS2_LIFEBOOK=y
CONFIG_MOUSE_PS2_TRACKPOINT=y
CONFIG_MOUSE_PS2_ELANTECH=y
CONFIG_MOUSE_PS2_SENTELIC=y
# CONFIG_MOUSE_PS2_TOUCHKIT is not set
CONFIG_MOUSE_PS2_FOCALTECH=y
CONFIG_MOUSE_PS2_SMBUS=y
# CONFIG_MOUSE_SERIAL is not set
# CONFIG_MOUSE_APPLETOUCH is not set
# CONFIG_MOUSE_BCM5974 is not set
# CONFIG_MOUSE_CYAPA is not set
# CONFIG_MOUSE_ELAN_I2C is not set
# CONFIG_MOUSE_VSXXXAA is not set
# CONFIG_MOUSE_SYNAPTICS_I2C is not set
# CONFIG_MOUSE_SYNAPTICS_USB is not set
CONFIG_INPUT_JOYSTICK=y
# CONFIG_JOYSTICK_ANALOG is not set
# CONFIG_JOYSTICK_A3D is not set
# CONFIG_JOYSTICK_ADI is not set
# CONFIG_JOYSTICK_COBRA is not set
# CONFIG_JOYSTICK_GF2K is not set
# CONFIG_JOYSTICK_GRIP is not set
# CONFIG_JOYSTICK_GRIP_MP is not set
# CONFIG_JOYSTICK_GUILLEMOT is not set
# CONFIG_JOYSTICK_INTERACT is not set
# CONFIG_JOYSTICK_SIDEWINDER is not set
# CONFIG_JOYSTICK_TMDC is not set
# CONFIG_JOYSTICK_IFORCE is not set
# CONFIG_JOYSTICK_WARRIOR is not set
# CONFIG_JOYSTICK_MAGELLAN is not set
# CONFIG_JOYSTICK_SPACEORB is not set
# CONFIG_JOYSTICK_SPACEBALL is not set
# CONFIG_JOYSTICK_STINGER is not set
# CONFIG_JOYSTICK_TWIDJOY is not set
# CONFIG_JOYSTICK_ZHENHUA is not set
# CONFIG_JOYSTICK_AS5011 is not set
# CONFIG_JOYSTICK_JOYDUMP is not set
# CONFIG_JOYSTICK_XPAD is not set
CONFIG_INPUT_TABLET=y
# CONFIG_TABLET_USB_ACECAD is not set
# CONFIG_TABLET_USB_AIPTEK is not set
# CONFIG_TABLET_USB_GTCO is not set
# CONFIG_TABLET_USB_HANWANG is not set
# CONFIG_TABLET_USB_KBTAB is not set
# CONFIG_TABLET_USB_PEGASUS is not set
# CONFIG_TABLET_SERIAL_WACOM4 is not set
CONFIG_INPUT_TOUCHSCREEN=y
CONFIG_TOUCHSCREEN_PROPERTIES=y
# CONFIG_TOUCHSCREEN_AD7879 is not set
# CONFIG_TOUCHSCREEN_ATMEL_MXT is not set
# CONFIG_TOUCHSCREEN_BU21013 is not set
# CONFIG_TOUCHSCREEN_CYTTSP_CORE is not set
# CONFIG_TOUCHSCREEN_CYTTSP4_CORE is not set
# CONFIG_TOUCHSCREEN_DYNAPRO is not set
# CONFIG_TOUCHSCREEN_HAMPSHIRE is not set
# CONFIG_TOUCHSCREEN_EETI is not set
# CONFIG_TOUCHSCREEN_EGALAX_SERIAL is not set
# CONFIG_TOUCHSCREEN_FUJITSU is not set
# CONFIG_TOUCHSCREEN_ILI210X is not set
# CONFIG_TOUCHSCREEN_GUNZE is not set
# CONFIG_TOUCHSCREEN_EKTF2127 is not set
# CONFIG_TOUCHSCREEN_ELAN is not set
# CONFIG_TOUCHSCREEN_ELO is not set
# CONFIG_TOUCHSCREEN_WACOM_W8001 is not set
# CONFIG_TOUCHSCREEN_WACOM_I2C is not set
# CONFIG_TOUCHSCREEN_MAX11801 is not set
# CONFIG_TOUCHSCREEN_MCS5000 is not set
# CONFIG_TOUCHSCREEN_MMS114 is not set
# CONFIG_TOUCHSCREEN_MELFAS_MIP4 is not set
# CONFIG_TOUCHSCREEN_MTOUCH is not set
# CONFIG_TOUCHSCREEN_INEXIO is not set
# CONFIG_TOUCHSCREEN_MK712 is not set
# CONFIG_TOUCHSCREEN_PENMOUNT is not set
# CONFIG_TOUCHSCREEN_EDT_FT5X06 is not set
# CONFIG_TOUCHSCREEN_TOUCHRIGHT is not set
# CONFIG_TOUCHSCREEN_TOUCHWIN is not set
# CONFIG_TOUCHSCREEN_PIXCIR is not set
# CONFIG_TOUCHSCREEN_WDT87XX_I2C is not set
# CONFIG_TOUCHSCREEN_USB_COMPOSITE is not set
# CONFIG_TOUCHSCREEN_TOUCHIT213 is not set
# CONFIG_TOUCHSCREEN_TSC_SERIO is not set
# CONFIG_TOUCHSCREEN_TSC2004 is not set
# CONFIG_TOUCHSCREEN_TSC2007 is not set
# CONFIG_TOUCHSCREEN_SILEAD is not set
# CONFIG_TOUCHSCREEN_ST1232 is not set
# CONFIG_TOUCHSCREEN_STMFTS is not set
# CONFIG_TOUCHSCREEN_SX8654 is not set
# CONFIG_TOUCHSCREEN_TPS6507X is not set
# CONFIG_TOUCHSCREEN_ZET6223 is not set
# CONFIG_TOUCHSCREEN_ROHM_BU21023 is not set
CONFIG_INPUT_MISC=y
# CONFIG_INPUT_AD714X is not set
# CONFIG_INPUT_BMA150 is not set
# CONFIG_INPUT_E3X0_BUTTON is not set
# CONFIG_INPUT_PCSPKR is not set
# CONFIG_INPUT_MMA8450 is not set
# CONFIG_INPUT_APANEL is not set
# CONFIG_INPUT_ATLAS_BTNS is not set
# CONFIG_INPUT_ATI_REMOTE2 is not set
# CONFIG_INPUT_KEYSPAN_REMOTE is not set
# CONFIG_INPUT_KXTJ9 is not set
# CONFIG_INPUT_POWERMATE is not set
# CONFIG_INPUT_YEALINK is not set
# CONFIG_INPUT_CM109 is not set
CONFIG_INPUT_UINPUT=y
# CONFIG_INPUT_PCF8574 is not set
# CONFIG_INPUT_ADXL34X is not set
# CONFIG_INPUT_IMS_PCU is not set
# CONFIG_INPUT_CMA3000 is not set
# CONFIG_INPUT_IDEAPAD_SLIDEBAR is not set
# CONFIG_INPUT_DRV2665_HAPTICS is not set
# CONFIG_INPUT_DRV2667_HAPTICS is not set
# CONFIG_RMI4_CORE is not set

#
# Hardware I/O ports
#
CONFIG_SERIO=y
CONFIG_ARCH_MIGHT_HAVE_PC_SERIO=y
CONFIG_SERIO_I8042=y
CONFIG_SERIO_SERPORT=y
# CONFIG_SERIO_CT82C710 is not set
# CONFIG_SERIO_PCIPS2 is not set
CONFIG_SERIO_LIBPS2=y
# CONFIG_SERIO_RAW is not set
# CONFIG_SERIO_ALTERA_PS2 is not set
# CONFIG_SERIO_PS2MULT is not set
# CONFIG_SERIO_ARC_PS2 is not set
# CONFIG_USERIO is not set
# CONFIG_GAMEPORT is not set

#
# Character devices
#
CONFIG_TTY=y
CONFIG_VT=y
CONFIG_CONSOLE_TRANSLATIONS=y
CONFIG_VT_CONSOLE=y
CONFIG_VT_CONSOLE_SLEEP=y
CONFIG_HW_CONSOLE=y
CONFIG_VT_HW_CONSOLE_BINDING=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_SERIAL_NONSTANDARD=y
# CONFIG_ROCKETPORT is not set
# CONFIG_CYCLADES is not set
# CONFIG_MOXA_INTELLIO is not set
# CONFIG_MOXA_SMARTIO is not set
# CONFIG_SYNCLINK is not set
# CONFIG_SYNCLINKMP is not set
# CONFIG_SYNCLINK_GT is not set
# CONFIG_NOZOMI is not set
# CONFIG_ISI is not set
# CONFIG_N_HDLC is not set
# CONFIG_N_GSM is not set
# CONFIG_TRACE_SINK is not set
CONFIG_DEVMEM=y
# CONFIG_DEVKMEM is not set

#
# Serial drivers
#
CONFIG_SERIAL_EARLYCON=y
CONFIG_SERIAL_8250=y
CONFIG_SERIAL_8250_DEPRECATED_OPTIONS=y
CONFIG_SERIAL_8250_PNP=y
# CONFIG_SERIAL_8250_FINTEK is not set
CONFIG_SERIAL_8250_CONSOLE=y
CONFIG_SERIAL_8250_DMA=y
CONFIG_SERIAL_8250_PCI=y
CONFIG_SERIAL_8250_EXAR=y
# CONFIG_SERIAL_8250_CS is not set
CONFIG_SERIAL_8250_NR_UARTS=32
CONFIG_SERIAL_8250_RUNTIME_UARTS=4
CONFIG_SERIAL_8250_EXTENDED=y
CONFIG_SERIAL_8250_MANY_PORTS=y
CONFIG_SERIAL_8250_SHARE_IRQ=y
# CONFIG_SERIAL_8250_DETECT_IRQ is not set
CONFIG_SERIAL_8250_RSA=y
# CONFIG_SERIAL_8250_FSL is not set
# CONFIG_SERIAL_8250_DW is not set
# CONFIG_SERIAL_8250_RT288X is not set
CONFIG_SERIAL_8250_LPSS=y
CONFIG_SERIAL_8250_MID=y
# CONFIG_SERIAL_8250_MOXA is not set

#
# Non-8250 serial port support
#
# CONFIG_SERIAL_KGDB_NMI is not set
# CONFIG_SERIAL_UARTLITE is not set
CONFIG_SERIAL_CORE=y
CONFIG_SERIAL_CORE_CONSOLE=y
CONFIG_CONSOLE_POLL=y
# CONFIG_SERIAL_JSM is not set
# CONFIG_SERIAL_SCCNXP is not set
# CONFIG_SERIAL_SC16IS7XX is not set
# CONFIG_SERIAL_ALTERA_JTAGUART is not set
# CONFIG_SERIAL_ALTERA_UART is not set
# CONFIG_SERIAL_ARC is not set
# CONFIG_SERIAL_RP2 is not set
# CONFIG_SERIAL_FSL_LPUART is not set
# CONFIG_SERIAL_DEV_BUS is not set
CONFIG_HVC_DRIVER=y
CONFIG_VIRTIO_CONSOLE=y
# CONFIG_IPMI_HANDLER is not set
CONFIG_HW_RANDOM=y
# CONFIG_HW_RANDOM_TIMERIOMEM is not set
# CONFIG_HW_RANDOM_INTEL is not set
# CONFIG_HW_RANDOM_AMD is not set
# CONFIG_HW_RANDOM_VIA is not set
# CONFIG_HW_RANDOM_VIRTIO is not set
CONFIG_NVRAM=y
# CONFIG_R3964 is not set
# CONFIG_APPLICOM is not set

#
# PCMCIA character devices
#
# CONFIG_SYNCLINK_CS is not set
# CONFIG_CARDMAN_4000 is not set
# CONFIG_CARDMAN_4040 is not set
# CONFIG_SCR24X is not set
# CONFIG_IPWIRELESS is not set
# CONFIG_MWAVE is not set
CONFIG_RAW_DRIVER=y
CONFIG_MAX_RAW_DEVS=8192
CONFIG_HPET=y
# CONFIG_HPET_MMAP is not set
# CONFIG_HANGCHECK_TIMER is not set
# CONFIG_TCG_TPM is not set
# CONFIG_TELCLOCK is not set
CONFIG_DEVPORT=y
# CONFIG_XILLYBUS is not set

#
# I2C support
#
CONFIG_I2C=y
CONFIG_ACPI_I2C_OPREGION=y
CONFIG_I2C_BOARDINFO=y
CONFIG_I2C_COMPAT=y
# CONFIG_I2C_CHARDEV is not set
# CONFIG_I2C_MUX is not set
CONFIG_I2C_HELPER_AUTO=y
CONFIG_I2C_ALGOBIT=y

#
# I2C Hardware Bus support
#

#
# PC SMBus host controller drivers
#
# CONFIG_I2C_ALI1535 is not set
# CONFIG_I2C_ALI1563 is not set
# CONFIG_I2C_ALI15X3 is not set
# CONFIG_I2C_AMD756 is not set
# CONFIG_I2C_AMD8111 is not set
# CONFIG_I2C_I801 is not set
# CONFIG_I2C_ISCH is not set
# CONFIG_I2C_ISMT is not set
CONFIG_I2C_PIIX4=y
# CONFIG_I2C_NFORCE2 is not set
# CONFIG_I2C_SIS5595 is not set
# CONFIG_I2C_SIS630 is not set
# CONFIG_I2C_SIS96X is not set
# CONFIG_I2C_VIA is not set
# CONFIG_I2C_VIAPRO is not set

#
# ACPI drivers
#
# CONFIG_I2C_SCMI is not set

#
# I2C system bus drivers (mostly embedded / system-on-chip)
#
# CONFIG_I2C_DESIGNWARE_PLATFORM is not set
# CONFIG_I2C_DESIGNWARE_PCI is not set
# CONFIG_I2C_EMEV2 is not set
# CONFIG_I2C_OCORES is not set
# CONFIG_I2C_PCA_PLATFORM is not set
# CONFIG_I2C_PXA_PCI is not set
# CONFIG_I2C_SIMTEC is not set
# CONFIG_I2C_XILINX is not set

#
# External I2C/SMBus adapter drivers
#
# CONFIG_I2C_DIOLAN_U2C is not set
# CONFIG_I2C_PARPORT_LIGHT is not set
# CONFIG_I2C_ROBOTFUZZ_OSIF is not set
# CONFIG_I2C_TAOS_EVM is not set
# CONFIG_I2C_TINY_USB is not set

#
# Other I2C/SMBus bus drivers
#
# CONFIG_I2C_MLXCPLD is not set
# CONFIG_I2C_STUB is not set
# CONFIG_I2C_SLAVE is not set
# CONFIG_I2C_DEBUG_CORE is not set
# CONFIG_I2C_DEBUG_ALGO is not set
# CONFIG_I2C_DEBUG_BUS is not set
# CONFIG_SPI is not set
# CONFIG_SPMI is not set
# CONFIG_HSI is not set
CONFIG_PPS=y
# CONFIG_PPS_DEBUG is not set

#
# PPS clients support
#
# CONFIG_PPS_CLIENT_KTIMER is not set
# CONFIG_PPS_CLIENT_LDISC is not set
# CONFIG_PPS_CLIENT_GPIO is not set

#
# PPS generators support
#

#
# PTP clock support
#
CONFIG_PTP_1588_CLOCK=y

#
# Enable PHYLIB and NETWORK_PHY_TIMESTAMPING to see the additional clocks.
#
# CONFIG_GPIOLIB is not set
# CONFIG_W1 is not set
# CONFIG_POWER_AVS is not set
# CONFIG_POWER_RESET is not set
CONFIG_POWER_SUPPLY=y
# CONFIG_POWER_SUPPLY_DEBUG is not set
# CONFIG_PDA_POWER is not set
# CONFIG_TEST_POWER is not set
# CONFIG_BATTERY_DS2780 is not set
# CONFIG_BATTERY_DS2781 is not set
# CONFIG_BATTERY_DS2782 is not set
# CONFIG_BATTERY_SBS is not set
# CONFIG_CHARGER_SBS is not set
# CONFIG_BATTERY_BQ27XXX is not set
# CONFIG_BATTERY_MAX17040 is not set
# CONFIG_BATTERY_MAX17042 is not set
# CONFIG_CHARGER_MAX8903 is not set
# CONFIG_CHARGER_LP8727 is not set
# CONFIG_CHARGER_BQ2415X is not set
# CONFIG_CHARGER_SMB347 is not set
# CONFIG_BATTERY_GAUGE_LTC2941 is not set
CONFIG_HWMON=y
# CONFIG_HWMON_VID is not set
# CONFIG_HWMON_DEBUG_CHIP is not set

#
# Native drivers
#
# CONFIG_SENSORS_ABITUGURU is not set
# CONFIG_SENSORS_ABITUGURU3 is not set
# CONFIG_SENSORS_AD7414 is not set
# CONFIG_SENSORS_AD7418 is not set
# CONFIG_SENSORS_ADM1021 is not set
# CONFIG_SENSORS_ADM1025 is not set
# CONFIG_SENSORS_ADM1026 is not set
# CONFIG_SENSORS_ADM1029 is not set
# CONFIG_SENSORS_ADM1031 is not set
# CONFIG_SENSORS_ADM9240 is not set
# CONFIG_SENSORS_ADT7410 is not set
# CONFIG_SENSORS_ADT7411 is not set
# CONFIG_SENSORS_ADT7462 is not set
# CONFIG_SENSORS_ADT7470 is not set
# CONFIG_SENSORS_ADT7475 is not set
# CONFIG_SENSORS_ASC7621 is not set
# CONFIG_SENSORS_K8TEMP is not set
CONFIG_SENSORS_K10TEMP=y
CONFIG_SENSORS_FAM15H_POWER=y
# CONFIG_SENSORS_APPLESMC is not set
# CONFIG_SENSORS_ASB100 is not set
# CONFIG_SENSORS_ASPEED is not set
# CONFIG_SENSORS_ATXP1 is not set
# CONFIG_SENSORS_DS620 is not set
# CONFIG_SENSORS_DS1621 is not set
# CONFIG_SENSORS_DELL_SMM is not set
# CONFIG_SENSORS_I5K_AMB is not set
# CONFIG_SENSORS_F71805F is not set
# CONFIG_SENSORS_F71882FG is not set
# CONFIG_SENSORS_F75375S is not set
# CONFIG_SENSORS_FSCHMD is not set
# CONFIG_SENSORS_FTSTEUTATES is not set
# CONFIG_SENSORS_GL518SM is not set
# CONFIG_SENSORS_GL520SM is not set
# CONFIG_SENSORS_G760A is not set
# CONFIG_SENSORS_G762 is not set
# CONFIG_SENSORS_HIH6130 is not set
# CONFIG_SENSORS_I5500 is not set
# CONFIG_SENSORS_CORETEMP is not set
# CONFIG_SENSORS_IT87 is not set
# CONFIG_SENSORS_JC42 is not set
# CONFIG_SENSORS_POWR1220 is not set
# CONFIG_SENSORS_LINEAGE is not set
# CONFIG_SENSORS_LTC2945 is not set
# CONFIG_SENSORS_LTC2990 is not set
# CONFIG_SENSORS_LTC4151 is not set
# CONFIG_SENSORS_LTC4215 is not set
# CONFIG_SENSORS_LTC4222 is not set
# CONFIG_SENSORS_LTC4245 is not set
# CONFIG_SENSORS_LTC4260 is not set
# CONFIG_SENSORS_LTC4261 is not set
# CONFIG_SENSORS_MAX16065 is not set
# CONFIG_SENSORS_MAX1619 is not set
# CONFIG_SENSORS_MAX1668 is not set
# CONFIG_SENSORS_MAX197 is not set
# CONFIG_SENSORS_MAX6621 is not set
# CONFIG_SENSORS_MAX6639 is not set
# CONFIG_SENSORS_MAX6642 is not set
# CONFIG_SENSORS_MAX6650 is not set
# CONFIG_SENSORS_MAX6697 is not set
# CONFIG_SENSORS_MAX31790 is not set
# CONFIG_SENSORS_MCP3021 is not set
# CONFIG_SENSORS_TC654 is not set
# CONFIG_SENSORS_LM63 is not set
# CONFIG_SENSORS_LM73 is not set
# CONFIG_SENSORS_LM75 is not set
# CONFIG_SENSORS_LM77 is not set
# CONFIG_SENSORS_LM78 is not set
# CONFIG_SENSORS_LM80 is not set
# CONFIG_SENSORS_LM83 is not set
# CONFIG_SENSORS_LM85 is not set
# CONFIG_SENSORS_LM87 is not set
# CONFIG_SENSORS_LM90 is not set
# CONFIG_SENSORS_LM92 is not set
# CONFIG_SENSORS_LM93 is not set
# CONFIG_SENSORS_LM95234 is not set
# CONFIG_SENSORS_LM95241 is not set
# CONFIG_SENSORS_LM95245 is not set
# CONFIG_SENSORS_PC87360 is not set
# CONFIG_SENSORS_PC87427 is not set
# CONFIG_SENSORS_NTC_THERMISTOR is not set
# CONFIG_SENSORS_NCT6683 is not set
# CONFIG_SENSORS_NCT6775 is not set
# CONFIG_SENSORS_NCT7802 is not set
# CONFIG_SENSORS_NCT7904 is not set
# CONFIG_SENSORS_PCF8591 is not set
# CONFIG_PMBUS is not set
# CONFIG_SENSORS_SHT21 is not set
# CONFIG_SENSORS_SHT3x is not set
# CONFIG_SENSORS_SHTC1 is not set
# CONFIG_SENSORS_SIS5595 is not set
# CONFIG_SENSORS_DME1737 is not set
# CONFIG_SENSORS_EMC1403 is not set
# CONFIG_SENSORS_EMC2103 is not set
# CONFIG_SENSORS_EMC6W201 is not set
# CONFIG_SENSORS_SMSC47M1 is not set
# CONFIG_SENSORS_SMSC47M192 is not set
# CONFIG_SENSORS_SMSC47B397 is not set
# CONFIG_SENSORS_SCH56XX_COMMON is not set
# CONFIG_SENSORS_SCH5627 is not set
# CONFIG_SENSORS_SCH5636 is not set
# CONFIG_SENSORS_STTS751 is not set
# CONFIG_SENSORS_SMM665 is not set
# CONFIG_SENSORS_ADC128D818 is not set
# CONFIG_SENSORS_ADS1015 is not set
# CONFIG_SENSORS_ADS7828 is not set
# CONFIG_SENSORS_AMC6821 is not set
# CONFIG_SENSORS_INA209 is not set
# CONFIG_SENSORS_INA2XX is not set
# CONFIG_SENSORS_INA3221 is not set
# CONFIG_SENSORS_TC74 is not set
# CONFIG_SENSORS_THMC50 is not set
# CONFIG_SENSORS_TMP102 is not set
# CONFIG_SENSORS_TMP103 is not set
# CONFIG_SENSORS_TMP108 is not set
# CONFIG_SENSORS_TMP401 is not set
# CONFIG_SENSORS_TMP421 is not set
# CONFIG_SENSORS_VIA_CPUTEMP is not set
# CONFIG_SENSORS_VIA686A is not set
# CONFIG_SENSORS_VT1211 is not set
# CONFIG_SENSORS_VT8231 is not set
# CONFIG_SENSORS_W83781D is not set
# CONFIG_SENSORS_W83791D is not set
# CONFIG_SENSORS_W83792D is not set
# CONFIG_SENSORS_W83793 is not set
# CONFIG_SENSORS_W83795 is not set
# CONFIG_SENSORS_W83L785TS is not set
# CONFIG_SENSORS_W83L786NG is not set
# CONFIG_SENSORS_W83627HF is not set
# CONFIG_SENSORS_W83627EHF is not set
# CONFIG_SENSORS_XGENE is not set

#
# ACPI drivers
#
# CONFIG_SENSORS_ACPI_POWER is not set
# CONFIG_SENSORS_ATK0110 is not set
CONFIG_THERMAL=y
CONFIG_THERMAL_EMERGENCY_POWEROFF_DELAY_MS=0
CONFIG_THERMAL_HWMON=y
CONFIG_THERMAL_WRITABLE_TRIPS=y
CONFIG_THERMAL_DEFAULT_GOV_STEP_WISE=y
# CONFIG_THERMAL_DEFAULT_GOV_FAIR_SHARE is not set
# CONFIG_THERMAL_DEFAULT_GOV_USER_SPACE is not set
# CONFIG_THERMAL_DEFAULT_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_GOV_FAIR_SHARE is not set
CONFIG_THERMAL_GOV_STEP_WISE=y
# CONFIG_THERMAL_GOV_BANG_BANG is not set
CONFIG_THERMAL_GOV_USER_SPACE=y
# CONFIG_THERMAL_GOV_POWER_ALLOCATOR is not set
# CONFIG_THERMAL_EMULATION is not set
# CONFIG_INTEL_POWERCLAMP is not set
CONFIG_X86_PKG_TEMP_THERMAL=y
# CONFIG_INTEL_SOC_DTS_THERMAL is not set

#
# ACPI INT340X thermal drivers
#
# CONFIG_INT340X_THERMAL is not set
# CONFIG_INTEL_PCH_THERMAL is not set
CONFIG_WATCHDOG=y
CONFIG_WATCHDOG_CORE=y
# CONFIG_WATCHDOG_NOWAYOUT is not set
CONFIG_WATCHDOG_HANDLE_BOOT_ENABLED=y
# CONFIG_WATCHDOG_SYSFS is not set

#
# Watchdog Device Drivers
#
# CONFIG_SOFT_WATCHDOG is not set
# CONFIG_WDAT_WDT is not set
# CONFIG_XILINX_WATCHDOG is not set
# CONFIG_ZIIRAVE_WATCHDOG is not set
# CONFIG_CADENCE_WATCHDOG is not set
# CONFIG_DW_WATCHDOG is not set
# CONFIG_MAX63XX_WATCHDOG is not set
# CONFIG_ACQUIRE_WDT is not set
# CONFIG_ADVANTECH_WDT is not set
# CONFIG_ALIM1535_WDT is not set
# CONFIG_ALIM7101_WDT is not set
# CONFIG_F71808E_WDT is not set
CONFIG_SP5100_TCO=y
# CONFIG_SBC_FITPC2_WATCHDOG is not set
# CONFIG_EUROTECH_WDT is not set
# CONFIG_IB700_WDT is not set
# CONFIG_IBMASR is not set
# CONFIG_WAFER_WDT is not set
# CONFIG_I6300ESB_WDT is not set
# CONFIG_IE6XX_WDT is not set
# CONFIG_ITCO_WDT is not set
# CONFIG_IT8712F_WDT is not set
# CONFIG_IT87_WDT is not set
# CONFIG_HP_WATCHDOG is not set
# CONFIG_SC1200_WDT is not set
# CONFIG_PC87413_WDT is not set
# CONFIG_NV_TCO is not set
# CONFIG_60XX_WDT is not set
# CONFIG_CPU5_WDT is not set
# CONFIG_SMSC_SCH311X_WDT is not set
# CONFIG_SMSC37B787_WDT is not set
# CONFIG_VIA_WDT is not set
# CONFIG_W83627HF_WDT is not set
# CONFIG_W83877F_WDT is not set
# CONFIG_W83977F_WDT is not set
# CONFIG_MACHZ_WDT is not set
# CONFIG_SBC_EPX_C3_WATCHDOG is not set
# CONFIG_NI903X_WDT is not set
# CONFIG_NIC7018_WDT is not set

#
# PCI-based Watchdog Cards
#
# CONFIG_PCIPCWATCHDOG is not set
# CONFIG_WDTPCI is not set

#
# USB-based Watchdog Cards
#
# CONFIG_USBPCWATCHDOG is not set

#
# Watchdog Pretimeout Governors
#
# CONFIG_WATCHDOG_PRETIMEOUT_GOV is not set
CONFIG_SSB_POSSIBLE=y

#
# Sonics Silicon Backplane
#
# CONFIG_SSB is not set
CONFIG_BCMA_POSSIBLE=y
# CONFIG_BCMA is not set

#
# Multifunction device drivers
#
# CONFIG_MFD_CORE is not set
# CONFIG_MFD_AS3711 is not set
# CONFIG_PMIC_ADP5520 is not set
# CONFIG_MFD_BCM590XX is not set
# CONFIG_MFD_BD9571MWV is not set
# CONFIG_MFD_AXP20X_I2C is not set
# CONFIG_MFD_CROS_EC is not set
# CONFIG_PMIC_DA903X is not set
# CONFIG_MFD_DA9052_I2C is not set
# CONFIG_MFD_DA9055 is not set
# CONFIG_MFD_DA9062 is not set
# CONFIG_MFD_DA9063 is not set
# CONFIG_MFD_DA9150 is not set
# CONFIG_MFD_DLN2 is not set
# CONFIG_MFD_MC13XXX_I2C is not set
# CONFIG_HTC_PASIC3 is not set
# CONFIG_MFD_INTEL_QUARK_I2C_GPIO is not set
# CONFIG_LPC_ICH is not set
# CONFIG_LPC_SCH is not set
# CONFIG_INTEL_SOC_PMIC_CHTWC is not set
# CONFIG_MFD_INTEL_LPSS_ACPI is not set
# CONFIG_MFD_INTEL_LPSS_PCI is not set
# CONFIG_MFD_JANZ_CMODIO is not set
# CONFIG_MFD_KEMPLD is not set
# CONFIG_MFD_88PM800 is not set
# CONFIG_MFD_88PM805 is not set
# CONFIG_MFD_88PM860X is not set
# CONFIG_MFD_MAX14577 is not set
# CONFIG_MFD_MAX77693 is not set
# CONFIG_MFD_MAX77843 is not set
# CONFIG_MFD_MAX8907 is not set
# CONFIG_MFD_MAX8925 is not set
# CONFIG_MFD_MAX8997 is not set
# CONFIG_MFD_MAX8998 is not set
# CONFIG_MFD_MT6397 is not set
# CONFIG_MFD_MENF21BMC is not set
# CONFIG_MFD_VIPERBOARD is not set
# CONFIG_MFD_RETU is not set
# CONFIG_MFD_PCF50633 is not set
# CONFIG_MFD_RDC321X is not set
# CONFIG_MFD_RTSX_PCI is not set
# CONFIG_MFD_RT5033 is not set
# CONFIG_MFD_RTSX_USB is not set
# CONFIG_MFD_RC5T583 is not set
# CONFIG_MFD_SEC_CORE is not set
# CONFIG_MFD_SI476X_CORE is not set
# CONFIG_MFD_SM501 is not set
# CONFIG_MFD_SKY81452 is not set
# CONFIG_MFD_SMSC is not set
# CONFIG_ABX500_CORE is not set
# CONFIG_MFD_SYSCON is not set
# CONFIG_MFD_TI_AM335X_TSCADC is not set
# CONFIG_MFD_LP3943 is not set
# CONFIG_MFD_LP8788 is not set
# CONFIG_MFD_TI_LMU is not set
# CONFIG_MFD_PALMAS is not set
# CONFIG_TPS6105X is not set
# CONFIG_TPS6507X is not set
# CONFIG_MFD_TPS65086 is not set
# CONFIG_MFD_TPS65090 is not set
# CONFIG_MFD_TPS65217 is not set
# CONFIG_MFD_TPS68470 is not set
# CONFIG_MFD_TI_LP873X is not set
# CONFIG_MFD_TPS65218 is not set
# CONFIG_MFD_TPS6586X is not set
# CONFIG_MFD_TPS65912_I2C is not set
# CONFIG_MFD_TPS80031 is not set
# CONFIG_TWL4030_CORE is not set
# CONFIG_TWL6040_CORE is not set
# CONFIG_MFD_WL1273_CORE is not set
# CONFIG_MFD_LM3533 is not set
# CONFIG_MFD_TMIO is not set
# CONFIG_MFD_VX855 is not set
# CONFIG_MFD_ARIZONA_I2C is not set
# CONFIG_MFD_WM8400 is not set
# CONFIG_MFD_WM831X_I2C is not set
# CONFIG_MFD_WM8350_I2C is not set
# CONFIG_MFD_WM8994 is not set
# CONFIG_REGULATOR is not set
CONFIG_RC_CORE=y
CONFIG_RC_MAP=y
CONFIG_RC_DECODERS=y
# CONFIG_LIRC is not set
CONFIG_IR_NEC_DECODER=y
CONFIG_IR_RC5_DECODER=y
CONFIG_IR_RC6_DECODER=y
CONFIG_IR_JVC_DECODER=y
CONFIG_IR_SONY_DECODER=y
CONFIG_IR_SANYO_DECODER=y
CONFIG_IR_SHARP_DECODER=y
CONFIG_IR_MCE_KBD_DECODER=y
CONFIG_IR_XMP_DECODER=y
# CONFIG_RC_DEVICES is not set
# CONFIG_MEDIA_SUPPORT is not set

#
# Graphics support
#
CONFIG_AGP=y
CONFIG_AGP_AMD64=y
CONFIG_AGP_INTEL=y
CONFIG_AGP_SIS=y
CONFIG_AGP_VIA=y
CONFIG_INTEL_GTT=y
CONFIG_VGA_ARB=y
CONFIG_VGA_ARB_MAX_GPUS=16
CONFIG_VGA_SWITCHEROO=y
# CONFIG_DRM is not set

#
# ACP (Audio CoProcessor) Configuration
#
# CONFIG_DRM_LIB_RANDOM is not set

#
# Frame buffer Devices
#
CONFIG_FB=y
# CONFIG_FIRMWARE_EDID is not set
CONFIG_FB_CMDLINE=y
CONFIG_FB_NOTIFY=y
# CONFIG_FB_DDC is not set
# CONFIG_FB_BOOT_VESA_SUPPORT is not set
CONFIG_FB_CFB_FILLRECT=y
CONFIG_FB_CFB_COPYAREA=y
CONFIG_FB_CFB_IMAGEBLIT=y
# CONFIG_FB_CFB_REV_PIXELS_IN_BYTE is not set
# CONFIG_FB_SYS_FILLRECT is not set
# CONFIG_FB_SYS_COPYAREA is not set
# CONFIG_FB_SYS_IMAGEBLIT is not set
# CONFIG_FB_PROVIDE_GET_FB_UNMAPPED_AREA is not set
# CONFIG_FB_FOREIGN_ENDIAN is not set
# CONFIG_FB_SYS_FOPS is not set
# CONFIG_FB_SVGALIB is not set
# CONFIG_FB_MACMODES is not set
# CONFIG_FB_BACKLIGHT is not set
CONFIG_FB_MODE_HELPERS=y
CONFIG_FB_TILEBLITTING=y

#
# Frame buffer hardware drivers
#
# CONFIG_FB_CIRRUS is not set
# CONFIG_FB_PM2 is not set
# CONFIG_FB_CYBER2000 is not set
# CONFIG_FB_ARC is not set
# CONFIG_FB_ASILIANT is not set
# CONFIG_FB_IMSTT is not set
# CONFIG_FB_VGA16 is not set
# CONFIG_FB_UVESA is not set
# CONFIG_FB_VESA is not set
CONFIG_FB_EFI=y
# CONFIG_FB_N411 is not set
# CONFIG_FB_HGA is not set
# CONFIG_FB_OPENCORES is not set
# CONFIG_FB_S1D13XXX is not set
# CONFIG_FB_NVIDIA is not set
# CONFIG_FB_RIVA is not set
# CONFIG_FB_I740 is not set
# CONFIG_FB_LE80578 is not set
# CONFIG_FB_MATROX is not set
# CONFIG_FB_RADEON is not set
# CONFIG_FB_ATY128 is not set
# CONFIG_FB_ATY is not set
# CONFIG_FB_S3 is not set
# CONFIG_FB_SAVAGE is not set
# CONFIG_FB_SIS is not set
# CONFIG_FB_NEOMAGIC is not set
# CONFIG_FB_KYRO is not set
# CONFIG_FB_3DFX is not set
# CONFIG_FB_VOODOO1 is not set
# CONFIG_FB_VT8623 is not set
# CONFIG_FB_TRIDENT is not set
# CONFIG_FB_ARK is not set
# CONFIG_FB_PM3 is not set
# CONFIG_FB_CARMINE is not set
# CONFIG_FB_SMSCUFX is not set
# CONFIG_FB_UDL is not set
# CONFIG_FB_IBM_GXT4500 is not set
# CONFIG_FB_VIRTUAL is not set
# CONFIG_FB_METRONOME is not set
# CONFIG_FB_MB862XX is not set
# CONFIG_FB_BROADSHEET is not set
# CONFIG_FB_AUO_K190X is not set
# CONFIG_FB_SIMPLE is not set
# CONFIG_FB_SM712 is not set
CONFIG_BACKLIGHT_LCD_SUPPORT=y
# CONFIG_LCD_CLASS_DEVICE is not set
CONFIG_BACKLIGHT_CLASS_DEVICE=y
# CONFIG_BACKLIGHT_GENERIC is not set
# CONFIG_BACKLIGHT_APPLE is not set
# CONFIG_BACKLIGHT_PM8941_WLED is not set
# CONFIG_BACKLIGHT_SAHARA is not set
# CONFIG_BACKLIGHT_ADP8860 is not set
# CONFIG_BACKLIGHT_ADP8870 is not set
# CONFIG_BACKLIGHT_LM3639 is not set
# CONFIG_BACKLIGHT_LV5207LP is not set
# CONFIG_BACKLIGHT_BD6107 is not set
# CONFIG_BACKLIGHT_ARCXCNN is not set
# CONFIG_VGASTATE is not set

#
# Console display driver support
#
CONFIG_VGA_CONSOLE=y
CONFIG_VGACON_SOFT_SCROLLBACK=y
CONFIG_VGACON_SOFT_SCROLLBACK_SIZE=64
# CONFIG_VGACON_SOFT_SCROLLBACK_PERSISTENT_ENABLE_BY_DEFAULT is not set
CONFIG_DUMMY_CONSOLE=y
CONFIG_DUMMY_CONSOLE_COLUMNS=80
CONFIG_DUMMY_CONSOLE_ROWS=25
# CONFIG_FRAMEBUFFER_CONSOLE is not set
CONFIG_LOGO=y
# CONFIG_LOGO_LINUX_MONO is not set
# CONFIG_LOGO_LINUX_VGA16 is not set
CONFIG_LOGO_LINUX_CLUT224=y
# CONFIG_SOUND is not set

#
# HID support
#
CONFIG_HID=y
CONFIG_HID_BATTERY_STRENGTH=y
CONFIG_HIDRAW=y
# CONFIG_UHID is not set
CONFIG_HID_GENERIC=y

#
# Special HID drivers
#
CONFIG_HID_A4TECH=y
# CONFIG_HID_ACCUTOUCH is not set
# CONFIG_HID_ACRUX is not set
CONFIG_HID_APPLE=y
# CONFIG_HID_APPLEIR is not set
# CONFIG_HID_ASUS is not set
# CONFIG_HID_AUREAL is not set
CONFIG_HID_BELKIN=y
# CONFIG_HID_BETOP_FF is not set
CONFIG_HID_CHERRY=y
CONFIG_HID_CHICONY=y
# CONFIG_HID_CORSAIR is not set
# CONFIG_HID_CMEDIA is not set
CONFIG_HID_CYPRESS=y
# CONFIG_HID_DRAGONRISE is not set
# CONFIG_HID_EMS_FF is not set
# CONFIG_HID_ELECOM is not set
# CONFIG_HID_ELO is not set
CONFIG_HID_EZKEY=y
# CONFIG_HID_GEMBIRD is not set
# CONFIG_HID_GFRM is not set
# CONFIG_HID_HOLTEK is not set
# CONFIG_HID_GT683R is not set
# CONFIG_HID_KEYTOUCH is not set
# CONFIG_HID_KYE is not set
# CONFIG_HID_UCLOGIC is not set
# CONFIG_HID_WALTOP is not set
# CONFIG_HID_GYRATION is not set
# CONFIG_HID_ICADE is not set
CONFIG_HID_ITE=y
# CONFIG_HID_TWINHAN is not set
CONFIG_HID_KENSINGTON=y
# CONFIG_HID_LCPOWER is not set
# CONFIG_HID_LED is not set
# CONFIG_HID_LENOVO is not set
CONFIG_HID_LOGITECH=y
# CONFIG_HID_LOGITECH_DJ is not set
# CONFIG_HID_LOGITECH_HIDPP is not set
CONFIG_LOGITECH_FF=y
CONFIG_LOGIRUMBLEPAD2_FF=y
CONFIG_LOGIG940_FF=y
CONFIG_LOGIWHEELS_FF=y
# CONFIG_HID_MAGICMOUSE is not set
# CONFIG_HID_MAYFLASH is not set
CONFIG_HID_MICROSOFT=y
CONFIG_HID_MONTEREY=y
# CONFIG_HID_MULTITOUCH is not set
# CONFIG_HID_NTI is not set
CONFIG_HID_NTRIG=y
# CONFIG_HID_ORTEK is not set
# CONFIG_HID_PANTHERLORD is not set
# CONFIG_HID_PENMOUNT is not set
# CONFIG_HID_PETALYNX is not set
# CONFIG_HID_PICOLCD is not set
# CONFIG_HID_PLANTRONICS is not set
# CONFIG_HID_PRIMAX is not set
# CONFIG_HID_RETRODE is not set
# CONFIG_HID_ROCCAT is not set
# CONFIG_HID_SAITEK is not set
# CONFIG_HID_SAMSUNG is not set
# CONFIG_HID_SONY is not set
# CONFIG_HID_SPEEDLINK is not set
# CONFIG_HID_STEELSERIES is not set
# CONFIG_HID_SUNPLUS is not set
# CONFIG_HID_RMI is not set
# CONFIG_HID_GREENASIA is not set
# CONFIG_HID_SMARTJOYPLUS is not set
# CONFIG_HID_TIVO is not set
# CONFIG_HID_TOPSEED is not set
# CONFIG_HID_THINGM is not set
# CONFIG_HID_THRUSTMASTER is not set
# CONFIG_HID_UDRAW_PS3 is not set
# CONFIG_HID_WACOM is not set
# CONFIG_HID_WIIMOTE is not set
# CONFIG_HID_XINMO is not set
# CONFIG_HID_ZEROPLUS is not set
# CONFIG_HID_ZYDACRON is not set
# CONFIG_HID_SENSOR_HUB is not set
# CONFIG_HID_ALPS is not set

#
# USB HID support
#
CONFIG_USB_HID=y
CONFIG_HID_PID=y
CONFIG_USB_HIDDEV=y

#
# I2C HID support
#
# CONFIG_I2C_HID is not set

#
# Intel ISH HID support
#
# CONFIG_INTEL_ISH_HID is not set
CONFIG_USB_OHCI_LITTLE_ENDIAN=y
CONFIG_USB_SUPPORT=y
CONFIG_USB_COMMON=y
CONFIG_USB_ARCH_HAS_HCD=y
CONFIG_USB=y
CONFIG_USB_PCI=y
CONFIG_USB_ANNOUNCE_NEW_DEVICES=y

#
# Miscellaneous USB options
#
CONFIG_USB_DEFAULT_PERSIST=y
# CONFIG_USB_DYNAMIC_MINORS is not set
# CONFIG_USB_OTG is not set
# CONFIG_USB_OTG_WHITELIST is not set
# CONFIG_USB_LEDS_TRIGGER_USBPORT is not set
CONFIG_USB_MON=y
# CONFIG_USB_WUSB_CBAF is not set

#
# USB Host Controller Drivers
#
# CONFIG_USB_C67X00_HCD is not set
CONFIG_USB_XHCI_HCD=y
CONFIG_USB_XHCI_PCI=y
# CONFIG_USB_XHCI_PLATFORM is not set
CONFIG_USB_EHCI_HCD=y
CONFIG_USB_EHCI_ROOT_HUB_TT=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_USB_EHCI_PCI=y
# CONFIG_USB_EHCI_HCD_PLATFORM is not set
# CONFIG_USB_OXU210HP_HCD is not set
# CONFIG_USB_ISP116X_HCD is not set
# CONFIG_USB_ISP1362_HCD is not set
# CONFIG_USB_FOTG210_HCD is not set
CONFIG_USB_OHCI_HCD=y
CONFIG_USB_OHCI_HCD_PCI=y
# CONFIG_USB_OHCI_HCD_PLATFORM is not set
CONFIG_USB_UHCI_HCD=y
# CONFIG_USB_SL811_HCD is not set
# CONFIG_USB_R8A66597_HCD is not set
# CONFIG_USB_HCD_TEST_MODE is not set

#
# USB Device Class drivers
#
# CONFIG_USB_ACM is not set
# CONFIG_USB_PRINTER is not set
# CONFIG_USB_WDM is not set
# CONFIG_USB_TMC is not set

#
# NOTE: USB_STORAGE depends on SCSI but BLK_DEV_SD may
#

#
# also be needed; see USB_STORAGE Help for more info
#
# CONFIG_USB_STORAGE is not set

#
# USB Imaging devices
#
# CONFIG_USB_MDC800 is not set
# CONFIG_USB_MICROTEK is not set
# CONFIG_USBIP_CORE is not set
# CONFIG_USB_MUSB_HDRC is not set
# CONFIG_USB_DWC3 is not set
# CONFIG_USB_DWC2 is not set
# CONFIG_USB_CHIPIDEA is not set
# CONFIG_USB_ISP1760 is not set

#
# USB port drivers
#
CONFIG_USB_SERIAL=y
CONFIG_USB_SERIAL_CONSOLE=y
CONFIG_USB_SERIAL_GENERIC=y
# CONFIG_USB_SERIAL_SIMPLE is not set
# CONFIG_USB_SERIAL_AIRCABLE is not set
# CONFIG_USB_SERIAL_ARK3116 is not set
# CONFIG_USB_SERIAL_BELKIN is not set
# CONFIG_USB_SERIAL_CH341 is not set
# CONFIG_USB_SERIAL_WHITEHEAT is not set
# CONFIG_USB_SERIAL_DIGI_ACCELEPORT is not set
# CONFIG_USB_SERIAL_CP210X is not set
# CONFIG_USB_SERIAL_CYPRESS_M8 is not set
# CONFIG_USB_SERIAL_EMPEG is not set
# CONFIG_USB_SERIAL_FTDI_SIO is not set
# CONFIG_USB_SERIAL_VISOR is not set
# CONFIG_USB_SERIAL_IPAQ is not set
# CONFIG_USB_SERIAL_IR is not set
# CONFIG_USB_SERIAL_EDGEPORT is not set
# CONFIG_USB_SERIAL_EDGEPORT_TI is not set
# CONFIG_USB_SERIAL_F81232 is not set
# CONFIG_USB_SERIAL_F8153X is not set
# CONFIG_USB_SERIAL_GARMIN is not set
# CONFIG_USB_SERIAL_IPW is not set
# CONFIG_USB_SERIAL_IUU is not set
# CONFIG_USB_SERIAL_KEYSPAN_PDA is not set
# CONFIG_USB_SERIAL_KEYSPAN is not set
# CONFIG_USB_SERIAL_KLSI is not set
# CONFIG_USB_SERIAL_KOBIL_SCT is not set
# CONFIG_USB_SERIAL_MCT_U232 is not set
# CONFIG_USB_SERIAL_METRO is not set
# CONFIG_USB_SERIAL_MOS7720 is not set
# CONFIG_USB_SERIAL_MOS7840 is not set
# CONFIG_USB_SERIAL_MXUPORT is not set
# CONFIG_USB_SERIAL_NAVMAN is not set
# CONFIG_USB_SERIAL_PL2303 is not set
# CONFIG_USB_SERIAL_OTI6858 is not set
# CONFIG_USB_SERIAL_QCAUX is not set
# CONFIG_USB_SERIAL_QUALCOMM is not set
# CONFIG_USB_SERIAL_SPCP8X5 is not set
# CONFIG_USB_SERIAL_SAFE is not set
# CONFIG_USB_SERIAL_SIERRAWIRELESS is not set
# CONFIG_USB_SERIAL_SYMBOL is not set
# CONFIG_USB_SERIAL_TI is not set
# CONFIG_USB_SERIAL_CYBERJACK is not set
# CONFIG_USB_SERIAL_XIRCOM is not set
# CONFIG_USB_SERIAL_OPTION is not set
# CONFIG_USB_SERIAL_OMNINET is not set
# CONFIG_USB_SERIAL_OPTICON is not set
# CONFIG_USB_SERIAL_XSENS_MT is not set
# CONFIG_USB_SERIAL_WISHBONE is not set
# CONFIG_USB_SERIAL_SSU100 is not set
# CONFIG_USB_SERIAL_QT2 is not set
# CONFIG_USB_SERIAL_UPD78F0730 is not set
# CONFIG_USB_SERIAL_DEBUG is not set

#
# USB Miscellaneous drivers
#
# CONFIG_USB_EMI62 is not set
# CONFIG_USB_EMI26 is not set
# CONFIG_USB_ADUTUX is not set
# CONFIG_USB_SEVSEG is not set
# CONFIG_USB_RIO500 is not set
# CONFIG_USB_LEGOTOWER is not set
# CONFIG_USB_LCD is not set
# CONFIG_USB_CYPRESS_CY7C63 is not set
# CONFIG_USB_CYTHERM is not set
# CONFIG_USB_IDMOUSE is not set
# CONFIG_USB_FTDI_ELAN is not set
# CONFIG_USB_APPLEDISPLAY is not set
# CONFIG_USB_SISUSBVGA is not set
# CONFIG_USB_LD is not set
# CONFIG_USB_TRANCEVIBRATOR is not set
# CONFIG_USB_IOWARRIOR is not set
# CONFIG_USB_TEST is not set
# CONFIG_USB_EHSET_TEST_FIXTURE is not set
# CONFIG_USB_ISIGHTFW is not set
# CONFIG_USB_YUREX is not set
# CONFIG_USB_EZUSB_FX2 is not set
# CONFIG_USB_HUB_USB251XB is not set
# CONFIG_USB_HSIC_USB3503 is not set
# CONFIG_USB_HSIC_USB4604 is not set
# CONFIG_USB_LINK_LAYER_TEST is not set
# CONFIG_USB_CHAOSKEY is not set

#
# USB Physical Layer drivers
#
# CONFIG_USB_PHY is not set
# CONFIG_NOP_USB_XCEIV is not set
# CONFIG_USB_ISP1301 is not set
# CONFIG_USB_GADGET is not set

#
# USB Power Delivery and Type-C drivers
#
# CONFIG_TYPEC_UCSI is not set
# CONFIG_USB_LED_TRIG is not set
# CONFIG_USB_ULPI_BUS is not set
# CONFIG_UWB is not set
# CONFIG_MMC is not set
# CONFIG_MEMSTICK is not set
CONFIG_NEW_LEDS=y
CONFIG_LEDS_CLASS=y
# CONFIG_LEDS_CLASS_FLASH is not set
# CONFIG_LEDS_BRIGHTNESS_HW_CHANGED is not set

#
# LED drivers
#
# CONFIG_LEDS_LM3530 is not set
# CONFIG_LEDS_LM3642 is not set
# CONFIG_LEDS_PCA9532 is not set
# CONFIG_LEDS_LP3944 is not set
# CONFIG_LEDS_LP5521 is not set
# CONFIG_LEDS_LP5523 is not set
# CONFIG_LEDS_LP5562 is not set
# CONFIG_LEDS_LP8501 is not set
# CONFIG_LEDS_LP8860 is not set
# CONFIG_LEDS_CLEVO_MAIL is not set
# CONFIG_LEDS_PCA955X is not set
# CONFIG_LEDS_PCA963X is not set
# CONFIG_LEDS_BD2802 is not set
# CONFIG_LEDS_INTEL_SS4200 is not set
# CONFIG_LEDS_TCA6507 is not set
# CONFIG_LEDS_TLC591XX is not set
# CONFIG_LEDS_LM355x is not set

#
# LED driver for blink(1) USB RGB LED is under Special HID drivers (HID_THINGM)
#
# CONFIG_LEDS_BLINKM is not set
# CONFIG_LEDS_MLXCPLD is not set
# CONFIG_LEDS_USER is not set
# CONFIG_LEDS_NIC78BX is not set

#
# LED Triggers
#
CONFIG_LEDS_TRIGGERS=y
# CONFIG_LEDS_TRIGGER_TIMER is not set
# CONFIG_LEDS_TRIGGER_ONESHOT is not set
# CONFIG_LEDS_TRIGGER_DISK is not set
# CONFIG_LEDS_TRIGGER_HEARTBEAT is not set
# CONFIG_LEDS_TRIGGER_BACKLIGHT is not set
# CONFIG_LEDS_TRIGGER_CPU is not set
# CONFIG_LEDS_TRIGGER_DEFAULT_ON is not set

#
# iptables trigger is under Netfilter config (LED target)
#
# CONFIG_LEDS_TRIGGER_TRANSIENT is not set
# CONFIG_LEDS_TRIGGER_CAMERA is not set
# CONFIG_LEDS_TRIGGER_PANIC is not set
CONFIG_ACCESSIBILITY=y
CONFIG_A11Y_BRAILLE_CONSOLE=y
# CONFIG_INFINIBAND is not set
CONFIG_EDAC_ATOMIC_SCRUB=y
CONFIG_EDAC_SUPPORT=y
CONFIG_EDAC=y
CONFIG_EDAC_LEGACY_SYSFS=y
# CONFIG_EDAC_DEBUG is not set
CONFIG_EDAC_DECODE_MCE=y
CONFIG_EDAC_GHES=y
CONFIG_EDAC_AMD64=y
# CONFIG_EDAC_AMD64_ERROR_INJECTION is not set
# CONFIG_EDAC_E752X is not set
# CONFIG_EDAC_I82975X is not set
# CONFIG_EDAC_I3000 is not set
# CONFIG_EDAC_I3200 is not set
# CONFIG_EDAC_IE31200 is not set
# CONFIG_EDAC_X38 is not set
# CONFIG_EDAC_I5400 is not set
# CONFIG_EDAC_I7CORE is not set
# CONFIG_EDAC_I5000 is not set
# CONFIG_EDAC_I5100 is not set
# CONFIG_EDAC_I7300 is not set
# CONFIG_EDAC_SBRIDGE is not set
# CONFIG_EDAC_SKX is not set
# CONFIG_EDAC_PND2 is not set
CONFIG_RTC_LIB=y
CONFIG_RTC_MC146818_LIB=y
CONFIG_RTC_CLASS=y
CONFIG_RTC_HCTOSYS=y
CONFIG_RTC_HCTOSYS_DEVICE="rtc0"
CONFIG_RTC_SYSTOHC=y
CONFIG_RTC_SYSTOHC_DEVICE="rtc0"
# CONFIG_RTC_DEBUG is not set
CONFIG_RTC_NVMEM=y

#
# RTC interfaces
#
CONFIG_RTC_INTF_SYSFS=y
CONFIG_RTC_INTF_PROC=y
CONFIG_RTC_INTF_DEV=y
# CONFIG_RTC_INTF_DEV_UIE_EMUL is not set
# CONFIG_RTC_DRV_TEST is not set

#
# I2C RTC drivers
#
# CONFIG_RTC_DRV_ABB5ZES3 is not set
# CONFIG_RTC_DRV_ABX80X is not set
# CONFIG_RTC_DRV_DS1307 is not set
# CONFIG_RTC_DRV_DS1374 is not set
# CONFIG_RTC_DRV_DS1672 is not set
# CONFIG_RTC_DRV_MAX6900 is not set
# CONFIG_RTC_DRV_RS5C372 is not set
# CONFIG_RTC_DRV_ISL1208 is not set
# CONFIG_RTC_DRV_ISL12022 is not set
# CONFIG_RTC_DRV_X1205 is not set
# CONFIG_RTC_DRV_PCF8523 is not set
# CONFIG_RTC_DRV_PCF85063 is not set
# CONFIG_RTC_DRV_PCF8563 is not set
# CONFIG_RTC_DRV_PCF8583 is not set
# CONFIG_RTC_DRV_M41T80 is not set
# CONFIG_RTC_DRV_BQ32K is not set
# CONFIG_RTC_DRV_S35390A is not set
# CONFIG_RTC_DRV_FM3130 is not set
# CONFIG_RTC_DRV_RX8010 is not set
# CONFIG_RTC_DRV_RX8581 is not set
# CONFIG_RTC_DRV_RX8025 is not set
# CONFIG_RTC_DRV_EM3027 is not set
# CONFIG_RTC_DRV_RV8803 is not set

#
# SPI RTC drivers
#
CONFIG_RTC_I2C_AND_SPI=y

#
# SPI and I2C RTC drivers
#
# CONFIG_RTC_DRV_DS3232 is not set
# CONFIG_RTC_DRV_PCF2127 is not set
# CONFIG_RTC_DRV_RV3029C2 is not set

#
# Platform RTC drivers
#
CONFIG_RTC_DRV_CMOS=y
# CONFIG_RTC_DRV_DS1286 is not set
# CONFIG_RTC_DRV_DS1511 is not set
# CONFIG_RTC_DRV_DS1553 is not set
# CONFIG_RTC_DRV_DS1685_FAMILY is not set
# CONFIG_RTC_DRV_DS1742 is not set
# CONFIG_RTC_DRV_DS2404 is not set
# CONFIG_RTC_DRV_STK17TA8 is not set
# CONFIG_RTC_DRV_M48T86 is not set
# CONFIG_RTC_DRV_M48T35 is not set
# CONFIG_RTC_DRV_M48T59 is not set
# CONFIG_RTC_DRV_MSM6242 is not set
# CONFIG_RTC_DRV_BQ4802 is not set
# CONFIG_RTC_DRV_RP5C01 is not set
# CONFIG_RTC_DRV_V3020 is not set

#
# on-CPU RTC drivers
#
# CONFIG_RTC_DRV_FTRTC010 is not set

#
# HID Sensor RTC drivers
#
# CONFIG_RTC_DRV_HID_SENSOR_TIME is not set
CONFIG_DMADEVICES=y
# CONFIG_DMADEVICES_DEBUG is not set

#
# DMA Devices
#
CONFIG_DMA_ENGINE=y
CONFIG_DMA_VIRTUAL_CHANNELS=y
CONFIG_DMA_ACPI=y
# CONFIG_ALTERA_MSGDMA is not set
# CONFIG_INTEL_IDMA64 is not set
CONFIG_INTEL_IOATDMA=y
# CONFIG_QCOM_HIDMA_MGMT is not set
# CONFIG_QCOM_HIDMA is not set
CONFIG_DW_DMAC_CORE=y
# CONFIG_DW_DMAC is not set
# CONFIG_DW_DMAC_PCI is not set
CONFIG_HSU_DMA=y

#
# DMA Clients
#
CONFIG_ASYNC_TX_DMA=y
# CONFIG_DMATEST is not set
CONFIG_DMA_ENGINE_RAID=y

#
# DMABUF options
#
# CONFIG_SYNC_FILE is not set
CONFIG_DCA=y
CONFIG_AUXDISPLAY=y
# CONFIG_IMG_ASCII_LCD is not set
# CONFIG_UIO is not set
# CONFIG_VFIO is not set
CONFIG_IRQ_BYPASS_MANAGER=y
# CONFIG_VIRT_DRIVERS is not set
CONFIG_VIRTIO=y

#
# Virtio drivers
#
CONFIG_VIRTIO_PCI=y
CONFIG_VIRTIO_PCI_LEGACY=y
# CONFIG_VIRTIO_BALLOON is not set
# CONFIG_VIRTIO_INPUT is not set
# CONFIG_VIRTIO_MMIO is not set

#
# Microsoft Hyper-V guest support
#
# CONFIG_HYPERV_TSCPAGE is not set
# CONFIG_STAGING is not set
CONFIG_X86_PLATFORM_DEVICES=y
# CONFIG_ACERHDF is not set
# CONFIG_ASUS_LAPTOP is not set
# CONFIG_DELL_LAPTOP is not set
# CONFIG_DELL_SMO8800 is not set
# CONFIG_DELL_RBTN is not set
# CONFIG_FUJITSU_LAPTOP is not set
# CONFIG_FUJITSU_TABLET is not set
# CONFIG_AMILO_RFKILL is not set
# CONFIG_HP_ACCEL is not set
# CONFIG_HP_WIRELESS is not set
# CONFIG_MSI_LAPTOP is not set
# CONFIG_PANASONIC_LAPTOP is not set
# CONFIG_COMPAL_LAPTOP is not set
# CONFIG_SONY_LAPTOP is not set
# CONFIG_IDEAPAD_LAPTOP is not set
# CONFIG_THINKPAD_ACPI is not set
# CONFIG_SENSORS_HDAPS is not set
# CONFIG_INTEL_MENLOW is not set
# CONFIG_EEEPC_LAPTOP is not set
# CONFIG_ASUS_WIRELESS is not set
# CONFIG_ACPI_WMI is not set
# CONFIG_TOPSTAR_LAPTOP is not set
# CONFIG_TOSHIBA_BT_RFKILL is not set
# CONFIG_TOSHIBA_HAPS is not set
# CONFIG_ACPI_CMPC is not set
# CONFIG_INTEL_CHT_INT33FE is not set
# CONFIG_INTEL_HID_EVENT is not set
# CONFIG_INTEL_VBTN is not set
# CONFIG_INTEL_IPS is not set
# CONFIG_INTEL_PMC_CORE is not set
# CONFIG_IBM_RTL is not set
# CONFIG_SAMSUNG_LAPTOP is not set
# CONFIG_INTEL_OAKTRAIL is not set
# CONFIG_SAMSUNG_Q10 is not set
# CONFIG_APPLE_GMUX is not set
# CONFIG_INTEL_RST is not set
# CONFIG_INTEL_SMARTCONNECT is not set
# CONFIG_PVPANIC is not set
# CONFIG_INTEL_PMC_IPC is not set
# CONFIG_SURFACE_PRO3_BUTTON is not set
# CONFIG_INTEL_PUNIT_IPC is not set
# CONFIG_MLX_PLATFORM is not set
# CONFIG_MLX_CPLD_PLATFORM is not set
# CONFIG_INTEL_TURBO_MAX_3 is not set
CONFIG_PMC_ATOM=y
# CONFIG_CHROME_PLATFORMS is not set
CONFIG_CLKDEV_LOOKUP=y
CONFIG_HAVE_CLK_PREPARE=y
CONFIG_COMMON_CLK=y

#
# Common Clock Framework
#
# CONFIG_COMMON_CLK_SI5351 is not set
# CONFIG_COMMON_CLK_CDCE706 is not set
# CONFIG_COMMON_CLK_CS2000_CP is not set
# CONFIG_COMMON_CLK_NXP is not set
# CONFIG_COMMON_CLK_PXA is not set
# CONFIG_COMMON_CLK_PIC32 is not set
# CONFIG_HWSPINLOCK is not set

#
# Clock Source drivers
#
CONFIG_CLKEVT_I8253=y
CONFIG_I8253_LOCK=y
CONFIG_CLKBLD_I8253=y
# CONFIG_ATMEL_PIT is not set
# CONFIG_SH_TIMER_CMT is not set
# CONFIG_SH_TIMER_MTU2 is not set
# CONFIG_SH_TIMER_TMU is not set
# CONFIG_EM_TIMER_STI is not set
CONFIG_MAILBOX=y
CONFIG_PCC=y
# CONFIG_ALTERA_MBOX is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y

#
# Generic IOMMU Pagetable Support
#
CONFIG_IOMMU_IOVA=y
CONFIG_AMD_IOMMU=y
# CONFIG_AMD_IOMMU_V2 is not set
CONFIG_DMAR_TABLE=y
CONFIG_INTEL_IOMMU=y
# CONFIG_INTEL_IOMMU_SVM is not set
# CONFIG_INTEL_IOMMU_DEFAULT_ON is not set
CONFIG_INTEL_IOMMU_FLOPPY_WA=y
CONFIG_IRQ_REMAP=y

#
# Remoteproc drivers
#
# CONFIG_REMOTEPROC is not set

#
# Rpmsg drivers
#
# CONFIG_RPMSG_QCOM_GLINK_RPM is not set

#
# SOC (System On Chip) specific Drivers
#

#
# Amlogic SoC drivers
#

#
# Broadcom SoC drivers
#

#
# i.MX SoC drivers
#

#
# Qualcomm SoC drivers
#
# CONFIG_SUNXI_SRAM is not set
# CONFIG_SOC_TI is not set
# CONFIG_PM_DEVFREQ is not set
# CONFIG_EXTCON is not set
# CONFIG_MEMORY is not set
# CONFIG_IIO is not set
# CONFIG_NTB is not set
# CONFIG_VME_BUS is not set
# CONFIG_PWM is not set

#
# IRQ chip support
#
CONFIG_ARM_GIC_MAX_NR=1
# CONFIG_ARM_GIC_V3_ITS is not set
# CONFIG_IPACK_BUS is not set
# CONFIG_RESET_CONTROLLER is not set
# CONFIG_FMC is not set

#
# PHY Subsystem
#
# CONFIG_GENERIC_PHY is not set
# CONFIG_BCM_KONA_USB2_PHY is not set
# CONFIG_PHY_PXA_28NM_HSIC is not set
# CONFIG_PHY_PXA_28NM_USB2 is not set
# CONFIG_POWERCAP is not set
# CONFIG_MCB is not set

#
# Performance monitor support
#
CONFIG_RAS=y
# CONFIG_RAS_CEC is not set
# CONFIG_THUNDERBOLT is not set

#
# Android
#
# CONFIG_ANDROID is not set
# CONFIG_LIBNVDIMM is not set
CONFIG_DAX=y
# CONFIG_DEV_DAX is not set
CONFIG_NVMEM=y
# CONFIG_STM is not set
# CONFIG_INTEL_TH is not set
# CONFIG_FPGA is not set

#
# FSI support
#
# CONFIG_FSI is not set

#
# Firmware Drivers
#
# CONFIG_EDD is not set
CONFIG_FIRMWARE_MEMMAP=y
# CONFIG_DELL_RBU is not set
# CONFIG_DCDBAS is not set
CONFIG_DMIID=y
CONFIG_DMI_SYSFS=y
CONFIG_DMI_SCAN_MACHINE_NON_EFI_FALLBACK=y
CONFIG_ISCSI_IBFT_FIND=y
# CONFIG_ISCSI_IBFT is not set
# CONFIG_FW_CFG_SYSFS is not set
# CONFIG_GOOGLE_FIRMWARE is not set

#
# EFI (Extensible Firmware Interface) Support
#
CONFIG_EFI_VARS=y
CONFIG_EFI_ESRT=y
CONFIG_EFI_VARS_PSTORE=y
# CONFIG_EFI_VARS_PSTORE_DEFAULT_DISABLE is not set
CONFIG_EFI_RUNTIME_MAP=y
# CONFIG_EFI_FAKE_MEMMAP is not set
CONFIG_EFI_RUNTIME_WRAPPERS=y
# CONFIG_EFI_BOOTLOADER_CONTROL is not set
# CONFIG_EFI_CAPSULE_LOADER is not set
# CONFIG_EFI_TEST is not set
# CONFIG_APPLE_PROPERTIES is not set
# CONFIG_RESET_ATTACK_MITIGATION is not set
CONFIG_UEFI_CPER=y
# CONFIG_EFI_DEV_PATH_PARSER is not set

#
# Tegra firmware driver
#

#
# File systems
#
CONFIG_DCACHE_WORD_ACCESS=y
# CONFIG_EXT2_FS is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_POSIX_ACL=y
CONFIG_EXT3_FS_SECURITY=y
CONFIG_EXT4_FS=y
CONFIG_EXT4_USE_FOR_EXT2=y
CONFIG_EXT4_FS_POSIX_ACL=y
CONFIG_EXT4_FS_SECURITY=y
# CONFIG_EXT4_ENCRYPTION is not set
# CONFIG_EXT4_DEBUG is not set
CONFIG_JBD2=y
# CONFIG_JBD2_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
# CONFIG_BTRFS_FS is not set
# CONFIG_NILFS2_FS is not set
# CONFIG_F2FS_FS is not set
# CONFIG_FS_DAX is not set
CONFIG_FS_POSIX_ACL=y
CONFIG_EXPORTFS=y
# CONFIG_EXPORTFS_BLOCK_OPS is not set
CONFIG_FILE_LOCKING=y
CONFIG_MANDATORY_FILE_LOCKING=y
# CONFIG_FS_ENCRYPTION is not set
CONFIG_FSNOTIFY=y
CONFIG_DNOTIFY=y
CONFIG_INOTIFY_USER=y
CONFIG_FANOTIFY=y
CONFIG_FANOTIFY_ACCESS_PERMISSIONS=y
CONFIG_QUOTA=y
CONFIG_QUOTA_NETLINK_INTERFACE=y
# CONFIG_PRINT_QUOTA_WARNING is not set
# CONFIG_QUOTA_DEBUG is not set
CONFIG_QUOTA_TREE=y
# CONFIG_QFMT_V1 is not set
CONFIG_QFMT_V2=y
CONFIG_QUOTACTL=y
CONFIG_QUOTACTL_COMPAT=y
CONFIG_AUTOFS4_FS=y
# CONFIG_FUSE_FS is not set
# CONFIG_OVERLAY_FS is not set

#
# Caches
#
# CONFIG_FSCACHE is not set

#
# CD-ROM/DVD Filesystems
#
CONFIG_ISO9660_FS=y
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_VMCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_PROC_PAGE_MONITOR=y
# CONFIG_PROC_CHILDREN is not set
CONFIG_KERNFS=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
CONFIG_TMPFS_XATTR=y
CONFIG_HUGETLBFS=y
CONFIG_HUGETLB_PAGE=y
CONFIG_ARCH_HAS_GIGANTIC_PAGE=y
CONFIG_CONFIGFS_FS=y
# CONFIG_EFIVAR_FS is not set
CONFIG_MISC_FILESYSTEMS=y
# CONFIG_ORANGEFS_FS is not set
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_ECRYPT_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_SQUASHFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_OMFS_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_QNX6FS_FS is not set
# CONFIG_ROMFS_FS is not set
CONFIG_PSTORE=y
CONFIG_PSTORE_ZLIB_COMPRESS=y
# CONFIG_PSTORE_LZO_COMPRESS is not set
# CONFIG_PSTORE_LZ4_COMPRESS is not set
# CONFIG_PSTORE_CONSOLE is not set
# CONFIG_PSTORE_PMSG is not set
# CONFIG_PSTORE_FTRACE is not set
# CONFIG_PSTORE_RAM is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
# CONFIG_NFS_V2 is not set
# CONFIG_NFS_V3 is not set
# CONFIG_NFS_V4 is not set
# CONFIG_NFS_SWAP is not set
# CONFIG_ROOT_NFS is not set
CONFIG_NFS_DEBUG=y
# CONFIG_NFSD is not set
CONFIG_GRACE_PERIOD=y
CONFIG_LOCKD=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
CONFIG_SUNRPC_DEBUG=y
# CONFIG_CEPH_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set
CONFIG_9P_FS=y
# CONFIG_9P_FS_POSIX_ACL is not set
# CONFIG_9P_FS_SECURITY is not set
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="utf8"
CONFIG_NLS_CODEPAGE_437=y
# CONFIG_NLS_CODEPAGE_737 is not set
# CONFIG_NLS_CODEPAGE_775 is not set
# CONFIG_NLS_CODEPAGE_850 is not set
# CONFIG_NLS_CODEPAGE_852 is not set
# CONFIG_NLS_CODEPAGE_855 is not set
# CONFIG_NLS_CODEPAGE_857 is not set
# CONFIG_NLS_CODEPAGE_860 is not set
# CONFIG_NLS_CODEPAGE_861 is not set
# CONFIG_NLS_CODEPAGE_862 is not set
# CONFIG_NLS_CODEPAGE_863 is not set
# CONFIG_NLS_CODEPAGE_864 is not set
# CONFIG_NLS_CODEPAGE_865 is not set
# CONFIG_NLS_CODEPAGE_866 is not set
# CONFIG_NLS_CODEPAGE_869 is not set
# CONFIG_NLS_CODEPAGE_936 is not set
# CONFIG_NLS_CODEPAGE_950 is not set
# CONFIG_NLS_CODEPAGE_932 is not set
# CONFIG_NLS_CODEPAGE_949 is not set
# CONFIG_NLS_CODEPAGE_874 is not set
# CONFIG_NLS_ISO8859_8 is not set
# CONFIG_NLS_CODEPAGE_1250 is not set
# CONFIG_NLS_CODEPAGE_1251 is not set
CONFIG_NLS_ASCII=y
# CONFIG_NLS_ISO8859_1 is not set
# CONFIG_NLS_ISO8859_2 is not set
# CONFIG_NLS_ISO8859_3 is not set
# CONFIG_NLS_ISO8859_4 is not set
# CONFIG_NLS_ISO8859_5 is not set
# CONFIG_NLS_ISO8859_6 is not set
# CONFIG_NLS_ISO8859_7 is not set
# CONFIG_NLS_ISO8859_9 is not set
# CONFIG_NLS_ISO8859_13 is not set
# CONFIG_NLS_ISO8859_14 is not set
# CONFIG_NLS_ISO8859_15 is not set
# CONFIG_NLS_KOI8_R is not set
# CONFIG_NLS_KOI8_U is not set
# CONFIG_NLS_MAC_ROMAN is not set
# CONFIG_NLS_MAC_CELTIC is not set
# CONFIG_NLS_MAC_CENTEURO is not set
# CONFIG_NLS_MAC_CROATIAN is not set
# CONFIG_NLS_MAC_CYRILLIC is not set
# CONFIG_NLS_MAC_GAELIC is not set
# CONFIG_NLS_MAC_GREEK is not set
# CONFIG_NLS_MAC_ICELAND is not set
# CONFIG_NLS_MAC_INUIT is not set
# CONFIG_NLS_MAC_ROMANIAN is not set
# CONFIG_NLS_MAC_TURKISH is not set
# CONFIG_NLS_UTF8 is not set
# CONFIG_DLM is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y

#
# printk and dmesg options
#
CONFIG_PRINTK_TIME=y
CONFIG_CONSOLE_LOGLEVEL_DEFAULT=7
CONFIG_MESSAGE_LOGLEVEL_DEFAULT=4
CONFIG_BOOT_PRINTK_DELAY=y
CONFIG_DYNAMIC_DEBUG=y

#
# Compile-time checks and compiler options
#
# CONFIG_DEBUG_INFO is not set
# CONFIG_ENABLE_WARN_DEPRECATED is not set
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_STRIP_ASM_SYMS=y
# CONFIG_READABLE_ASM is not set
CONFIG_UNUSED_SYMBOLS=y
# CONFIG_PAGE_OWNER is not set
CONFIG_DEBUG_FS=y
CONFIG_HEADERS_CHECK=y
# CONFIG_DEBUG_SECTION_MISMATCH is not set
CONFIG_SECTION_MISMATCH_WARN_ONLY=y
CONFIG_STACK_VALIDATION=y
# CONFIG_DEBUG_FORCE_WEAK_PER_CPU is not set
CONFIG_MAGIC_SYSRQ=y
CONFIG_MAGIC_SYSRQ_DEFAULT_ENABLE=0x1
CONFIG_MAGIC_SYSRQ_SERIAL=y
CONFIG_DEBUG_KERNEL=y

#
# Memory Debugging
#
# CONFIG_PAGE_EXTENSION is not set
# CONFIG_DEBUG_PAGEALLOC is not set
# CONFIG_PAGE_POISONING is not set
# CONFIG_DEBUG_PAGE_REF is not set
CONFIG_DEBUG_RODATA_TEST=y
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_SLUB_DEBUG_ON is not set
# CONFIG_SLUB_STATS is not set
CONFIG_HAVE_DEBUG_KMEMLEAK=y
# CONFIG_DEBUG_KMEMLEAK is not set
# CONFIG_DEBUG_STACK_USAGE is not set
CONFIG_DEBUG_VM=y
# CONFIG_DEBUG_VM_VMACACHE is not set
# CONFIG_DEBUG_VM_RB is not set
# CONFIG_DEBUG_VM_PGFLAGS is not set
CONFIG_ARCH_HAS_DEBUG_VIRTUAL=y
# CONFIG_DEBUG_VIRTUAL is not set
CONFIG_DEBUG_MEMORY_INIT=y
# CONFIG_DEBUG_PER_CPU_MAPS is not set
CONFIG_HAVE_DEBUG_STACKOVERFLOW=y
CONFIG_DEBUG_STACKOVERFLOW=y
CONFIG_HAVE_ARCH_KMEMCHECK=y
CONFIG_HAVE_ARCH_KASAN=y
# CONFIG_KASAN is not set
CONFIG_ARCH_HAS_KCOV=y
# CONFIG_KCOV is not set
CONFIG_DEBUG_SHIRQ=y

#
# Debug Lockups and Hangs
#
# CONFIG_SOFTLOCKUP_DETECTOR is not set
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
# CONFIG_HARDLOCKUP_DETECTOR is not set
# CONFIG_DETECT_HUNG_TASK is not set
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_SCHED_DEBUG=y
CONFIG_SCHED_INFO=y
CONFIG_SCHEDSTATS=y
# CONFIG_SCHED_STACK_END_CHECK is not set
# CONFIG_DEBUG_TIMEKEEPING is not set

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
CONFIG_STACKTRACE=y
# CONFIG_WARN_ALL_UNSEEDED_RANDOM is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_PI_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_DEBUG_NOTIFIERS is not set
# CONFIG_DEBUG_CREDENTIALS is not set

#
# RCU Debugging
#
# CONFIG_PROVE_RCU is not set
# CONFIG_TORTURE_TEST is not set
# CONFIG_RCU_PERF_TEST is not set
# CONFIG_RCU_TORTURE_TEST is not set
CONFIG_RCU_CPU_STALL_TIMEOUT=60
# CONFIG_RCU_TRACE is not set
# CONFIG_RCU_EQS_DEBUG is not set
# CONFIG_DEBUG_WQ_FORCE_RR_CPU is not set
# CONFIG_DEBUG_BLOCK_EXT_DEVT is not set
# CONFIG_CPU_HOTPLUG_STATE_CONTROL is not set
# CONFIG_NOTIFIER_ERROR_INJECTION is not set
# CONFIG_FAULT_INJECTION is not set
CONFIG_LATENCYTOP=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_NOP_TRACER=y
CONFIG_HAVE_FUNCTION_TRACER=y
CONFIG_HAVE_FUNCTION_GRAPH_TRACER=y
CONFIG_HAVE_DYNAMIC_FTRACE=y
CONFIG_HAVE_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_HAVE_FTRACE_MCOUNT_RECORD=y
CONFIG_HAVE_SYSCALL_TRACEPOINTS=y
CONFIG_HAVE_FENTRY=y
CONFIG_HAVE_C_RECORDMCOUNT=y
CONFIG_TRACER_MAX_TRACE=y
CONFIG_TRACE_CLOCK=y
CONFIG_RING_BUFFER=y
CONFIG_EVENT_TRACING=y
CONFIG_CONTEXT_SWITCH_TRACER=y
CONFIG_TRACING=y
CONFIG_GENERIC_TRACER=y
CONFIG_TRACING_SUPPORT=y
CONFIG_FTRACE=y
CONFIG_FUNCTION_TRACER=y
CONFIG_FUNCTION_GRAPH_TRACER=y
# CONFIG_IRQSOFF_TRACER is not set
CONFIG_SCHED_TRACER=y
# CONFIG_HWLAT_TRACER is not set
CONFIG_FTRACE_SYSCALLS=y
CONFIG_TRACER_SNAPSHOT=y
# CONFIG_TRACER_SNAPSHOT_PER_CPU_SWAP is not set
CONFIG_BRANCH_PROFILE_NONE=y
# CONFIG_PROFILE_ANNOTATED_BRANCHES is not set
# CONFIG_PROFILE_ALL_BRANCHES is not set
CONFIG_STACK_TRACER=y
CONFIG_BLK_DEV_IO_TRACE=y
CONFIG_KPROBE_EVENTS=y
CONFIG_UPROBE_EVENTS=y
CONFIG_PROBE_EVENTS=y
CONFIG_DYNAMIC_FTRACE=y
CONFIG_DYNAMIC_FTRACE_WITH_REGS=y
CONFIG_FUNCTION_PROFILER=y
CONFIG_FTRACE_MCOUNT_RECORD=y
# CONFIG_FTRACE_STARTUP_TEST is not set
# CONFIG_MMIOTRACE is not set
# CONFIG_HIST_TRIGGERS is not set
# CONFIG_TRACEPOINT_BENCHMARK is not set
# CONFIG_RING_BUFFER_BENCHMARK is not set
# CONFIG_RING_BUFFER_STARTUP_TEST is not set
# CONFIG_TRACE_EVAL_MAP_FILE is not set
CONFIG_PROVIDE_OHCI1394_DMA_INIT=y
# CONFIG_DMA_API_DEBUG is not set

#
# Runtime Testing
#
# CONFIG_LKDTM is not set
# CONFIG_TEST_LIST_SORT is not set
# CONFIG_TEST_SORT is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_RBTREE_TEST is not set
# CONFIG_INTERVAL_TREE_TEST is not set
# CONFIG_PERCPU_TEST is not set
CONFIG_ATOMIC64_SELFTEST=y
# CONFIG_ASYNC_RAID6_TEST is not set
# CONFIG_TEST_HEXDUMP is not set
# CONFIG_TEST_STRING_HELPERS is not set
CONFIG_TEST_KSTRTOX=y
# CONFIG_TEST_PRINTF is not set
# CONFIG_TEST_BITMAP is not set
# CONFIG_TEST_UUID is not set
# CONFIG_TEST_RHASHTABLE is not set
# CONFIG_TEST_HASH is not set
# CONFIG_TEST_LKM is not set
# CONFIG_TEST_USER_COPY is not set
# CONFIG_TEST_BPF is not set
# CONFIG_TEST_FIRMWARE is not set
# CONFIG_TEST_SYSCTL is not set
# CONFIG_TEST_UDELAY is not set
# CONFIG_TEST_STATIC_KEYS is not set
# CONFIG_TEST_KMOD is not set
# CONFIG_MEMTEST is not set
# CONFIG_BUG_ON_DATA_CORRUPTION is not set
# CONFIG_SAMPLES is not set
CONFIG_HAVE_ARCH_KGDB=y
CONFIG_KGDB=y
CONFIG_KGDB_SERIAL_CONSOLE=y
CONFIG_KGDB_TESTS=y
# CONFIG_KGDB_TESTS_ON_BOOT is not set
CONFIG_KGDB_LOW_LEVEL_TRAP=y
CONFIG_KGDB_KDB=y
CONFIG_KDB_DEFAULT_ENABLE=0x1
CONFIG_KDB_KEYBOARD=y
CONFIG_KDB_CONTINUE_CATASTROPHIC=0
CONFIG_ARCH_HAS_UBSAN_SANITIZE_ALL=y
# CONFIG_ARCH_WANTS_UBSAN_NO_NULL is not set
# CONFIG_UBSAN is not set
CONFIG_ARCH_HAS_DEVMEM_IS_ALLOWED=y
CONFIG_STRICT_DEVMEM=y
# CONFIG_IO_STRICT_DEVMEM is not set
CONFIG_EARLY_PRINTK_USB=y
# CONFIG_X86_VERBOSE_BOOTUP is not set
CONFIG_EARLY_PRINTK=y
CONFIG_EARLY_PRINTK_DBGP=y
# CONFIG_EARLY_PRINTK_EFI is not set
# CONFIG_EARLY_PRINTK_USB_XDBC is not set
CONFIG_X86_PTDUMP_CORE=y
# CONFIG_X86_PTDUMP is not set
# CONFIG_EFI_PGT_DUMP is not set
CONFIG_DEBUG_WX=y
CONFIG_DOUBLEFAULT=y
# CONFIG_DEBUG_TLBFLUSH is not set
# CONFIG_IOMMU_DEBUG is not set
# CONFIG_IOMMU_STRESS is not set
CONFIG_HAVE_MMIOTRACE_SUPPORT=y
# CONFIG_X86_DECODER_SELFTEST is not set
CONFIG_IO_DELAY_TYPE_0X80=0
CONFIG_IO_DELAY_TYPE_0XED=1
CONFIG_IO_DELAY_TYPE_UDELAY=2
CONFIG_IO_DELAY_TYPE_NONE=3
CONFIG_IO_DELAY_0X80=y
# CONFIG_IO_DELAY_0XED is not set
# CONFIG_IO_DELAY_UDELAY is not set
# CONFIG_IO_DELAY_NONE is not set
CONFIG_DEFAULT_IO_DELAY_TYPE=0
CONFIG_DEBUG_BOOT_PARAMS=y
# CONFIG_CPA_DEBUG is not set
CONFIG_OPTIMIZE_INLINING=y
# CONFIG_DEBUG_ENTRY is not set
CONFIG_DEBUG_NMI_SELFTEST=y
CONFIG_X86_DEBUG_FPU=y
# CONFIG_PUNIT_ATOM_DEBUG is not set
CONFIG_UNWINDER_ORC=y
# CONFIG_UNWINDER_FRAME_POINTER is not set

#
# Security options
#
CONFIG_KEYS=y
CONFIG_KEYS_COMPAT=y
# CONFIG_PERSISTENT_KEYRINGS is not set
# CONFIG_BIG_KEYS is not set
# CONFIG_ENCRYPTED_KEYS is not set
# CONFIG_KEY_DH_OPERATIONS is not set
# CONFIG_SECURITY_DMESG_RESTRICT is not set
CONFIG_SECURITY=y
CONFIG_SECURITY_WRITABLE_HOOKS=y
CONFIG_SECURITYFS=y
CONFIG_SECURITY_NETWORK=y
CONFIG_KAISER=y
CONFIG_SECURITY_NETWORK_XFRM=y
# CONFIG_SECURITY_PATH is not set
CONFIG_INTEL_TXT=y
CONFIG_LSM_MMAP_MIN_ADDR=65536
CONFIG_HAVE_HARDENED_USERCOPY_ALLOCATOR=y
# CONFIG_HARDENED_USERCOPY is not set
# CONFIG_FORTIFY_SOURCE is not set
# CONFIG_STATIC_USERMODEHELPER is not set
CONFIG_SECURITY_SELINUX=y
CONFIG_SECURITY_SELINUX_BOOTPARAM=y
CONFIG_SECURITY_SELINUX_BOOTPARAM_VALUE=1
CONFIG_SECURITY_SELINUX_DISABLE=y
CONFIG_SECURITY_SELINUX_DEVELOP=y
CONFIG_SECURITY_SELINUX_AVC_STATS=y
CONFIG_SECURITY_SELINUX_CHECKREQPROT_VALUE=1
# CONFIG_SECURITY_SMACK is not set
# CONFIG_SECURITY_TOMOYO is not set
# CONFIG_SECURITY_APPARMOR is not set
# CONFIG_SECURITY_LOADPIN is not set
# CONFIG_SECURITY_YAMA is not set
CONFIG_INTEGRITY=y
# CONFIG_INTEGRITY_SIGNATURE is not set
CONFIG_INTEGRITY_AUDIT=y
# CONFIG_IMA is not set
# CONFIG_EVM is not set
CONFIG_DEFAULT_SECURITY_SELINUX=y
# CONFIG_DEFAULT_SECURITY_DAC is not set
CONFIG_DEFAULT_SECURITY="selinux"
CONFIG_XOR_BLOCKS=y
CONFIG_ASYNC_CORE=y
CONFIG_ASYNC_MEMCPY=y
CONFIG_ASYNC_XOR=y
CONFIG_ASYNC_PQ=y
CONFIG_ASYNC_RAID6_RECOV=y
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_ALGAPI2=y
CONFIG_CRYPTO_AEAD=y
CONFIG_CRYPTO_AEAD2=y
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_BLKCIPHER2=y
CONFIG_CRYPTO_HASH=y
CONFIG_CRYPTO_HASH2=y
CONFIG_CRYPTO_RNG=y
CONFIG_CRYPTO_RNG2=y
CONFIG_CRYPTO_RNG_DEFAULT=y
CONFIG_CRYPTO_AKCIPHER2=y
CONFIG_CRYPTO_KPP2=y
CONFIG_CRYPTO_ACOMP2=y
# CONFIG_CRYPTO_RSA is not set
# CONFIG_CRYPTO_DH is not set
CONFIG_CRYPTO_ECDH=y
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_MANAGER2=y
# CONFIG_CRYPTO_USER is not set
# CONFIG_CRYPTO_MANAGER_DISABLE_TESTS is not set
CONFIG_CRYPTO_GF128MUL=y
CONFIG_CRYPTO_NULL=y
CONFIG_CRYPTO_NULL2=y
# CONFIG_CRYPTO_PCRYPT is not set
CONFIG_CRYPTO_WORKQUEUE=y
CONFIG_CRYPTO_CRYPTD=y
# CONFIG_CRYPTO_MCRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set
CONFIG_CRYPTO_SIMD=y
CONFIG_CRYPTO_GLUE_HELPER_X86=y
CONFIG_CRYPTO_ENGINE=m

#
# Authenticated Encryption with Associated Data
#
# CONFIG_CRYPTO_CCM is not set
# CONFIG_CRYPTO_GCM is not set
# CONFIG_CRYPTO_CHACHA20POLY1305 is not set
CONFIG_CRYPTO_SEQIV=y
CONFIG_CRYPTO_ECHAINIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=y
# CONFIG_CRYPTO_CTS is not set
CONFIG_CRYPTO_ECB=y
CONFIG_CRYPTO_LRW=y
# CONFIG_CRYPTO_PCBC is not set
CONFIG_CRYPTO_XTS=y
# CONFIG_CRYPTO_KEYWRAP is not set

#
# Hash modes
#
CONFIG_CRYPTO_CMAC=y
CONFIG_CRYPTO_HMAC=y
# CONFIG_CRYPTO_XCBC is not set
# CONFIG_CRYPTO_VMAC is not set

#
# Digest
#
CONFIG_CRYPTO_CRC32C=y
CONFIG_CRYPTO_CRC32C_INTEL=y
# CONFIG_CRYPTO_CRC32 is not set
# CONFIG_CRYPTO_CRC32_PCLMUL is not set
CONFIG_CRYPTO_CRCT10DIF=y
# CONFIG_CRYPTO_CRCT10DIF_PCLMUL is not set
# CONFIG_CRYPTO_GHASH is not set
# CONFIG_CRYPTO_POLY1305 is not set
# CONFIG_CRYPTO_POLY1305_X86_64 is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=y
# CONFIG_CRYPTO_MICHAEL_MIC is not set
# CONFIG_CRYPTO_RMD128 is not set
# CONFIG_CRYPTO_RMD160 is not set
# CONFIG_CRYPTO_RMD256 is not set
# CONFIG_CRYPTO_RMD320 is not set
CONFIG_CRYPTO_SHA1=y
# CONFIG_CRYPTO_SHA1_SSSE3 is not set
# CONFIG_CRYPTO_SHA256_SSSE3 is not set
# CONFIG_CRYPTO_SHA512_SSSE3 is not set
# CONFIG_CRYPTO_SHA1_MB is not set
# CONFIG_CRYPTO_SHA256_MB is not set
# CONFIG_CRYPTO_SHA512_MB is not set
CONFIG_CRYPTO_SHA256=y
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_SHA3 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set
CONFIG_CRYPTO_GHASH_CLMUL_NI_INTEL=y

#
# Ciphers
#
CONFIG_CRYPTO_AES=y
# CONFIG_CRYPTO_AES_TI is not set
CONFIG_CRYPTO_AES_X86_64=y
CONFIG_CRYPTO_AES_NI_INTEL=y
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
# CONFIG_CRYPTO_BLOWFISH_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA is not set
# CONFIG_CRYPTO_CAMELLIA_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAMELLIA_AESNI_AVX2_X86_64 is not set
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST5_AVX_X86_64 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_CAST6_AVX_X86_64 is not set
# CONFIG_CRYPTO_DES is not set
# CONFIG_CRYPTO_DES3_EDE_X86_64 is not set
# CONFIG_CRYPTO_FCRYPT is not set
# CONFIG_CRYPTO_KHAZAD is not set
# CONFIG_CRYPTO_SALSA20 is not set
# CONFIG_CRYPTO_SALSA20_X86_64 is not set
# CONFIG_CRYPTO_CHACHA20 is not set
# CONFIG_CRYPTO_CHACHA20_X86_64 is not set
# CONFIG_CRYPTO_SEED is not set
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_SERPENT_SSE2_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX_X86_64 is not set
# CONFIG_CRYPTO_SERPENT_AVX2_X86_64 is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set
# CONFIG_CRYPTO_TWOFISH_X86_64 is not set
# CONFIG_CRYPTO_TWOFISH_X86_64_3WAY is not set
# CONFIG_CRYPTO_TWOFISH_AVX_X86_64 is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
# CONFIG_CRYPTO_LZO is not set
# CONFIG_CRYPTO_842 is not set
# CONFIG_CRYPTO_LZ4 is not set
# CONFIG_CRYPTO_LZ4HC is not set

#
# Random Number Generation
#
# CONFIG_CRYPTO_ANSI_CPRNG is not set
CONFIG_CRYPTO_DRBG_MENU=y
CONFIG_CRYPTO_DRBG_HMAC=y
# CONFIG_CRYPTO_DRBG_HASH is not set
# CONFIG_CRYPTO_DRBG_CTR is not set
CONFIG_CRYPTO_DRBG=y
CONFIG_CRYPTO_JITTERENTROPY=y
CONFIG_CRYPTO_USER_API=y
CONFIG_CRYPTO_USER_API_HASH=y
CONFIG_CRYPTO_USER_API_SKCIPHER=y
# CONFIG_CRYPTO_USER_API_RNG is not set
# CONFIG_CRYPTO_USER_API_AEAD is not set
CONFIG_CRYPTO_HW=y
# CONFIG_CRYPTO_DEV_PADLOCK is not set
# CONFIG_CRYPTO_DEV_FSL_CAAM_CRYPTO_API_DESC is not set
# CONFIG_CRYPTO_DEV_CCP is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCC is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXX is not set
# CONFIG_CRYPTO_DEV_QAT_C62X is not set
# CONFIG_CRYPTO_DEV_QAT_DH895xCCVF is not set
# CONFIG_CRYPTO_DEV_QAT_C3XXXVF is not set
# CONFIG_CRYPTO_DEV_QAT_C62XVF is not set
# CONFIG_CRYPTO_DEV_NITROX_CNN55XX is not set
CONFIG_CRYPTO_DEV_VIRTIO=m
# CONFIG_ASYMMETRIC_KEY_TYPE is not set

#
# Certificates for signature checking
#
# CONFIG_SYSTEM_BLACKLIST_KEYRING is not set
CONFIG_HAVE_KVM=y
CONFIG_HAVE_KVM_IRQCHIP=y
CONFIG_HAVE_KVM_IRQFD=y
CONFIG_HAVE_KVM_IRQ_ROUTING=y
CONFIG_HAVE_KVM_EVENTFD=y
CONFIG_KVM_MMIO=y
CONFIG_KVM_ASYNC_PF=y
CONFIG_HAVE_KVM_MSI=y
CONFIG_HAVE_KVM_CPU_RELAX_INTERCEPT=y
CONFIG_KVM_VFIO=y
CONFIG_KVM_GENERIC_DIRTYLOG_READ_PROTECT=y
CONFIG_KVM_COMPAT=y
CONFIG_HAVE_KVM_IRQ_BYPASS=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=y
CONFIG_KVM_INTEL=y
CONFIG_KVM_AMD=y
CONFIG_KVM_MMU_AUDIT=y
# CONFIG_VHOST_NET is not set
# CONFIG_VHOST_CROSS_ENDIAN_LEGACY is not set
CONFIG_BINARY_PRINTF=y

#
# Library routines
#
CONFIG_RAID6_PQ=y
CONFIG_BITREVERSE=y
# CONFIG_HAVE_ARCH_BITREVERSE is not set
CONFIG_RATIONAL=y
CONFIG_GENERIC_STRNCPY_FROM_USER=y
CONFIG_GENERIC_STRNLEN_USER=y
CONFIG_GENERIC_NET_UTILS=y
CONFIG_GENERIC_FIND_FIRST_BIT=y
CONFIG_GENERIC_PCI_IOMAP=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_IO=y
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_ARCH_HAS_FAST_MULTIPLIER=y
CONFIG_CRC_CCITT=y
CONFIG_CRC16=y
CONFIG_CRC_T10DIF=y
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=y
# CONFIG_CRC32_SELFTEST is not set
CONFIG_CRC32_SLICEBY8=y
# CONFIG_CRC32_SLICEBY4 is not set
# CONFIG_CRC32_SARWATE is not set
# CONFIG_CRC32_BIT is not set
# CONFIG_CRC4 is not set
# CONFIG_CRC7 is not set
CONFIG_LIBCRC32C=y
# CONFIG_CRC8 is not set
# CONFIG_AUDIT_ARCH_COMPAT_GENERIC is not set
# CONFIG_RANDOM32_SELFTEST is not set
CONFIG_ZLIB_INFLATE=y
CONFIG_ZLIB_DEFLATE=y
CONFIG_LZO_COMPRESS=y
CONFIG_LZO_DECOMPRESS=y
CONFIG_LZ4_DECOMPRESS=y
CONFIG_XZ_DEC=y
CONFIG_XZ_DEC_X86=y
CONFIG_XZ_DEC_POWERPC=y
CONFIG_XZ_DEC_IA64=y
CONFIG_XZ_DEC_ARM=y
CONFIG_XZ_DEC_ARMTHUMB=y
CONFIG_XZ_DEC_SPARC=y
CONFIG_XZ_DEC_BCJ=y
# CONFIG_XZ_DEC_TEST is not set
CONFIG_DECOMPRESS_GZIP=y
CONFIG_DECOMPRESS_BZIP2=y
CONFIG_DECOMPRESS_LZMA=y
CONFIG_DECOMPRESS_XZ=y
CONFIG_DECOMPRESS_LZO=y
CONFIG_DECOMPRESS_LZ4=y
CONFIG_GENERIC_ALLOCATOR=y
CONFIG_RADIX_TREE_MULTIORDER=y
CONFIG_ASSOCIATIVE_ARRAY=y
CONFIG_HAS_IOMEM=y
CONFIG_HAS_IOPORT_MAP=y
CONFIG_HAS_DMA=y
# CONFIG_DMA_NOOP_OPS is not set
# CONFIG_DMA_VIRT_OPS is not set
CONFIG_CPU_RMAP=y
CONFIG_DQL=y
CONFIG_GLOB=y
# CONFIG_GLOB_SELFTEST is not set
CONFIG_NLATTR=y
# CONFIG_CORDIC is not set
# CONFIG_DDR is not set
# CONFIG_IRQ_POLL is not set
CONFIG_UCS2_STRING=y
# CONFIG_SG_SPLIT is not set
CONFIG_SG_POOL=y
CONFIG_ARCH_HAS_SG_CHAIN=y
CONFIG_ARCH_HAS_PMEM_API=y
CONFIG_ARCH_HAS_UACCESS_FLUSHCACHE=y
CONFIG_SBITMAP=y
# CONFIG_STRING_SELFTEST is not set

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
  2017-11-24 20:59   ` Andy Lutomirski
@ 2017-11-24 22:09   ` Ingo Molnar
  2017-11-24 22:35     ` Andy Lutomirski
  2017-11-25  4:09   ` [crash] PANIC: double fault, error_code: 0x0 Dave Hansen
  2 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 22:09 UTC (permalink / raw)
  To: linux-kernel, Andy Lutomirski, Dave Hansen
  Cc: Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
> > (v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
> > on top of latest tip:x86/urgent (12a78d43de76).
> > 
> > This version is pretty well tested, at least on the usual x86 tree test systems.
> > It has a couple of merge mistakes fixed, the biggest difference is in patch #22:
> > 
> >   x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
> > 
> > The other patches are identical or very close to what I posted earlier today.
> 
> Here's a new bug, on a testsystem I get the double fault boot crash attached 
> below. The same bzImage crashes on other systems as well, so it's not CPU 
> dependent.
> 
> Via Kconfig-bisection I have narrowed it down to the following .config detail: 
> it's triggered by _disabling_ CONFIG_DEBUG_ENTRY and enabling CONFIG_KAISER=y.
> 
> I.e. one of the sanity checks of CONFIG_DEBUG_ENTRY has some positive side effect. 
> I'll try to track down which one it is - any ideas meanwhile?
> 
> Thanks,
> 
> 	Ingo
> 
> [    8.797733] calling  pt_dump_init+0x0/0x3b @ 1
> [    8.803144] initcall pt_dump_init+0x0/0x3b returned 0 after 1 usecs
> [    8.810589] calling  aes_init+0x0/0x11 @ 1
> [    8.815757] initcall aes_init+0x0/0x11 returned 0 after 141 usecs
> [    8.823020] calling  ghash_pclmulqdqni_mod_init+0x0/0x54 @ 1
> [    8.831002] PANIC: double fault, error_code: 0x0
> [    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
> [    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
> [    8.831002] task: ffff880828ba8000 task.stack: ffffc90004444000
> [    8.831002] RIP: 0010:page_fault+0x11/0x60
> [    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
> [    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77

After much more debugging, the patch below 'fixes' the crash as well, when 
CONFIG_DEBUG_ENTRY is disabled.

Note that if *any* of those 4 padding sequences is removed, the kernel starts 
crashing again. Also note that the exact size of the padding appears to be not 
material - it could be larger as well, i.e. it's not an alignment bug I think.

In any case it's not a problem in the actual assembly code paths itself it 
appears.

One guess would be tha it's some sort of sizing bug: maybe the padding forces a 
key piece of data or code on another page - but I'm too tired to root cause it 
right now.

Any ideas?

Thanks,

	Ingo
---
 arch/x86/entry/entry_64.S | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4ac952080869..e83029892017 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -547,6 +547,8 @@ END(irq_entries_start)
 	ud2
 .Lokay_\@:
 	addq $8, %rsp
+#else
+	.rep 16; nop; .endr
 #endif
 .endm
 
@@ -597,6 +599,8 @@ END(irq_entries_start)
 	je	.Lirq_stack_okay\@
 	ud2
 	.Lirq_stack_okay\@:
+#else
+	.rep 16; nop; .endr
 #endif
 
 .Lirq_stack_push_old_rsp_\@:
@@ -707,6 +711,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
 	jnz	1f
 	ud2
 1:
+#else
+	.rep 16; nop; .endr
 #endif
 	POP_EXTRA_REGS
 	popq	%r11
@@ -773,6 +779,8 @@ GLOBAL(restore_regs_and_return_to_kernel)
 	jz	1f
 	ud2
 1:
+#else
+	.rep 16; nop; .endr
 #endif
 	POP_EXTRA_REGS
 	POP_C_REGS

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 22:09   ` Ingo Molnar
@ 2017-11-24 22:35     ` Andy Lutomirski
  2017-11-24 22:53       ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-24 22:35 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds



> On Nov 24, 2017, at 3:09 PM, Ingo Molnar <mingo@kernel.org> wrote:
> 
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
>> 
>> * Ingo Molnar <mingo@kernel.org> wrote:
>> 
>>> This is a repost of the latest entry-stack plus Kaiser bits from Andy Lutomirski
>>> (v3 series from today) and Dave Hansen (kaiser-414-tipwip-20171123 version),
>>> on top of latest tip:x86/urgent (12a78d43de76).
>>> 
>>> This version is pretty well tested, at least on the usual x86 tree test systems.
>>> It has a couple of merge mistakes fixed, the biggest difference is in patch #22:
>>> 
>>>  x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
>>> 
>>> The other patches are identical or very close to what I posted earlier today.
>> 
>> Here's a new bug, on a testsystem I get the double fault boot crash attached 
>> below. The same bzImage crashes on other systems as well, so it's not CPU 
>> dependent.
>> 
>> Via Kconfig-bisection I have narrowed it down to the following .config detail: 
>> it's triggered by _disabling_ CONFIG_DEBUG_ENTRY and enabling CONFIG_KAISER=y.
>> 
>> I.e. one of the sanity checks of CONFIG_DEBUG_ENTRY has some positive side effect. 
>> I'll try to track down which one it is - any ideas meanwhile?
>> 
>> Thanks,
>> 
>>    Ingo
>> 
>> [    8.797733] calling  pt_dump_init+0x0/0x3b @ 1
>> [    8.803144] initcall pt_dump_init+0x0/0x3b returned 0 after 1 usecs
>> [    8.810589] calling  aes_init+0x0/0x11 @ 1
>> [    8.815757] initcall aes_init+0x0/0x11 returned 0 after 141 usecs
>> [    8.823020] calling  ghash_pclmulqdqni_mod_init+0x0/0x54 @ 1
>> [    8.831002] PANIC: double fault, error_code: 0x0
>> [    8.831002] CPU: 11 PID: 260 Comm: modprobe Not tainted 4.14.0-01419-g1b46550a680d-dirty #17
>> [    8.831002] Hardware name: Intel Corporation S2600GZ/S2600GZ, BIOS SE5C600.86B.02.02.0002.122320131210 12/23/2013
>> [    8.831002] task: ffff880828ba8000 task.stack: ffffc90004444000
>> [    8.831002] RIP: 0010:page_fault+0x11/0x60
>> [    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
>> [    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
> 
> After much more debugging, the patch below 'fixes' the crash as well, when 
> CONFIG_DEBUG_ENTRY is disabled.
> 
> Note that if *any* of those 4 padding sequences is removed, the kernel starts 
> crashing again. Also note that the exact size of the padding appears to be not 
> material - it could be larger as well, i.e. it's not an alignment bug I think.
> 
> In any case it's not a problem in the actual assembly code paths itself it 
> appears.
> 
> One guess would be tha it's some sort of sizing bug: maybe the padding forces a 
> key piece of data or code on another page - but I'm too tired to root cause it 
> right now.
> 
> Any ideas?

This smells like a pagerable setup bug. Either the pagetables are a bit broken or they're totally busted and the passing gets something in a more TLB-friendly place.

> 
> Thanks,
> 
>    Ingo
> ---
> arch/x86/entry/entry_64.S | 8 ++++++++
> 1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 4ac952080869..e83029892017 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -547,6 +547,8 @@ END(irq_entries_start)
>    ud2
> .Lokay_\@:
>    addq $8, %rsp
> +#else
> +    .rep 16; nop; .endr
> #endif
> .endm
> 
> @@ -597,6 +599,8 @@ END(irq_entries_start)
>    je    .Lirq_stack_okay\@
>    ud2
>    .Lirq_stack_okay\@:
> +#else
> +    .rep 16; nop; .endr
> #endif
> 
> .Lirq_stack_push_old_rsp_\@:
> @@ -707,6 +711,8 @@ GLOBAL(swapgs_restore_regs_and_return_to_usermode)
>    jnz    1f
>    ud2
> 1:
> +#else
> +    .rep 16; nop; .endr
> #endif
>    POP_EXTRA_REGS
>    popq    %r11
> @@ -773,6 +779,8 @@ GLOBAL(restore_regs_and_return_to_kernel)
>    jz    1f
>    ud2
> 1:
> +#else
> +    .rep 16; nop; .endr
> #endif
>    POP_EXTRA_REGS
>    POP_C_REGS

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 22:35     ` Andy Lutomirski
@ 2017-11-24 22:53       ` Ingo Molnar
  2017-11-25  9:21         ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24 22:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> > Note that if *any* of those 4 padding sequences is removed, the kernel starts 
> > crashing again. Also note that the exact size of the padding appears to be not 
> > material - it could be larger as well, i.e. it's not an alignment bug I think.
> > 
> > In any case it's not a problem in the actual assembly code paths itself it 
> > appears.
> > 
> > One guess would be tha it's some sort of sizing bug: maybe the padding forces a 
> > key piece of data or code on another page - but I'm too tired to root cause it 
> > right now.
> > 
> > Any ideas?
> 
> This smells like a pagerable setup bug. Either the pagetables are a bit broken or they're totally busted and the passing gets something in a more TLB-friendly place.

Also note that the delta patch below also keeps it working, i.e. doubling the 
first padding and eliminating the second padding.

I.e. it's the total per IRQ entry padding that matters, not the exact placement of 
the padding.

I.e. some sort of sizing bug - IDT and/or the pagetables.

(Also note that in my config NR_CPUS is at 128 - defconfigs are 64.)

Thanks,

	Ingo

---
 arch/x86/entry/entry_64.S |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/entry/entry_64.S
===================================================================
--- linux.orig/arch/x86/entry/entry_64.S
+++ linux/arch/x86/entry/entry_64.S
@@ -548,7 +548,7 @@ END(irq_entries_start)
 .Lokay_\@:
 	addq $8, %rsp
 #else
-	.rep 16; nop; .endr
+	.rep 32; nop; .endr
 #endif
 .endm
 
@@ -600,7 +600,7 @@ END(irq_entries_start)
 	ud2
 	.Lirq_stack_okay\@:
 #else
-	.rep 16; nop; .endr
+//	.rep 16; nop; .endr
 #endif
 
 .Lirq_stack_push_old_rsp_\@:

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 17:23 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
@ 2017-11-25  0:02   ` Thomas Gleixner
  2017-11-25 12:41     ` Thomas Gleixner
  2017-11-26 11:50   ` Borislav Petkov
  1 sibling, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25  0:02 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:
> @@ -1288,6 +1308,8 @@ ENTRY(error_entry)
>  	 * from user mode due to an IRET fault.
>  	 */
>  	SWAPGS
> +	/* We have user CR3.  Change to kernel CR3. */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>  
>  .Lerror_entry_from_usermode_after_swapgs:
>  	/* Put us onto the real thread stack. */
> @@ -1333,6 +1355,7 @@ ENTRY(error_entry)
>  	 * gsbase and proceed.  We'll fix up the exception and land in
>  	 * .Lgs_change's error handler with kernel gsbase.
>  	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
>  	SWAPGS

This is wrong. SWAPGS needs to come first.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
  2017-11-24 20:59   ` Andy Lutomirski
  2017-11-24 22:09   ` Ingo Molnar
@ 2017-11-25  4:09   ` Dave Hansen
  2017-11-25  4:15     ` Dave Hansen
  2 siblings, 1 reply; 113+ messages in thread
From: Dave Hansen @ 2017-11-25  4:09 UTC (permalink / raw)
  To: Ingo Molnar, linux-kernel, Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On 11/24/2017 12:22 PM, Ingo Molnar wrote:
> [    8.831002] RIP: 0010:page_fault+0x11/0x60
> [    8.831002] RSP: 0000:ffffffffff0e7fc8 EFLAGS: 00010046
> [    8.831002] RAX: 00000000819d4d77 RBX: 0000000000000001 RCX: ffffffff819d4d77
> [    8.831002] RDX: 0000000000000003 RSI: 0000000000000010 RDI: ffffffffff0e8078
> [    8.831002] RBP: 0000000000000000 R08: 00007ffd7f1aa530 R09: 00007f9407f98400
> [    8.831002] R10: 0000000000000007 R11: 0000000000000000 R12: 00007ffd7f1aa680
> [    8.831002] R13: 00007f9407f91f80 R14: 0000000000000007 R15: 0000000000000000
> [    8.831002]  ? native_iret+0x7/0x7
> [    8.831002] WARNING: can't dereference iret registers at ffffffffff0e8048 for ip page_fault+0x11/0x60
> [    8.831002]  </#DF>
> [    8.831002]  <SYSENTER>
> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60
> [    8.831002]  ? native_iret+0x7/0x7
> [    8.831002]  ? __do_page_fault+0x4c0/0x4c0
> [    8.831002]  ? page_fault+0x2c/0x60
> [    8.831002]  ? __entry_text_end+0x1/0x1
> [    8.831002]  </SYSENTER>

Just a stab in the dark from looking at this for a few seconds...

Doesn't this stack trace mean we started C-code page fault handing on
the sysenter stack?  Seems like we missed a switch to the process stack.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-25  4:09   ` [crash] PANIC: double fault, error_code: 0x0 Dave Hansen
@ 2017-11-25  4:15     ` Dave Hansen
  0 siblings, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2017-11-25  4:15 UTC (permalink / raw)
  To: Dave Hansen, Ingo Molnar, linux-kernel, Andy Lutomirski
  Cc: Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On 11/24/2017 08:09 PM, Dave Hansen wrote:
> Doesn't this stack trace mean we started C-code page fault handing on
> the sysenter stack?  Seems like we missed a switch to the process stack.
... and probably a KAISER switch to the kernel CR3.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-24 22:53       ` Ingo Molnar
@ 2017-11-25  9:21         ` Ingo Molnar
  2017-11-25  9:32           ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-25  9:21 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen
  Cc: linux-kernel, Andy Lutomirski, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> 
> * Andy Lutomirski <luto@amacapital.net> wrote:
> 
> > > Note that if *any* of those 4 padding sequences is removed, the kernel starts 
> > > crashing again. Also note that the exact size of the padding appears to be not 
> > > material - it could be larger as well, i.e. it's not an alignment bug I think.
> > > 
> > > In any case it's not a problem in the actual assembly code paths itself it 
> > > appears.
> > > 
> > > One guess would be tha it's some sort of sizing bug: maybe the padding forces a 
> > > key piece of data or code on another page - but I'm too tired to root cause it 
> > > right now.
> > > 
> > > Any ideas?
> > 
> > This smells like a pagerable setup bug. Either the pagetables are a bit broken or they're totally busted and the passing gets something in a more TLB-friendly place.
> 
> Also note that the delta patch below also keeps it working, i.e. doubling the 
> first padding and eliminating the second padding.
> 
> I.e. it's the total per IRQ entry padding that matters, not the exact placement of 
> the padding.
> 
> I.e. some sort of sizing bug - IDT and/or the pagetables.
> 
> (Also note that in my config NR_CPUS is at 128 - defconfigs are 64.)

The simplest padding I found is the one below - this indicates some sort of 
section sizing or page table setup bug (or page alignment bug) and makes races and 
other bugs less likely.

Thanks,

	Ingo

=================>
 arch/x86/entry/entry_64.S | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4ac952080869..ea992ca4e74f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -547,6 +547,8 @@ END(irq_entries_start)
 	ud2
 .Lokay_\@:
 	addq $8, %rsp
+#else
+	.rep 64; nop; .endr
 #endif
 .endm
 

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-25  9:21         ` Ingo Molnar
@ 2017-11-25  9:32           ` Ingo Molnar
  2017-11-25  9:39             ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-25  9:32 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen, Josh Poimboeuf
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> > (Also note that in my config NR_CPUS is at 128 - defconfigs are 64.)
> 
> The simplest padding I found is the one below - this indicates some sort of 
> section sizing or page table setup bug (or page alignment bug) and makes races and 
> other bugs less likely.
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 4ac952080869..ea992ca4e74f 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -547,6 +547,8 @@ END(irq_entries_start)
>  	ud2
>  .Lokay_\@:
>  	addq $8, %rsp
> +#else
> +	.rep 64; nop; .endr

Also note that turning off CONFIG_UNWINDER_ORC also solves the crash. I did that 
in an attempt to get a different backtrace.

So it's either unwinder related, or seemingly minor changes to code 
alignment/placement will make the bug go away.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [crash] PANIC: double fault, error_code: 0x0
  2017-11-25  9:32           ` Ingo Molnar
@ 2017-11-25  9:39             ` Ingo Molnar
  2017-11-25 11:17               ` [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-25  9:39 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen, Josh Poimboeuf
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 4ac952080869..ea992ca4e74f 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -547,6 +547,8 @@ END(irq_entries_start)
> >  	ud2
> >  .Lokay_\@:
> >  	addq $8, %rsp
> > +#else
> > +	.rep 64; nop; .endr
> 
> Also note that turning off CONFIG_UNWINDER_ORC also solves the crash. I did that 
> in an attempt to get a different backtrace.
> 
> So it's either unwinder related, or seemingly minor changes to code 
> alignment/placement will make the bug go away.

Ok, I think the Orc unwinder is innocent: I just forced a build with frame 
pointers but with ORC debuginfo and unwinder, and that is booting fine too.

So it's the specific code size and alignment present in the config I sent that is 
triggering the bug. Fudging that alignment/sizing with the workaround patch above 
makes the crash go away.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping
  2017-11-25  9:39             ` Ingo Molnar
@ 2017-11-25 11:17               ` Ingo Molnar
  2017-11-25 16:08                 ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-25 11:17 UTC (permalink / raw)
  To: Andy Lutomirski, Dave Hansen, Josh Poimboeuf
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Ingo Molnar <mingo@kernel.org> wrote:

> > So it's either unwinder related, or seemingly minor changes to code 
> > alignment/placement will make the bug go away.
> 
> Ok, I think the Orc unwinder is innocent: I just forced a build with frame 
> pointers but with ORC debuginfo and unwinder, and that is booting fine too.
> 
> So it's the specific code size and alignment present in the config I sent that is 
> triggering the bug. Fudging that alignment/sizing with the workaround patch above 
> makes the crash go away.

Ok, after a few hours of debugging with Thomas today we found the bug: it was 
caused by Kaiser not mapping the __irqentry_text_start...__irqentry_text_end 
kernel virtual memory range, which the IRQ entry code requires before it switches 
to the kernel page tables.

This bug was hidden in most configs by virtue of:

 	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
 				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);

because the __irqentry_text_start...__irqentry_text_end kernel code section comes 
right after this section and is pretty small - so it usually fits into the last 
page of the mapping.

But for certain configs the __entry_text_start address crossed the next page 
boundary, and was not mapped - resulting in a page fault and then in a double 
fault as the entry stack overflowed.

The patch below fixes the crash, and after some testing I'll backmerge it to the 
core Kaiser patch in tip:WIP.x86/mm.

There's a few takeaways from this incident though:

 - Entry stack overflow debugging is pretty crude and should probably be improved.

 - We should better isolate all Kaiser-mapped sections and pad them out
   to page boundary. This would immediately trigger any simila false-sharing 
   side-effect mapping bugs and they wouldn't be .config dependent Heisenbugs.

Thanks,

	Ingo

=============>
>From 19bbacd5f7c66b4716716bad97848a9cacd6fe68 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@kernel.org>
Date: Sat, 25 Nov 2017 12:10:38 +0100
Subject: [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping

Backmerge to:

  f9e7e52fa076: x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)

Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/mm/kaiser.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/x86/mm/kaiser.c b/arch/x86/mm/kaiser.c
index 1eb27b410556..ea8887bf550a 100644
--- a/arch/x86/mm/kaiser.c
+++ b/arch/x86/mm/kaiser.c
@@ -412,6 +412,8 @@ void __init kaiser_init(void)
 
 	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
 				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
+	kaiser_add_user_map_ptrs_early(__irqentry_text_start, __irqentry_text_end,
+				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
 
 	/* the fixed map address of the idt_table */
 	kaiser_add_user_map_early((void *)idt_descr.address,

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24 17:23 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
@ 2017-11-25 11:40   ` Borislav Petkov
  2017-11-25 15:00     ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 11:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:43PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
> single register (except FLAGS), including RSP, live.  It somehow needs
> to set RSP to point to a valid stack, which means it needs to save the
> user RSP somewhere and find its own stack pointer.  The canonical way
> to do this is with SWAPGS, which lets us access percpu data using the
> %gs prefix.
> 
> With KAISER-like pagetable switching, this is problematic.  Without a
> scratch register, switching CR3 is impossible, so %gs-based percpu
> memory would need to be mapped in the user pagetables.  Doing that
> without information leaks is difficult or impossible.
> 
> Instead, use a different sneaky trick.  Map a copy of the first part
> of the SYSCALL asm at a different address for each CPU.  Now RIP
> varies depending on the CPU, so we can use RIP-relative memory access
> to access percpu memory.  By putting the relevant information (one
> scratch slot and the stack address) at a constant offset relative to
> RIP, we can make SYSCALL work without relying on %gs.
> 
> A nice thing about this approach is that we can easily switch it on
> and off if we want pagetable switching to be configurable.
> 
> The compat variant of SYSCALL doesn't have this problem in the first
> place -- there are plenty of scratch registers, since we don't care
> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
> at all.
> 
> XXX: Whenever we settle how KAISER gets turned on and off, we should do
> the same to this.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/include/asm/fixmap.h |  2 ++
>  arch/x86/kernel/asm-offsets.c |  1 +
>  arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
>  arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
>  5 files changed, 72 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 426b8c669d6a..0cde243b7542 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -140,6 +140,54 @@ END(native_usergs_sysret64)
>   * with them due to bugs in both AMD and Intel CPUs.
>   */
>  
> +	.pushsection .entry_trampoline, "ax"
> +
> +/*
> + * The code in here gets remapped into cpu_entry_area's trampoline.  This means
> + * that the assembler and linker have the wrong idea as to where this code
> + * lives (and, in fact, it's mapped more than once, so it's not even at a
> + * fixed address).  So we can't reference any symbols outside the entry
> + * trampoline and expect it to work.
> + *
> + * Instead, we carefully abuse %rip-relative addressing.
> + * .Lentry_trampoline(%rip) refers to the start of the remapped) entry

_entry_trampoline(%rip) I'd guess.

> + * trampoline.  We can thus find cpu_entry_area with this macro:

Uuh, fun. :)

> + */
> +
> +#define CPU_ENTRY_AREA \
> +	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)

So this generates

	_entry_trampoline - 16384(%rip)

here. IINM, the layout looks like this

[ GDT | TSS | TSS | TSS | trampoline ]

where each section is a page, and we have 4 pages per CPU. Just for my
own understanding...

> +
> +/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
> +#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
> +	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA

I'm wondering if it would be easier to make SYSENTER_stack part of
struct cpu_entry_area and thus simplify that wild offset math here :)

Also, pls align:

#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
                    SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA

> +
> +ENTRY(entry_SYSCALL_64_trampoline)
> +	UNWIND_HINT_EMPTY
> +	swapgs
> +
> +	/* Stash the user RSP. */
> +	movq	%rsp, RSP_SCRATCH
> +
> +	/* Load the top of the task stack into RSP */
> +	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp

Yeah, let's put CPU_ENTRY_AREA first because then it reads easier:

movq    CPU_ENTRY_AREA + CPU_ENTRY_AREA_tss + TSS_sp1, %rsp

i.e., pointer to cpu_entry_area, offset to tss within said
cpu_entry_area and then inside that, sp1.

Ditto for above.

...

> @@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
>  /* May not be marked __init: used by software suspend */
>  void syscall_init(void)
>  {
> +	extern char _entry_trampoline[];
> +	extern char entry_SYSCALL_64_trampoline[];
> +
>  	int cpu = smp_processor_id();
>  
>  	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> -	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
> +	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));

Definitely a local var.

>  #ifdef CONFIG_IA32_EMULATION
>  	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> index a4009fb9be87..2738cfb6c8c8 100644
> --- a/arch/x86/kernel/vmlinux.lds.S
> +++ b/arch/x86/kernel/vmlinux.lds.S
> @@ -107,6 +107,16 @@ SECTIONS
>  		SOFTIRQENTRY_TEXT
>  		*(.fixup)
>  		*(.gnu.warning)
> +
> +#ifdef CONFIG_X86_64
> +		/* Entry trampoline */

No need for that comment - variable and section names are enough. :)

> +		. = ALIGN(PAGE_SIZE);
> +		_entry_trampoline = .;
> +		*(.entry_trampoline)
> +		. = ALIGN(PAGE_SIZE);
> +		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
> +#endif
> +
>  		/* End of text section */
>  		_etext = .;
>  	} :text = 0x9090
> -- 

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races
  2017-11-24 17:23 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
@ 2017-11-25 12:05   ` Borislav Petkov
  0 siblings, 0 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 12:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:44PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> That race has been fixed and code cleaned up for a while now.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/12e75976dbbb7ece2b0a64238f1d3892dfed1e16.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/irq.c | 12 ------------
>  1 file changed, 12 deletions(-)
> 
> diff --git a/arch/x86/kernel/irq.c b/arch/x86/kernel/irq.c
> index 49cfd9fe7589..68e1867cca80 100644
> --- a/arch/x86/kernel/irq.c
> +++ b/arch/x86/kernel/irq.c
> @@ -219,18 +219,6 @@ __visible unsigned int __irq_entry do_IRQ(struct pt_regs *regs)
>  	/* high bit used in ret_from_ code  */
>  	unsigned vector = ~regs->orig_ax;
>  
> -	/*
> -	 * NB: Unlike exception entries, IRQ entries do not reliably
> -	 * handle context tracking in the low-level entry code.  This is
> -	 * because syscall entries execute briefly with IRQs on before
> -	 * updating context tracking state, so we can take an IRQ from
> -	 * kernel mode with CONTEXT_USER.  The low-level entry code only
> -	 * updates the context if we came from user mode, so we won't
> -	 * switch to CONTEXT_KERNEL.  We'll fix that once the syscall
> -	 * code is cleaned up enough that we can cleanly defer enabling
> -	 * IRQs.
> -	 */
> -
>  	entering_irq();
>  
>  	/* entering_irq() tells RCU that we're not quiescent.  Check it. */
> -- 

Reviewed-by: Borislav Petkov <bp@suse.de>

Also, fixes like that should move to the top of the patchset.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP
  2017-11-24 17:23 ` [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP Ingo Molnar
@ 2017-11-25 12:07   ` Borislav Petkov
  0 siblings, 0 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 12:07 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:45PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> In case something goes wrong with unwind (not unlikely in case of
> overflow), print the offending IP where we detected the overflow.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/6fcf700cc5ee884fb739b67d1246ab4185c41409.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/kernel/irq_64.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
> index 020efbf5786b..d86e344f5b3d 100644
> --- a/arch/x86/kernel/irq_64.c
> +++ b/arch/x86/kernel/irq_64.c
> @@ -57,10 +57,10 @@ static inline void stack_overflow_check(struct pt_regs *regs)
>  	if (regs->sp >= estack_top && regs->sp <= estack_bottom)
>  		return;
>  
> -	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx)\n",
> +	WARN_ONCE(1, "do_IRQ(): %s has overflown the kernel stack (cur:%Lx,sp:%lx,irq stk top-bottom:%Lx-%Lx,exception stk top-bottom:%Lx-%Lx,ip:%pF)\n",
>  		current->comm, curbase, regs->sp,
>  		irq_stack_top, irq_stack_bottom,
> -		estack_top, estack_bottom);
> +		estack_top, estack_bottom, (void *)regs->ip);
>  
>  	if (sysctl_panic_on_stackoverflow)
>  		panic("low stack detected by irq handler - check messages\n");
> -- 

Reviewed-by: Borislav Petkov <bp@suse.de>

Also move to the beginning of the patchset.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area
  2017-11-24 17:23 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
@ 2017-11-25 12:34   ` Borislav Petkov
  0 siblings, 0 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 12:34 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:46PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> The IST stacks are needed when an IST exception occurs and are
> accessed before any kernel code at all runs.  Move them into
> cpu_entry_area.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/0ffddccdc0ce1953f950a553142662cf68258fb7.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/fixmap.h | 10 ++++++++++
>  arch/x86/kernel/cpu/common.c  | 40 +++++++++++++++++++++++++---------------
>  2 files changed, 35 insertions(+), 15 deletions(-)

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-25  0:02   ` Thomas Gleixner
@ 2017-11-25 12:41     ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 12:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Sat, 25 Nov 2017, Thomas Gleixner wrote:

> On Fri, 24 Nov 2017, Ingo Molnar wrote:
> > @@ -1288,6 +1308,8 @@ ENTRY(error_entry)
> >  	 * from user mode due to an IRET fault.
> >  	 */
> >  	SWAPGS
> > +	/* We have user CR3.  Change to kernel CR3. */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> >  
> >  .Lerror_entry_from_usermode_after_swapgs:
> >  	/* Put us onto the real thread stack. */
> > @@ -1333,6 +1355,7 @@ ENTRY(error_entry)
> >  	 * gsbase and proceed.  We'll fix up the exception and land in
> >  	 * .Lgs_change's error handler with kernel gsbase.
> >  	 */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
> >  	SWAPGS
> 
> This is wrong. SWAPGS needs to come first.

With this fixed:

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-25 11:40   ` Borislav Petkov
@ 2017-11-25 15:00     ` Andy Lutomirski
  2017-11-26 14:26       ` [PATCH] " Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-25 15:00 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sat, Nov 25, 2017 at 3:40 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 24, 2017 at 06:23:43PM +0100, Ingo Molnar wrote:
>> From: Andy Lutomirski <luto@kernel.org>
>>
>> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
>> single register (except FLAGS), including RSP, live.  It somehow needs
>> to set RSP to point to a valid stack, which means it needs to save the
>> user RSP somewhere and find its own stack pointer.  The canonical way
>> to do this is with SWAPGS, which lets us access percpu data using the
>> %gs prefix.
>>
>> With KAISER-like pagetable switching, this is problematic.  Without a
>> scratch register, switching CR3 is impossible, so %gs-based percpu
>> memory would need to be mapped in the user pagetables.  Doing that
>> without information leaks is difficult or impossible.
>>
>> Instead, use a different sneaky trick.  Map a copy of the first part
>> of the SYSCALL asm at a different address for each CPU.  Now RIP
>> varies depending on the CPU, so we can use RIP-relative memory access
>> to access percpu memory.  By putting the relevant information (one
>> scratch slot and the stack address) at a constant offset relative to
>> RIP, we can make SYSCALL work without relying on %gs.
>>
>> A nice thing about this approach is that we can easily switch it on
>> and off if we want pagetable switching to be configurable.
>>
>> The compat variant of SYSCALL doesn't have this problem in the first
>> place -- there are plenty of scratch registers, since we don't care
>> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
>> at all.
>>
>> XXX: Whenever we settle how KAISER gets turned on and off, we should do
>> the same to this.
>>
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
>> Cc: Borislav Petkov <bpetkov@suse.de>
>> Cc: Brian Gerst <brgerst@gmail.com>
>> Cc: Dave Hansen <dave.hansen@intel.com>
>> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
>> Cc: Linus Torvalds <torvalds@linux-foundation.org>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
>> Signed-off-by: Ingo Molnar <mingo@kernel.org>
>> ---
>>  arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
>>  arch/x86/include/asm/fixmap.h |  2 ++
>>  arch/x86/kernel/asm-offsets.c |  1 +
>>  arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
>>  arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
>>  5 files changed, 72 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
>> index 426b8c669d6a..0cde243b7542 100644
>> --- a/arch/x86/entry/entry_64.S
>> +++ b/arch/x86/entry/entry_64.S
>> @@ -140,6 +140,54 @@ END(native_usergs_sysret64)
>>   * with them due to bugs in both AMD and Intel CPUs.
>>   */
>>
>> +     .pushsection .entry_trampoline, "ax"
>> +
>> +/*
>> + * The code in here gets remapped into cpu_entry_area's trampoline.  This means
>> + * that the assembler and linker have the wrong idea as to where this code
>> + * lives (and, in fact, it's mapped more than once, so it's not even at a
>> + * fixed address).  So we can't reference any symbols outside the entry
>> + * trampoline and expect it to work.
>> + *
>> + * Instead, we carefully abuse %rip-relative addressing.
>> + * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
>
> _entry_trampoline(%rip) I'd guess.

Indeed.  It used to be .L, but then I put it in the linker script the
obvious way and it wasn't any more.

>
>> + * trampoline.  We can thus find cpu_entry_area with this macro:
>
> Uuh, fun. :)

That's what I thought!

>
>> + */
>> +
>> +#define CPU_ENTRY_AREA \
>> +     _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
>
> So this generates
>
>         _entry_trampoline - 16384(%rip)
>
> here. IINM, the layout looks like this
>
> [ GDT | TSS | TSS | TSS | trampoline ]
>
> where each section is a page, and we have 4 pages per CPU. Just for my
> own understanding...

Indeed.

>
>> +
>> +/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
>> +#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
>> +     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
>
> I'm wondering if it would be easier to make SYSENTER_stack part of
> struct cpu_entry_area and thus simplify that wild offset math here :)

It's like that with the last patch that I haven't send out applied, actually :)

>
> Also, pls align:
>
> #define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
>                     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA

Done.

>
>> +
>> +ENTRY(entry_SYSCALL_64_trampoline)
>> +     UNWIND_HINT_EMPTY
>> +     swapgs
>> +
>> +     /* Stash the user RSP. */
>> +     movq    %rsp, RSP_SCRATCH
>> +
>> +     /* Load the top of the task stack into RSP */
>> +     movq    CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
>
> Yeah, let's put CPU_ENTRY_AREA first because then it reads easier:
>
> movq    CPU_ENTRY_AREA + CPU_ENTRY_AREA_tss + TSS_sp1, %rsp

Nope, it won't build.  That would expand like 0x1(%rip) + 0x2, which
isn't valid.

>
> i.e., pointer to cpu_entry_area, offset to tss within said
> cpu_entry_area and then inside that, sp1.
>
> Ditto for above.
>
> ...
>
>> @@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
>>  /* May not be marked __init: used by software suspend */
>>  void syscall_init(void)
>>  {
>> +     extern char _entry_trampoline[];
>> +     extern char entry_SYSCALL_64_trampoline[];
>> +
>>       int cpu = smp_processor_id();
>>
>>       wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
>> -     wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
>> +     wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
>
> Definitely a local var.

Done.

>
>>  #ifdef CONFIG_IA32_EMULATION
>>       wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
>> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
>> index a4009fb9be87..2738cfb6c8c8 100644
>> --- a/arch/x86/kernel/vmlinux.lds.S
>> +++ b/arch/x86/kernel/vmlinux.lds.S
>> @@ -107,6 +107,16 @@ SECTIONS
>>               SOFTIRQENTRY_TEXT
>>               *(.fixup)
>>               *(.gnu.warning)
>> +
>> +#ifdef CONFIG_X86_64
>> +             /* Entry trampoline */
>
> No need for that comment - variable and section names are enough. :)
>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary
  2017-11-24 17:23 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
@ 2017-11-25 15:29   ` Borislav Petkov
  0 siblings, 0 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 15:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:47PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> Now that the SYSENTER stack has a guard page, there's no need for a
> canary to detect overflow after the fact.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/be3179c0a38c392fa44ebeb7dd89391ff5c010c3.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  arch/x86/include/asm/processor.h | 1 -
>  arch/x86/kernel/dumpstack.c      | 3 +--
>  arch/x86/kernel/process.c        | 1 -
>  arch/x86/kernel/traps.c          | 7 -------
>  4 files changed, 1 insertion(+), 11 deletions(-)

Nice.

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping
  2017-11-25 11:17               ` [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping Ingo Molnar
@ 2017-11-25 16:08                 ` Thomas Gleixner
  2017-11-25 20:06                   ` Steven Rostedt
  2017-11-27  8:14                   ` Peter Zijlstra
  0 siblings, 2 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 16:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, Dave Hansen, Josh Poimboeuf, LKML,
	Andy Lutomirski, H . Peter Anvin, Peter Zijlstra,
	Borislav Petkov, Linus Torvalds, Steven Rostedt, Andrey Ryabinin,
	kasan-dev, Masami Hiramatsu

On Sat, 25 Nov 2017, Ingo Molnar wrote:
>  	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
>  				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
> +	kaiser_add_user_map_ptrs_early(__irqentry_text_start, __irqentry_text_end,
> +				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);

The bad news is that this maps more stuff than actually needed:

ffffffff81ab1c20 T __irqentry_text_start
ffffffff81ab1c20 T apic_timer_interrupt
ffffffff81ab1cf0 T x86_platform_ipi
ffffffff81ab1dc0 T threshold_interrupt
ffffffff81ab1e90 T deferred_error_interrupt
ffffffff81ab1f60 T thermal_interrupt
ffffffff81ab2030 T call_function_single_interrupt
ffffffff81ab2100 T call_function_interrupt
ffffffff81ab21d0 T reschedule_interrupt
ffffffff81ab22a0 T error_interrupt
ffffffff81ab2370 T spurious_interrupt
ffffffff81ab2440 T irq_work_interrupt

The above are the real entry points. The below is already C-Code which has
nothing to do with the entry region.

ffffffff81ab2510 T do_IRQ
ffffffff81ab2610 T smp_x86_platform_ipi
ffffffff81ab2840 T smp_irq_work_interrupt
ffffffff81ab2a50 T smp_deferred_error_interrupt
ffffffff81ab2c60 T smp_threshold_interrupt
ffffffff81ab2e70 T smp_thermal_interrupt
ffffffff81ab3080 T smp_reschedule_interrupt
ffffffff81ab3280 T smp_call_function_interrupt
ffffffff81ab3490 T smp_call_function_single_interrupt
ffffffff81ab36a0 T smp_apic_timer_interrupt
ffffffff81ab3900 T smp_spurious_interrupt
ffffffff81ab3b40 T smp_error_interrupt
ffffffff81ab3e20 T smp_irq_move_cleanup_interrupt
ffffffff81ab3ecc T __irqentry_text_end

irqentry_text is checked by kasan, kprobes, tracing and the unwinder.

kasan uses it to filter irq stacks, i.e. to limit the stack entries to the
point where it hits something inside irqentry_text. Shouldn't be a problem
to restrict it to the actual entry code.

kprobes has a similar thing. The comment says:

	Do not optimize in the entry code due to the unstable stack
	and register handling.

The C functions are not affected by that ....

Tracing uses it to stop the printout of the function graph. Should be safe
to print the C-Code functions.

The unwinder might get confused. The comment there says:

  Don't warn if the unwinder got lost due to an interrupt in entry
  code or in the C handler before the first frame pointer got set up:

I think the unwinder one is easy to solve by having a separate section for
the C-code portion.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-24 17:23 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
@ 2017-11-25 16:39   ` Borislav Petkov
  2017-11-25 16:50     ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 16:39 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:48PM +0100, Ingo Molnar wrote:
> From: Andy Lutomirski <luto@kernel.org>
> 
> The existing code was a mess, mainly because C arrays are nasty.
> Turn SYSENTER_stack into a struct, add a helper to find it, and do
> all the obvious cleanups this enables.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Borislav Petkov <bpetkov@suse.de>
> Cc: Brian Gerst <brgerst@gmail.com>
> Cc: Dave Hansen <dave.hansen@intel.com>
> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Link: https://lkml.kernel.org/r/38ff640712c9b591b32de24a080daf13afaba234.1511497875.git.luto@kernel.org
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---

...

> diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
> index 61b1af88ac07..46c0995344aa 100644
> --- a/arch/x86/kernel/asm-offsets.c
> +++ b/arch/x86/kernel/asm-offsets.c
> @@ -94,10 +94,8 @@ void common(void) {
>  	BLANK();
>  	DEFINE(PTREGS_SIZE, sizeof(struct pt_regs));
>  
> -	/* Offset from cpu_tss to SYSENTER_stack */
> -	OFFSET(CPU_TSS_SYSENTER_stack, tss_struct, SYSENTER_stack);
> -	/* Size of SYSENTER_stack */
> -	DEFINE(SIZEOF_SYSENTER_stack, sizeof(((struct tss_struct *)0)->SYSENTER_stack));
> +	OFFSET(TSS_STRUCT_SYSENTER_stack, tss_struct, SYSENTER_stack);
> +	DEFINE(SIZEOF_SYSENTER_stack, sizeof(struct SYSENTER_stack));
>  
>  	/* Layout info for cpu_entry_area */
>  	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 6b949e6ea0f9..f9c7e6852874 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1332,12 +1332,7 @@ void enable_sep_cpu(void)
>  
>  	tss->x86_tss.ss1 = __KERNEL_CS;
>  	wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);
> -
> -	wrmsr(MSR_IA32_SYSENTER_ESP,
> -	      (unsigned long)&get_cpu_entry_area(cpu)->tss +
> -	      offsetofend(struct tss_struct, SYSENTER_stack),
> -	      0);
> -
> +	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
>  	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);

Right, so we have now two TSS thingies, AFAICT:

	tss = &per_cpu(cpu_tss, cpu);

which is cpu_tss and then indirectly, we have also:

	&get_cpu_entry_area((cpu))->tss

And those are two different things in my guest here:

[    0.044002] tss: 0xf5747000
[    0.044706] entry area tss: 0xffef1000

What is the logic here? We carry two TSSs per CPU - one which is RO
for the entry area and the other is the actual cpu_tss thing? Or am I
misreading it?

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 16:39   ` Borislav Petkov
@ 2017-11-25 16:50     ` Thomas Gleixner
  2017-11-25 16:55       ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 16:50 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Andy Lutomirski,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sat, 25 Nov 2017, Borislav Petkov wrote:
> > -
> > +	wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
> >  	wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
> 
> Right, so we have now two TSS thingies, AFAICT:
> 
> 	tss = &per_cpu(cpu_tss, cpu);
> 
> which is cpu_tss and then indirectly, we have also:
> 
> 	&get_cpu_entry_area((cpu))->tss
> 
> And those are two different things in my guest here:
> 
> [    0.044002] tss: 0xf5747000
> [    0.044706] entry area tss: 0xffef1000
> 
> What is the logic here? We carry two TSSs per CPU - one which is RO
> for the entry area and the other is the actual cpu_tss thing? Or am I
> misreading it?

entry area tss is a alias mapping of cpu_tss

+       set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
+                               &per_cpu(cpu_tss, cpu),
+                               sizeof(struct tss_struct) / PAGE_SIZE,
+                               PAGE_KERNEL);

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 16:50     ` Thomas Gleixner
@ 2017-11-25 16:55       ` Andy Lutomirski
  2017-11-25 17:03         ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-25 16:55 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Borislav Petkov, Ingo Molnar, linux-kernel, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds



> On Nov 25, 2017, at 9:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Sat, 25 Nov 2017, Borislav Petkov wrote:
>>> -
>>> +    wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
>>>    wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
>> 
>> Right, so we have now two TSS thingies, AFAICT:
>> 
>>    tss = &per_cpu(cpu_tss, cpu);
>> 
>> which is cpu_tss and then indirectly, we have also:
>> 
>>    &get_cpu_entry_area((cpu))->tss
>> 
>> And those are two different things in my guest here:
>> 
>> [    0.044002] tss: 0xf5747000
>> [    0.044706] entry area tss: 0xffef1000
>> 
>> What is the logic here? We carry two TSSs per CPU - one which is RO
>> for the entry area and the other is the actual cpu_tss thing? Or am I
>> misreading it?
> 
> entry area tss is a alias mapping of cpu_tss
> 
> +       set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
> +                               &per_cpu(cpu_tss, cpu),
> +                               sizeof(struct tss_struct) / PAGE_SIZE,
> +                               PAGE_KERNEL);
> 

Exactly.  And, in the patch I haven't emailed, the alias is RO on x86_64.

Maybe I should rename cpu_tss to cpu_tss_rw in that patch.

> Thanks,
> 
>    tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 16:55       ` Andy Lutomirski
@ 2017-11-25 17:03         ` Thomas Gleixner
  2017-11-25 17:10           ` Borislav Petkov
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 17:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Ingo Molnar, linux-kernel, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> > On Nov 25, 2017, at 9:50 AM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> > On Sat, 25 Nov 2017, Borislav Petkov wrote:
> >>> -
> >>> +    wrmsr(MSR_IA32_SYSENTER_ESP, (unsigned long)(cpu_SYSENTER_stack(cpu) + 1), 0);
> >>>    wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);
> >> 
> >> Right, so we have now two TSS thingies, AFAICT:
> >> 
> >>    tss = &per_cpu(cpu_tss, cpu);
> >> 
> >> which is cpu_tss and then indirectly, we have also:
> >> 
> >>    &get_cpu_entry_area((cpu))->tss
> >> 
> >> And those are two different things in my guest here:
> >> 
> >> [    0.044002] tss: 0xf5747000
> >> [    0.044706] entry area tss: 0xffef1000
> >> 
> >> What is the logic here? We carry two TSSs per CPU - one which is RO
> >> for the entry area and the other is the actual cpu_tss thing? Or am I
> >> misreading it?
> > 
> > entry area tss is a alias mapping of cpu_tss
> > 
> > +       set_percpu_fixmap_pages(get_cpu_entry_area_index(cpu, tss),
> > +                               &per_cpu(cpu_tss, cpu),
> > +                               sizeof(struct tss_struct) / PAGE_SIZE,
> > +                               PAGE_KERNEL);
> > 
> 
> Exactly.  And, in the patch I haven't emailed, the alias is RO on x86_64.

Btw, I don't think you have to worry about making it RO. The SDM is blurry
about that for 64bit. RW is a must for 32bit though.

> Maybe I should rename cpu_tss to cpu_tss_rw in that patch.

For clarity that would be nice.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 17:03         ` Thomas Gleixner
@ 2017-11-25 17:10           ` Borislav Petkov
  2017-11-25 17:26             ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-25 17:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Ingo Molnar, linux-kernel, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sat, Nov 25, 2017 at 06:03:36PM +0100, Thomas Gleixner wrote:
> > Maybe I should rename cpu_tss to cpu_tss_rw in that patch.
> 
> For clarity that would be nice.

+ a comment stating the alias mapping. It took tglx and me a while on
IRC to figure it out. :-)

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  2017-11-24 17:23 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
@ 2017-11-25 17:17   ` Thomas Gleixner
  2017-11-26 15:54     ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 17:17 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:
> diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> index aab4fe9f49f8..300090d1c209 100644
> --- a/arch/x86/include/asm/desc.h
> +++ b/arch/x86/include/asm/desc.h
> @@ -46,7 +46,7 @@ struct gdt_page {
>  	struct desc_struct gdt[GDT_ENTRIES];
>  } __attribute__((aligned(PAGE_SIZE)));
>  
> -DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
> +DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);

This patch is obsolete. We have the alias mappings in the cpu_entry_area
already.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 17:10           ` Borislav Petkov
@ 2017-11-25 17:26             ` Andy Lutomirski
  2017-11-27  9:27               ` Peter Zijlstra
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-25 17:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Thomas Gleixner, Ingo Molnar, linux-kernel, Dave Hansen,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sat, Nov 25, 2017 at 9:10 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Sat, Nov 25, 2017 at 06:03:36PM +0100, Thomas Gleixner wrote:
>> > Maybe I should rename cpu_tss to cpu_tss_rw in that patch.
>>
>> For clarity that would be nice.
>
> + a comment stating the alias mapping. It took tglx and me a while on
> IRC to figure it out. :-)

I'm adding this comment to the top of cpu_entry_area:

+ * Every field is a virtual alias of some other allocated backing store.
+ * There is no direct allocation of a struct cpu_entry_area.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-24 17:24 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
@ 2017-11-25 19:18   ` Thomas Gleixner
  2017-11-25 19:53     ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 19:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> The KAISER CR3 switches are expensive for many reasons.  Not all systems
> benefit from the protection provided by KAISER.  Some of them can not
> pay the high performance cost.
> 
> This patch adds a debugfs file.  To disable KAISER, you do:
> 
> 	echo 0 > /sys/kernel/debug/x86/kaiser-enabled
> 
> and to re-enable it, you can:
> 
> 	echo 1 > /sys/kernel/debug/x86/kaiser-enabled
> 
> This is a *minimal* implementation.  There are certainly plenty of
> optimizations that can be done on top of this by using ALTERNATIVES
> among other things.

It's not only minimal. It's naive and broken. That thing explodes when
toggled in the wrong moment. I did not even attempt to debug that, because
I think the approach is wrong.

If you really want to make it runtime switchable, then:

 - the shadow tables need to be updated unconditionally. I did not check
   whether thats done right now, but explosions are simpler to achieve when
   switching it back on. Though switching it off crashes as well.

 - you need to make sure that no task is in user space or on the way to it.
   The much I hate stop_machine(), that's probably the right tool.
   Once everything is in stomp_machine() the switch can be flipped.

 - the poisoning/unpoisoning of the kernel tables does not need to be done
   from stop_machine(). That can be done from regular context with a TIF
   flag, so you can make sure that every task is up to date before
   returning to user space. Though that needs a lot of thought.

For now I really want to see that removed entirely and replaced by a simple
boot time switch. We can use the global variable for now and optimize it
later on.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-25 19:18   ` Thomas Gleixner
@ 2017-11-25 19:53     ` Andy Lutomirski
  2017-11-25 20:05       ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-25 19:53 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds



> On Nov 25, 2017, at 12:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Fri, 24 Nov 2017, Ingo Molnar wrote:
>> 
>> From: Dave Hansen <dave.hansen@linux.intel.com>
>> 
>> The KAISER CR3 switches are expensive for many reasons.  Not all systems
>> benefit from the protection provided by KAISER.  Some of them can not
>> pay the high performance cost.
>> 
>> This patch adds a debugfs file.  To disable KAISER, you do:
>> 
>>    echo 0 > /sys/kernel/debug/x86/kaiser-enabled
>> 
>> and to re-enable it, you can:
>> 
>>    echo 1 > /sys/kernel/debug/x86/kaiser-enabled
>> 
>> This is a *minimal* implementation.  There are certainly plenty of
>> optimizations that can be done on top of this by using ALTERNATIVES
>> among other things.
> 
> It's not only minimal. It's naive and broken. That thing explodes when
> toggled in the wrong moment. I did not even attempt to debug that, because
> I think the approach is wrong.
> 
> If you really want to make it runtime switchable, then:
> 
> - the shadow tables need to be updated unconditionally. I did not check
>   whether thats done right now, but explosions are simpler to achieve when
>   switching it back on. Though switching it off crashes as well.
> 
> - you need to make sure that no task is in user space or on the way to it.
>   The much I hate stop_machine(), that's probably the right tool.
>   Once everything is in stomp_machine() the switch can be flipped.
> 
> - the poisoning/unpoisoning of the kernel tables does not need to be done
>   from stop_machine(). That can be done from regular context with a TIF
>   flag, so you can make sure that every task is up to date before
>   returning to user space. Though that needs a lot of thought.
> 
> For now I really want to see that removed entirely and replaced by a simple
> boot time switch. We can use the global variable for now and optimize it
> later on.
> 

Nah, let's do it right: use either X86_FEATURE_WHATEVER or a static_branch.  We have nice asm support for both.

Keep in mind that, for a static_branch, actually setting the thing needs to be deferred, but that's straightforward.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-25 19:53     ` Andy Lutomirski
@ 2017-11-25 20:05       ` Thomas Gleixner
  2017-11-25 22:10         ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 20:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> > On Nov 25, 2017, at 12:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> On Fri, 24 Nov 2017, Ingo Molnar wrote:
> >> 
> >> From: Dave Hansen <dave.hansen@linux.intel.com>
> >> 
> >> The KAISER CR3 switches are expensive for many reasons.  Not all systems
> >> benefit from the protection provided by KAISER.  Some of them can not
> >> pay the high performance cost.
> >> 
> >> This patch adds a debugfs file.  To disable KAISER, you do:
> >> 
> >>    echo 0 > /sys/kernel/debug/x86/kaiser-enabled
> >> 
> >> and to re-enable it, you can:
> >> 
> >>    echo 1 > /sys/kernel/debug/x86/kaiser-enabled
> >> 
> >> This is a *minimal* implementation.  There are certainly plenty of
> >> optimizations that can be done on top of this by using ALTERNATIVES
> >> among other things.
> > 
> > It's not only minimal. It's naive and broken. That thing explodes when
> > toggled in the wrong moment. I did not even attempt to debug that, because
> > I think the approach is wrong.
> > 
> > If you really want to make it runtime switchable, then:
> > 
> > - the shadow tables need to be updated unconditionally. I did not check
> >   whether thats done right now, but explosions are simpler to achieve when
> >   switching it back on. Though switching it off crashes as well.
> > 
> > - you need to make sure that no task is in user space or on the way to it.
> >   The much I hate stop_machine(), that's probably the right tool.
> >   Once everything is in stomp_machine() the switch can be flipped.
> > 
> > - the poisoning/unpoisoning of the kernel tables does not need to be done
> >   from stop_machine(). That can be done from regular context with a TIF
> >   flag, so you can make sure that every task is up to date before
> >   returning to user space. Though that needs a lot of thought.
> > 
> > For now I really want to see that removed entirely and replaced by a simple
> > boot time switch. We can use the global variable for now and optimize it
> > later on.
> > 
> 
> Nah, let's do it right: use either X86_FEATURE_WHATEVER or a
> static_branch.  We have nice asm support for both.

Agreed.

> Keep in mind that, for a static_branch, actually setting the thing needs
> to be deferred, but that's straightforward.

That's not an issue during boot. That would be an issue for a run time
switch.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping
  2017-11-25 16:08                 ` Thomas Gleixner
@ 2017-11-25 20:06                   ` Steven Rostedt
  2017-11-27  8:14                   ` Peter Zijlstra
  1 sibling, 0 replies; 113+ messages in thread
From: Steven Rostedt @ 2017-11-25 20:06 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Andy Lutomirski, Dave Hansen, Josh Poimboeuf, LKML,
	Andy Lutomirski, H . Peter Anvin, Peter Zijlstra,
	Borislav Petkov, Linus Torvalds, Andrey Ryabinin, kasan-dev,
	Masami Hiramatsu

On Sat, 25 Nov 2017 17:08:08 +0100 (CET)
Thomas Gleixner <tglx@linutronix.de> wrote:

> Tracing uses it to stop the printout of the function graph. Should be safe
> to print the C-Code functions.

It uses it to filter out interrupts, as well as to print the:

"<=========="

"==========>"

annotations. Since function tracing only traces C functions, this will
be lost.

That said, we could possibly replace that check with keeping track
of the hardirq state in the "flags" field of the trace event entry.

-- Steve

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 28/43] x86/mm/kaiser: Map cpu entry area
  2017-11-24 17:23 ` [PATCH 28/43] x86/mm/kaiser: Map cpu entry area Ingo Molnar
@ 2017-11-25 21:40   ` Thomas Gleixner
  2017-11-26 15:19     ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 21:40 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:
> +void kaiser_add_mapping_cpu_entry(int cpu)
> +{
> +	kaiser_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
> +				  __PAGE_KERNEL_RO);
> +
> +	/* includes the entry stack */
> +	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
> +				  sizeof(get_cpu_entry_area(cpu)->tss),
> +				  __PAGE_KERNEL | _PAGE_GLOBAL);
> +
> +	/* Entry code, so needs to be EXEC */
> +	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
> +				  sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
> +				  __PAGE_KERNEL_EXEC | _PAGE_GLOBAL);

This creates a RWX mapping and wants to be __PAGE_KERNEL_RX!

Whats weird is that I saw dump_walk_pgd_level_checkwx() complain once about
that, but on the next boot it failed to detect it for whatever reason. This
needs further investigation.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-25 20:05       ` Thomas Gleixner
@ 2017-11-25 22:10         ` Andy Lutomirski
  2017-11-25 22:48           ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-25 22:10 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds



> On Nov 25, 2017, at 1:05 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Sat, 25 Nov 2017, Andy Lutomirski wrote:
>>>> On Nov 25, 2017, at 12:18 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>>> On Fri, 24 Nov 2017, Ingo Molnar wrote:
>>>> 
>>>> From: Dave Hansen <dave.hansen@linux.intel.com>
>>>> 
>>>> The KAISER CR3 switches are expensive for many reasons.  Not all systems
>>>> benefit from the protection provided by KAISER.  Some of them can not
>>>> pay the high performance cost.
>>>> 
>>>> This patch adds a debugfs file.  To disable KAISER, you do:
>>>> 
>>>>   echo 0 > /sys/kernel/debug/x86/kaiser-enabled
>>>> 
>>>> and to re-enable it, you can:
>>>> 
>>>>   echo 1 > /sys/kernel/debug/x86/kaiser-enabled
>>>> 
>>>> This is a *minimal* implementation.  There are certainly plenty of
>>>> optimizations that can be done on top of this by using ALTERNATIVES
>>>> among other things.
>>> 
>>> It's not only minimal. It's naive and broken. That thing explodes when
>>> toggled in the wrong moment. I did not even attempt to debug that, because
>>> I think the approach is wrong.
>>> 
>>> If you really want to make it runtime switchable, then:
>>> 
>>> - the shadow tables need to be updated unconditionally. I did not check
>>>  whether thats done right now, but explosions are simpler to achieve when
>>>  switching it back on. Though switching it off crashes as well.
>>> 
>>> - you need to make sure that no task is in user space or on the way to it.
>>>  The much I hate stop_machine(), that's probably the right tool.
>>>  Once everything is in stomp_machine() the switch can be flipped.
>>> 
>>> - the poisoning/unpoisoning of the kernel tables does not need to be done
>>>  from stop_machine(). That can be done from regular context with a TIF
>>>  flag, so you can make sure that every task is up to date before
>>>  returning to user space. Though that needs a lot of thought.
>>> 
>>> For now I really want to see that removed entirely and replaced by a simple
>>> boot time switch. We can use the global variable for now and optimize it
>>> later on.
>>> 
>> 
>> Nah, let's do it right: use either X86_FEATURE_WHATEVER or a
>> static_branch.  We have nice asm support for both.
> 
> Agreed.
> 
>> Keep in mind that, for a static_branch, actually setting the thing needs
>> to be deferred, but that's straightforward.
> 
> That's not an issue during boot. That would be an issue for a run time
> switch.

What I mean is: if you modify a static_branch too early, it blows up terribly.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-25 22:10         ` Andy Lutomirski
@ 2017-11-25 22:48           ` Thomas Gleixner
  2017-11-26  0:21             ` Andy Lutomirski
  0 siblings, 1 reply; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-25 22:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> > On Nov 25, 2017, at 1:05 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> >> Keep in mind that, for a static_branch, actually setting the thing needs
> >> to be deferred, but that's straightforward.
> > 
> > That's not an issue during boot. That would be an issue for a run time
> > switch.
> 
> What I mean is: if you modify a static_branch too early, it blows up terribly.

I'm aware of that. We can't switch it in the early boot stage. But that
does not matter as we can switch way before we reach user space.

The early kaiser mappings are fine whether we use them later or not. At the
point in boot where we actually make the decision, there is nothing more
than the extra 4k shadow which got initialized.

If we ever want to do runtime switching, then the full shadow mapping needs
to be maintained even while kaiser is disabled, just the NX poisoning of
the user space mappings is what makes the difference.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-25 22:48           ` Thomas Gleixner
@ 2017-11-26  0:21             ` Andy Lutomirski
  2017-11-26  8:11               ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Andy Lutomirski @ 2017-11-26  0:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Sat, Nov 25, 2017 at 2:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> On Sat, 25 Nov 2017, Andy Lutomirski wrote:
>> > On Nov 25, 2017, at 1:05 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>> > On Sat, 25 Nov 2017, Andy Lutomirski wrote:
>> >> Keep in mind that, for a static_branch, actually setting the thing needs
>> >> to be deferred, but that's straightforward.
>> >
>> > That's not an issue during boot. That would be an issue for a run time
>> > switch.
>>
>> What I mean is: if you modify a static_branch too early, it blows up terribly.
>
> I'm aware of that. We can't switch it in the early boot stage. But that
> does not matter as we can switch way before we reach user space.
>
> The early kaiser mappings are fine whether we use them later or not. At the
> point in boot where we actually make the decision, there is nothing more
> than the extra 4k shadow which got initialized.
>
> If we ever want to do runtime switching, then the full shadow mapping needs
> to be maintained even while kaiser is disabled, just the NX poisoning of
> the user space mappings is what makes the difference.

One unfortunate thing is that, if we boot with kaiser off and don't
intend to ever switch it on at runtime, we could avoid the 8k pgd
allocations.  But if we want to be able to enable kaiser, we need the
8k mappings.  My inclination is to not try for runtime control until
some distro asks for it.

In general, I think that trying to runtime switch without stop_machine
is a bit nuts, and getting it to be reliable even with stop_machine is
gross.  Not to mention that stop_machine is currently incompatible
with writing to static branches, although that's fixable.

--Andy

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled at runtime
  2017-11-26  0:21             ` Andy Lutomirski
@ 2017-11-26  8:11               ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-26  8:11 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> On Sat, Nov 25, 2017 at 2:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> >> > On Nov 25, 2017, at 1:05 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> >> > On Sat, 25 Nov 2017, Andy Lutomirski wrote:
> >> >> Keep in mind that, for a static_branch, actually setting the thing needs
> >> >> to be deferred, but that's straightforward.
> >> >
> >> > That's not an issue during boot. That would be an issue for a run time
> >> > switch.
> >>
> >> What I mean is: if you modify a static_branch too early, it blows up terribly.
> >
> > I'm aware of that. We can't switch it in the early boot stage. But that
> > does not matter as we can switch way before we reach user space.
> >
> > The early kaiser mappings are fine whether we use them later or not. At the
> > point in boot where we actually make the decision, there is nothing more
> > than the extra 4k shadow which got initialized.
> >
> > If we ever want to do runtime switching, then the full shadow mapping needs
> > to be maintained even while kaiser is disabled, just the NX poisoning of
> > the user space mappings is what makes the difference.
> 
> One unfortunate thing is that, if we boot with kaiser off and don't
> intend to ever switch it on at runtime, we could avoid the 8k pgd
> allocations.  But if we want to be able to enable kaiser, we need the
> 8k mappings.  My inclination is to not try for runtime control until
> some distro asks for it.

I completely agree. boot time is good enough. We should start with the 8k
allocations for simplicity reasons and when that works have a patch on top
which switches is back to 4k.

> In general, I think that trying to runtime switch without stop_machine
> is a bit nuts, and getting it to be reliable even with stop_machine is
> gross.  Not to mention that stop_machine is currently incompatible
> with writing to static branches, although that's fixable.

Yes, it's doable, but surely not trivial and I'd like to avoid the mess it
creates.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching
  2017-11-24 17:23 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
  2017-11-25  0:02   ` Thomas Gleixner
@ 2017-11-26 11:50   ` Borislav Petkov
  2017-11-26 14:55     ` [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code " Ingo Molnar
  1 sibling, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 11:50 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:50PM +0100, Ingo Molnar wrote:
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index 3fd8bc560fae..e1650da01323 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -1,6 +1,7 @@
>  /* SPDX-License-Identifier: GPL-2.0 */
>  #include <linux/jump_label.h>
>  #include <asm/unwind_hints.h>
> +#include <asm/cpufeatures.h>
>  
>  /*
>  
> @@ -187,6 +188,70 @@ For 32-bit we have the following conventions - kernel is built with
>  #endif
>  .endm
>  
> +#ifdef CONFIG_KAISER
> +
> +/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
> +#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)

Btw, entry_64.o doesn't build when at this patch if you force-enable
CONFIG_KAISER by doing

#define CONFIG_KAISER

above it.

arch/x86/entry/entry_64.S: Assembler messages:
arch/x86/entry/entry_64.S:210: Error: invalid operands (*ABS* and *UND* sections) for `<<'
arch/x86/entry/entry_64.S:406: Error: invalid operands (*ABS* and *UND* sections) for `<<'
arch/x86/entry/entry_64.S:743: Error: invalid operands (*ABS* and *UND* sections) for `<<'
arch/x86/entry/entry_64.S:955: Error: invalid operands (*ABS* and *UND* sections) for `<<'
...

due to the missing PAGE_SHIFT definition in asm.

I'm assuming that is resolved later - not that it breaks bisection...

> +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> +	movq	%cr3, %r\scratch_reg
> +	movq	%r\scratch_reg, \save_reg

What happened to making it uniform so that that macro can be invoked
like this:

	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax ...

instead of "splitting" the arg?

IOW, hunk below builds here, and asm looks correct:

    14bf:       31 db                   xor    %ebx,%ebx
    14c1:       0f 20 d8                mov    %cr3,%rax
    14c4:       49 89 c6                mov    %rax,%r14
    14c7:       48 a9 00 10 00 00       test   $0x1000,%rax
    14cd:       74 09                   je     14d8 <paranoid_entry+0x78>
    14cf:       48 25 ff ef ff ff       and    $0xffffffffffffefff,%rax
    14d5:       0f 22 d8                mov    %rax,%cr3

---
diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
index e1650da01323..d528f7060774 100644
--- a/arch/x86/entry/calling.h
+++ b/arch/x86/entry/calling.h
@@ -188,10 +188,12 @@ For 32-bit we have the following conventions - kernel is built with
 #endif
 .endm
 
+#define CONFIG_KAISER
+
 #ifdef CONFIG_KAISER
 
 /* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
-#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+#define KAISER_SWITCH_MASK (1<<12)
 
 .macro ADJUST_KERNEL_CR3 reg:req
 	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
@@ -216,17 +218,17 @@ For 32-bit we have the following conventions - kernel is built with
 .endm
 
 .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
-	movq	%cr3, %r\scratch_reg
-	movq	%r\scratch_reg, \save_reg
+	movq	%cr3, \scratch_reg
+	movq	\scratch_reg, \save_reg
 	/*
 	 * Is the switch bit zero?  This means the address is
 	 * up in real KAISER patches in a moment.
 	 */
-	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	testq	$(KAISER_SWITCH_MASK), \scratch_reg
 	jz	.Ldone_\@
 
-	ADJUST_KERNEL_CR3 %r\scratch_reg
-	movq	%r\scratch_reg, %cr3
+	ADJUST_KERNEL_CR3 \scratch_reg
+	movq	\scratch_reg, %cr3
 
 .Ldone_\@:
 .endm
diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 4ac952080869..5a15d0852b2f 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -1256,7 +1256,7 @@ ENTRY(paranoid_entry)
 	xorl	%ebx, %ebx
 
 1:
-	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
 
 	ret
 END(paranoid_entry)

> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 34e3110b0876..4ac952080869 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
>  	/* Stash the user RSP. */
>  	movq	%rsp, RSP_SCRATCH
>  
> +	/* Note: using %rsp as a scratch reg. */

Haha, yap, it just got freed :)

> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
> +
>  	/* Load the top of the task stack into RSP */
>  	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
>  
> @@ -198,6 +201,13 @@ ENTRY(entry_SYSCALL_64)
>  
>  	swapgs
>  	movq	%rsp, PER_CPU_VAR(rsp_scratch)

<---- newline here.

> +	/*
> +	 * The kernel CR3 is needed to map the process stack, but we
> +	 * need a scratch register to be able to load CR3.  %rsp is
> +	 * clobberable right now, so use it as a scratch register.
> +	 * %rsp will be look crazy here for a couple instructions.

s/be // or "will be looking crazy" :-)

> +	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp

Now, this is questionable: we did enter through the trampoline
entry_SYSCALL_64_trampoline so theoretically, we wouldn't need to switch
to CR3 here again because, well, we did already.

I.e., entry_SYSCALL_64 is not going to be called anymore. Unless we will
jump to it when we decide to jump over the trampolines in the kaiser
disabled case. Just pointing it out here so that we don't forget to deal
with this...

> @@ -1239,7 +1254,11 @@ ENTRY(paranoid_entry)
>  	js	1f				/* negative -> in kernel */
>  	SWAPGS
>  	xorl	%ebx, %ebx
> -1:	ret
> +
> +1:
> +	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
> +
> +	ret
>  END(paranoid_entry)
>  
>  /*
> @@ -1261,6 +1280,7 @@ ENTRY(paranoid_exit)
>  	testl	%ebx, %ebx			/* swapgs needed? */
>  	jnz	.Lparanoid_exit_no_swapgs
>  	TRACE_IRQS_IRETQ
> +	RESTORE_CR3	%r14

	RESTORE_CR3 save_reg=%r14

like the other invocation below.

But if the runtime disable gets changed to a boottime one, you don't
need that macro anymore.

>  	SWAPGS_UNSAFE_STACK
>  	jmp	.Lparanoid_exit_restore
>  .Lparanoid_exit_no_swapgs:

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-24 19:12     ` Andy Lutomirski
@ 2017-11-26 14:05       ` Ingo Molnar
  2017-11-26 17:28         ` Borislav Petkov
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Andy Lutomirski <luto@kernel.org> wrote:

> On Fri, Nov 24, 2017 at 10:25 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Fri, Nov 24, 2017 at 06:23:40PM +0100, Ingo Molnar wrote:
> >> From: Andy Lutomirski <luto@kernel.org>
> >>
> >> When we start using an entry trampoline, a #GP from userspace will
> >> be delivered on the entry stack, not on the task stack.  Fix the
> >> espfix64 #DF fixup to set up #GP according to TSS.SP0, rather than
> >> assuming that pt_regs + 1 == SP0.  This won't change anything
> >> without an entry stack, but it will make the code continue to work
> >> when an entry stack is added.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Borislav Petkov <bpetkov@suse.de>
> >> Cc: Brian Gerst <brgerst@gmail.com>
> >> Cc: Dave Hansen <dave.hansen@intel.com>
> >> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Link: https://lkml.kernel.org/r/b1ef4136616c6bd2a75d1fd2736d1d54437d65a8.1511497875.git.luto@kernel.org
> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> >> ---
> >>  arch/x86/kernel/traps.c | 5 +++--
> >>  1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
> >> index 2008dd0f8ccb..1bd43f044c62 100644
> >> --- a/arch/x86/kernel/traps.c
> >> +++ b/arch/x86/kernel/traps.c
> >> @@ -359,7 +359,8 @@ dotraplinkage void do_double_fault(struct pt_regs *regs, long error_code)
> >>               regs->cs == __KERNEL_CS &&
> >>               regs->ip == (unsigned long)native_irq_return_iret)
> >>       {
> >> -             struct pt_regs *normal_regs = task_pt_regs(current);
> >> +             struct pt_regs *normal_regs =
> >> +                     (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
> >
> > Just let that line stick out. Also, you can shorten it by renaming
> > normal_regs to something much shorter - it is a local variable and the
> > comment already explains everything you you can just as well have:
> >
> >          struct pt_regs *r = (struct pt_regs *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
> >
> 
> Done, along with much better comments.
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack&id=6510485d026abdd144d170b1bc8508327acb5648

I have added Boris's reviewed-by tag:

  Reviewed-by: Borislav Petkov <bp@suse.de>

which I suspect applies now?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries
  2017-11-24 19:02   ` Borislav Petkov
@ 2017-11-26 14:16     ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:16 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:41PM +0100, Ingo Molnar wrote:
> > From: Andy Lutomirski <luto@kernel.org>
> > 
> > Historically, IDT entries from usermode have always gone directly
> > to the running task's kernel stack.  Rearrange it so that we enter on
> > a percpu trampoline stack and then manually switch to the task's stack.
> > This touches a couple of extra cachelines, but it gives us a chance
> > to run some code before we touch the kernel stack.
> > 
> > The asm isn't exactly beautiful, but I think that fully refactoring
> > it can wait.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> 
> I think you mean Reviewed-by: here. The following patches have it too.
> 
> > Cc: Borislav Petkov <bpetkov@suse.de>
> > Cc: Brian Gerst <brgerst@gmail.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/fa3958723a1a85baeaf309c735b775841205800e.1511497875.git.luto@kernel.org
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > ---
> >  arch/x86/entry/entry_64.S        | 67 ++++++++++++++++++++++++++++++----------
> >  arch/x86/entry/entry_64_compat.S |  5 ++-
> >  arch/x86/include/asm/switch_to.h |  2 +-
> >  arch/x86/include/asm/traps.h     |  1 -
> >  arch/x86/kernel/cpu/common.c     |  6 ++--
> >  arch/x86/kernel/traps.c          | 18 +++++------
> >  6 files changed, 68 insertions(+), 31 deletions(-)
> 
> ...
> 
> > diff --git a/arch/x86/include/asm/switch_to.h b/arch/x86/include/asm/switch_to.h
> > index 8c6bd6863db9..a6796ac8d311 100644
> > --- a/arch/x86/include/asm/switch_to.h
> > +++ b/arch/x86/include/asm/switch_to.h
> > @@ -93,7 +93,7 @@ static inline void update_sp0(struct task_struct *task)
> >  #ifdef CONFIG_X86_32
> >  	load_sp0(task->thread.sp0);
> >  #else
> > -	load_sp0(task_top_of_stack(task));
> > +	/* On x86_64, sp0 always points to the entry trampoline stack. */
> >  #endif
> 
> You can put this comment to the one ontop and remove the #else.
> ifdeffery is always ugly and the less, the better.

Ok, I made it:

        /* On x86_64, sp0 always points to the entry trampoline stack, which is constant: */
#ifdef CONFIG_X86_32
        load_sp0(task->thread.sp0);
#endif

> >  	/*
> >  	 * This is called from entry_64.S early in handling a fault
> >  	 * caused by a bad iret to user mode.  To handle the fault
> > -	 * correctly, we want move our stack frame to task_pt_regs
> > -	 * and we want to pretend that the exception came from the
> > -	 * iret target.
> > +	 * correctly, we want move our stack frame to where it would
> 
> " ... we want to move... "
> 
> > +	 * be had we entered directly on the entry stack (rather than
> > +	 * just below the IRET frame) and we want to pretend that the
> > +	 * exception came from the iret target.
> 
> s/iret/IRET/
> 
> >  	 */
> >  	struct bad_iret_stack *new_stack =
> > -		container_of(task_pt_regs(current),
> > -			     struct bad_iret_stack, regs);
> > +		(struct bad_iret_stack *)this_cpu_read(cpu_tss.x86_tss.sp0) - 1;
> >  
> >  	/* Copy the IRET target to the new stack. */
> >  	memmove(&new_stack->regs.ip, (void *)s->regs.sp, 5*8);
> > -- 
> 
> with that:
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

Fixed those details and added your tag, thanks Boris!

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-24 19:10   ` Borislav Petkov
@ 2017-11-26 14:18     ` Ingo Molnar
  2017-11-26 17:33       ` Borislav Petkov
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:18 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:42PM +0100, Ingo Molnar wrote:
> > From: Andy Lutomirski <luto@kernel.org>
> > 
> > By itself, this is useless.  It gives us the ability to run some final
> > code before exit that cannnot run on the kernel stack.  This could
> > include a CR3 switch a la KAISER or some kernel stack erasing, for
> > example.  (Or even weird things like *changing* which kernel stack
> > gets used as an ASLR-strengthening mechanism.)
> > 
> > The SYSRET32 path is not covered yet.  It could be in the future or
> > we could just ignore it and force the slow path if needed.
> > 
> > Signed-off-by: Andy Lutomirski <luto@kernel.org>
> > Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> > Cc: Borislav Petkov <bpetkov@suse.de>
> > Cc: Brian Gerst <brgerst@gmail.com>
> > Cc: Dave Hansen <dave.hansen@intel.com>
> > Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Link: https://lkml.kernel.org/r/d350017000eed20922c3b2711a2d9229dc809256.1511497875.git.luto@kernel.org
> > Signed-off-by: Ingo Molnar <mingo@kernel.org>
> > ---
> >  arch/x86/entry/entry_64.S | 55 +++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 51 insertions(+), 4 deletions(-)
> 
> Nice commenting, future generations will appreciate it!
> 
> :-)
> 
> Reviewed-by: Borislav Petkov <bp@suse.de>

Heh, indeed - if by 'future generations' you mean 'the guys who merge it,
a few months down the line'. ;-)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-25 15:00     ` Andy Lutomirski
@ 2017-11-26 14:26       ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Sat, Nov 25, 2017 at 3:40 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Fri, Nov 24, 2017 at 06:23:43PM +0100, Ingo Molnar wrote:
> >> From: Andy Lutomirski <luto@kernel.org>
> >>
> >> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
> >> single register (except FLAGS), including RSP, live.  It somehow needs
> >> to set RSP to point to a valid stack, which means it needs to save the
> >> user RSP somewhere and find its own stack pointer.  The canonical way
> >> to do this is with SWAPGS, which lets us access percpu data using the
> >> %gs prefix.
> >>
> >> With KAISER-like pagetable switching, this is problematic.  Without a
> >> scratch register, switching CR3 is impossible, so %gs-based percpu
> >> memory would need to be mapped in the user pagetables.  Doing that
> >> without information leaks is difficult or impossible.
> >>
> >> Instead, use a different sneaky trick.  Map a copy of the first part
> >> of the SYSCALL asm at a different address for each CPU.  Now RIP
> >> varies depending on the CPU, so we can use RIP-relative memory access
> >> to access percpu memory.  By putting the relevant information (one
> >> scratch slot and the stack address) at a constant offset relative to
> >> RIP, we can make SYSCALL work without relying on %gs.
> >>
> >> A nice thing about this approach is that we can easily switch it on
> >> and off if we want pagetable switching to be configurable.
> >>
> >> The compat variant of SYSCALL doesn't have this problem in the first
> >> place -- there are plenty of scratch registers, since we don't care
> >> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
> >> at all.
> >>
> >> XXX: Whenever we settle how KAISER gets turned on and off, we should do
> >> the same to this.
> >>
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> >> Cc: Borislav Petkov <bpetkov@suse.de>
> >> Cc: Brian Gerst <brgerst@gmail.com>
> >> Cc: Dave Hansen <dave.hansen@intel.com>
> >> Cc: Josh Poimboeuf <jpoimboe@redhat.com>
> >> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> >> Cc: Peter Zijlstra <peterz@infradead.org>
> >> Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
> >> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> >> ---
> >>  arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
> >>  arch/x86/include/asm/fixmap.h |  2 ++
> >>  arch/x86/kernel/asm-offsets.c |  1 +
> >>  arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
> >>  arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
> >>  5 files changed, 72 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> >> index 426b8c669d6a..0cde243b7542 100644
> >> --- a/arch/x86/entry/entry_64.S
> >> +++ b/arch/x86/entry/entry_64.S
> >> @@ -140,6 +140,54 @@ END(native_usergs_sysret64)
> >>   * with them due to bugs in both AMD and Intel CPUs.
> >>   */
> >>
> >> +     .pushsection .entry_trampoline, "ax"
> >> +
> >> +/*
> >> + * The code in here gets remapped into cpu_entry_area's trampoline.  This means
> >> + * that the assembler and linker have the wrong idea as to where this code
> >> + * lives (and, in fact, it's mapped more than once, so it's not even at a
> >> + * fixed address).  So we can't reference any symbols outside the entry
> >> + * trampoline and expect it to work.
> >> + *
> >> + * Instead, we carefully abuse %rip-relative addressing.
> >> + * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
> >
> > _entry_trampoline(%rip) I'd guess.
> 
> Indeed.  It used to be .L, but then I put it in the linker script the
> obvious way and it wasn't any more.
> 
> >
> >> + * trampoline.  We can thus find cpu_entry_area with this macro:
> >
> > Uuh, fun. :)
> 
> That's what I thought!
> 
> >
> >> + */
> >> +
> >> +#define CPU_ENTRY_AREA \
> >> +     _entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
> >
> > So this generates
> >
> >         _entry_trampoline - 16384(%rip)
> >
> > here. IINM, the layout looks like this
> >
> > [ GDT | TSS | TSS | TSS | trampoline ]
> >
> > where each section is a page, and we have 4 pages per CPU. Just for my
> > own understanding...
> 
> Indeed.
> 
> >
> >> +
> >> +/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
> >> +#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
> >> +     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
> >
> > I'm wondering if it would be easier to make SYSENTER_stack part of
> > struct cpu_entry_area and thus simplify that wild offset math here :)
> 
> It's like that with the last patch that I haven't send out applied, actually :)
> 
> >
> > Also, pls align:
> >
> > #define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
> >                     SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
> 
> Done.
> 
> >
> >> +
> >> +ENTRY(entry_SYSCALL_64_trampoline)
> >> +     UNWIND_HINT_EMPTY
> >> +     swapgs
> >> +
> >> +     /* Stash the user RSP. */
> >> +     movq    %rsp, RSP_SCRATCH
> >> +
> >> +     /* Load the top of the task stack into RSP */
> >> +     movq    CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
> >
> > Yeah, let's put CPU_ENTRY_AREA first because then it reads easier:
> >
> > movq    CPU_ENTRY_AREA + CPU_ENTRY_AREA_tss + TSS_sp1, %rsp
> 
> Nope, it won't build.  That would expand like 0x1(%rip) + 0x2, which
> isn't valid.
> 
> >
> > i.e., pointer to cpu_entry_area, offset to tss within said
> > cpu_entry_area and then inside that, sp1.
> >
> > Ditto for above.
> >
> > ...
> >
> >> @@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
> >>  /* May not be marked __init: used by software suspend */
> >>  void syscall_init(void)
> >>  {
> >> +     extern char _entry_trampoline[];
> >> +     extern char entry_SYSCALL_64_trampoline[];
> >> +
> >>       int cpu = smp_processor_id();
> >>
> >>       wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
> >> -     wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
> >> +     wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
> >
> > Definitely a local var.
> 
> Done.
> 
> >
> >>  #ifdef CONFIG_IA32_EMULATION
> >>       wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
> >> diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
> >> index a4009fb9be87..2738cfb6c8c8 100644
> >> --- a/arch/x86/kernel/vmlinux.lds.S
> >> +++ b/arch/x86/kernel/vmlinux.lds.S
> >> @@ -107,6 +107,16 @@ SECTIONS
> >>               SOFTIRQENTRY_TEXT
> >>               *(.fixup)
> >>               *(.gnu.warning)
> >> +
> >> +#ifdef CONFIG_X86_64
> >> +             /* Entry trampoline */
> >
> > No need for that comment - variable and section names are enough. :)

Ok, I've picked up this version and added Boris's Reviewed-by tag - see the latest 
below.

Thanks,

	Ingo

==================>
Subject: x86/entry/64: Create a percpu SYSCALL entry trampoline
From: Andy Lutomirski <luto@kernel.org>
Date: Sat, 11 Nov 2017 19:52:54 -0800

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Link: https://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S     |   48 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |    2 +
 arch/x86/kernel/asm-offsets.c |    1 
 arch/x86/kernel/cpu/common.c  |   15 ++++++++++++-
 arch/x86/kernel/vmlinux.lds.S |    9 +++++++
 5 files changed, 74 insertions(+), 1 deletion(-)

Index: tip/arch/x86/entry/entry_64.S
===================================================================
--- tip.orig/arch/x86/entry/entry_64.S
+++ tip/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * _entry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH	CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+			SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
Index: tip/arch/x86/include/asm/fixmap.h
===================================================================
--- tip.orig/arch/x86/include/asm/fixmap.h
+++ tip/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
Index: tip/arch/x86/kernel/asm-offsets.c
===================================================================
--- tip.orig/arch/x86/kernel/asm-offsets.c
+++ tip/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
Index: tip/arch/x86/kernel/cpu/common.c
===================================================================
--- tip.orig/arch/x86/kernel/cpu/common.c
+++ tip/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *,
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,16 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char,
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
+	unsigned long SYSCALL64_entry_trampoline =
+		(unsigned long)get_cpu_entry_area(cpu)->entry_trampoline +
+		(entry_SYSCALL_64_trampoline - _entry_trampoline);
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, SYSCALL64_entry_trampoline);
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
Index: tip/arch/x86/kernel/vmlinux.lds.S
===================================================================
--- tip.orig/arch/x86/kernel/vmlinux.lds.S
+++ tip/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,15 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code for entry/exit CR3 switching
  2017-11-26 11:50   ` Borislav Petkov
@ 2017-11-26 14:55     ` Ingo Molnar
  2017-11-27 13:29       ` Josh Poimboeuf
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 14:55 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:50PM +0100, Ingo Molnar wrote:
> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index 3fd8bc560fae..e1650da01323 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -1,6 +1,7 @@
> >  /* SPDX-License-Identifier: GPL-2.0 */
> >  #include <linux/jump_label.h>
> >  #include <asm/unwind_hints.h>
> > +#include <asm/cpufeatures.h>
> >  
> >  /*
> >  
> > @@ -187,6 +188,70 @@ For 32-bit we have the following conventions - kernel is built with
> >  #endif
> >  .endm
> >  
> > +#ifdef CONFIG_KAISER
> > +
> > +/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
> > +#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
> 
> Btw, entry_64.o doesn't build when at this patch if you force-enable
> CONFIG_KAISER by doing
> 
> #define CONFIG_KAISER
> 
> above it.
> 
> arch/x86/entry/entry_64.S: Assembler messages:
> arch/x86/entry/entry_64.S:210: Error: invalid operands (*ABS* and *UND* sections) for `<<'
> arch/x86/entry/entry_64.S:406: Error: invalid operands (*ABS* and *UND* sections) for `<<'
> arch/x86/entry/entry_64.S:743: Error: invalid operands (*ABS* and *UND* sections) for `<<'
> arch/x86/entry/entry_64.S:955: Error: invalid operands (*ABS* and *UND* sections) for `<<'
> ...
> 
> due to the missing PAGE_SHIFT definition in asm.
> 
> I'm assuming that is resolved later - not that it breaks bisection...
> 
> > +.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> > +	movq	%cr3, %r\scratch_reg
> > +	movq	%r\scratch_reg, \save_reg
> 
> What happened to making it uniform so that that macro can be invoked
> like this:
> 
> 	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax ...
> 
> instead of "splitting" the arg?
> 
> IOW, hunk below builds here, and asm looks correct:
> 
>     14bf:       31 db                   xor    %ebx,%ebx
>     14c1:       0f 20 d8                mov    %cr3,%rax
>     14c4:       49 89 c6                mov    %rax,%r14
>     14c7:       48 a9 00 10 00 00       test   $0x1000,%rax
>     14cd:       74 09                   je     14d8 <paranoid_entry+0x78>
>     14cf:       48 25 ff ef ff ff       and    $0xffffffffffffefff,%rax
>     14d5:       0f 22 d8                mov    %rax,%cr3
> 
> ---
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index e1650da01323..d528f7060774 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -188,10 +188,12 @@ For 32-bit we have the following conventions - kernel is built with
>  #endif
>  .endm
>  
> +#define CONFIG_KAISER
> +
>  #ifdef CONFIG_KAISER
>  
>  /* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
> -#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
> +#define KAISER_SWITCH_MASK (1<<12)
>  
>  .macro ADJUST_KERNEL_CR3 reg:req
>  	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
> @@ -216,17 +218,17 @@ For 32-bit we have the following conventions - kernel is built with
>  .endm
>  
>  .macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
> -	movq	%cr3, %r\scratch_reg
> -	movq	%r\scratch_reg, \save_reg
> +	movq	%cr3, \scratch_reg
> +	movq	\scratch_reg, \save_reg
>  	/*
>  	 * Is the switch bit zero?  This means the address is
>  	 * up in real KAISER patches in a moment.
>  	 */
> -	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
> +	testq	$(KAISER_SWITCH_MASK), \scratch_reg
>  	jz	.Ldone_\@
>  
> -	ADJUST_KERNEL_CR3 %r\scratch_reg
> -	movq	%r\scratch_reg, %cr3
> +	ADJUST_KERNEL_CR3 \scratch_reg
> +	movq	\scratch_reg, %cr3
>  
>  .Ldone_\@:
>  .endm
> diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> index 4ac952080869..5a15d0852b2f 100644
> --- a/arch/x86/entry/entry_64.S
> +++ b/arch/x86/entry/entry_64.S
> @@ -1256,7 +1256,7 @@ ENTRY(paranoid_entry)
>  	xorl	%ebx, %ebx
>  
>  1:
> -	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
> +	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=%rax save_reg=%r14
>  
>  	ret
>  END(paranoid_entry)
> 
> > diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
> > index 34e3110b0876..4ac952080869 100644
> > --- a/arch/x86/entry/entry_64.S
> > +++ b/arch/x86/entry/entry_64.S
> > @@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
> >  	/* Stash the user RSP. */
> >  	movq	%rsp, RSP_SCRATCH
> >  
> > +	/* Note: using %rsp as a scratch reg. */
> 
> Haha, yap, it just got freed :)
> 
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
> > +
> >  	/* Load the top of the task stack into RSP */
> >  	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
> >  
> > @@ -198,6 +201,13 @@ ENTRY(entry_SYSCALL_64)
> >  
> >  	swapgs
> >  	movq	%rsp, PER_CPU_VAR(rsp_scratch)
> 
> <---- newline here.
> 
> > +	/*
> > +	 * The kernel CR3 is needed to map the process stack, but we
> > +	 * need a scratch register to be able to load CR3.  %rsp is
> > +	 * clobberable right now, so use it as a scratch register.
> > +	 * %rsp will be look crazy here for a couple instructions.
> 
> s/be // or "will be looking crazy" :-)
> 
> > +	 */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
> 
> Now, this is questionable: we did enter through the trampoline
> entry_SYSCALL_64_trampoline so theoretically, we wouldn't need to switch
> to CR3 here again because, well, we did already.
> 
> I.e., entry_SYSCALL_64 is not going to be called anymore. Unless we will
> jump to it when we decide to jump over the trampolines in the kaiser
> disabled case. Just pointing it out here so that we don't forget to deal
> with this...
> 
> > @@ -1239,7 +1254,11 @@ ENTRY(paranoid_entry)
> >  	js	1f				/* negative -> in kernel */
> >  	SWAPGS
> >  	xorl	%ebx, %ebx
> > -1:	ret
> > +
> > +1:
> > +	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
> > +
> > +	ret
> >  END(paranoid_entry)
> >  
> >  /*
> > @@ -1261,6 +1280,7 @@ ENTRY(paranoid_exit)
> >  	testl	%ebx, %ebx			/* swapgs needed? */
> >  	jnz	.Lparanoid_exit_no_swapgs
> >  	TRACE_IRQS_IRETQ
> > +	RESTORE_CR3	%r14
> 
> 	RESTORE_CR3 save_reg=%r14
> 
> like the other invocation below.
> 
> But if the runtime disable gets changed to a boottime one, you don't
> need that macro anymore.

Ok, I've incorporated all these fixes into the latest version - attached below.

[ Also added your Reviewed-by tag, to save a step in the best-case scenario ;-) ]

Thanks,

	Ingo

===================>
Subject: x86/mm/kaiser: Prepare the x86/entry assembly code for entry/exit CR3 switching
From: Dave Hansen <dave.hansen@linux.intel.com>
Date: Wed, 22 Nov 2017 16:34:42 -0800

From: Dave Hansen <dave.hansen@linux.intel.com>

This is largely code from Andy Lutomirski.  I fixed a few bugs
in it, and added a few SWITCH_TO_* spots.

KAISER needs to switch to a different CR3 value when it enters
the kernel and switch back when it exits.  This essentially
needs to be done before leaving assembly code.

This is extra challenging because the switching context is
tricky: the registers that can be clobbered can vary.  It is also
hard to store things on the stack because there is an established
ABI (ptregs) or the stack is entirely unsafe to use.

This patch establishes a set of macros that allow changing to
the user and kernel CR3 values.

Interactions with SWAPGS: previous versions of the KAISER code
relied on having per-CPU scratch space to save/restore a register
that can be used for the CR3 MOV.  The %GS register is used to
index into our per-CPU space, so SWAPGS *had* to be done before
the CR3 switch.  That scratch space is gone now, but the semantic
that SWAPGS must be done before the CR3 MOV is retained.  This is
good to keep because it is not that hard to do and it allows us
to do things like add per-CPU debugging information to help us
figure out what goes wrong sometimes.

What this does in the NMI code is worth pointing out.  NMIs
can interrupt *any* context and they can also be nested with
NMIs interrupting other NMIs.  The comments below
".Lnmi_from_kernel" explain the format of the stack during this
situation.  Changing the format of this stack is not a fun
exercise: I tried.  Instead of storing the old CR3 value on the
stack, this patch depend on the *regular* register save/restore
mechanism and then uses %r14 to keep CR3 during the NMI.  It is
callee-saved and will not be clobbered by the C NMI handlers that
get called.

Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: daniel.gruss@iaik.tugraz.at
Cc: hughd@google.com
Cc: keescook@google.com
Cc: linux-mm@kvack.org
Cc: luto@kernel.org
Cc: michael.schwarz@iaik.tugraz.at
Cc: moritz.lipp@iaik.tugraz.at
Cc: richard.fellner@student.tugraz.at
Link: https://lkml.kernel.org/r/20171123003442.2D047A7D@viggo.jf.intel.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/calling.h         |   65 +++++++++++++++++++++++++++++++++++++++
 arch/x86/entry/entry_64.S        |   38 +++++++++++++++++++++-
 arch/x86/entry/entry_64_compat.S |   24 +++++++++++++-
 3 files changed, 123 insertions(+), 4 deletions(-)

Index: tip/arch/x86/entry/calling.h
===================================================================
--- tip.orig/arch/x86/entry/calling.h
+++ tip/arch/x86/entry/calling.h
@@ -1,6 +1,7 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/jump_label.h>
 #include <asm/unwind_hints.h>
+#include <asm/cpufeatures.h>
 
 /*
 
@@ -187,6 +188,70 @@ For 32-bit we have the following convent
 #endif
 .endm
 
+#ifdef CONFIG_KAISER
+
+/* KAISER PGDs are 8k.  Flip bit 12 to switch between the two halves: */
+#define KAISER_SWITCH_MASK (1<<PAGE_SHIFT)
+
+.macro ADJUST_KERNEL_CR3 reg:req
+	/* Clear "KAISER bit", point CR3 at kernel pagetables: */
+	andq	$(~KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro ADJUST_USER_CR3 reg:req
+	/* Move CR3 up a page to the user page tables: */
+	orq	$(KAISER_SWITCH_MASK), \reg
+.endm
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_KERNEL_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+	mov	%cr3, \scratch_reg
+	ADJUST_USER_CR3 \scratch_reg
+	mov	\scratch_reg, %cr3
+.endm
+
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+	movq	%cr3, %r\scratch_reg
+	movq	%r\scratch_reg, \save_reg
+	/*
+	 * Is the switch bit zero?  This means the address is
+	 * up in real KAISER patches in a moment.
+	 */
+	testq	$(KAISER_SWITCH_MASK), %r\scratch_reg
+	jz	.Ldone_\@
+
+	ADJUST_KERNEL_CR3 %r\scratch_reg
+	movq	%r\scratch_reg, %cr3
+
+.Ldone_\@:
+.endm
+
+.macro RESTORE_CR3 save_reg:req
+	/*
+	 * The CR3 write could be avoided when not changing its value,
+	 * but would require a CR3 read *and* a scratch register.
+	 */
+	movq	\save_reg, %cr3
+.endm
+
+#else /* CONFIG_KAISER=n: */
+
+.macro SWITCH_TO_KERNEL_CR3 scratch_reg:req
+.endm
+.macro SWITCH_TO_USER_CR3 scratch_reg:req
+.endm
+.macro SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg:req save_reg:req
+.endm
+.macro RESTORE_CR3 save_reg:req
+.endm
+
+#endif
+
 #endif /* CONFIG_X86_64 */
 
 /*
Index: tip/arch/x86/entry/entry_64.S
===================================================================
--- tip.orig/arch/x86/entry/entry_64.S
+++ tip/arch/x86/entry/entry_64.S
@@ -168,6 +168,9 @@ ENTRY(entry_SYSCALL_64_trampoline)
 	/* Stash the user RSP. */
 	movq	%rsp, RSP_SCRATCH
 
+	/* Note: using %rsp as a scratch reg. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	/* Load the top of the task stack into RSP */
 	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
 
@@ -198,6 +201,14 @@ ENTRY(entry_SYSCALL_64)
 
 	swapgs
 	movq	%rsp, PER_CPU_VAR(rsp_scratch)
+
+	/*
+	 * The kernel CR3 is needed to map the process stack, but we
+	 * need a scratch register to be able to load CR3.  %rsp is
+	 * clobberable right now, so use it as a scratch register.
+	 * %rsp will look crazy here for a couple instructions.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/* Construct struct pt_regs on stack */
@@ -393,6 +404,7 @@ syscall_return_via_sysret:
 	 * We are on the trampoline stack.  All regs except RDI are live.
 	 * We can do future final exit work right here.
 	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
 
 	popq	%rdi
 	popq	%rsp
@@ -729,6 +741,8 @@ GLOBAL(swapgs_restore_regs_and_return_to
 	 * We can do future final exit work right here.
 	 */
 
+	SWITCH_TO_USER_CR3 scratch_reg=%rdi
+
 	/* Restore RDI. */
 	popq	%rdi
 	SWAPGS
@@ -938,6 +952,8 @@ ENTRY(switch_to_thread_stack)
 	UNWIND_HINT_FUNC
 
 	pushq	%rdi
+	/* Need to switch before accessing the thread stack. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
 	movq	%rsp, %rdi
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT sp_offset=16 sp_reg=ORC_REG_DI
@@ -1239,7 +1255,11 @@ ENTRY(paranoid_entry)
 	js	1f				/* negative -> in kernel */
 	SWAPGS
 	xorl	%ebx, %ebx
-1:	ret
+
+1:
+	SAVE_AND_SWITCH_TO_KERNEL_CR3 scratch_reg=ax save_reg=%r14
+
+	ret
 END(paranoid_entry)
 
 /*
@@ -1261,6 +1281,7 @@ ENTRY(paranoid_exit)
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	.Lparanoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
+	RESTORE_CR3	save_reg=%r14
 	SWAPGS_UNSAFE_STACK
 	jmp	.Lparanoid_exit_restore
 .Lparanoid_exit_no_swapgs:
@@ -1288,6 +1309,8 @@ ENTRY(error_entry)
 	 * from user mode due to an IRET fault.
 	 */
 	SWAPGS
+	/* We have user CR3.  Change to kernel CR3. */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 .Lerror_entry_from_usermode_after_swapgs:
 	/* Put us onto the real thread stack. */
@@ -1334,6 +1357,7 @@ ENTRY(error_entry)
 	 * .Lgs_change's error handler with kernel gsbase.
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 	jmp .Lerror_entry_done
 
 .Lbstep_iret:
@@ -1343,10 +1367,11 @@ ENTRY(error_entry)
 
 .Lerror_bad_iret:
 	/*
-	 * We came from an IRET to user mode, so we have user gsbase.
-	 * Switch to kernel gsbase:
+	 * We came from an IRET to user mode, so we have user
+	 * gsbase and CR3.  Switch to kernel gsbase and CR3:
 	 */
 	SWAPGS
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rax
 
 	/*
 	 * Pretend that the exception came from user mode: set up pt_regs
@@ -1378,6 +1403,10 @@ END(error_exit)
 /*
  * Runs on exception stack.  Xen PV does not go through this path at all,
  * so we can use real assembly here.
+ *
+ * Registers:
+ *	%r14: Used to save/restore the CR3 of the interrupted context
+ *	      when KAISER is in use.  Do not clobber.
  */
 ENTRY(nmi)
 	UNWIND_HINT_IRET_REGS
@@ -1441,6 +1470,7 @@ ENTRY(nmi)
 
 	swapgs
 	cld
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdx
 	movq	%rsp, %rdx
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 	UNWIND_HINT_IRET_REGS base=%rdx offset=8
@@ -1693,6 +1723,8 @@ end_repeat_nmi:
 	movq	$-1, %rsi
 	call	do_nmi
 
+	RESTORE_CR3 save_reg=%r14
+
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
Index: tip/arch/x86/entry/entry_64_compat.S
===================================================================
--- tip.orig/arch/x86/entry/entry_64_compat.S
+++ tip/arch/x86/entry/entry_64_compat.S
@@ -49,6 +49,10 @@
 ENTRY(entry_SYSENTER_compat)
 	/* Interrupts are off on entry. */
 	SWAPGS
+
+	/* We are about to clobber %rsp anyway, clobbering here is OK */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
+
 	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
 
 	/*
@@ -216,6 +220,12 @@ GLOBAL(entry_SYSCALL_compat_after_hwfram
 	pushq   $0			/* pt_regs->r15 = 0 */
 
 	/*
+	 * We just saved %rdi so it is safe to clobber.  It is not
+	 * preserved during the C calls inside TRACE_IRQS_OFF anyway.
+	 */
+	SWITCH_TO_KERNEL_CR3 scratch_reg=%rdi
+
+	/*
 	 * User mode is traced as though IRQs are on, and SYSENTER
 	 * turned them off.
 	 */
@@ -256,10 +266,22 @@ sysret32_from_system_call:
 	 * when the system call started, which is already known to user
 	 * code.  We zero R8-R10 to avoid info leaks.
          */
+	movq	RSP-ORIG_RAX(%rsp), %rsp
+
+	/*
+	 * The original userspace %rsp (RSP-ORIG_RAX(%rsp)) is stored
+	 * on the process stack which is not mapped to userspace and
+	 * not readable after we SWITCH_TO_USER_CR3.  Delay the CR3
+	 * switch until after after the last reference to the process
+	 * stack.
+	 *
+	 * %r8 is zeroed before the sysret, thus safe to clobber.
+	 */
+	SWITCH_TO_USER_CR3 scratch_reg=%r8
+
 	xorq	%r8, %r8
 	xorq	%r9, %r9
 	xorq	%r10, %r10
-	movq	RSP-ORIG_RAX(%rsp), %rsp
 	swapgs
 	sysretl
 END(entry_SYSCALL_compat)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 28/43] x86/mm/kaiser: Map cpu entry area
  2017-11-25 21:40   ` Thomas Gleixner
@ 2017-11-26 15:19     ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 15:19 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 24 Nov 2017, Ingo Molnar wrote:
> > +void kaiser_add_mapping_cpu_entry(int cpu)
> > +{
> > +	kaiser_add_user_map_early(get_cpu_gdt_ro(cpu), PAGE_SIZE,
> > +				  __PAGE_KERNEL_RO);
> > +
> > +	/* includes the entry stack */
> > +	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->tss,
> > +				  sizeof(get_cpu_entry_area(cpu)->tss),
> > +				  __PAGE_KERNEL | _PAGE_GLOBAL);
> > +
> > +	/* Entry code, so needs to be EXEC */
> > +	kaiser_add_user_map_early(&get_cpu_entry_area(cpu)->entry_trampoline,
> > +				  sizeof(get_cpu_entry_area(cpu)->entry_trampoline),
> > +				  __PAGE_KERNEL_EXEC | _PAGE_GLOBAL);
> 
> This creates a RWX mapping and wants to be __PAGE_KERNEL_RX!

So I think __PAGE_KERNEL_EXEC and __PAGE_KERNEL are really dangerous names which 
creates RWX mappings without people intending it.

Should be renamed to something like __PAGE_KERNEL_RWX ? And not be used anywhere ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit
  2017-11-25 17:17   ` Thomas Gleixner
@ 2017-11-26 15:54     ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-26 15:54 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds


* Thomas Gleixner <tglx@linutronix.de> wrote:

> On Fri, 24 Nov 2017, Ingo Molnar wrote:
> > diff --git a/arch/x86/include/asm/desc.h b/arch/x86/include/asm/desc.h
> > index aab4fe9f49f8..300090d1c209 100644
> > --- a/arch/x86/include/asm/desc.h
> > +++ b/arch/x86/include/asm/desc.h
> > @@ -46,7 +46,7 @@ struct gdt_page {
> >  	struct desc_struct gdt[GDT_ENTRIES];
> >  } __attribute__((aligned(PAGE_SIZE)));
> >  
> > -DECLARE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page);
> > +DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(struct gdt_page, gdt_page);
> 
> This patch is obsolete. We have the alias mappings in the cpu_entry_area
> already.

Gone in the latest WIP.x86/mm.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-26 14:05       ` Ingo Molnar
@ 2017-11-26 17:28         ` Borislav Petkov
  2017-11-27  9:19           ` Ingo Molnar
  0 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 17:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Andy Lutomirski, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sun, Nov 26, 2017 at 03:05:33PM +0100, Ingo Molnar wrote:
> > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack&id=6510485d026abdd144d170b1bc8508327acb5648
> 
> I have added Boris's reviewed-by tag:
> 
>   Reviewed-by: Borislav Petkov <bp@suse.de>
> 
> which I suspect applies now?

Almost, looking at the patch at that URL, there's a trailing "modify" in the
comment:

+	 * we land directly at the #GP(0) vector with the stack already
+	 * set up according to its expectations.modify
						^^^^^^
which you can amend-remove.

Other than that, it looks good!

Thx.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack
  2017-11-26 14:18     ` Ingo Molnar
@ 2017-11-26 17:33       ` Borislav Petkov
  0 siblings, 0 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 17:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sun, Nov 26, 2017 at 03:18:47PM +0100, Ingo Molnar wrote:
> Heh, indeed - if by 'future generations' you mean 'the guys who merge it,
> a few months down the line'. ;-)

... and the poor souls who get to backport it and maybe even the guys
who merged it, get to look at this after some time has passed by and
then all of a sudden, they start scratching heads: "Why TF did we do
that then? What were we thinking?"

It happens even to the best of us. :-)

And entry code is tricky as hell so hell yeah! to all the commenting.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
@ 2017-11-26 17:41   ` Borislav Petkov
  2017-11-27  9:26     ` Ingo Molnar
  2017-11-27 21:14     ` Dave Hansen
  0 siblings, 2 replies; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 17:41 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:51PM +0100, Ingo Molnar wrote:
> From: Dave Hansen <dave.hansen@linux.intel.com>
> 
> These patches are based on work from a team at Graz University of
> Technology posted here: https://github.com/IAIK/KAISER
> 
> The KAISER approach keeps two copies of the page tables: one for running
> in the kernel and one for running userspace.  But, there are a few
> structures that are needed for switching in and out of the kernel and
> a good subset of *those* are per-cpu data.
> 
> This patch creates a new kind of per-cpu data that is mapped and

Never say "This patch" in the commit message of a patch. It is
tautologically useless.

> can be used no matter which copy of the page tables is active.
> Users of this new section will be forthcoming.
> 
> Thanks to Hugh Dickins for cleanups to this code.
> 
> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: daniel.gruss@iaik.tugraz.at
> Cc: hughd@google.com
> Cc: keescook@google.com
> Cc: linux-mm@kvack.org
> Cc: luto@kernel.org
> Cc: michael.schwarz@iaik.tugraz.at
> Cc: moritz.lipp@iaik.tugraz.at
> Cc: richard.fellner@student.tugraz.at
> Link: https://lkml.kernel.org/r/20171123003444.196CB6DB@viggo.jf.intel.com
> Signed-off-by: Ingo Molnar <mingo@kernel.org>
> ---
>  include/asm-generic/vmlinux.lds.h |  7 +++++++
>  include/linux/percpu-defs.h       | 30 ++++++++++++++++++++++++++++++
>  2 files changed, 37 insertions(+)
> 
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index bdcd1caae092..e12168936d3f 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -826,7 +826,14 @@
>   */
>  #define PERCPU_INPUT(cacheline)						\
>  	VMLINUX_SYMBOL(__per_cpu_start) = .;				\
> +	VMLINUX_SYMBOL(__per_cpu_user_mapped_start) = .;		\
>  	*(.data..percpu..first)						\
> +	. = ALIGN(cacheline);						\
> +	*(.data..percpu..user_mapped)					\
> +	*(.data..percpu..user_mapped..shared_aligned)			\
> +	. = ALIGN(PAGE_SIZE);						\
> +	*(.data..percpu..user_mapped..page_aligned)			\
> +	VMLINUX_SYMBOL(__per_cpu_user_mapped_end) = .;			\
>  	. = ALIGN(PAGE_SIZE);						\
>  	*(.data..percpu..page_aligned)					\
>  	. = ALIGN(cacheline);						\
> diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
> index 2d2096ba1cfe..752513674295 100644
> --- a/include/linux/percpu-defs.h
> +++ b/include/linux/percpu-defs.h
> @@ -35,6 +35,12 @@
>  
>  #endif
>  
> +#ifdef CONFIG_KAISER
> +#define USER_MAPPED_SECTION "..user_mapped"
> +#else
> +#define USER_MAPPED_SECTION ""
> +#endif
>  /*
>   * Base implementations of per-CPU variable declarations and definitions, where
>   * the section in which the variable is to be placed is provided by the
> @@ -115,6 +121,12 @@
>  #define DEFINE_PER_CPU(type, name)					\
>  	DEFINE_PER_CPU_SECTION(type, name, "")
>  
> +#define DECLARE_PER_CPU_USER_MAPPED(type, name)				\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
> +
> +#define DEFINE_PER_CPU_USER_MAPPED(type, name)				\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION)
> +
>  /*
>   * Declaration/definition used for per-CPU variables that must come first in
>   * the set of variables.
> @@ -144,6 +156,14 @@
>  	DEFINE_PER_CPU_SECTION(type, name, PER_CPU_SHARED_ALIGNED_SECTION) \
>  	____cacheline_aligned_in_smp
>  
> +#define DECLARE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
> +	____cacheline_aligned_in_smp
> +
> +#define DEFINE_PER_CPU_SHARED_ALIGNED_USER_MAPPED(type, name)		\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION PER_CPU_SHARED_ALIGNED_SECTION) \
> +	____cacheline_aligned_in_smp
> +
>  #define DECLARE_PER_CPU_ALIGNED(type, name)				\
>  	DECLARE_PER_CPU_SECTION(type, name, PER_CPU_ALIGNED_SECTION)	\
>  	____cacheline_aligned
> @@ -162,6 +182,16 @@
>  #define DEFINE_PER_CPU_PAGE_ALIGNED(type, name)				\
>  	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
>  	__aligned(PAGE_SIZE)
> +/*
> + * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.
> + */

WARNING: line over 100 characters
#122: FILE: include/linux/percpu-defs.h:186:
+ * Declaration/definition used for per-CPU variables that must be page aligned and need to be mapped in user mode.

> +#define DECLARE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
> +	DECLARE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
> +	__aligned(PAGE_SIZE)
> +
> +#define DEFINE_PER_CPU_PAGE_ALIGNED_USER_MAPPED(type, name)		\
> +	DEFINE_PER_CPU_SECTION(type, name, USER_MAPPED_SECTION"..page_aligned") \
> +	__aligned(PAGE_SIZE)
>  
>  /*
>   * Declaration/definition used for per-CPU variables that must be read mostly.
> -- 

With that:

Reviewed-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
@ 2017-11-26 18:51   ` Borislav Petkov
  2017-11-27  9:30     ` Ingo Molnar
  2017-11-26 20:49   ` Borislav Petkov
  2017-11-26 22:25   ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off Borislav Petkov
  2 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 18:51 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
> new file mode 100644

Here some text cleanups/typos fixes on top, after reading through it:

---
diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
index 46efa3662e22..f2df0441f6ea 100644
--- a/Documentation/x86/kaiser.txt
+++ b/Documentation/x86/kaiser.txt
@@ -4,9 +4,10 @@ Overview
 KAISER is a countermeasure against attacks on kernel address
 information.  There are at least three existing, published,
 approaches using the shared user/kernel mapping and hardware features
-to defeat KASLR.  One approach referenced in the paper locates the
-kernel by observing differences in page fault timing between
-present-but-inaccessable kernel pages and non-present pages.
+to defeat KASLR.  One approach referenced in the paper
+(https://gruss.cc/files/kaiser.pdf) locates the kernel by
+observing differences in page fault timing between
+present-but-inaccessible kernel pages and non-present pages.
 
 When the kernel is entered via syscalls, interrupts or exceptions,
 page tables are switched to the full "kernel" copy.  When the
@@ -18,37 +19,36 @@ entry/exit functions themselves and the interrupt descriptor
 table (IDT).
 
 This helps to ensure that side-channel attacks that leverage the
-paging structures do not function when KAISER is enabled.  It can be
-enabled by setting CONFIG_KAISER=y
+paging structures do not function when KAISER is enabled, by setting
+CONFIG_KAISER=y.
 
 Page Table Management
 =====================
 
 When KAISER is enabled, the kernel manages two sets of page
 tables.  The first copy is very similar to what would be present
-for a kernel without KAISER.  This includes a complete mapping of
-userspace that the kernel can use for things like copy_to_user().
+for a kernel without KAISER.  It includes a complete mapping of
+userspace that the kernel needs for things like copy_*_user().
 
 The second (shadow) is used when running userspace and mirrors the
-mapping of userspace present in the kernel copy.  It maps a only
+mapping of userspace present in the kernel copy.  It maps only
 the kernel data needed to enter and exit the kernel.
 
 The shadow is populated by the kaiser_add_*() functions.  Only
-kernel data which has been explicity mapped will appear in the
-shadow copy.  These calls are rare at runtime.
+kernel data which has been explicitly mapped will appear in the
+shadow copy. These calls are rare at runtime.
 
 For a new userspace mapping, the kernel makes the entries in its
 page tables like normal.  The only difference is when the kernel
 makes entries in the top (PGD) level.  In addition to setting the
-entry in the main kernel PGD, a copy if the entry is made in the
+entry in the main kernel PGD, a copy of the entry is made in the
 shadow PGD.
 
 For user space mappings the kernel creates an entry in the kernel
 PGD and the same entry in the shadow PGD, so the underlying page
-table to which the PGD entry points is shared down to the PTE
+table to which the PGD entry points to, is shared down to the PTE
 level.  This leaves a single, shared set of userspace page tables
-to manage.  One PTE to lock, one set set of accessed bits, dirty
-bits, etc...
+to manage.  One PTE to lock, one set of accessed, dirty bits, etc...
 
 Overhead
 ========
@@ -76,8 +76,8 @@ this protection comes at a cost:
   a. CR3 manipulation to switch between the page table copies
      must be done at interrupt, syscall, and exception entry
      and exit (it can be skipped when the kernel is interrupted,
-     though.)  Moves to CR3 are on the order of a hundred
-     cycles, and are required every at entry and every at exit.
+     though.)  CR3 modifications are in the order of a hundred
+     cycles, and are required at every entry and exit.
   b. Task stacks must be mapped/unmapped.  We need to walk
      and modify the shadow page tables at fork() and exit().
   c. Global pages are disabled.  This feature of the MMU
@@ -91,7 +91,7 @@ this protection comes at a cost:
      systems with PCID support, the context switch code must flush
      both the user and kernel entries out of the TLB, with an
      INVPCID in addition to the CR3 write.  This INVPCID is
-     generally slower than a CR3 write, but still on the order of
+     generally slower than a CR3 write, but still in the order of
      a hundred cycles.
   e. The shadow page tables must be populated for each new
      process.  Even without KAISER, the shared kernel mappings
@@ -123,7 +123,7 @@ Possible Future Work:
    code or userspace since it will not have to reload all of
    its TLB entries.  However, its upside is limited by PCID
    being used.
-4. Allow KAISER to enabled/disabled at runtime so folks can
+4. Allow KAISER to be enabled/disabled at runtime so folks can
    run a single kernel image.
 
 Debugging:
@@ -144,7 +144,7 @@ that are worth noting here.
    running perf.
  * Kernel crashes at the first exit to userspace.  entry_64.S
    bugs, or failing to map some of the exit code.
- * Crashes at first interrupt that interrupts userspace. The paths
+ * Crashes at the first interrupt that interrupts userspace. The paths
    in entry_64.S that return to userspace are sometimes separate
    from the ones that return to the kernel.
  * Double faults: overflowing the kernel stack because of page
@@ -157,4 +157,3 @@ that are worth noting here.
    as mount(8) failing to mount the rootfs.  These have
    tended to be TLB invalidation issues.  Usually invalidating
    the wrong PCID, or otherwise missing an invalidation.
-

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
  2017-11-26 18:51   ` Borislav Petkov
@ 2017-11-26 20:49   ` Borislav Petkov
  2017-11-27 10:38     ` Ingo Molnar
  2017-11-26 22:25   ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off Borislav Petkov
  2 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 20:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> index e1650da01323..d087c3aa0514 100644
> --- a/arch/x86/entry/calling.h
> +++ b/arch/x86/entry/calling.h
> @@ -2,6 +2,7 @@
>  #include <linux/jump_label.h>
>  #include <asm/unwind_hints.h>
>  #include <asm/cpufeatures.h>
> +#include <asm/page_types.h>
>  
>  /*
>  
> diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
> new file mode 100644
> index 000000000000..3c2cc71b4058
> --- /dev/null
> +++ b/arch/x86/include/asm/kaiser.h
> @@ -0,0 +1,57 @@
> +#ifndef _ASM_X86_KAISER_H
> +#define _ASM_X86_KAISER_H
> +/*
> + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of version 2 of the GNU General Public License as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful, but
> + * WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> + * General Public License for more details.
> + *
> + * Based on work published here: https://github.com/IAIK/KAISER
> + * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
> + */
> +#ifndef __ASSEMBLY__
> +
> +#ifdef CONFIG_KAISER
> +/**
> + *  kaiser_add_mapping - map a kernel range into the user page tables
> + *  @addr: the start address of the range
> + *  @size: the size of the range
> + *  @flags: The mapping flags of the pages
> + *
> + *  Use this on all data and code that need to be mapped into both
> + *  copies of the page tables.  This includes the code that switches
> + *  to/from userspace and all of the hardware structures that are
> + *  virtually-addressed and needed in userspace like the interrupt
> + *  table.
> + */
> +extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
> +			      unsigned long flags);
> +
> +/**
> + *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
> + *  @addr: the start address of the range
> + *  @size: the size of the range
> + */
> +extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
> +
> +/**
> + *  kaiser_init - Initialize the shadow mapping
> + *
> + *  Most parts of the shadow mapping can be mapped upon boot
> + *  time.  Only per-process things like the thread stacks
> + *  or a new LDT have to be mapped at runtime.  These boot-
> + *  time mappings are permanent and never unmapped.
> + */
> +extern void kaiser_init(void);

Those externs are not needed.

> +
> +#endif
> +
> +#endif /* __ASSEMBLY__ */
> +
> +#endif /* _ASM_X86_KAISER_H */
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index f735c3016325..d3901124143f 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1106,6 +1106,11 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
>  static inline void clone_pgd_range(pgd_t *dst, pgd_t *src, int count)
>  {
>         memcpy(dst, src, count * sizeof(pgd_t));
> +#ifdef CONFIG_KAISER
> +	/* Clone the shadow pgd part as well */
> +	memcpy(kernel_to_shadow_pgdp(dst), kernel_to_shadow_pgdp(src),
> +	       count * sizeof(pgd_t));
> +#endif
>  }
>  
>  #define PTE_SHIFT ilog2(PTRS_PER_PTE)
> diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
> index e9f05331e732..c239839e92bd 100644
> --- a/arch/x86/include/asm/pgtable_64.h
> +++ b/arch/x86/include/asm/pgtable_64.h
> @@ -131,9 +131,137 @@ static inline pud_t native_pudp_get_and_clear(pud_t *xp)
>  #endif
>  }
>  
> +#ifdef CONFIG_KAISER
> +/*
> + * All top-level KAISER page tables are order-1 pages (8k-aligned
> + * and 8k in size).  The kernel one is at the beginning 4k and
> + * the user (shadow) one is in the last 4k.  To switch between
> + * them, you just need to flip the 12th bit in their addresses.
> + */
> +#define KAISER_PGTABLE_SWITCH_BIT	PAGE_SHIFT
> +
> +/*
> + * This generates better code than the inline assembly in
> + * __set_bit().
> + */
> +static inline void *ptr_set_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;
> +
> +	__ptr |= (1<<bit);

	__ptr |= BIT(bit);

Ditto below.

> +	return (void *)__ptr;
> +}
> +static inline void *ptr_clear_bit(void *ptr, int bit)
> +{
> +	unsigned long __ptr = (unsigned long)ptr;
> +
> +	__ptr &= ~(1<<bit);
> +	return (void *)__ptr;
> +}
> +
> +static inline pgd_t *kernel_to_shadow_pgdp(pgd_t *pgdp)
> +{
> +	return ptr_set_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
> +}
> +static inline pgd_t *shadow_to_kernel_pgdp(pgd_t *pgdp)
> +{
> +	return ptr_clear_bit(pgdp, KAISER_PGTABLE_SWITCH_BIT);
> +}
> +static inline p4d_t *kernel_to_shadow_p4dp(p4d_t *p4dp)
> +{
> +	return ptr_set_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
> +}
> +static inline p4d_t *shadow_to_kernel_p4dp(p4d_t *p4dp)
> +{
> +	return ptr_clear_bit(p4dp, KAISER_PGTABLE_SWITCH_BIT);
> +}
> +#endif /* CONFIG_KAISER */
> +
> +/*
> + * Page table pages are page-aligned.  The lower half of the top
> + * level is used for userspace and the top half for the kernel.
> + *
> + * Returns true for parts of the PGD that map userspace and
> + * false for the parts that map the kernel.
> + */
> +static inline bool pgdp_maps_userspace(void *__ptr)
> +{
> +	unsigned long ptr = (unsigned long)__ptr;
> +
> +	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);

This generates

	return (ptr & ~(~(((1UL) << 12)-1))) < (((1UL) << 12) / 2);

and if you turn that ~PAGE_MASK into PAGE_SIZE-1 you get

	return (ptr &  (((1UL) << 12)-1))  < (((1UL) << 12) / 2);

which removes the self-cancelling negation:

	return (ptr & (PAGE_SIZE-1)) < (PAGE_SIZE / 2);

The final asm is the same, though.

> +
> +/*
> + * Does this PGD allow access from userspace?
> + */
> +static inline bool pgd_userspace_access(pgd_t pgd)
> +{
> +	return pgd.pgd & _PAGE_USER;
> +}
> +
> +/*
> + * Take a PGD location (pgdp) and a pgd value that needs
> + * to be set there.  Populates the shadow and returns
> + * the resulting PGD that must be set in the kernel copy
> + * of the page tables.
> + */
> +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
> +{
> +#ifdef CONFIG_KAISER
> +	if (pgd_userspace_access(pgd)) {
> +		if (pgdp_maps_userspace(pgdp)) {
> +			/*
> +			 * The user/shadow page tables get the full
> +			 * PGD, accessible from userspace:
> +			 */
> +			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
> +			/*
> +			 * For the copy of the pgd that the kernel
> +			 * uses, make it unusable to userspace.  This
> +			 * ensures if we get out to userspace with the
> +			 * wrong CR3 value, userspace will crash
> +			 * instead of running.
> +			 */
> +			pgd.pgd |= _PAGE_NX;
> +		}
> +	} else if (pgd_userspace_access(*pgdp)) {
> +		/*
> +		 * We are clearing a _PAGE_USER PGD for which we
> +		 * presumably populated the shadow.  We must now
> +		 * clear the shadow PGD entry.
> +		 */
> +		if (pgdp_maps_userspace(pgdp)) {
> +			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
> +		} else {
> +			/*
> +			 * Attempted to clear a _PAGE_USER PGD which
> +			 * is in the kernel porttion of the address
> +			 * space.  PGDs are pre-populated and we
> +			 * never clear them.
> +			 */
> +			WARN_ON_ONCE(1);
> +		}
> +	} else {
> +		/*
> +		 * _PAGE_USER was not set in either the PGD being set
> +		 * or cleared.  All kernel PGDs should be
> +		 * pre-populated so this should never happen after
> +		 * boot.
> +		 */

So maybe do:

		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);

> +	}
> +#endif
> +	/* return the copy of the PGD we want the kernel to use: */
> +	return pgd;
> +}
> +
> +
>  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
>  {
> +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
> +	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
> +#else /* CONFIG_KAISER */

No need for that comment.

>  	*p4dp = p4d;
> +#endif
>  }
>  
>  static inline void native_p4d_clear(p4d_t *p4d)
> @@ -147,7 +275,11 @@ static inline void native_p4d_clear(p4d_t *p4d)
>  
>  static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
>  {
> +#ifdef CONFIG_KAISER
> +	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
> +#else /* CONFIG_KAISER */

No need for that comment.

>  	*pgdp = pgd;
> +#endif
>  }
>  
>  static inline void native_pgd_clear(pgd_t *pgd)
> diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
> index 7d7715dde901..4780dba2cc59 100644
> --- a/arch/x86/kernel/espfix_64.c
> +++ b/arch/x86/kernel/espfix_64.c
> @@ -41,6 +41,7 @@
>  #include <asm/pgalloc.h>
>  #include <asm/setup.h>
>  #include <asm/espfix.h>
> +#include <asm/kaiser.h>
>  
>  /*
>   * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
> @@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
>  	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
>  	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
>  	p4d_populate(&init_mm, p4d, espfix_pud_page);

<---- newline here.

> +	/*
> +	 * Just copy the top-level PGD that is mapping the espfix
> +	 * area to ensure it is mapped into the shadow user page
> +	 * tables.
> +	 *
> +	 * For 5-level paging, the espfix pgd was populated when
> +	 * kaiser_init() pre-populated all the pgd entries.  The above
> +	 * p4d_alloc() would never do anything and the p4d_populate()
> +	 * would be done to a p4d already mapped in the userspace pgd.
> +	 */
> +#ifdef CONFIG_KAISER
> +	if (CONFIG_PGTABLE_LEVELS <= 4) {
> +		set_pgd(kernel_to_shadow_pgdp(pgd),
> +			__pgd(_KERNPG_TABLE | (p4d_pfn(*p4d) << PAGE_SHIFT)));
> +	}
> +#endif
>  
>  	/* Randomize the locations */
>  	init_espfix_random();

End of part I of the review of this biggy :)

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off
  2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
  2017-11-26 18:51   ` Borislav Petkov
  2017-11-26 20:49   ` Borislav Petkov
@ 2017-11-26 22:25   ` Borislav Petkov
  2017-11-26 22:41     ` Thomas Gleixner
  2 siblings, 1 reply; 113+ messages in thread
From: Borislav Petkov @ 2017-11-26 22:25 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> + * Take a PGD location (pgdp) and a pgd value that needs
> + * to be set there.  Populates the shadow and returns
> + * the resulting PGD that must be set in the kernel copy
> + * of the page tables.
> + */
> +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
> +{
> +#ifdef CONFIG_KAISER
> +	if (pgd_userspace_access(pgd)) {
> +		if (pgdp_maps_userspace(pgdp)) {
> +			/*
> +			 * The user/shadow page tables get the full
> +			 * PGD, accessible from userspace:
> +			 */
> +			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
> +			/*
> +			 * For the copy of the pgd that the kernel
> +			 * uses, make it unusable to userspace.  This
> +			 * ensures if we get out to userspace with the
> +			 * wrong CR3 value, userspace will crash
> +			 * instead of running.
> +			 */
> +			pgd.pgd |= _PAGE_NX;

Lemme hold this down here so that we don't forget (and tglx is working on it
already... ):

So we need to handle the case where we boot with "noexec=off" and thus
clear _PAGE_NX from __supported_pte_mask. I'd vouch for a conservative
solution where we warn if _PAGE_NX is not set in __supported_pte_mask
and thus at least tell the user that she shouldn't do noexec kernels and
expect kaiser protection...

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off
  2017-11-26 22:25   ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off Borislav Petkov
@ 2017-11-26 22:41     ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-26 22:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Ingo Molnar, linux-kernel, Dave Hansen, Andy Lutomirski,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sun, 26 Nov 2017, Borislav Petkov wrote:
> On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> > + * Take a PGD location (pgdp) and a pgd value that needs
> > + * to be set there.  Populates the shadow and returns
> > + * the resulting PGD that must be set in the kernel copy
> > + * of the page tables.
> > + */
> > +static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
> > +{
> > +#ifdef CONFIG_KAISER
> > +	if (pgd_userspace_access(pgd)) {
> > +		if (pgdp_maps_userspace(pgdp)) {
> > +			/*
> > +			 * The user/shadow page tables get the full
> > +			 * PGD, accessible from userspace:
> > +			 */
> > +			kernel_to_shadow_pgdp(pgdp)->pgd = pgd.pgd;
> > +			/*
> > +			 * For the copy of the pgd that the kernel
> > +			 * uses, make it unusable to userspace.  This
> > +			 * ensures if we get out to userspace with the
> > +			 * wrong CR3 value, userspace will crash
> > +			 * instead of running.
> > +			 */
> > +			pgd.pgd |= _PAGE_NX;
> 
> Lemme hold this down here so that we don't forget (and tglx is working on it
> already... ):

A worse problem exists vs. PAGE_GLOBAL. Its enforced by KAISER
unconditionally... Fixes are posted soon as an update to the series I sent
earlier today.

> So we need to handle the case where we boot with "noexec=off" and thus
> clear _PAGE_NX from __supported_pte_mask. I'd vouch for a conservative
> solution where we warn if _PAGE_NX is not set in __supported_pte_mask
> and thus at least tell the user that she shouldn't do noexec kernels and
> expect kaiser protection...

Let me add that.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping
  2017-11-25 16:08                 ` Thomas Gleixner
  2017-11-25 20:06                   ` Steven Rostedt
@ 2017-11-27  8:14                   ` Peter Zijlstra
  2017-11-27  8:21                     ` Peter Zijlstra
  1 sibling, 1 reply; 113+ messages in thread
From: Peter Zijlstra @ 2017-11-27  8:14 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Andy Lutomirski, Dave Hansen, Josh Poimboeuf, LKML,
	Andy Lutomirski, H . Peter Anvin, Borislav Petkov,
	Linus Torvalds, Steven Rostedt, Andrey Ryabinin, kasan-dev,
	Masami Hiramatsu

On Sat, Nov 25, 2017 at 05:08:08PM +0100, Thomas Gleixner wrote:
> On Sat, 25 Nov 2017, Ingo Molnar wrote:
> >  	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
> >  				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
> > +	kaiser_add_user_map_ptrs_early(__irqentry_text_start, __irqentry_text_end,
> > +				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
> 
> The bad news is that this maps more stuff than actually needed:
> 
> ffffffff81ab1c20 T __irqentry_text_start
> ffffffff81ab1c20 T apic_timer_interrupt
> ffffffff81ab1cf0 T x86_platform_ipi
> ffffffff81ab1dc0 T threshold_interrupt
> ffffffff81ab1e90 T deferred_error_interrupt
> ffffffff81ab1f60 T thermal_interrupt
> ffffffff81ab2030 T call_function_single_interrupt
> ffffffff81ab2100 T call_function_interrupt
> ffffffff81ab21d0 T reschedule_interrupt
> ffffffff81ab22a0 T error_interrupt
> ffffffff81ab2370 T spurious_interrupt
> ffffffff81ab2440 T irq_work_interrupt

I'm confused though; for IDT we have that trampoline Andy did. So the
interrupt should land on the entry stack, do the switch magic and then
call the 'real' IDT entry, no?

So why do these functions need to be mapped at all?

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping
  2017-11-27  8:14                   ` Peter Zijlstra
@ 2017-11-27  8:21                     ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2017-11-27  8:21 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Ingo Molnar, Andy Lutomirski, Dave Hansen, Josh Poimboeuf, LKML,
	Andy Lutomirski, H . Peter Anvin, Borislav Petkov,
	Linus Torvalds, Steven Rostedt, Andrey Ryabinin, kasan-dev,
	Masami Hiramatsu

On Mon, Nov 27, 2017 at 09:14:48AM +0100, Peter Zijlstra wrote:
> On Sat, Nov 25, 2017 at 05:08:08PM +0100, Thomas Gleixner wrote:
> > On Sat, 25 Nov 2017, Ingo Molnar wrote:
> > >  	kaiser_add_user_map_ptrs_early(__entry_text_start, __entry_text_end,
> > >  				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
> > > +	kaiser_add_user_map_ptrs_early(__irqentry_text_start, __irqentry_text_end,
> > > +				       __PAGE_KERNEL_RX | _PAGE_GLOBAL);
> > 
> > The bad news is that this maps more stuff than actually needed:
> > 
> > ffffffff81ab1c20 T __irqentry_text_start
> > ffffffff81ab1c20 T apic_timer_interrupt
> > ffffffff81ab1cf0 T x86_platform_ipi
> > ffffffff81ab1dc0 T threshold_interrupt
> > ffffffff81ab1e90 T deferred_error_interrupt
> > ffffffff81ab1f60 T thermal_interrupt
> > ffffffff81ab2030 T call_function_single_interrupt
> > ffffffff81ab2100 T call_function_interrupt
> > ffffffff81ab21d0 T reschedule_interrupt
> > ffffffff81ab22a0 T error_interrupt
> > ffffffff81ab2370 T spurious_interrupt
> > ffffffff81ab2440 T irq_work_interrupt
> 
> I'm confused though; for IDT we have that trampoline Andy did. So the
> interrupt should land on the entry stack, do the switch magic and then
> call the 'real' IDT entry, no?
> 
> So why do these functions need to be mapped at all?

Aah, I see. The switch magic is in the generic macro and thus ends up in
the above functions...

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack
  2017-11-26 17:28         ` Borislav Petkov
@ 2017-11-27  9:19           ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:19 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, linux-kernel, Dave Hansen, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Sun, Nov 26, 2017 at 03:05:33PM +0100, Ingo Molnar wrote:
> > > https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/commit/?h=x86/entry_stack&id=6510485d026abdd144d170b1bc8508327acb5648
> > 
> > I have added Boris's reviewed-by tag:
> > 
> >   Reviewed-by: Borislav Petkov <bp@suse.de>
> > 
> > which I suspect applies now?
> 
> Almost, looking at the patch at that URL, there's a trailing "modify" in the
> comment:
> 
> +	 * we land directly at the #GP(0) vector with the stack already
> +	 * set up according to its expectations.modify
> 						^^^^^^
> which you can amend-remove.

Oops - fixed.

> Other than that, it looks good!

Thanks!

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-26 17:41   ` Borislav Petkov
@ 2017-11-27  9:26     ` Ingo Molnar
  2017-11-27 21:14     ` Dave Hansen
  1 sibling, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:26 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:51PM +0100, Ingo Molnar wrote:
> > From: Dave Hansen <dave.hansen@linux.intel.com>
> > 
> > These patches are based on work from a team at Graz University of
> > Technology posted here: https://github.com/IAIK/KAISER
> > 
> > The KAISER approach keeps two copies of the page tables: one for running
> > in the kernel and one for running userspace.  But, there are a few
> > structures that are needed for switching in and out of the kernel and
> > a good subset of *those* are per-cpu data.
> > 
> > This patch creates a new kind of per-cpu data that is mapped and
> 
> Never say "This patch" in the commit message of a patch. It is
> tautologically useless.

It makes sense in some contexts though. For example:

  The compat variant of SYSCALL doesn't have this problem in the first
  place -- there are plenty of scratch registers, since we don't care
  about preserving r8-r15. This patch therefore doesn't touch SYSCALL32
  at all.

If we only had:

   We don't touch SYSCALL32 at all.

it would be ambiguous: does it describe the current status quo, or perhaps a 
decision we made when writing the patch? The 'this patch' variant adds extra 
emphasis that SYSCALL32 is fine and doesn't require any changes.

Also, even the above variant:

> > This patch creates a new kind of per-cpu data that is mapped and ...

is a bit clearer than:

> > Create a new kind of per-cpu data that is mapped and ...

Because it stresses that this is a new change. The latter form includes that 
information as well, but is pretty close to:

> > The kernel creates a new kind of per-cpu data that is mapped and ...

... which describes the status quo and not new behavior. Explicitly qualifying who 
does something makes it clearer.

I.e. redundancy sometimes helps readability. We don't want 'this patch' in every 
second sentence, but having it written out once, especially where we switch from 
describing existing behavior to describing new behavior (as is the case here) is 
perfectly fine and improves readability.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code
  2017-11-25 17:26             ` Andy Lutomirski
@ 2017-11-27  9:27               ` Peter Zijlstra
  0 siblings, 0 replies; 113+ messages in thread
From: Peter Zijlstra @ 2017-11-27  9:27 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Thomas Gleixner, Ingo Molnar, linux-kernel,
	Dave Hansen, H . Peter Anvin, Linus Torvalds

On Sat, Nov 25, 2017 at 09:26:38AM -0800, Andy Lutomirski wrote:
> On Sat, Nov 25, 2017 at 9:10 AM, Borislav Petkov <bp@alien8.de> wrote:
> > On Sat, Nov 25, 2017 at 06:03:36PM +0100, Thomas Gleixner wrote:
> >> > Maybe I should rename cpu_tss to cpu_tss_rw in that patch.
> >>
> >> For clarity that would be nice.
> >
> > + a comment stating the alias mapping. It took tglx and me a while on
> > IRC to figure it out. :-)
> 
> I'm adding this comment to the top of cpu_entry_area:
> 
> + * Every field is a virtual alias of some other allocated backing store.
> + * There is no direct allocation of a struct cpu_entry_area.

Right, thanks! That did take me a few minutes to sort as well :-)

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-26 18:51   ` Borislav Petkov
@ 2017-11-27  9:30     ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-27  9:30 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> > diff --git a/Documentation/x86/kaiser.txt b/Documentation/x86/kaiser.txt
> > new file mode 100644
> 
> Here some text cleanups/typos fixes on top, after reading through it:

Applied, thanks!

	Ingo

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch)
  2017-11-26 20:49   ` Borislav Petkov
@ 2017-11-27 10:38     ` Ingo Molnar
  0 siblings, 0 replies; 113+ messages in thread
From: Ingo Molnar @ 2017-11-27 10:38 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, Thomas Gleixner,
	H . Peter Anvin, Peter Zijlstra, Linus Torvalds


* Borislav Petkov <bp@alien8.de> wrote:

> On Fri, Nov 24, 2017 at 06:23:53PM +0100, Ingo Molnar wrote:
> > diff --git a/arch/x86/entry/calling.h b/arch/x86/entry/calling.h
> > index e1650da01323..d087c3aa0514 100644
> > --- a/arch/x86/entry/calling.h
> > +++ b/arch/x86/entry/calling.h
> > @@ -2,6 +2,7 @@
> >  #include <linux/jump_label.h>
> >  #include <asm/unwind_hints.h>
> >  #include <asm/cpufeatures.h>
> > +#include <asm/page_types.h>
> >  
> >  /*
> >  
> > diff --git a/arch/x86/include/asm/kaiser.h b/arch/x86/include/asm/kaiser.h
> > new file mode 100644
> > index 000000000000..3c2cc71b4058
> > --- /dev/null
> > +++ b/arch/x86/include/asm/kaiser.h
> > @@ -0,0 +1,57 @@
> > +#ifndef _ASM_X86_KAISER_H
> > +#define _ASM_X86_KAISER_H
> > +/*
> > + * Copyright(c) 2017 Intel Corporation. All rights reserved.
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of version 2 of the GNU General Public License as
> > + * published by the Free Software Foundation.
> > + *
> > + * This program is distributed in the hope that it will be useful, but
> > + * WITHOUT ANY WARRANTY; without even the implied warranty of
> > + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
> > + * General Public License for more details.
> > + *
> > + * Based on work published here: https://github.com/IAIK/KAISER
> > + * Modified by Dave Hansen <dave.hansen@intel.com to actually work.
> > + */
> > +#ifndef __ASSEMBLY__
> > +
> > +#ifdef CONFIG_KAISER
> > +/**
> > + *  kaiser_add_mapping - map a kernel range into the user page tables
> > + *  @addr: the start address of the range
> > + *  @size: the size of the range
> > + *  @flags: The mapping flags of the pages
> > + *
> > + *  Use this on all data and code that need to be mapped into both
> > + *  copies of the page tables.  This includes the code that switches
> > + *  to/from userspace and all of the hardware structures that are
> > + *  virtually-addressed and needed in userspace like the interrupt
> > + *  table.
> > + */
> > +extern int kaiser_add_mapping(unsigned long addr, unsigned long size,
> > +			      unsigned long flags);
> > +
> > +/**
> > + *  kaiser_remove_mapping - remove a kernel mapping from the userpage tables
> > + *  @addr: the start address of the range
> > + *  @size: the size of the range
> > + */
> > +extern void kaiser_remove_mapping(unsigned long start, unsigned long size);
> > +
> > +/**
> > + *  kaiser_init - Initialize the shadow mapping
> > + *
> > + *  Most parts of the shadow mapping can be mapped upon boot
> > + *  time.  Only per-process things like the thread stacks
> > + *  or a new LDT have to be mapped at runtime.  These boot-
> > + *  time mappings are permanent and never unmapped.
> > + */
> > +extern void kaiser_init(void);
> 
> Those externs are not needed.

Well, we typically use them for API definitions, even though technically they are 
superfluous.

> > + * This generates better code than the inline assembly in
> > + * __set_bit().
> > + */
> > +static inline void *ptr_set_bit(void *ptr, int bit)
> > +{
> > +	unsigned long __ptr = (unsigned long)ptr;
> > +
> > +	__ptr |= (1<<bit);
> 
> 	__ptr |= BIT(bit);
> 
> Ditto below.

Fixed.

> > +static inline bool pgdp_maps_userspace(void *__ptr)
> > +{
> > +	unsigned long ptr = (unsigned long)__ptr;
> > +
> > +	return (ptr & ~PAGE_MASK) < (PAGE_SIZE / 2);
> 
> This generates
> 
> 	return (ptr & ~(~(((1UL) << 12)-1))) < (((1UL) << 12) / 2);
> 
> and if you turn that ~PAGE_MASK into PAGE_SIZE-1 you get
> 
> 	return (ptr &  (((1UL) << 12)-1))  < (((1UL) << 12) / 2);
> 
> which removes the self-cancelling negation:
> 
> 	return (ptr & (PAGE_SIZE-1)) < (PAGE_SIZE / 2);
> 
> The final asm is the same, though.

I left the original version, because it generates the same output - unless you 
feel strongly about this?

> > +	} else {
> > +		/*
> > +		 * _PAGE_USER was not set in either the PGD being set
> > +		 * or cleared.  All kernel PGDs should be
> > +		 * pre-populated so this should never happen after
> > +		 * boot.
> > +		 */
> 
> So maybe do:
> 
> 		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);

Indeed, check added.

> >  static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
> >  {
> > +#if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
> > +	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
> > +#else /* CONFIG_KAISER */
> 
> No need for that comment.

Fixed.

> >  /*
> >   * Note: we only need 6*8 = 48 bytes for the espfix stack, but round
> > @@ -128,6 +129,22 @@ void __init init_espfix_bsp(void)
> >  	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
> >  	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
> >  	p4d_populate(&init_mm, p4d, espfix_pud_page);
> 
> <---- newline here.

Added.

> >  	/* Randomize the locations */
> >  	init_espfix_random();
> 
> End of part I of the review of this biggy :)

Below is the delta patch - I'm backmerging it to the core patch.

I'll send out the whole series in the next hour or so, people asked for it so that 
they can review the latest.

Thanks,

	Ingo

============>

 arch/x86/include/asm/pgtable_64.h | 9 +++++----
 arch/x86/kernel/espfix_64.c       | 1 +
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index c8c0dc1cfe86..8d725fcb921b 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -148,14 +148,14 @@ static inline void *ptr_set_bit(void *ptr, int bit)
 {
 	unsigned long __ptr = (unsigned long)ptr;
 
-	__ptr |= (1<<bit);
+	__ptr |= BIT(bit);
 	return (void *)__ptr;
 }
 static inline void *ptr_clear_bit(void *ptr, int bit)
 {
 	unsigned long __ptr = (unsigned long)ptr;
 
-	__ptr &= ~(1<<bit);
+	__ptr &= ~BIT(bit);
 	return (void *)__ptr;
 }
 
@@ -255,6 +255,7 @@ static inline pgd_t kaiser_set_shadow_pgd(pgd_t *pgdp, pgd_t pgd)
 		 * pre-populated so this should never happen after
 		 * boot.
 		 */
+		WARN_ON_ONCE(system_state == SYSTEM_RUNNING);
 	}
 #endif
 	/* return the copy of the PGD we want the kernel to use: */
@@ -266,7 +267,7 @@ static inline void native_set_p4d(p4d_t *p4dp, p4d_t p4d)
 {
 #if defined(CONFIG_KAISER) && !defined(CONFIG_X86_5LEVEL)
 	p4dp->pgd = kaiser_set_shadow_pgd(&p4dp->pgd, p4d.pgd);
-#else /* CONFIG_KAISER */
+#else
 	*p4dp = p4d;
 #endif
 }
@@ -284,7 +285,7 @@ static inline void native_set_pgd(pgd_t *pgdp, pgd_t pgd)
 {
 #ifdef CONFIG_KAISER
 	*pgdp = kaiser_set_shadow_pgd(pgdp, pgd);
-#else /* CONFIG_KAISER */
+#else
 	*pgdp = pgd;
 #endif
 }
diff --git a/arch/x86/kernel/espfix_64.c b/arch/x86/kernel/espfix_64.c
index 8bb116d73aaa..319033f5bbd9 100644
--- a/arch/x86/kernel/espfix_64.c
+++ b/arch/x86/kernel/espfix_64.c
@@ -129,6 +129,7 @@ void __init init_espfix_bsp(void)
 	pgd = &init_top_pgt[pgd_index(ESPFIX_BASE_ADDR)];
 	p4d = p4d_alloc(&init_mm, pgd, ESPFIX_BASE_ADDR);
 	p4d_populate(&init_mm, p4d, espfix_pud_page);
+
 	/*
 	 * Just copy the top-level PGD that is mapping the espfix
 	 * area to ensure it is mapped into the shadow user page

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code for entry/exit CR3 switching
  2017-11-26 14:55     ` [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code " Ingo Molnar
@ 2017-11-27 13:29       ` Josh Poimboeuf
  2017-11-27 13:36         ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Josh Poimboeuf @ 2017-11-27 13:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Borislav Petkov, linux-kernel, Dave Hansen, Andy Lutomirski,
	Thomas Gleixner, H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Sun, Nov 26, 2017 at 03:55:53PM +0100, Ingo Molnar wrote:
> @@ -198,6 +201,14 @@ ENTRY(entry_SYSCALL_64)
>  
>  	swapgs
>  	movq	%rsp, PER_CPU_VAR(rsp_scratch)
> +
> +	/*
> +	 * The kernel CR3 is needed to map the process stack, but we
> +	 * need a scratch register to be able to load CR3.  %rsp is
> +	 * clobberable right now, so use it as a scratch register.
> +	 * %rsp will look crazy here for a couple instructions.
> +	 */
> +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
>  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp

I think it doesn't make sense to have a SWITCH_TO_KERNEL_CR3 here.  This
code is the (future) kaiser-disabled path, where SWITCH_TO_KERNEL_CR3 is
a no-op anyway.

-- 
Josh

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code for entry/exit CR3 switching
  2017-11-27 13:29       ` Josh Poimboeuf
@ 2017-11-27 13:36         ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-27 13:36 UTC (permalink / raw)
  To: Josh Poimboeuf
  Cc: Ingo Molnar, Borislav Petkov, linux-kernel, Dave Hansen,
	Andy Lutomirski, H . Peter Anvin, Peter Zijlstra, Linus Torvalds

On Mon, 27 Nov 2017, Josh Poimboeuf wrote:
> On Sun, Nov 26, 2017 at 03:55:53PM +0100, Ingo Molnar wrote:
> > @@ -198,6 +201,14 @@ ENTRY(entry_SYSCALL_64)
> >  
> >  	swapgs
> >  	movq	%rsp, PER_CPU_VAR(rsp_scratch)
> > +
> > +	/*
> > +	 * The kernel CR3 is needed to map the process stack, but we
> > +	 * need a scratch register to be able to load CR3.  %rsp is
> > +	 * clobberable right now, so use it as a scratch register.
> > +	 * %rsp will look crazy here for a couple instructions.
> > +	 */
> > +	SWITCH_TO_KERNEL_CR3 scratch_reg=%rsp
> >  	movq	PER_CPU_VAR(cpu_current_top_of_stack), %rsp
> 
> I think it doesn't make sense to have a SWITCH_TO_KERNEL_CR3 here.  This
> code is the (future) kaiser-disabled path, where SWITCH_TO_KERNEL_CR3 is
> a no-op anyway.

Yes, Borislav noticed that as well. That's a left over from the early
versions where the trampoline did not exist.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas
  2017-11-26 17:41   ` Borislav Petkov
  2017-11-27  9:26     ` Ingo Molnar
@ 2017-11-27 21:14     ` Dave Hansen
  1 sibling, 0 replies; 113+ messages in thread
From: Dave Hansen @ 2017-11-27 21:14 UTC (permalink / raw)
  To: Borislav Petkov, Ingo Molnar
  Cc: linux-kernel, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Linus Torvalds

On 11/26/2017 09:41 AM, Borislav Petkov wrote:
>> The KAISER approach keeps two copies of the page tables: one for running
>> in the kernel and one for running userspace.  But, there are a few
>> structures that are needed for switching in and out of the kernel and
>> a good subset of *those* are per-cpu data.
>>
>> This patch creates a new kind of per-cpu data that is mapped and
> Never say "This patch" in the commit message of a patch. It is
> tautologically useless.

Look at any academic paper's abstract.  They almost always describe the
problem and the state of the art, and then start to describe the paper's
content.  It's entirely normal to say "this paper" to help differentiate
these things.

Patches can and should be the same.

We should not litter the text with "this patch does this", "this patch
does that", but we should not outlaw it entirely.  IOW, you can't just
blindly say, not to do it.

^ permalink raw reply	[flat|nested] 113+ messages in thread

* Re: [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
@ 2017-11-24 13:52   ` Thomas Gleixner
  0 siblings, 0 replies; 113+ messages in thread
From: Thomas Gleixner @ 2017-11-24 13:52 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-kernel, Dave Hansen, Andy Lutomirski, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

On Fri, 24 Nov 2017, Ingo Molnar wrote:

> From: Andy Lutomirski <luto@kernel.org>
> 
> Handling SYSCALL is tricky: the SYSCALL handler is entered with every
> single register (except FLAGS), including RSP, live.  It somehow needs
> to set RSP to point to a valid stack, which means it needs to save the
> user RSP somewhere and find its own stack pointer.  The canonical way
> to do this is with SWAPGS, which lets us access percpu data using the
> %gs prefix.
> 
> With KAISER-like pagetable switching, this is problematic.  Without a
> scratch register, switching CR3 is impossible, so %gs-based percpu
> memory would need to be mapped in the user pagetables.  Doing that
> without information leaks is difficult or impossible.
> 
> Instead, use a different sneaky trick.  Map a copy of the first part
> of the SYSCALL asm at a different address for each CPU.  Now RIP
> varies depending on the CPU, so we can use RIP-relative memory access
> to access percpu memory.  By putting the relevant information (one
> scratch slot and the stack address) at a constant offset relative to
> RIP, we can make SYSCALL work without relying on %gs.

Smart!

> A nice thing about this approach is that we can easily switch it on
> and off if we want pagetable switching to be configurable.
> 
> The compat variant of SYSCALL doesn't have this problem in the first
> place -- there are plenty of scratch registers, since we don't care
> about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
> at all.
> 
> XXX: Whenever we settle how KAISER gets turned on and off, we should do
> the same to this.
> 
> Signed-off-by: Andy Lutomirski <luto@kernel.org>

Reviewed-by: Thomas Gleixner <tglx@linutronix.de>

^ permalink raw reply	[flat|nested] 113+ messages in thread

* [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline
  2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
@ 2017-11-24  9:14 ` Ingo Molnar
  2017-11-24 13:52   ` Thomas Gleixner
  0 siblings, 1 reply; 113+ messages in thread
From: Ingo Molnar @ 2017-11-24  9:14 UTC (permalink / raw)
  To: linux-kernel
  Cc: Dave Hansen, Andy Lutomirski, Thomas Gleixner, H . Peter Anvin,
	Peter Zijlstra, Borislav Petkov, Linus Torvalds

From: Andy Lutomirski <luto@kernel.org>

Handling SYSCALL is tricky: the SYSCALL handler is entered with every
single register (except FLAGS), including RSP, live.  It somehow needs
to set RSP to point to a valid stack, which means it needs to save the
user RSP somewhere and find its own stack pointer.  The canonical way
to do this is with SWAPGS, which lets us access percpu data using the
%gs prefix.

With KAISER-like pagetable switching, this is problematic.  Without a
scratch register, switching CR3 is impossible, so %gs-based percpu
memory would need to be mapped in the user pagetables.  Doing that
without information leaks is difficult or impossible.

Instead, use a different sneaky trick.  Map a copy of the first part
of the SYSCALL asm at a different address for each CPU.  Now RIP
varies depending on the CPU, so we can use RIP-relative memory access
to access percpu memory.  By putting the relevant information (one
scratch slot and the stack address) at a constant offset relative to
RIP, we can make SYSCALL work without relying on %gs.

A nice thing about this approach is that we can easily switch it on
and off if we want pagetable switching to be configurable.

The compat variant of SYSCALL doesn't have this problem in the first
place -- there are plenty of scratch registers, since we don't care
about preserving r8-r15.  This patch therefore doesn't touch SYSCALL32
at all.

XXX: Whenever we settle how KAISER gets turned on and off, we should do
the same to this.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bpetkov@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/b95ccae0a5a2f090c901e49fce7c9e8ff6acd40d.1511497875.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/entry/entry_64.S     | 48 +++++++++++++++++++++++++++++++++++++++++++
 arch/x86/include/asm/fixmap.h |  2 ++
 arch/x86/kernel/asm-offsets.c |  1 +
 arch/x86/kernel/cpu/common.c  | 12 ++++++++++-
 arch/x86/kernel/vmlinux.lds.S | 10 +++++++++
 5 files changed, 72 insertions(+), 1 deletion(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 426b8c669d6a..0cde243b7542 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -140,6 +140,54 @@ END(native_usergs_sysret64)
  * with them due to bugs in both AMD and Intel CPUs.
  */
 
+	.pushsection .entry_trampoline, "ax"
+
+/*
+ * The code in here gets remapped into cpu_entry_area's trampoline.  This means
+ * that the assembler and linker have the wrong idea as to where this code
+ * lives (and, in fact, it's mapped more than once, so it's not even at a
+ * fixed address).  So we can't reference any symbols outside the entry
+ * trampoline and expect it to work.
+ *
+ * Instead, we carefully abuse %rip-relative addressing.
+ * .Lentry_trampoline(%rip) refers to the start of the remapped) entry
+ * trampoline.  We can thus find cpu_entry_area with this macro:
+ */
+
+#define CPU_ENTRY_AREA \
+	_entry_trampoline - CPU_ENTRY_AREA_entry_trampoline(%rip)
+
+/* The top word of the SYSENTER stack is hot and is usable as scratch space. */
+#define RSP_SCRATCH CPU_ENTRY_AREA_tss + CPU_TSS_SYSENTER_stack + \
+	SIZEOF_SYSENTER_stack - 8 + CPU_ENTRY_AREA
+
+ENTRY(entry_SYSCALL_64_trampoline)
+	UNWIND_HINT_EMPTY
+	swapgs
+
+	/* Stash the user RSP. */
+	movq	%rsp, RSP_SCRATCH
+
+	/* Load the top of the task stack into RSP */
+	movq	CPU_ENTRY_AREA_tss + TSS_sp1 + CPU_ENTRY_AREA, %rsp
+
+	/* Start building the simulated IRET frame. */
+	pushq	$__USER_DS			/* pt_regs->ss */
+	pushq	RSP_SCRATCH			/* pt_regs->sp */
+	pushq	%r11				/* pt_regs->flags */
+	pushq	$__USER_CS			/* pt_regs->cs */
+	pushq	%rcx				/* pt_regs->ip */
+
+	/*
+	 * x86 lacks a near absolute jump, and we can't jump to the real
+	 * entry text with a relative jump, so we fake it using retq.
+	 */
+	pushq	$entry_SYSCALL_64_after_hwframe
+	retq
+END(entry_SYSCALL_64_trampoline)
+
+	.popsection
+
 ENTRY(entry_SYSCALL_64)
 	UNWIND_HINT_EMPTY
 	/*
diff --git a/arch/x86/include/asm/fixmap.h b/arch/x86/include/asm/fixmap.h
index 3a42da14c2cb..7eb1b5490395 100644
--- a/arch/x86/include/asm/fixmap.h
+++ b/arch/x86/include/asm/fixmap.h
@@ -58,6 +58,8 @@ struct cpu_entry_area {
 	 * of the TSS region.
 	 */
 	struct tss_struct tss;
+
+	char entry_trampoline[PAGE_SIZE];
 };
 
 #define CPU_ENTRY_AREA_PAGES (sizeof(struct cpu_entry_area) / PAGE_SIZE)
diff --git a/arch/x86/kernel/asm-offsets.c b/arch/x86/kernel/asm-offsets.c
index 55858b277cf6..61b1af88ac07 100644
--- a/arch/x86/kernel/asm-offsets.c
+++ b/arch/x86/kernel/asm-offsets.c
@@ -101,4 +101,5 @@ void common(void) {
 
 	/* Layout info for cpu_entry_area */
 	OFFSET(CPU_ENTRY_AREA_tss, cpu_entry_area, tss);
+	OFFSET(CPU_ENTRY_AREA_entry_trampoline, cpu_entry_area, entry_trampoline);
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7c82a8a8bfda..5a05db084659 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -507,6 +507,8 @@ DEFINE_PER_CPU(struct cpu_entry_area *, cpu_entry_area);
 static inline void setup_cpu_entry_area(int cpu)
 {
 #ifdef CONFIG_X86_64
+	extern char _entry_trampoline[];
+
 	/* On 64-bit systems, we use a read-only fixmap GDT. */
 	pgprot_t gdt_prot = PAGE_KERNEL_RO;
 #else
@@ -553,6 +555,11 @@ static inline void setup_cpu_entry_area(int cpu)
 #ifdef CONFIG_X86_32
 	this_cpu_write(cpu_entry_area, get_cpu_entry_area(cpu));
 #endif
+
+#ifdef CONFIG_X86_64
+	__set_fixmap(get_cpu_entry_area_index(cpu, entry_trampoline),
+		     __pa_symbol(_entry_trampoline), PAGE_KERNEL_RX);
+#endif
 }
 
 /* Load the original GDT from the per-cpu structure */
@@ -1417,10 +1424,13 @@ static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
 {
+	extern char _entry_trampoline[];
+	extern char entry_SYSCALL_64_trampoline[];
+
 	int cpu = smp_processor_id();
 
 	wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
-	wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
+	wrmsrl(MSR_LSTAR, (unsigned long)get_cpu_entry_area(cpu)->entry_trampoline + (entry_SYSCALL_64_trampoline - _entry_trampoline));
 
 #ifdef CONFIG_IA32_EMULATION
 	wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
diff --git a/arch/x86/kernel/vmlinux.lds.S b/arch/x86/kernel/vmlinux.lds.S
index a4009fb9be87..2738cfb6c8c8 100644
--- a/arch/x86/kernel/vmlinux.lds.S
+++ b/arch/x86/kernel/vmlinux.lds.S
@@ -107,6 +107,16 @@ SECTIONS
 		SOFTIRQENTRY_TEXT
 		*(.fixup)
 		*(.gnu.warning)
+
+#ifdef CONFIG_X86_64
+		/* Entry trampoline */
+		. = ALIGN(PAGE_SIZE);
+		_entry_trampoline = .;
+		*(.entry_trampoline)
+		. = ALIGN(PAGE_SIZE);
+		ASSERT(. - _entry_trampoline == PAGE_SIZE, "entry trampoline is too big");
+#endif
+
 		/* End of text section */
 		_etext = .;
 	} :text = 0x9090
-- 
2.14.1

^ permalink raw reply	[flat|nested] 113+ messages in thread

end of thread, other threads:[~2017-11-27 21:15 UTC | newest]

Thread overview: 113+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-24 17:23 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24, v2 version Ingo Molnar
2017-11-24 17:23 ` [PATCH 01/43] x86/decoder: Add new TEST instruction pattern Ingo Molnar
2017-11-24 17:23 ` [PATCH 02/43] x86/asm/64: Allocate and enable the SYSENTER stack Ingo Molnar
2017-11-24 17:23 ` [PATCH 03/43] x86/dumpstack: Add get_stack_info() support for " Ingo Molnar
2017-11-24 17:23 ` [PATCH 04/43] x86/gdt: Put per-cpu GDT remaps in ascending order Ingo Molnar
2017-11-24 17:23 ` [PATCH 05/43] x86/fixmap: Generalize the GDT fixmap mechanism Ingo Molnar
2017-11-24 17:23 ` [PATCH 06/43] x86/kasan/64: Teach KASAN about the cpu_entry_area Ingo Molnar
2017-11-24 17:23 ` [PATCH 07/43] x86/asm: Fix assumptions that the HW TSS is at the beginning of cpu_tss Ingo Molnar
2017-11-24 17:23 ` [PATCH 08/43] x86/dumpstack: Handle stack overflow on all stacks Ingo Molnar
2017-11-24 17:23 ` [PATCH 09/43] x86/asm: Move SYSENTER_stack to the beginning of struct tss_struct Ingo Molnar
2017-11-24 17:23 ` [PATCH 10/43] x86/asm: Remap the TSS into the cpu entry area Ingo Molnar
2017-11-24 17:23 ` [PATCH 11/43] x86/asm/64: Separate cpu_current_top_of_stack from TSS.sp0 Ingo Molnar
2017-11-24 17:23 ` [PATCH 12/43] x86/espfix/64: Stop assuming that pt_regs is on the entry stack Ingo Molnar
2017-11-24 18:25   ` Borislav Petkov
2017-11-24 19:12     ` Andy Lutomirski
2017-11-26 14:05       ` Ingo Molnar
2017-11-26 17:28         ` Borislav Petkov
2017-11-27  9:19           ` Ingo Molnar
2017-11-24 17:23 ` [PATCH 13/43] x86/asm/64: Use a percpu trampoline stack for IDT entries Ingo Molnar
2017-11-24 19:02   ` Borislav Petkov
2017-11-26 14:16     ` Ingo Molnar
2017-11-24 17:23 ` [PATCH 14/43] x86/asm/64: Return to userspace from the trampoline stack Ingo Molnar
2017-11-24 19:10   ` Borislav Petkov
2017-11-26 14:18     ` Ingo Molnar
2017-11-26 17:33       ` Borislav Petkov
2017-11-24 17:23 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
2017-11-25 11:40   ` Borislav Petkov
2017-11-25 15:00     ` Andy Lutomirski
2017-11-26 14:26       ` [PATCH] " Ingo Molnar
2017-11-24 17:23 ` [PATCH 16/43] x86/irq: Remove an old outdated comment about context tracking races Ingo Molnar
2017-11-25 12:05   ` Borislav Petkov
2017-11-24 17:23 ` [PATCH 17/43] x86/irq/64: In the stack overflow warning, print the offending IP Ingo Molnar
2017-11-25 12:07   ` Borislav Petkov
2017-11-24 17:23 ` [PATCH 18/43] x86/entry/64: Move the IST stacks into cpu_entry_area Ingo Molnar
2017-11-25 12:34   ` Borislav Petkov
2017-11-24 17:23 ` [PATCH 19/43] x86/entry/64: Remove the SYSENTER stack canary Ingo Molnar
2017-11-25 15:29   ` Borislav Petkov
2017-11-24 17:23 ` [PATCH 20/43] x86/entry: Clean up SYSENTER_stack code Ingo Molnar
2017-11-25 16:39   ` Borislav Petkov
2017-11-25 16:50     ` Thomas Gleixner
2017-11-25 16:55       ` Andy Lutomirski
2017-11-25 17:03         ` Thomas Gleixner
2017-11-25 17:10           ` Borislav Petkov
2017-11-25 17:26             ` Andy Lutomirski
2017-11-27  9:27               ` Peter Zijlstra
2017-11-24 17:23 ` [PATCH 21/43] x86/mm/kaiser: Disable global pages by default with KAISER Ingo Molnar
2017-11-24 17:23 ` [PATCH 22/43] x86/mm/kaiser: Prepare assembly for entry/exit CR3 switching Ingo Molnar
2017-11-25  0:02   ` Thomas Gleixner
2017-11-25 12:41     ` Thomas Gleixner
2017-11-26 11:50   ` Borislav Petkov
2017-11-26 14:55     ` [PATCH v2] x86/mm/kaiser: Prepare the x86/entry assembly code " Ingo Molnar
2017-11-27 13:29       ` Josh Poimboeuf
2017-11-27 13:36         ` Thomas Gleixner
2017-11-24 17:23 ` [PATCH 23/43] x86/mm/kaiser: Introduce user-mapped per-cpu areas Ingo Molnar
2017-11-26 17:41   ` Borislav Petkov
2017-11-27  9:26     ` Ingo Molnar
2017-11-27 21:14     ` Dave Hansen
2017-11-24 17:23 ` [PATCH 24/43] x86/mm/kaiser: Mark per-cpu data structures required for entry/exit Ingo Molnar
2017-11-25 17:17   ` Thomas Gleixner
2017-11-26 15:54     ` Ingo Molnar
2017-11-24 17:23 ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch) Ingo Molnar
2017-11-26 18:51   ` Borislav Petkov
2017-11-27  9:30     ` Ingo Molnar
2017-11-26 20:49   ` Borislav Petkov
2017-11-27 10:38     ` Ingo Molnar
2017-11-26 22:25   ` [PATCH 25/43] x86/mm/kaiser: Unmap kernel from userspace page tables (core patch), noexec=off Borislav Petkov
2017-11-26 22:41     ` Thomas Gleixner
2017-11-24 17:23 ` [PATCH 26/43] x86/mm/kaiser: Allow NX poison to be set in p4d/pgd Ingo Molnar
2017-11-24 17:23 ` [PATCH 27/43] x86/mm/kaiser: Make sure static PGDs are 8k in size Ingo Molnar
2017-11-24 17:23 ` [PATCH 28/43] x86/mm/kaiser: Map cpu entry area Ingo Molnar
2017-11-25 21:40   ` Thomas Gleixner
2017-11-26 15:19     ` Ingo Molnar
2017-11-24 17:23 ` [PATCH 29/43] x86/mm/kaiser: Map dynamically-allocated LDTs Ingo Molnar
2017-11-24 17:23 ` [PATCH 30/43] x86/mm/kaiser: Map espfix structures Ingo Molnar
2017-11-24 17:23 ` [PATCH 31/43] x86/mm/kaiser: Map entry stack variable Ingo Molnar
2017-11-24 17:24 ` [PATCH 32/43] x86/mm/kaiser: Map virtually-addressed performance monitoring buffers Ingo Molnar
2017-11-24 17:24 ` [PATCH 33/43] x86/mm: Move CR3 construction functions Ingo Molnar
2017-11-24 17:24 ` [PATCH 34/43] x86/mm: Remove hard-coded ASID limit checks Ingo Molnar
2017-11-24 17:24 ` [PATCH 35/43] x86/mm: Put mmu-to-h/w ASID translation in one place Ingo Molnar
2017-11-24 17:24 ` [PATCH 36/43] x86/mm/kaiser: Allow flushing for future ASID switches Ingo Molnar
2017-11-24 17:24 ` [PATCH 37/43] x86/mm/kaiser: Use PCID feature to make user and kernel switches faster Ingo Molnar
2017-11-24 17:24 ` [PATCH 38/43] x86/mm/kaiser: Disable native VSYSCALL Ingo Molnar
2017-11-24 17:24 ` [PATCH 39/43] x86/mm/kaiser: Add debugfs file to turn KAISER on/off at runtime Ingo Molnar
2017-11-24 17:24 ` [PATCH 40/43] x86/mm/kaiser: Add a function to check for KAISER being enabled Ingo Molnar
2017-11-24 17:24 ` [PATCH 41/43] x86/mm/kaiser: Un-poison PGDs at runtime Ingo Molnar
2017-11-24 17:24 ` [PATCH 42/43] x86/mm/kaiser: Allow KAISER to be enabled/disabled " Ingo Molnar
2017-11-25 19:18   ` Thomas Gleixner
2017-11-25 19:53     ` Andy Lutomirski
2017-11-25 20:05       ` Thomas Gleixner
2017-11-25 22:10         ` Andy Lutomirski
2017-11-25 22:48           ` Thomas Gleixner
2017-11-26  0:21             ` Andy Lutomirski
2017-11-26  8:11               ` Thomas Gleixner
2017-11-24 17:24 ` [PATCH 43/43] x86/mm/kaiser: Add Kconfig Ingo Molnar
2017-11-24 20:22 ` [crash] PANIC: double fault, error_code: 0x0 Ingo Molnar
2017-11-24 20:59   ` Andy Lutomirski
2017-11-24 21:49     ` Ingo Molnar
2017-11-24 21:52       ` Ingo Molnar
2017-11-24 22:09   ` Ingo Molnar
2017-11-24 22:35     ` Andy Lutomirski
2017-11-24 22:53       ` Ingo Molnar
2017-11-25  9:21         ` Ingo Molnar
2017-11-25  9:32           ` Ingo Molnar
2017-11-25  9:39             ` Ingo Molnar
2017-11-25 11:17               ` [PATCH] x86/mm/kaiser: Fix IRQ entries text section mapping Ingo Molnar
2017-11-25 16:08                 ` Thomas Gleixner
2017-11-25 20:06                   ` Steven Rostedt
2017-11-27  8:14                   ` Peter Zijlstra
2017-11-27  8:21                     ` Peter Zijlstra
2017-11-25  4:09   ` [crash] PANIC: double fault, error_code: 0x0 Dave Hansen
2017-11-25  4:15     ` Dave Hansen
  -- strict thread matches above, loose matches on Subject: below --
2017-11-24  9:14 [PATCH 00/43] x86 entry-stack and Kaiser series, 2017/11/24 version Ingo Molnar
2017-11-24  9:14 ` [PATCH 15/43] x86/entry/64: Create a percpu SYSCALL entry trampoline Ingo Molnar
2017-11-24 13:52   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.