All of lore.kernel.org
 help / color / mirror / Atom feed
* Updated RD/WRFS/GSBASE patchkit
@ 2015-04-10 15:50 Andi Kleen
  2015-04-10 15:50 ` [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
                   ` (7 more replies)
  0 siblings, 8 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel

This patchkit implements support for the RD(FS|GS)BASE instructions
in recent Intel CPUs. This allows both user programs to use 
these instructions to get additional address registers and do
user fast context switches, and optimizes some kernel paths (context
switch, paranoid interrupts) with them.

Main change in this version is porting to the recent entry_64.S changes

-Andi


^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 15:50 ` [PATCH 2/8] x86: Naturally align the debug IST stack Andi Kleen
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Needed in a followon patch which needs to naturally align a 8K stack.
I just put it into the page aligned section. This may cause an extra
4K hole if we're unlucky.

Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 include/linux/percpu-defs.h | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/include/linux/percpu-defs.h b/include/linux/percpu-defs.h
index 57f3a1c..70616be 100644
--- a/include/linux/percpu-defs.h
+++ b/include/linux/percpu-defs.h
@@ -163,6 +163,10 @@
 	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
 	__aligned(PAGE_SIZE)
 
+#define DEFINE_PER_CPU_2PAGE_ALIGNED(type, name)			\
+	DEFINE_PER_CPU_SECTION(type, name, "..page_aligned")		\
+	__aligned(2*PAGE_SIZE)
+
 /*
  * Declaration/definition used for per-CPU variables that must be read mostly.
  */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 2/8] x86: Naturally align the debug IST stack
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
  2015-04-10 15:50 ` [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 18:50   ` Andy Lutomirski
  2015-04-10 15:50 ` [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The followon patch for RD*BASE requires the IST stacks to be naturally
aligned because it uses and to access the bottom of the stack.
This was true for everyone except for the debug ist stack.
Make the debug ist stack naturally aligned too.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/page_64_types.h | 4 ++--
 arch/x86/kernel/cpu/common.c         | 8 +++++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
index 4edd53b..b631f04 100644
--- a/arch/x86/include/asm/page_64_types.h
+++ b/arch/x86/include/asm/page_64_types.h
@@ -22,8 +22,8 @@
 
 #define DOUBLEFAULT_STACK 1
 #define NMI_STACK 2
-#define DEBUG_STACK 3
-#define MCE_STACK 4
+#define MCE_STACK 3
+#define DEBUG_STACK 4 /* must be always last */
 #define N_EXCEPTION_STACKS 4  /* hw limit: 7 */
 
 #define PUD_PAGE_SIZE		(_AC(1, UL) << PUD_SHIFT)
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 2346c95..33a0293 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1158,8 +1158,12 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
 	  [DEBUG_STACK - 1]			= DEBUG_STKSZ
 };
 
+/* The IST stacks must be naturally aligned */
+
 static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
-	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
+	[(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ]);
+static DEFINE_PER_CPU_2PAGE_ALIGNED(char, debug_stack
+	[DEBUG_STKSZ]);
 
 /* May not be marked __init: used by software suspend */
 void syscall_init(void)
@@ -1349,6 +1353,8 @@ void cpu_init(void)
 		char *estacks = per_cpu(exception_stacks, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
+			if (v == DEBUG_STACK - 1)
+				estacks = per_cpu(debug_stack, cpu);
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
  2015-04-10 15:50 ` [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
  2015-04-10 15:50 ` [PATCH 2/8] x86: Naturally align the debug IST stack Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 19:06   ` Andy Lutomirski
  2015-04-10 19:07   ` Andy Lutomirski
  2015-04-10 15:50 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Add C intrinsics and assembler macros for the new rd/wr fs/gs base
instructions and for swapgs.

Very straight forward. Used in followon patch.

For assembler only a few standard registers used by entry_64.S
are defined.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/fsgs.h | 54 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)
 create mode 100644 arch/x86/include/asm/fsgs.h

diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
new file mode 100644
index 0000000..1df5085
--- /dev/null
+++ b/arch/x86/include/asm/fsgs.h
@@ -0,0 +1,54 @@
+#ifndef _ASM_FSGS_H
+#define _ASM_FSGS_H 1
+
+#ifndef __ASSEMBLY__
+
+static inline void swapgs(void)
+{
+	asm volatile("swapgs" ::: "memory");
+}
+
+/* Must be protected by X86_FEATURE_FSGSBASE check. */
+
+static inline unsigned long rdgsbase(void)
+{
+	unsigned long gs;
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
+			: "=a" (gs)
+			:: "memory");
+	return gs;
+}
+
+static inline unsigned long rdfsbase(void)
+{
+	unsigned long fs;
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
+			: "=a" (fs)
+			:: "memory");
+	return fs;
+}
+
+static inline void wrgsbase(unsigned long gs)
+{
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
+			:: "a" (gs)
+			: "memory");
+}
+
+static inline void wrfsbase(unsigned long fs)
+{
+	asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
+			:: "a" (fs)
+			: "memory");
+}
+
+#else
+
+/* Handle old assemblers. */
+#define RDGSBASE_R15 .byte 0xf3,0x49,0x0f,0xae,0xcf
+#define WRGSBASE_RDI .byte 0xf3,0x48,0x0f,0xae,0xdf
+#define WRGSBASE_R15 .byte 0xf3,0x49,0x0f,0xae,0xdf
+
+#endif /* __ASSEMBLY__ */
+
+#endif
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
                   ` (2 preceding siblings ...)
  2015-04-10 15:50 ` [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 19:12   ` Andy Lutomirski
  2015-04-10 15:50 ` [PATCH 5/8] x86: Make old K8 swapgs workaround conditional Andi Kleen
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Introduction:

IvyBridge added four new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there. Another use
case is having (relatively) cheap access to a new address
register per thread.

The instructions are opt-in and have to be explicitely enabled
by the OS.

For more details on how to use the instructions see
Documentation/x86/fsgs.txt added in a followon patch.

Paranoid exception path changes:
===============================

The paranoid entry/exit code is used for any NMI like
exception.

Previously Linux couldn't support the new instructions
because the paranoid entry code relied on the gs base never being
negative outside the kernel to decide when to use swaps. It would
check the gs MSR value and assume it was already running in
kernel if negative.

To get rid of this assumption we have to revamp the paranoid exception
path to not rely on this. We can use the new instructions
to get (relatively) quick access to the GS value, and use
it directly.

This is also significantly faster than a MSR read, so will speed
NMIs (critical for profiling)

The kernel gs for the paranoid path is now stored at the
bottom of the IST stack (so that it can be derived from RSP).
For this we need to know the size of the IST stack
(4K or 8K), which is now passed in as a stack parameter
to save_paranoid.

The original patch compared the gs with the kernel gs and
assumed that if it was identical, swapgs was not needed
(and no user space processing was needed). This
was nice and simple and didn't need a lot of changes.

But this had the side effect that if a user process set its
GS to the same as the kernel it may lose rescheduling
checks (so a racing reschedule IPI would have been
only acted upon the next non paranoid interrupt)

This version now switches to full save/restore of the GS.
This requires quite some changes in the paranoid path.
Unfortunately I didn't come up with a simpler scheme.

Previously we had a flag in EBX that indicated whether
SWAPGS needs to be called later or not. In the new scheme
this turns into a tristate, with a new "restore from R15"
mode that is used when the FSGSBASE instructions are available.
In this case the GS base is saved and restored.
The exit paths are all adjusted to handle this correctly.

So the three possible states for the paranoid exit path are:

- Do nothing (pre FSGSBASE), if we interrupted the kernel
as determined by the existing negative GS
- Do swapgs, if we interrupted user space with positive GS
(pre FSGSBASE), or we saved gs, but it was overwritten
later by a context switch (with FSGSBASE)
- Restore from R15 (with FSGSBASE), if the gs base was saved
in R15

Context switch changes:
======================

Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear.  This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.

This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB
(or with some newer systems which don't properly hint the
stack limit), or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.

It is all done in a single patch because there was no
simple way to do it in pieces without having crash
holes inbetween.

v2: Change to save/restore GS instead of using swapgs
based on the value. Large scale changes.
v3: Fix wrong flag initialization in fallback path.
Thanks 0day!
v4: Make swapgs code paths kprobes safe.
Port to new base line code which now switches indexes.
v5: Port to new kernel which avoids paranoid entry for ring 3.
Removed some code that handled this previously.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/cpu/common.c |   6 +++
 arch/x86/kernel/entry_64.S   | 114 ++++++++++++++++++++++++++++++++++---------
 arch/x86/kernel/process_64.c |  42 ++++++++++++++--
 3 files changed, 134 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 33a0293..7df88a3 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -957,6 +957,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #ifdef CONFIG_NUMA
 	numa_add_cpu(smp_processor_id());
 #endif
+
+	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+		cr4_set_bits(X86_CR4_FSGSBASE);
 }
 
 #ifdef CONFIG_X86_64
@@ -1351,10 +1354,13 @@ void cpu_init(void)
 	 */
 	if (!oist->ist[0]) {
 		char *estacks = per_cpu(exception_stacks, cpu);
+		void *gs = per_cpu(irq_stack_union.gs_base, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			if (v == DEBUG_STACK - 1)
 				estacks = per_cpu(debug_stack, cpu);
+			/* Store GS at bottom of stack for bootstrap access */
+			*(void **)estacks = gs;
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index f0095a7..0b74ab0 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,8 @@
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
+#include <asm/alternative-asm.h>
+#include <asm/fsgs.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -218,32 +220,74 @@ ENDPROC(native_usergs_sysret64)
 	CFI_REL_OFFSET r15, R15+\offset
 	.endm
 
+/* Values of the ebx flag: */
+#define DO_RESTORE_R15	 2 /* Restore GS at end */
+#define DO_SWAPGS	 1 /* Use SWAPGS at end */
+#define DO_NOTHING	 0 /* Back to ring 0 with same gs */
+
+/*
+ * Stack layout:
+ * +16 pt_regs
+ * +8  stack mask for ist or 0
+ * +0  return address.
+ */
+#define OFF 16
+#define STKMSK 8
+
 ENTRY(save_paranoid)
-	XCPT_FRAME 1 RDI+8
+	XCPT_FRAME 1 RDI+OFF
 	cld
-	movq %rdi, RDI+8(%rsp)
-	movq %rsi, RSI+8(%rsp)
-	movq_cfi rdx, RDX+8
-	movq_cfi rcx, RCX+8
-	movq_cfi rax, RAX+8
-	movq %r8, R8+8(%rsp)
-	movq %r9, R9+8(%rsp)
-	movq %r10, R10+8(%rsp)
-	movq %r11, R11+8(%rsp)
-	movq_cfi rbx, RBX+8
-	movq %rbp, RBP+8(%rsp)
-	movq %r12, R12+8(%rsp)
-	movq %r13, R13+8(%rsp)
-	movq %r14, R14+8(%rsp)
-	movq %r15, R15+8(%rsp)
-	movl $1,%ebx
+	movq %rdi, RDI+OFF(%rsp)
+	movq %rsi, RSI+OFF(%rsp)
+	movq_cfi rdx, RDX+OFF
+	movq_cfi rcx, RCX+OFF
+	movq_cfi rax, RAX+OFF
+	movq %r8, R8+OFF(%rsp)
+	movq %r9, R9+OFF(%rsp)
+	movq %r10, R10+OFF(%rsp)
+	movq %r11, R11+OFF(%rsp)
+	movq_cfi rbx, RBX+OFF
+	movq %rbp, RBP+OFF(%rsp)
+	movq %r12, R12+OFF(%rsp)
+	movq %r13, R13+OFF(%rsp)
+	movq %r14, R14+OFF(%rsp)
+	movq %r15, R15+OFF(%rsp)
+	movq $-1,ORIG_RAX+OFF(%rsp)	/* no syscall to restart */
+33:
+	ASM_NOP5	/* May be replaced with jump to paranoid_save_gs */
+34:
+	movl $DO_NOTHING,%ebx
 	movl $MSR_GS_BASE,%ecx
 	rdmsr
 	testl %edx,%edx
 	js 1f	/* negative -> in kernel */
 	SWAPGS
-	xorl %ebx,%ebx
+	movl $DO_SWAPGS,%ebx
 1:	ret
+
+	/* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
+	.section .altinstr_replacement,"ax"
+35:	.byte 0xe9 /* 32bit near jump */
+	.long paranoid_save_gs-34b
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
+	.previous
+
+	/* Faster version not using RDMSR, and also not assuming
+	 * anything about the previous GS value.
+	 * This allows the user to arbitarily change GS using
+	 * WRGSBASE.
+	 */
+paranoid_save_gs:
+	RDGSBASE_R15			# read previous gs
+	movq STKMSK(%rsp),%rax		# get ist stack mask
+	andq %rsp,%rax			# get bottom of stack
+	movq (%rax),%rdi		# get expected GS
+	WRGSBASE_RDI			# set gs for kernel
+	mov $DO_RESTORE_R15,%ebx	# flag for exit code
+	ret
+
 	CFI_ENDPROC
 END(save_paranoid)
 
@@ -1026,7 +1070,8 @@ apicinterrupt IRQ_WORK_VECTOR \
  */
 #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
 
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
+		stack_mask=-EXCEPTION_STKSZ
 ENTRY(\sym)
 	/* Sanity check */
 	.if \shift_ist != -1 && \paranoid == 0
@@ -1055,7 +1100,10 @@ ENTRY(\sym)
 	testl $3, CS(%rsp)		/* If coming from userspace, switch */
 	jnz 1f				/* stacks. */
 	.endif
+	pushq_cfi $\stack_mask		/* ist stack size */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax			/* Drop ist stack size */
 	.else
 	call error_entry
 	.endif
@@ -1090,7 +1138,7 @@ ENTRY(\sym)
 	.endif
 
 	.if \paranoid
-	jmp paranoid_exit		/* %ebx: no swapgs flag */
+	jmp paranoid_exit		/* %ebx: no swapgs flag, r15: old gs */
 	.else
 	jmp error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1311,9 +1359,12 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	hyperv_callback_vector hyperv_vector_handler
 #endif /* CONFIG_HYPERV */
 
-idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
+idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
+idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
 idtentry stack_segment do_stack_segment has_error_code=1
+
 #ifdef CONFIG_XEN
 idtentry xen_debug do_debug has_error_code=0
 idtentry xen_int3 do_int3 has_error_code=0
@@ -1339,17 +1390,25 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 	 * to try to handle preemption here.
 	 */
 
-	/* ebx:	no swapgs flag */
+	/* ebx:	no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
 ENTRY(paranoid_exit)
 	DEFAULT_FRAME
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF_DEBUG
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz paranoid_restore
+	cmpl  $DO_NOTHING,%ebx		/* swapgs needed? */
+	je  paranoid_restore
 	TRACE_IRQS_IRETQ 0
+	cmpl  $DO_RESTORE_R15,%ebx	/* restore gs? */
+	je  paranoid_restore_gs
 	SWAPGS_UNSAFE_STACK
 	RESTORE_ALL 8
+paranoid_restore_gs:
+	WRGSBASE_R15
+	RESTORE_ALL 8
 	INTERRUPT_RETURN
+	jmp irq_return
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
@@ -1650,6 +1709,7 @@ end_repeat_nmi:
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	pushq_cfi $-EXCEPTION_STKSZ	/* ist stack size */
 	/*
 	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
 	 * as we should not be calling schedule in NMI context.
@@ -1658,6 +1718,8 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax		/* pop ist size */
 	DEFAULT_FRAME 0
 
 	/*
@@ -1682,11 +1744,15 @@ end_repeat_nmi:
 	je 1f
 	movq %r12, %cr2
 1:
-	
+	cmpl $DO_RESTORE_R15,%ebx
+	je nmi_gs_restore
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
+	jmp nmi_restore
+nmi_gs_restore:
+	WRGSBASE_R15				/* restore gs */
 nmi_restore:
 	/* Pop the extra iret frame at once */
 	RESTORE_ALL 6*8
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 67fcc43..3019c51 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/fsgs.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -261,6 +262,27 @@ void start_thread_ia32(struct pt_regs *regs, u32 new_ip, u32 new_sp)
 }
 #endif
 
+/* Out of line to be protected from kprobes. */
+
+/* Interrupts are disabled here. */
+static noinline __kprobes void switch_gs(unsigned long gs)
+{
+	swapgs();
+	wrgsbase(gs);
+	swapgs();
+}
+
+/* Interrupts are disabled here. */
+static noinline __kprobes unsigned long read_user_gs(void)
+{
+	unsigned long gs;
+
+	swapgs();
+	gs = rdgsbase();
+	swapgs();
+	return gs;
+}
+
 /*
  *	switch_to(x,y) should switch tasks from x to y.
  *
@@ -293,6 +315,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	savesegment(fs, fsindex);
 	savesegment(gs, gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		prev->fs = rdfsbase();
+		prev->gs = read_user_gs();
+	}
 
 	/*
 	 * Load TLS before restoring any segments so that segment loads
@@ -381,8 +407,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (fsindex)
 			prev->fs = 0;
 	}
-	if (next->fs)
-		wrmsrl(MSR_FS_BASE, next->fs);
+	if (next->fs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			wrfsbase(next->fs);
+		else
+			wrmsrl(MSR_FS_BASE, next->fs);
+	}
 	prev->fsindex = fsindex;
 
 	if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -392,8 +422,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (gsindex)
 			prev->gs = 0;
 	}
-	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	if (next->gs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			switch_gs(next->gs);
+		else
+			wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	}
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_p, fpu);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 5/8] x86: Make old K8 swapgs workaround conditional
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
                   ` (3 preceding siblings ...)
  2015-04-10 15:50 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 21:46   ` Andy Lutomirski
  2015-04-10 22:01   ` Borislav Petkov
  2015-04-10 15:50 ` [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
                   ` (2 subsequent siblings)
  7 siblings, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Every gs selector/index reload always paid an extra MFENCE
between the two SWAPGS. This was to work around an old
bug in early K8 steppings.  All other CPUs don't need the extra
mfence. Patch the extra MFENCE only in for K8.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/cpufeature.h |  1 +
 arch/x86/kernel/cpu/amd.c         |  3 +++
 arch/x86/kernel/entry_64.S        | 10 +++++++++-
 3 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
index 90a5485..c695fad 100644
--- a/arch/x86/include/asm/cpufeature.h
+++ b/arch/x86/include/asm/cpufeature.h
@@ -255,6 +255,7 @@
 #define X86_BUG_11AP		X86_BUG(5) /* Bad local APIC aka 11AP */
 #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
 #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
+#define X86_BUG_SWAPGS_MFENCE	X86_BUG(8) /* SWAPGS may need MFENCE */
 
 #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
 
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index a220239..e7f5667 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -551,6 +551,9 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
 	if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
 		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
 
+	/* Early steppings needed a mfence on swapgs. */
+	set_cpu_cap(c, X86_BUG_SWAPGS_MFENCE);
+
 	/*
 	 * Some BIOSes incorrectly force this feature, but only K8 revision D
 	 * (model = 0x14) and later actually support it.
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 0b74ab0..bb44292 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1212,13 +1212,21 @@ ENTRY(native_load_gs_index)
 	SWAPGS
 gs_change:
 	movl %edi,%gs
-2:	mfence		/* workaround */
+2:	ASM_NOP3	/* may be replaced with mfence */
 	SWAPGS
 	popfq_cfi
 	ret
 	CFI_ENDPROC
 END(native_load_gs_index)
 
+	/* Early K8 systems needed an mfence after swapgs to workaround a bug */
+	.section .altinstr_replacement,"ax"
+3:	mfence
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 2b,3b,X86_BUG_SWAPGS_MFENCE,3,3
+	.previous
+
 	_ASM_EXTABLE(gs_change,bad_gs)
 	.section .fixup,"ax"
 	/* running with kernelgs */
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
                   ` (4 preceding siblings ...)
  2015-04-10 15:50 ` [PATCH 5/8] x86: Make old K8 swapgs workaround conditional Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 19:15   ` Andy Lutomirski
  2015-04-10 22:14   ` Borislav Petkov
  2015-04-10 15:50 ` [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base Andi Kleen
  2015-04-10 15:50 ` [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl Andi Kleen
  7 siblings, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

The kernel needs to explicitely enable RD/WRFSBASE to handle context
switch correctly. So the application needs to know if it can safely use
these instruction. Just looking at the CPUID bit is not enough because it
may be running in a kernel that does not enable the instructions.

One way for the application would be to just try and catch the SIGILL.
But that is difficult to do in libraries which may not want
to overwrite the signal handlers of the main application.

So we need to provide a way for the application to discover the kernel
capability.

I used AT_HWCAP2 in the ELF aux vector which is already used by
PPC for similar things. We define a new Linux defined bitmap
returned in AT_HWCAP.  Currently it has only one bit set,
for kernel is FSGSBASE capable.

The application can then access it manually or using
the getauxval() function in newer glibc.

Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/elf.h        | 7 +++++++
 arch/x86/include/uapi/asm/hwcap.h | 7 +++++++
 arch/x86/kernel/cpu/common.c      | 7 ++++++-
 3 files changed, 20 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/include/uapi/asm/hwcap.h

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index ca3347a..47dac31 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -257,6 +257,13 @@ extern int force_personality32;
 
 #define ELF_HWCAP		(boot_cpu_data.x86_capability[0])
 
+extern unsigned kernel_enabled_features;
+
+/* HWCAP2 supplies kernel enabled CPU feature, so that the application
+   can know that it can safely use them. The bits are defined in
+   uapi/asm/hwcap.h. */
+#define ELF_HWCAP2		kernel_enabled_features;
+
 /* This yields a string that ld.so will use to load implementation
    specific libraries for optimization.  This is more specific in
    intent than poking at uname or /proc/cpuinfo.
diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h
new file mode 100644
index 0000000..d9c54f8
--- /dev/null
+++ b/arch/x86/include/uapi/asm/hwcap.h
@@ -0,0 +1,7 @@
+#ifndef _ASM_HWCAP_H
+#define _ASM_HWCAP_H 1
+
+#define HWCAP2_FSGSBASE	(1 << 0) 	/* Kernel enabled RD/WR FS/GS BASE */
+/* upto bit 31 free */
+
+#endif
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 7df88a3..81186cb 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -34,6 +34,7 @@
 #include <asm/i387.h>
 #include <asm/fpu-internal.h>
 #include <asm/mtrr.h>
+#include <asm/hwcap.h>
 #include <linux/numa.h>
 #include <asm/asm.h>
 #include <asm/cpu.h>
@@ -49,6 +50,8 @@
 
 #include "cpu.h"
 
+unsigned kernel_enabled_features __read_mostly;
+
 /* all of these masks are initialized in setup_cpu_local_masks() */
 cpumask_var_t cpu_initialized_mask;
 cpumask_var_t cpu_callout_mask;
@@ -958,8 +961,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 	numa_add_cpu(smp_processor_id());
 #endif
 
-	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
+		kernel_enabled_features |= HWCAP2_FSGSBASE;
 		cr4_set_bits(X86_CR4_FSGSBASE);
+	}
 }
 
 #ifdef CONFIG_X86_64
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
                   ` (5 preceding siblings ...)
  2015-04-10 15:50 ` [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 19:17   ` Andy Lutomirski
  2015-04-10 15:50 ` [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl Andi Kleen
  7 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

v2: Minor updates to documentation requested in review.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 Documentation/x86/fsgs.txt | 76 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 76 insertions(+)
 create mode 100644 Documentation/x86/fsgs.txt

diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
new file mode 100644
index 0000000..26a1e29
--- /dev/null
+++ b/Documentation/x86/fsgs.txt
@@ -0,0 +1,76 @@
+
+Using FS and GS prefixes on x86_64-linux
+
+The x86 architecture supports segment prefixes per instruction to add an
+offset to an address.  On 64bit x86, these are mostly nops, except for FS
+and GS.
+
+This offers an efficient way to reference a global pointer.
+
+The compiler has to generate special code to use these base registers,
+or they can be accessed with inline assembler.
+
+	mov %gs:offset,%reg
+	mov %fs:offset,%reg
+
+FS is used to address the thread local segment (TLS), declared using
+__thread.  The compiler then automatically generates the correct prefixes and
+relocations to access these values.
+
+FS is normally managed by the runtime code or the threading library.
+
+GS is freely available, but may need special (compiler or inline assembler)
+code to use.
+
+Traditionally 64bit FS and GS could be set by the arch_prctl system call
+
+	arch_prctl(ARCH_SET_GS, value)
+	arch_prctl(ARCH_SET_FS, value)
+
+[There was also an older method using modify_ldt(), inherited from 32bit,
+but this is not discussed here.]
+
+However using a syscall is problematic for user space threading libraries
+that want to context switch in user space. The whole point of them
+is avoiding the overhead of a syscall. It's also cleaner for compilers
+wanting to use the extra register to use instructions to write
+it, or read it directly to compute addresses and offsets.
+
+Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
+access these registers quickly from user context
+
+	RDFSBASE %reg	read the FS base	(or _readfsbase_u64)
+	RDGSBASE %reg	read the GS base	(or _readgsbase_u64)
+
+	WRFSBASE %reg	write the FS base	(or _writefsbase_u64)
+	WRGSBASE %reg	write the GS base	(or _writegsbase_u64)
+
+The instructions are supported by the CPU when the "fsgsbase" string is shown in
+/proc/cpuinfo (or directly retrieved through the CPUID instruction).
+The instructions are only available to 64bit binaries.
+
+However the kernel needs to explicitly enable these instructions, as it
+may otherwise not correctly context switch the state. Newer Linux
+kernels enable this. When the kernel did not enable the instruction
+they will fault with an #UD exception.
+
+An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
+bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
+kernel supports RDFSGSBASE.
+
+	#include <sys/auxv.h>
+	#include <elf.h>
+
+	/* Will be eventually in asm/hwcap.h */
+	#define HWCAP2_FSGSBASE        (1 << 0)
+
+        unsigned val = getauxval(AT_HWCAP2);
+        if (val & HWCAP2_FSGSBASE) {
+                asm("wrgsbase %0" :: "r" (ptr));
+        }
+
+Another requirement is that the FS or GS selector has to be zero
+(is normally true unless changed explicitly)
+
+
+Andi Kleen
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl
  2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
                   ` (6 preceding siblings ...)
  2015-04-10 15:50 ` [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base Andi Kleen
@ 2015-04-10 15:50 ` Andi Kleen
  2015-04-10 19:19   ` Andy Lutomirski
  7 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 15:50 UTC (permalink / raw)
  To: x86; +Cc: luto, linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Convert arch_prctl to use the new instructions to
change fs/gs if available, instead of using MSRs.

This is merely a small performance optimization,
no new functionality.

With the new instructions the syscall is really obsolete,
as everything can be set directly in ring 3. But the syscall
is widely used by existing software, so we still support it.

The syscall still enforces that the addresses are not
in kernel space, even though that is not needed more.
This is mainly so that the programs written for new CPUs
do not suddenly fail on old CPUs.

With the new instructions available it prefers to use
them in the context switch, instead of using the old
"use GDT segment rewrite" trick.

v2: Make kprobes safe
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/kernel/process_64.c | 48 ++++++++++++++++++++++++++++++++++++--------
 1 file changed, 40 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3019c51..1fe4d79 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -534,20 +534,38 @@ unsigned long get_wchan(struct task_struct *p)
 	return 0;
 }
 
+static noinline __kprobes void reload_user_gs(unsigned long addr)
+{
+	local_irq_disable();
+	swapgs();
+	loadsegment(gs, 0);
+	wrgsbase(addr);
+	swapgs();
+	local_irq_enable();
+}
+
 long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 {
 	int ret = 0;
 	int doit = task == current;
 	int cpu;
+	int fast_seg = boot_cpu_has(X86_FEATURE_FSGSBASE);
 
 	switch (code) {
 	case ARCH_SET_GS:
+		/*
+		 * With fast_seg we don't need that check anymore,
+		 * but keep it so that programs do not suddenly
+		 * start failing when run on older CPUs.
+		 * If you really want to set a address in kernel space
+		 * use WRGSBASE directly.
+		 */
 		if (addr >= TASK_SIZE_OF(task))
 			return -EPERM;
 		cpu = get_cpu();
 		/* handle small bases via the GDT because that's faster to
 		   switch. */
-		if (addr <= 0xffffffff) {
+		if (addr <= 0xffffffff && !fast_seg) {
 			set_32bit_tls(task, GS_TLS, addr);
 			if (doit) {
 				load_TLS(&task->thread, cpu);
@@ -559,8 +577,12 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 			task->thread.gsindex = 0;
 			task->thread.gs = addr;
 			if (doit) {
-				load_gs_index(0);
-				ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
+				if (fast_seg) {
+					reload_user_gs(addr);
+				} else {
+					load_gs_index(0);
+					ret = wrmsrl_safe(MSR_KERNEL_GS_BASE, addr);
+				}
 			}
 		}
 		put_cpu();
@@ -573,7 +595,7 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		cpu = get_cpu();
 		/* handle small bases via the GDT because that's faster to
 		   switch. */
-		if (addr <= 0xffffffff) {
+		if (addr <= 0xffffffff && !fast_seg) {
 			set_32bit_tls(task, FS_TLS, addr);
 			if (doit) {
 				load_TLS(&task->thread, cpu);
@@ -588,7 +610,10 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 				/* set the selector to 0 to not confuse
 				   __switch_to */
 				loadsegment(fs, 0);
-				ret = wrmsrl_safe(MSR_FS_BASE, addr);
+				if (fast_seg)
+					wrfsbase(addr);
+				else
+					ret = wrmsrl_safe(MSR_FS_BASE, addr);
 			}
 		}
 		put_cpu();
@@ -597,6 +622,8 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 		unsigned long base;
 		if (task->thread.fsindex == FS_TLS_SEL)
 			base = read_32bit_tls(task, FS_TLS);
+		else if (doit && fast_seg)
+			base = rdfsbase();
 		else if (doit)
 			rdmsrl(MSR_FS_BASE, base);
 		else
@@ -611,9 +638,14 @@ long do_arch_prctl(struct task_struct *task, int code, unsigned long addr)
 			base = read_32bit_tls(task, GS_TLS);
 		else if (doit) {
 			savesegment(gs, gsindex);
-			if (gsindex)
-				rdmsrl(MSR_KERNEL_GS_BASE, base);
-			else
+			if (gsindex) {
+				if (fast_seg) {
+					local_irq_disable();
+					base = read_user_gs();
+					local_irq_enable();
+				} else
+					rdmsrl(MSR_KERNEL_GS_BASE, base);
+			} else
 				base = task->thread.gs;
 		} else
 			base = task->thread.gs;
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* Re: [PATCH 2/8] x86: Naturally align the debug IST stack
  2015-04-10 15:50 ` [PATCH 2/8] x86: Naturally align the debug IST stack Andi Kleen
@ 2015-04-10 18:50   ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 18:50 UTC (permalink / raw)
  To: Andi Kleen; +Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> The followon patch for RD*BASE requires the IST stacks to be naturally
> aligned because it uses and to access the bottom of the stack.
> This was true for everyone except for the debug ist stack.
> Make the debug ist stack naturally aligned too.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/page_64_types.h | 4 ++--
>  arch/x86/kernel/cpu/common.c         | 8 +++++++-
>  2 files changed, 9 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_64_types.h b/arch/x86/include/asm/page_64_types.h
> index 4edd53b..b631f04 100644
> --- a/arch/x86/include/asm/page_64_types.h
> +++ b/arch/x86/include/asm/page_64_types.h
> @@ -22,8 +22,8 @@
>
>  #define DOUBLEFAULT_STACK 1
>  #define NMI_STACK 2
> -#define DEBUG_STACK 3
> -#define MCE_STACK 4
> +#define MCE_STACK 3
> +#define DEBUG_STACK 4 /* must be always last */
>  #define N_EXCEPTION_STACKS 4  /* hw limit: 7 */
>
>  #define PUD_PAGE_SIZE          (_AC(1, UL) << PUD_SHIFT)
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 2346c95..33a0293 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -1158,8 +1158,12 @@ static const unsigned int exception_stack_sizes[N_EXCEPTION_STACKS] = {
>           [DEBUG_STACK - 1]                     = DEBUG_STKSZ
>  };
>
> +/* The IST stacks must be naturally aligned */
> +
>  static DEFINE_PER_CPU_PAGE_ALIGNED(char, exception_stacks
> -       [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ + DEBUG_STKSZ]);
> +       [(N_EXCEPTION_STACKS - 1) * EXCEPTION_STKSZ]);
> +static DEFINE_PER_CPU_2PAGE_ALIGNED(char, debug_stack
> +       [DEBUG_STKSZ]);

IMO this needs some kind of assertion that DEBUG_STKSZ == 2 *
PAGE_SIZE.  Or you could instead pass the required alignment to the
macro if you add a parameter.

--Andy

>
>  /* May not be marked __init: used by software suspend */
>  void syscall_init(void)
> @@ -1349,6 +1353,8 @@ void cpu_init(void)
>                 char *estacks = per_cpu(exception_stacks, cpu);
>
>                 for (v = 0; v < N_EXCEPTION_STACKS; v++) {
> +                       if (v == DEBUG_STACK - 1)
> +                               estacks = per_cpu(debug_stack, cpu);
>                         estacks += exception_stack_sizes[v];
>                         oist->ist[v] = t->x86_tss.ist[v] =
>                                         (unsigned long)estacks;
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions
  2015-04-10 15:50 ` [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
@ 2015-04-10 19:06   ` Andy Lutomirski
  2015-04-10 19:07   ` Andy Lutomirski
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:06 UTC (permalink / raw)
  To: Andi Kleen; +Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Add C intrinsics and assembler macros for the new rd/wr fs/gs base
> instructions and for swapgs.
>
> Very straight forward. Used in followon patch.
>
> For assembler only a few standard registers used by entry_64.S
> are defined.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/fsgs.h | 54 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>  create mode 100644 arch/x86/include/asm/fsgs.h
>
> diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
> new file mode 100644
> index 0000000..1df5085
> --- /dev/null
> +++ b/arch/x86/include/asm/fsgs.h
> @@ -0,0 +1,54 @@
> +#ifndef _ASM_FSGS_H
> +#define _ASM_FSGS_H 1
> +
> +#ifndef __ASSEMBLY__
> +
> +static inline void swapgs(void)
> +{
> +       asm volatile("swapgs" ::: "memory");
> +}
> +
> +/* Must be protected by X86_FEATURE_FSGSBASE check. */
> +
> +static inline unsigned long rdgsbase(void)
> +{
> +       unsigned long gs;
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
> +                       : "=a" (gs)
> +                       :: "memory");
> +       return gs;
> +}
> +
> +static inline unsigned long rdfsbase(void)
> +{
> +       unsigned long fs;
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
> +                       : "=a" (fs)
> +                       :: "memory");
> +       return fs;
> +}
> +
> +static inline void wrgsbase(unsigned long gs)
> +{
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
> +                       :: "a" (gs)
> +                       : "memory");
> +}
> +
> +static inline void wrfsbase(unsigned long fs)
> +{
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
> +                       :: "a" (fs)
> +                       : "memory");
> +}

I think several of these should either be always_inline or have
tracing and kprobes disabled.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions
  2015-04-10 15:50 ` [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
  2015-04-10 19:06   ` Andy Lutomirski
@ 2015-04-10 19:07   ` Andy Lutomirski
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen, Borislav Petkov

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Add C intrinsics and assembler macros for the new rd/wr fs/gs base
> instructions and for swapgs.
>
> Very straight forward. Used in followon patch.
>
> For assembler only a few standard registers used by entry_64.S
> are defined.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/fsgs.h | 54 +++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 54 insertions(+)
>  create mode 100644 arch/x86/include/asm/fsgs.h
>
> diff --git a/arch/x86/include/asm/fsgs.h b/arch/x86/include/asm/fsgs.h
> new file mode 100644
> index 0000000..1df5085
> --- /dev/null
> +++ b/arch/x86/include/asm/fsgs.h
> @@ -0,0 +1,54 @@
> +#ifndef _ASM_FSGS_H
> +#define _ASM_FSGS_H 1
> +
> +#ifndef __ASSEMBLY__
> +
> +static inline void swapgs(void)
> +{
> +       asm volatile("swapgs" ::: "memory");
> +}
> +
> +/* Must be protected by X86_FEATURE_FSGSBASE check. */
> +
> +static inline unsigned long rdgsbase(void)
> +{
> +       unsigned long gs;
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc8 # rdgsbaseq %%rax"
> +                       : "=a" (gs)
> +                       :: "memory");
> +       return gs;
> +}
> +
> +static inline unsigned long rdfsbase(void)
> +{
> +       unsigned long fs;
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xc0 # rdfsbaseq %%rax"
> +                       : "=a" (fs)
> +                       :: "memory");
> +       return fs;
> +}
> +
> +static inline void wrgsbase(unsigned long gs)
> +{
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd8 # wrgsbaseq %%rax"
> +                       :: "a" (gs)
> +                       : "memory");
> +}
> +
> +static inline void wrfsbase(unsigned long fs)
> +{
> +       asm volatile(".byte 0xf3,0x48,0x0f,0xae,0xd0 # wrfsbaseq %%rax"
> +                       :: "a" (fs)
> +                       : "memory");
> +}

I think these should be always_inline.  If they somehow fail to get
inlined and then get kprobed or traced, bad things will happen.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 15:50 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
@ 2015-04-10 19:12   ` Andy Lutomirski
  2015-04-10 20:21     ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:12 UTC (permalink / raw)
  To: Andi Kleen
  Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen,
	Steven Rostedt, Borislav Petkov

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Introduction:
>
> IvyBridge added four new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there. Another use
> case is having (relatively) cheap access to a new address
> register per thread.
>
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
>
> For more details on how to use the instructions see
> Documentation/x86/fsgs.txt added in a followon patch.
>
> Paranoid exception path changes:
> ===============================
>
> The paranoid entry/exit code is used for any NMI like
> exception.
>
> Previously Linux couldn't support the new instructions
> because the paranoid entry code relied on the gs base never being
> negative outside the kernel to decide when to use swaps. It would
> check the gs MSR value and assume it was already running in
> kernel if negative.
>
> To get rid of this assumption we have to revamp the paranoid exception
> path to not rely on this. We can use the new instructions
> to get (relatively) quick access to the GS value, and use
> it directly.
>
> This is also significantly faster than a MSR read, so will speed
> NMIs (critical for profiling)
>
> The kernel gs for the paranoid path is now stored at the
> bottom of the IST stack (so that it can be derived from RSP).
> For this we need to know the size of the IST stack
> (4K or 8K), which is now passed in as a stack parameter
> to save_paranoid.
>
> The original patch compared the gs with the kernel gs and
> assumed that if it was identical, swapgs was not needed
> (and no user space processing was needed). This
> was nice and simple and didn't need a lot of changes.
>
> But this had the side effect that if a user process set its
> GS to the same as the kernel it may lose rescheduling
> checks (so a racing reschedule IPI would have been
> only acted upon the next non paranoid interrupt)
>
> This version now switches to full save/restore of the GS.
> This requires quite some changes in the paranoid path.
> Unfortunately I didn't come up with a simpler scheme.
>
> Previously we had a flag in EBX that indicated whether
> SWAPGS needs to be called later or not. In the new scheme
> this turns into a tristate, with a new "restore from R15"
> mode that is used when the FSGSBASE instructions are available.
> In this case the GS base is saved and restored.
> The exit paths are all adjusted to handle this correctly.
>
> So the three possible states for the paranoid exit path are:
>
> - Do nothing (pre FSGSBASE), if we interrupted the kernel
> as determined by the existing negative GS
> - Do swapgs, if we interrupted user space with positive GS
> (pre FSGSBASE), or we saved gs, but it was overwritten
> later by a context switch (with FSGSBASE)

We never run paranoid_exit if we interrupted userspace, and we can't
context switch on the IST stack, so I don't see how this is possible.

> - Restore from R15 (with FSGSBASE), if the gs base was saved
> in R15

What about case 4: we interrupted the kernel with usergs?  (The code
seems more correct in this regard, but this description above is
confusing to me.)

>
> Context switch changes:
> ======================
>
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear.  This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
>
> This is in term enables fast switching when there are
> enough threads that their TLS segment does not fit below 4GB
> (or with some newer systems which don't properly hint the
> stack limit), or alternatively programs that use fs as an additional base
> register will not get a sigificant context switch penalty.
>
> It is all done in a single patch because there was no
> simple way to do it in pieces without having crash
> holes inbetween.
>
> v2: Change to save/restore GS instead of using swapgs
> based on the value. Large scale changes.
> v3: Fix wrong flag initialization in fallback path.
> Thanks 0day!
> v4: Make swapgs code paths kprobes safe.
> Port to new base line code which now switches indexes.
> v5: Port to new kernel which avoids paranoid entry for ring 3.
> Removed some code that handled this previously.
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/cpu/common.c |   6 +++
>  arch/x86/kernel/entry_64.S   | 114 ++++++++++++++++++++++++++++++++++---------
>  arch/x86/kernel/process_64.c |  42 ++++++++++++++--
>  3 files changed, 134 insertions(+), 28 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 33a0293..7df88a3 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -957,6 +957,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  #ifdef CONFIG_NUMA
>         numa_add_cpu(smp_processor_id());
>  #endif
> +
> +       if (cpu_has(c, X86_FEATURE_FSGSBASE))
> +               cr4_set_bits(X86_CR4_FSGSBASE);
>  }
>
>  #ifdef CONFIG_X86_64
> @@ -1351,10 +1354,13 @@ void cpu_init(void)
>          */
>         if (!oist->ist[0]) {
>                 char *estacks = per_cpu(exception_stacks, cpu);
> +               void *gs = per_cpu(irq_stack_union.gs_base, cpu);
>
>                 for (v = 0; v < N_EXCEPTION_STACKS; v++) {
>                         if (v == DEBUG_STACK - 1)
>                                 estacks = per_cpu(debug_stack, cpu);
> +                       /* Store GS at bottom of stack for bootstrap access */
> +                       *(void **)estacks = gs;
>                         estacks += exception_stack_sizes[v];
>                         oist->ist[v] = t->x86_tss.ist[v] =
>                                         (unsigned long)estacks;

Seems reasonable to me.

You could possibly simplify some things if you wrote the address to
the bottom of *each* debug stack.  Then you wouldn't need the extra
alignment stuff.

> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index f0095a7..0b74ab0 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -58,6 +58,8 @@
>  #include <asm/context_tracking.h>
>  #include <asm/smap.h>
>  #include <asm/pgtable_types.h>
> +#include <asm/alternative-asm.h>
> +#include <asm/fsgs.h>
>  #include <linux/err.h>
>
>  /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
> @@ -218,32 +220,74 @@ ENDPROC(native_usergs_sysret64)
>         CFI_REL_OFFSET r15, R15+\offset
>         .endm
>
> +/* Values of the ebx flag: */
> +#define DO_RESTORE_R15  2 /* Restore GS at end */
> +#define DO_SWAPGS       1 /* Use SWAPGS at end */
> +#define DO_NOTHING      0 /* Back to ring 0 with same gs */
> +
> +/*
> + * Stack layout:
> + * +16 pt_regs
> + * +8  stack mask for ist or 0

What does that mean?

Oh, I get it.  It's the size of the IST stack we're on.  Let's please
make all IST stacks the same size as suggested above and get rid of
this.  After all, they really are all the same size -- there's just
more than one debug stack.

> + * +0  return address.
> + */
> +#define OFF 16
> +#define STKMSK 8
> +
>  ENTRY(save_paranoid)
> -       XCPT_FRAME 1 RDI+8
> +       XCPT_FRAME 1 RDI+OFF
>         cld
> -       movq %rdi, RDI+8(%rsp)
> -       movq %rsi, RSI+8(%rsp)
> -       movq_cfi rdx, RDX+8
> -       movq_cfi rcx, RCX+8
> -       movq_cfi rax, RAX+8
> -       movq %r8, R8+8(%rsp)
> -       movq %r9, R9+8(%rsp)
> -       movq %r10, R10+8(%rsp)
> -       movq %r11, R11+8(%rsp)
> -       movq_cfi rbx, RBX+8
> -       movq %rbp, RBP+8(%rsp)
> -       movq %r12, R12+8(%rsp)
> -       movq %r13, R13+8(%rsp)
> -       movq %r14, R14+8(%rsp)
> -       movq %r15, R15+8(%rsp)
> -       movl $1,%ebx
> +       movq %rdi, RDI+OFF(%rsp)
> +       movq %rsi, RSI+OFF(%rsp)
> +       movq_cfi rdx, RDX+OFF
> +       movq_cfi rcx, RCX+OFF
> +       movq_cfi rax, RAX+OFF
> +       movq %r8, R8+OFF(%rsp)
> +       movq %r9, R9+OFF(%rsp)
> +       movq %r10, R10+OFF(%rsp)
> +       movq %r11, R11+OFF(%rsp)
> +       movq_cfi rbx, RBX+OFF
> +       movq %rbp, RBP+OFF(%rsp)
> +       movq %r12, R12+OFF(%rsp)
> +       movq %r13, R13+OFF(%rsp)
> +       movq %r14, R14+OFF(%rsp)
> +       movq %r15, R15+OFF(%rsp)
> +       movq $-1,ORIG_RAX+OFF(%rsp)     /* no syscall to restart */
> +33:
> +       ASM_NOP5        /* May be replaced with jump to paranoid_save_gs */

Is there some reason that the normal ALTERNATIVE macro can't be used here?

> +34:
> +       movl $DO_NOTHING,%ebx
>         movl $MSR_GS_BASE,%ecx
>         rdmsr
>         testl %edx,%edx
>         js 1f   /* negative -> in kernel */
>         SWAPGS
> -       xorl %ebx,%ebx
> +       movl $DO_SWAPGS,%ebx
>  1:     ret
> +
> +       /* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
> +       .section .altinstr_replacement,"ax"
> +35:    .byte 0xe9 /* 32bit near jump */
> +       .long paranoid_save_gs-34b
> +       .previous
> +       .section .altinstructions,"a"
> +       altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
> +       .previous
> +
> +       /* Faster version not using RDMSR, and also not assuming
> +        * anything about the previous GS value.
> +        * This allows the user to arbitarily change GS using
> +        * WRGSBASE.
> +        */
> +paranoid_save_gs:
> +       RDGSBASE_R15                    # read previous gs
> +       movq STKMSK(%rsp),%rax          # get ist stack mask
> +       andq %rsp,%rax                  # get bottom of stack
> +       movq (%rax),%rdi                # get expected GS
> +       WRGSBASE_RDI                    # set gs for kernel
> +       mov $DO_RESTORE_R15,%ebx        # flag for exit code

I think this code isn't needed.  There's a CPU feature that tells us
that we're in this case, so why do we need a marker in ebx?

> +       ret
> +
>         CFI_ENDPROC
>  END(save_paranoid)
>
> @@ -1026,7 +1070,8 @@ apicinterrupt IRQ_WORK_VECTOR \
>   */
>  #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
>
> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
> +               stack_mask=-EXCEPTION_STKSZ

This can be removed as well, I think.

>  ENTRY(\sym)
>         /* Sanity check */
>         .if \shift_ist != -1 && \paranoid == 0
> @@ -1055,7 +1100,10 @@ ENTRY(\sym)
>         testl $3, CS(%rsp)              /* If coming from userspace, switch */
>         jnz 1f                          /* stacks. */
>         .endif
> +       pushq_cfi $\stack_mask          /* ist stack size */
>         call save_paranoid
> +       /* r15: previous gs */
> +       popq_cfi %rax                   /* Drop ist stack size */

Ditto.

>         .else
>         call error_entry
>         .endif
> @@ -1090,7 +1138,7 @@ ENTRY(\sym)
>         .endif
>
>         .if \paranoid
> -       jmp paranoid_exit               /* %ebx: no swapgs flag */
> +       jmp paranoid_exit               /* %ebx: no swapgs flag, r15: old gs */
>         .else
>         jmp error_exit                  /* %ebx: no swapgs flag */
>         .endif
> @@ -1311,9 +1359,12 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
>         hyperv_callback_vector hyperv_vector_handler
>  #endif /* CONFIG_HYPERV */
>
> -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> -idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> +idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> +        stack_mask=-DEBUG_STKSZ
> +idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> +        stack_mask=-DEBUG_STKSZ
>  idtentry stack_segment do_stack_segment has_error_code=1
> +

Ditto.

>  #ifdef CONFIG_XEN
>  idtentry xen_debug do_debug has_error_code=0
>  idtentry xen_int3 do_int3 has_error_code=0
> @@ -1339,17 +1390,25 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
>          * to try to handle preemption here.
>          */
>
> -       /* ebx: no swapgs flag */
> +       /* ebx: no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
>  ENTRY(paranoid_exit)
>         DEFAULT_FRAME
>         DISABLE_INTERRUPTS(CLBR_NONE)
>         TRACE_IRQS_OFF_DEBUG
>         testl %ebx,%ebx                         /* swapgs needed? */
>         jnz paranoid_restore
> +       cmpl  $DO_NOTHING,%ebx          /* swapgs needed? */
> +       je  paranoid_restore

Something's wrong here, I think.  DO_NOTHING == 0, so this is just
duplicate code.

>         TRACE_IRQS_IRETQ 0
> +       cmpl  $DO_RESTORE_R15,%ebx      /* restore gs? */
> +       je  paranoid_restore_gs

Can't we use alternatives here?

>         SWAPGS_UNSAFE_STACK
>         RESTORE_ALL 8
> +paranoid_restore_gs:
> +       WRGSBASE_R15

I'd rather jump back a couple instructions and avoid lots of
duplication below, especially since...

> +       RESTORE_ALL 8
>         INTERRUPT_RETURN
> +       jmp irq_return

... this jmp irq_return is dead code.

>  paranoid_restore:
>         TRACE_IRQS_IRETQ_DEBUG 0
>         RESTORE_ALL 8
> @@ -1650,6 +1709,7 @@ end_repeat_nmi:
>         pushq_cfi $-1           /* ORIG_RAX: no syscall to restart */
>         subq $ORIG_RAX-R15, %rsp
>         CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
> +       pushq_cfi $-EXCEPTION_STKSZ     /* ist stack size */

This needs Steven's blessing, I think.  This stuff is already very twisted.

>         /*
>          * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
>          * as we should not be calling schedule in NMI context.
> @@ -1658,6 +1718,8 @@ end_repeat_nmi:
>          * exceptions might do.
>          */
>         call save_paranoid
> +       /* r15: previous gs */
> +       popq_cfi %rax           /* pop ist size */
>         DEFAULT_FRAME 0
>
>         /*
> @@ -1682,11 +1744,15 @@ end_repeat_nmi:
>         je 1f
>         movq %r12, %cr2
>  1:
> -
> +       cmpl $DO_RESTORE_R15,%ebx
> +       je nmi_gs_restore
>         testl %ebx,%ebx                         /* swapgs needed? */
>         jnz nmi_restore
>  nmi_swapgs:
>         SWAPGS_UNSAFE_STACK
> +       jmp nmi_restore
> +nmi_gs_restore:
> +       WRGSBASE_R15                            /* restore gs */
>  nmi_restore:
>         /* Pop the extra iret frame at once */
>         RESTORE_ALL 6*8
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 67fcc43..3019c51 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -49,6 +49,7 @@
>  #include <asm/syscalls.h>
>  #include <asm/debugreg.h>
>  #include <asm/switch_to.h>
> +#include <asm/fsgs.h>
>
>  asmlinkage extern void ret_from_fork(void);
>
> @@ -261,6 +262,27 @@ void start_thread_ia32(struct pt_regs *regs, u32 new_ip, u32 new_sp)
>  }
>  #endif
>
> +/* Out of line to be protected from kprobes. */
> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes void switch_gs(unsigned long gs)
> +{
> +       swapgs();
> +       wrgsbase(gs);
> +       swapgs();
> +}

NOKPROBE_SYMBOL is preferred these days, I think.  Also, let's call
this something more helpful like write_user_gsbase.

> +
> +/* Interrupts are disabled here. */
> +static noinline __kprobes unsigned long read_user_gs(void)
> +{
> +       unsigned long gs;
> +
> +       swapgs();
> +       gs = rdgsbase();
> +       swapgs();
> +       return gs;
> +}

How about read_user_gsbase?

> +
>  /*
>   *     switch_to(x,y) should switch tasks from x to y.
>   *
> @@ -293,6 +315,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>          */
>         savesegment(fs, fsindex);
>         savesegment(gs, gsindex);
> +       if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> +               prev->fs = rdfsbase();
> +               prev->gs = read_user_gs();
> +       }
>
>         /*
>          * Load TLS before restoring any segments so that segment loads
> @@ -381,8 +407,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>                 if (fsindex)
>                         prev->fs = 0;

Unless I misunderstood the docs, the assumption that fsindex != 0
implies no fsbase override isn't really accurate any more.  (It never
was really accurate, but it used to be a lot closer to the truth.)

>         }
> -       if (next->fs)
> -               wrmsrl(MSR_FS_BASE, next->fs);
> +       if (next->fs) {
> +               if (static_cpu_has(X86_FEATURE_FSGSBASE))
> +                       wrfsbase(next->fs);
> +               else
> +                       wrmsrl(MSR_FS_BASE, next->fs);
> +       }
>         prev->fsindex = fsindex;
>
>         if (unlikely(gsindex | next->gsindex | prev->gs)) {
> @@ -392,8 +422,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
>                 if (gsindex)
>                         prev->gs = 0;
>         }
> -       if (next->gs)
> -               wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> +       if (next->gs) {
> +               if (static_cpu_has(X86_FEATURE_FSGSBASE))
> +                       switch_gs(next->gs);
> +               else
> +                       wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
> +       }
>         prev->gsindex = gsindex;
>
>         switch_fpu_finish(next_p, fpu);
> --
> 1.9.3
>

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 15:50 ` [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
@ 2015-04-10 19:15   ` Andy Lutomirski
  2015-04-10 22:14   ` Borislav Petkov
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:15 UTC (permalink / raw)
  To: Andi Kleen; +Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> The kernel needs to explicitely enable RD/WRFSBASE to handle context
> switch correctly. So the application needs to know if it can safely use
> these instruction. Just looking at the CPUID bit is not enough because it
> may be running in a kernel that does not enable the instructions.
>
> One way for the application would be to just try and catch the SIGILL.
> But that is difficult to do in libraries which may not want
> to overwrite the signal handlers of the main application.
>
> So we need to provide a way for the application to discover the kernel
> capability.
>
> I used AT_HWCAP2 in the ELF aux vector which is already used by
> PPC for similar things. We define a new Linux defined bitmap
> returned in AT_HWCAP.  Currently it has only one bit set,
> for kernel is FSGSBASE capable.
>
> The application can then access it manually or using
> the getauxval() function in newer glibc.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/elf.h        | 7 +++++++
>  arch/x86/include/uapi/asm/hwcap.h | 7 +++++++
>  arch/x86/kernel/cpu/common.c      | 7 ++++++-
>  3 files changed, 20 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/uapi/asm/hwcap.h
>
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index ca3347a..47dac31 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -257,6 +257,13 @@ extern int force_personality32;
>
>  #define ELF_HWCAP              (boot_cpu_data.x86_capability[0])
>
> +extern unsigned kernel_enabled_features;
> +
> +/* HWCAP2 supplies kernel enabled CPU feature, so that the application
> +   can know that it can safely use them. The bits are defined in
> +   uapi/asm/hwcap.h. */
> +#define ELF_HWCAP2             kernel_enabled_features;

Could we name this something more like elf_hwcap2 instead of
kernel_enabled_features?

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 15:50 ` [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base Andi Kleen
@ 2015-04-10 19:17   ` Andy Lutomirski
  2015-04-10 20:22     ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:17 UTC (permalink / raw)
  To: Andi Kleen
  Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen, Borislav Petkov

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> v2: Minor updates to documentation requested in review.
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  Documentation/x86/fsgs.txt | 76 ++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 76 insertions(+)
>  create mode 100644 Documentation/x86/fsgs.txt
>
> diff --git a/Documentation/x86/fsgs.txt b/Documentation/x86/fsgs.txt
> new file mode 100644
> index 0000000..26a1e29
> --- /dev/null
> +++ b/Documentation/x86/fsgs.txt
> @@ -0,0 +1,76 @@
> +
> +Using FS and GS prefixes on x86_64-linux
> +
> +The x86 architecture supports segment prefixes per instruction to add an
> +offset to an address.  On 64bit x86, these are mostly nops, except for FS
> +and GS.
> +
> +This offers an efficient way to reference a global pointer.
> +
> +The compiler has to generate special code to use these base registers,
> +or they can be accessed with inline assembler.
> +
> +       mov %gs:offset,%reg
> +       mov %fs:offset,%reg
> +
> +FS is used to address the thread local segment (TLS), declared using
> +__thread.  The compiler then automatically generates the correct prefixes and
> +relocations to access these values.
> +
> +FS is normally managed by the runtime code or the threading library.
> +
> +GS is freely available, but may need special (compiler or inline assembler)
> +code to use.
> +
> +Traditionally 64bit FS and GS could be set by the arch_prctl system call
> +
> +       arch_prctl(ARCH_SET_GS, value)
> +       arch_prctl(ARCH_SET_FS, value)
> +
> +[There was also an older method using modify_ldt(), inherited from 32bit,
> +but this is not discussed here.]
> +
> +However using a syscall is problematic for user space threading libraries
> +that want to context switch in user space. The whole point of them
> +is avoiding the overhead of a syscall. It's also cleaner for compilers
> +wanting to use the extra register to use instructions to write
> +it, or read it directly to compute addresses and offsets.
> +
> +Newer Intel CPUs (Ivy Bridge and later) added new instructions to directly
> +access these registers quickly from user context
> +
> +       RDFSBASE %reg   read the FS base        (or _readfsbase_u64)
> +       RDGSBASE %reg   read the GS base        (or _readgsbase_u64)
> +
> +       WRFSBASE %reg   write the FS base       (or _writefsbase_u64)
> +       WRGSBASE %reg   write the GS base       (or _writegsbase_u64)
> +
> +The instructions are supported by the CPU when the "fsgsbase" string is shown in
> +/proc/cpuinfo (or directly retrieved through the CPUID instruction).
> +The instructions are only available to 64bit binaries.
> +
> +However the kernel needs to explicitly enable these instructions, as it
> +may otherwise not correctly context switch the state. Newer Linux
> +kernels enable this. When the kernel did not enable the instruction
> +they will fault with an #UD exception.
> +
> +An FSGSBASE enabled kernel can be detected by checking the AT_HWCAP2
> +bitmask in the aux vector. When the HWCAP2_FSGSBASE bit is set the
> +kernel supports RDFSGSBASE.
> +
> +       #include <sys/auxv.h>
> +       #include <elf.h>
> +
> +       /* Will be eventually in asm/hwcap.h */
> +       #define HWCAP2_FSGSBASE        (1 << 0)
> +
> +        unsigned val = getauxval(AT_HWCAP2);
> +        if (val & HWCAP2_FSGSBASE) {
> +                asm("wrgsbase %0" :: "r" (ptr));
> +        }
> +
> +Another requirement is that the FS or GS selector has to be zero
> +(is normally true unless changed explicitly)

IMO this is worthy of explanation.

I think that your __switch_to is buggy and that's why the selectors
need to be zero.  Is that the only issue?  If so, let's fix the bug
instead.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl
  2015-04-10 15:50 ` [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl Andi Kleen
@ 2015-04-10 19:19   ` Andy Lutomirski
  2015-04-10 19:58     ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 19:19 UTC (permalink / raw)
  To: Andi Kleen
  Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen, Borislav Petkov

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Convert arch_prctl to use the new instructions to
> change fs/gs if available, instead of using MSRs.
>
> This is merely a small performance optimization,
> no new functionality.
>
> With the new instructions the syscall is really obsolete,
> as everything can be set directly in ring 3. But the syscall
> is widely used by existing software, so we still support it.

It's also necessary on my poor obsolete Sandy Bridge machines :)

>
> The syscall still enforces that the addresses are not
> in kernel space, even though that is not needed more.
> This is mainly so that the programs written for new CPUs
> do not suddenly fail on old CPUs.
>
> With the new instructions available it prefers to use
> them in the context switch, instead of using the old
> "use GDT segment rewrite" trick.
>
> v2: Make kprobes safe
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/kernel/process_64.c | 48 ++++++++++++++++++++++++++++++++++++--------
>  1 file changed, 40 insertions(+), 8 deletions(-)
>
> diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
> index 3019c51..1fe4d79 100644
> --- a/arch/x86/kernel/process_64.c
> +++ b/arch/x86/kernel/process_64.c
> @@ -534,20 +534,38 @@ unsigned long get_wchan(struct task_struct *p)
>         return 0;
>  }
>
> +static noinline __kprobes void reload_user_gs(unsigned long addr)
> +{
> +       local_irq_disable();
> +       swapgs();
> +       loadsegment(gs, 0);
> +       wrgsbase(addr);
> +       swapgs();
> +       local_irq_enable();
> +}

These names are terrifying.  How about
write_user_gsbase_and_set_gs_to_zero?  (Yes, it's a mouthful.)  Or
write_user_gs_and_gsbase(unsigned short gs, unsigned long gsbase),
which we'll need if we make __switch_to fully correct.

Also, at some point we should fix the fsindex vs fs crap in
thread_struct.  They should be fsindex and fsbase.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl
  2015-04-10 19:19   ` Andy Lutomirski
@ 2015-04-10 19:58     ` Andi Kleen
  0 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 19:58 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel, Borislav Petkov

> These names are terrifying.  How about
> write_user_gsbase_and_set_gs_to_zero?  (Yes, it's a mouthful.)  Or
> write_user_gs_and_gsbase(unsigned short gs, unsigned long gsbase),

Sorry we're not programing Java here.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 19:12   ` Andy Lutomirski
@ 2015-04-10 20:21     ` Andi Kleen
  2015-04-10 20:25       ` Borislav Petkov
  2015-04-10 20:34       ` Andy Lutomirski
  0 siblings, 2 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

> We never run paranoid_exit if we interrupted userspace, and we can't
> context switch on the IST stack, so I don't see how this is possible.
> 
> > - Restore from R15 (with FSGSBASE), if the gs base was saved
> > in R15
> 
> What about case 4: we interrupted the kernel with usergs?  (The code
> seems more correct in this regard, but this description above is
> confusing to me.)

I'll fix the description.

> >                                 estacks = per_cpu(debug_stack, cpu);
> > +                       /* Store GS at bottom of stack for bootstrap access */
> > +                       *(void **)estacks = gs;
> >                         estacks += exception_stack_sizes[v];
> >                         oist->ist[v] = t->x86_tss.ist[v] =
> >                                         (unsigned long)estacks;
> 
> Seems reasonable to me.
> 
> You could possibly simplify some things if you wrote the address to
> the bottom of *each* debug stack.  Then you wouldn't need the extra
> alignment stuff.

It would waste 16K or so per CPU. I don't think the if is a problem.


> > +/*
> > + * Stack layout:
> > + * +16 pt_regs
> > + * +8  stack mask for ist or 0
> 
> What does that mean?
> 
> Oh, I get it.  It's the size of the IST stack we're on.  Let's please
> make all IST stacks the same size as suggested above and get rid of
> this.  After all, they really are all the same size -- there's just
> more than one debug stack.

I don't want to waste the memory here. A few instructions more is much
preferable.

> > +       movq_cfi rdx, RDX+OFF
> > +       movq_cfi rcx, RCX+OFF
> > +       movq_cfi rax, RAX+OFF
> > +       movq %r8, R8+OFF(%rsp)
> > +       movq %r9, R9+OFF(%rsp)
> > +       movq %r10, R10+OFF(%rsp)
> > +       movq %r11, R11+OFF(%rsp)
> > +       movq_cfi rbx, RBX+OFF
> > +       movq %rbp, RBP+OFF(%rsp)
> > +       movq %r12, R12+OFF(%rsp)
> > +       movq %r13, R13+OFF(%rsp)
> > +       movq %r14, R14+OFF(%rsp)
> > +       movq %r15, R15+OFF(%rsp)
> > +       movq $-1,ORIG_RAX+OFF(%rsp)     /* no syscall to restart */
> > +33:
> > +       ASM_NOP5        /* May be replaced with jump to paranoid_save_gs */
> 
> Is there some reason that the normal ALTERNATIVE macro can't be used here?

This is assembler, not C.

Since you asked for such extensive use I added a new macro for it now.

> > +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
> > +               stack_mask=-EXCEPTION_STKSZ
> 
> This can be removed as well, I think.

No with the different sized stacks.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 19:17   ` Andy Lutomirski
@ 2015-04-10 20:22     ` Andi Kleen
  2015-04-10 20:38       ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:22 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel, Borislav Petkov




> I think that your __switch_to is buggy and that's why the selectors
> need to be zero.  Is that the only issue?  If so, let's fix the bug
> instead.

I don't think there is a bug.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:21     ` Andi Kleen
@ 2015-04-10 20:25       ` Borislav Petkov
  2015-04-10 20:52         ` Andi Kleen
  2015-04-10 20:34       ` Andy Lutomirski
  1 sibling, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2015-04-10 20:25 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Andi Kleen, X86 ML, Andrew Lutomirski,
	linux-kernel, Steven Rostedt

On Fri, Apr 10, 2015 at 01:21:32PM -0700, Andi Kleen wrote:
> > Is there some reason that the normal ALTERNATIVE macro can't be used here?
> 
> This is assembler, not C.

He's talking about the ALTERNATIVE asm macro in alternative-asm.h in
tip/master now. No open coded crap.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:21     ` Andi Kleen
  2015-04-10 20:25       ` Borislav Petkov
@ 2015-04-10 20:34       ` Andy Lutomirski
  2015-04-10 20:41         ` Andi Kleen
  1 sibling, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 20:34 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

On Fri, Apr 10, 2015 at 1:21 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> We never run paranoid_exit if we interrupted userspace, and we can't
>> context switch on the IST stack, so I don't see how this is possible.
>>
>> > - Restore from R15 (with FSGSBASE), if the gs base was saved
>> > in R15
>>
>> What about case 4: we interrupted the kernel with usergs?  (The code
>> seems more correct in this regard, but this description above is
>> confusing to me.)
>
> I'll fix the description.
>
>> >                                 estacks = per_cpu(debug_stack, cpu);
>> > +                       /* Store GS at bottom of stack for bootstrap access */
>> > +                       *(void **)estacks = gs;
>> >                         estacks += exception_stack_sizes[v];
>> >                         oist->ist[v] = t->x86_tss.ist[v] =
>> >                                         (unsigned long)estacks;
>>
>> Seems reasonable to me.
>>
>> You could possibly simplify some things if you wrote the address to
>> the bottom of *each* debug stack.  Then you wouldn't need the extra
>> alignment stuff.
>
> It would waste 16K or so per CPU. I don't think the if is a problem.
>

It wouldn't take any additional memory at all.  Currently we have 8k
of "debug" stack which is really two 4k pieces, and you're putting the
kernel gs base in the bottom word.  I'm suggesting that you duplicate
the kernel gs base at the bottom work and the bottom word + 4k.  We
already have a hard limit of 4k of debug stack because of the IST
shift mechanism -- it really is two separate 4k stacks, not one 8k
stack.

Heck, we could rename it DEBUG_STACK_1 and DEBUG_STACK_2, although I
wouldn't get too excited about it since I'm working on patches to
remove the debug stack entirely.

The benefit of this is that the mask needed to find the kernel gs base
is the same for all entries.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 20:22     ` Andi Kleen
@ 2015-04-10 20:38       ` Andy Lutomirski
  2015-04-10 20:46         ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 20:38 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel, Borislav Petkov

On Fri, Apr 10, 2015 at 1:22 PM, Andi Kleen <ak@linux.intel.com> wrote:
>
>
>
>> I think that your __switch_to is buggy and that's why the selectors
>> need to be zero.  Is that the only issue?  If so, let's fix the bug
>> instead.
>
> I don't think there is a bug.

So what's the issue?

If user code programs fs != 0 and fsbase != whatever is implied by fs
and the GDT/LDT, what happens?  There's already a minor buglet in that
area without wrfsbase, but fixing it would be a big performance hit
because we don't have rdfsbase and rdgsbase to read the state
efficiently.  (Specifically, if we have gs == 0, gsbase == 0, but
*saved* gsbase != 0, then we corrupt gsbase on context switch.)

But, with the new instructions, we can do it simply, efficiently, and
correctly in all cases.  Let's do so.

--Andy

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:34       ` Andy Lutomirski
@ 2015-04-10 20:41         ` Andi Kleen
  2015-04-10 20:47           ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:41 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

> It wouldn't take any additional memory at all.  Currently we have 8k
> of "debug" stack which is really two 4k pieces, and you're putting the
> kernel gs base in the bottom word.  I'm suggesting that you duplicate
> the kernel gs base at the bottom work and the bottom word + 4k.  We
> already have a hard limit of 4k of debug stack because of the IST
> shift mechanism -- it really is two separate 4k stacks, not one 8k
> stack.

Seems like a hack. What happens if we add an uneven number of stacks?

Just handling it in the code is simple enough.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 20:38       ` Andy Lutomirski
@ 2015-04-10 20:46         ` Andi Kleen
  2015-04-10 20:52           ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel, Borislav Petkov

> If user code programs fs != 0 and fsbase != whatever is implied by fs
> and the GDT/LDT, what happens? 

We load the value from the LDT/GDT.

> There's already a minor buglet in that
> area without wrfsbase, but fixing it would be a big performance hit
> because we don't have rdfsbase and rdgsbase to read the state
> efficiently.  (Specifically, if we have gs == 0, gsbase == 0, but
> *saved* gsbase != 0, then we corrupt gsbase on context switch.)
> 
> But, with the new instructions, we can do it simply, efficiently, and
> correctly in all cases.  Let's do so.

We would need an instruction to write the index without changing the base.
That's not what the new instructions do.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:41         ` Andi Kleen
@ 2015-04-10 20:47           ` Andy Lutomirski
  2015-04-10 20:57             ` Andi Kleen
  2015-04-10 22:52             ` Andi Kleen
  0 siblings, 2 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 20:47 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

On Fri, Apr 10, 2015 at 1:41 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> It wouldn't take any additional memory at all.  Currently we have 8k
>> of "debug" stack which is really two 4k pieces, and you're putting the
>> kernel gs base in the bottom word.  I'm suggesting that you duplicate
>> the kernel gs base at the bottom work and the bottom word + 4k.  We
>> already have a hard limit of 4k of debug stack because of the IST
>> shift mechanism -- it really is two separate 4k stacks, not one 8k
>> stack.
>
> Seems like a hack. What happens if we add an uneven number of stacks?

This works for DEBUG_STKSZ == n * EXCEPTION_STKSZ for any n.  Just
duplicate the pointer n times.

I think all of this stems from unfortunate naming.  DEBUG_STACK isn't
one stack -- it's a debug stack *array*.  The IST shift mechanism
means that we can use different entries in that array as our stacks
depending on how deeply nested we are.

>
> Just handling it in the code is simple enough.

It seems to account for over half the asm diff.  I'm talking about the
addition of approximately two lines of C and the removal of a huge
chunk of the asm diff.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:25       ` Borislav Petkov
@ 2015-04-10 20:52         ` Andi Kleen
  2015-04-10 20:53           ` Borislav Petkov
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, Andi Kleen, X86 ML, Andrew Lutomirski,
	linux-kernel, Steven Rostedt

On Fri, Apr 10, 2015 at 10:25:42PM +0200, Borislav Petkov wrote:
> On Fri, Apr 10, 2015 at 01:21:32PM -0700, Andi Kleen wrote:
> > > Is there some reason that the normal ALTERNATIVE macro can't be used here?
> > 
> > This is assembler, not C.
> 
> He's talking about the ALTERNATIVE asm macro in alternative-asm.h in
> tip/master now. No open coded crap.

I don't think that macro works jumps that are patched in
(and probably should have never been added because of that)

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base
  2015-04-10 20:46         ` Andi Kleen
@ 2015-04-10 20:52           ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 20:52 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel, Borislav Petkov

On Fri, Apr 10, 2015 at 1:46 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> If user code programs fs != 0 and fsbase != whatever is implied by fs
>> and the GDT/LDT, what happens?
>
> We load the value from the LDT/GDT.
>
>> There's already a minor buglet in that
>> area without wrfsbase, but fixing it would be a big performance hit
>> because we don't have rdfsbase and rdgsbase to read the state
>> efficiently.  (Specifically, if we have gs == 0, gsbase == 0, but
>> *saved* gsbase != 0, then we corrupt gsbase on context switch.)
>>
>> But, with the new instructions, we can do it simply, efficiently, and
>> correctly in all cases.  Let's do so.
>
> We would need an instruction to write the index without changing the base.
> That's not what the new instructions do.

I think it's as simple as:

if (has fancy new feature) {
  prev->fsindex = [read fs];
  prev->fsbase = rdfsbase();
  prev->gsindex = [read gs];
  prev->gsbase = [read gsbase];

  load_fs(next->fsindex);
  wrfsbase(next->fsbase);
  write_gs_base_and_index(next->gsindex, next->gsbase);
} else {
  do the old mess;
}

Hmm.  This may need a bit of thought wrt ptrace.  We also need to
consider what happens in the event that the selector load fails.
Presumably we fall back to base == 0.

(Note: no matter what we do here, we at least need to think about
ptrace.  I would argue that poking a nonzero value into fs or gs from
ptrace should zero the saved base register, even on old hardware.
Presumably changing from nonzero to zero should also zero it.)

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:52         ` Andi Kleen
@ 2015-04-10 20:53           ` Borislav Petkov
  0 siblings, 0 replies; 51+ messages in thread
From: Borislav Petkov @ 2015-04-10 20:53 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Andi Kleen, X86 ML, Andrew Lutomirski,
	linux-kernel, Steven Rostedt

On Fri, Apr 10, 2015 at 01:52:05PM -0700, Andi Kleen wrote:
> I don't think that macro works jumps that are patched in
> (and probably should have never been added because of that)

-ENOPARSE.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:47           ` Andy Lutomirski
@ 2015-04-10 20:57             ` Andi Kleen
  2015-04-10 21:07               ` Andy Lutomirski
  2015-04-10 22:52             ` Andi Kleen
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 20:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

> I think all of this stems from unfortunate naming.  DEBUG_STACK isn't
> one stack -- it's a debug stack *array*.  The IST shift mechanism
> means that we can use different entries in that array as our stacks
> depending on how deeply nested we are.

I still think it's a terrible idea.

> > Just handling it in the code is simple enough.
> 
> It seems to account for over half the asm diff.  I'm talking about the
> addition of approximately two lines of C and the removal of a huge
> chunk of the asm diff.

It's just adding offsets to the stack code. Nothing complicated.
It's also straight forward code. Far more preferable than your magic
overlapping stacks.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:57             ` Andi Kleen
@ 2015-04-10 21:07               ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 21:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

On Fri, Apr 10, 2015 at 1:57 PM, Andi Kleen <ak@linux.intel.com> wrote:
>> I think all of this stems from unfortunate naming.  DEBUG_STACK isn't
>> one stack -- it's a debug stack *array*.  The IST shift mechanism
>> means that we can use different entries in that array as our stacks
>> depending on how deeply nested we are.
>
> I still think it's a terrible idea.
>
>> > Just handling it in the code is simple enough.
>>
>> It seems to account for over half the asm diff.  I'm talking about the
>> addition of approximately two lines of C and the removal of a huge
>> chunk of the asm diff.
>
> It's just adding offsets to the stack code. Nothing complicated.
> It's also straight forward code. Far more preferable than your magic
> overlapping stacks.

There are no overlapping stacks.  There's an array of stacks.  That
array wasn't my idea.  I want to delete it, and I have WIP code that
prepares to delete it.  But we have it today, and all of the IST
stacks have exactly the same usable size, and we might as well take
advantage of that.

The layout is straightforward if overcomplicated:

--- debug stack + 8k ---
... 4092 bytes free, including pt_regs ...
kernel gs base
--- debug stack + 4k ---
... 4092 bytes free, including pt_regs ...
kernel gs base
--- bottom of debug stack array ---

There are more stacks above or below this (I haven't checked which
order the things are in, and it doesn't matter).

The top debug stack *is not 8k*.  It's 4k, despite the awkward
description in the code that allocates it.

So the IST stacks really are all the same size.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] x86: Make old K8 swapgs workaround conditional
  2015-04-10 15:50 ` [PATCH 5/8] x86: Make old K8 swapgs workaround conditional Andi Kleen
@ 2015-04-10 21:46   ` Andy Lutomirski
  2015-04-10 22:01   ` Borislav Petkov
  1 sibling, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 21:46 UTC (permalink / raw)
  To: Andi Kleen; +Cc: X86 ML, Andrew Lutomirski, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 8:50 AM, Andi Kleen <andi@firstfloor.org> wrote:
> From: Andi Kleen <ak@linux.intel.com>
>
> Every gs selector/index reload always paid an extra MFENCE
> between the two SWAPGS. This was to work around an old
> bug in early K8 steppings.  All other CPUs don't need the extra
> mfence. Patch the extra MFENCE only in for K8.
>
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/cpufeature.h |  1 +
>  arch/x86/kernel/cpu/amd.c         |  3 +++
>  arch/x86/kernel/entry_64.S        | 10 +++++++++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 90a5485..c695fad 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -255,6 +255,7 @@
>  #define X86_BUG_11AP           X86_BUG(5) /* Bad local APIC aka 11AP */
>  #define X86_BUG_FXSAVE_LEAK    X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
>  #define X86_BUG_CLFLUSH_MONITOR        X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
> +#define X86_BUG_SWAPGS_MFENCE  X86_BUG(8) /* SWAPGS may need MFENCE */
>
>  #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
>
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index a220239..e7f5667 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -551,6 +551,9 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
>         if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
>                 set_cpu_cap(c, X86_FEATURE_REP_GOOD);
>
> +       /* Early steppings needed a mfence on swapgs. */
> +       set_cpu_cap(c, X86_BUG_SWAPGS_MFENCE);
> +
>         /*
>          * Some BIOSes incorrectly force this feature, but only K8 revision D
>          * (model = 0x14) and later actually support it.
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 0b74ab0..bb44292 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1212,13 +1212,21 @@ ENTRY(native_load_gs_index)
>         SWAPGS
>  gs_change:
>         movl %edi,%gs
> -2:     mfence          /* workaround */
> +2:     ASM_NOP3        /* may be replaced with mfence */

ALTERNATIVE, please.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] x86: Make old K8 swapgs workaround conditional
  2015-04-10 15:50 ` [PATCH 5/8] x86: Make old K8 swapgs workaround conditional Andi Kleen
  2015-04-10 21:46   ` Andy Lutomirski
@ 2015-04-10 22:01   ` Borislav Petkov
  2015-04-10 23:10     ` Andi Kleen
  1 sibling, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2015-04-10 22:01 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, luto, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 08:50:30AM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Every gs selector/index reload always paid an extra MFENCE
> between the two SWAPGS. This was to work around an old
> bug in early K8 steppings.  All other CPUs don't need the extra
> mfence. Patch the extra MFENCE only in for K8.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/cpufeature.h |  1 +
>  arch/x86/kernel/cpu/amd.c         |  3 +++
>  arch/x86/kernel/entry_64.S        | 10 +++++++++-
>  3 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/cpufeature.h b/arch/x86/include/asm/cpufeature.h
> index 90a5485..c695fad 100644
> --- a/arch/x86/include/asm/cpufeature.h
> +++ b/arch/x86/include/asm/cpufeature.h
> @@ -255,6 +255,7 @@
>  #define X86_BUG_11AP		X86_BUG(5) /* Bad local APIC aka 11AP */
>  #define X86_BUG_FXSAVE_LEAK	X86_BUG(6) /* FXSAVE leaks FOP/FIP/FOP */
>  #define X86_BUG_CLFLUSH_MONITOR	X86_BUG(7) /* AAI65, CLFLUSH required before MONITOR */
> +#define X86_BUG_SWAPGS_MFENCE	X86_BUG(8) /* SWAPGS may need MFENCE */
>  
>  #if defined(__KERNEL__) && !defined(__ASSEMBLY__)
>  
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index a220239..e7f5667 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -551,6 +551,9 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
>  	if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
>  		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
>  
> +	/* Early steppings needed a mfence on swapgs. */
> +	set_cpu_cap(c, X86_BUG_SWAPGS_MFENCE);

set_cpu_bug()

and this should not be set on all K8 but for the early steppings only
which need it.

> +
>  	/*
>  	 * Some BIOSes incorrectly force this feature, but only K8 revision D
>  	 * (model = 0x14) and later actually support it.
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 0b74ab0..bb44292 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -1212,13 +1212,21 @@ ENTRY(native_load_gs_index)
>  	SWAPGS
>  gs_change:
>  	movl %edi,%gs
> -2:	mfence		/* workaround */
> +2:	ASM_NOP3	/* may be replaced with mfence */
>  	SWAPGS
>  	popfq_cfi
>  	ret
>  	CFI_ENDPROC
>  END(native_load_gs_index)
>  
> +	/* Early K8 systems needed an mfence after swapgs to workaround a bug */
> +	.section .altinstr_replacement,"ax"
> +3:	mfence
> +	.previous
> +	.section .altinstructions,"a"
> +	altinstruction_entry 2b,3b,X86_BUG_SWAPGS_MFENCE,3,3
> +	.previous
> +

What AndyL said.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 15:50 ` [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
  2015-04-10 19:15   ` Andy Lutomirski
@ 2015-04-10 22:14   ` Borislav Petkov
  2015-04-10 23:07     ` Andi Kleen
  2015-04-10 23:08     ` Andi Kleen
  1 sibling, 2 replies; 51+ messages in thread
From: Borislav Petkov @ 2015-04-10 22:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, luto, linux-kernel, Andi Kleen

On Fri, Apr 10, 2015 at 08:50:31AM -0700, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> The kernel needs to explicitely enable RD/WRFSBASE to handle context
> switch correctly. So the application needs to know if it can safely use
> these instruction. Just looking at the CPUID bit is not enough because it
> may be running in a kernel that does not enable the instructions.
> 
> One way for the application would be to just try and catch the SIGILL.
> But that is difficult to do in libraries which may not want
> to overwrite the signal handlers of the main application.
> 
> So we need to provide a way for the application to discover the kernel
> capability.
> 
> I used AT_HWCAP2 in the ELF aux vector which is already used by
> PPC for similar things. We define a new Linux defined bitmap
> returned in AT_HWCAP.  Currently it has only one bit set,
> for kernel is FSGSBASE capable.
> 
> The application can then access it manually or using
> the getauxval() function in newer glibc.
> 
> Signed-off-by: Andi Kleen <ak@linux.intel.com>
> ---
>  arch/x86/include/asm/elf.h        | 7 +++++++
>  arch/x86/include/uapi/asm/hwcap.h | 7 +++++++
>  arch/x86/kernel/cpu/common.c      | 7 ++++++-
>  3 files changed, 20 insertions(+), 1 deletion(-)
>  create mode 100644 arch/x86/include/uapi/asm/hwcap.h
> 
> diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
> index ca3347a..47dac31 100644
> --- a/arch/x86/include/asm/elf.h
> +++ b/arch/x86/include/asm/elf.h
> @@ -257,6 +257,13 @@ extern int force_personality32;
>  
>  #define ELF_HWCAP		(boot_cpu_data.x86_capability[0])
>  
> +extern unsigned kernel_enabled_features;
> +
> +/* HWCAP2 supplies kernel enabled CPU feature, so that the application
> +   can know that it can safely use them. The bits are defined in
> +   uapi/asm/hwcap.h. */

Comments formatting.

> +#define ELF_HWCAP2		kernel_enabled_features;
> +
>  /* This yields a string that ld.so will use to load implementation
>     specific libraries for optimization.  This is more specific in
>     intent than poking at uname or /proc/cpuinfo.
> diff --git a/arch/x86/include/uapi/asm/hwcap.h b/arch/x86/include/uapi/asm/hwcap.h
> new file mode 100644
> index 0000000..d9c54f8
> --- /dev/null
> +++ b/arch/x86/include/uapi/asm/hwcap.h
> @@ -0,0 +1,7 @@
> +#ifndef _ASM_HWCAP_H
> +#define _ASM_HWCAP_H 1
> +
> +#define HWCAP2_FSGSBASE	(1 << 0) 	/* Kernel enabled RD/WR FS/GS BASE */

BIT()

> +/* upto bit 31 free */
> +
> +#endif
> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 7df88a3..81186cb 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -34,6 +34,7 @@
>  #include <asm/i387.h>
>  #include <asm/fpu-internal.h>
>  #include <asm/mtrr.h>
> +#include <asm/hwcap.h>
>  #include <linux/numa.h>
>  #include <asm/asm.h>
>  #include <asm/cpu.h>
> @@ -49,6 +50,8 @@
>  
>  #include "cpu.h"
>  
> +unsigned kernel_enabled_features __read_mostly;
> +
>  /* all of these masks are initialized in setup_cpu_local_masks() */
>  cpumask_var_t cpu_initialized_mask;
>  cpumask_var_t cpu_callout_mask;
> @@ -958,8 +961,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  	numa_add_cpu(smp_processor_id());
>  #endif
>  
> -	if (cpu_has(c, X86_FEATURE_FSGSBASE))
> +	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {

static_cpu_has_safe()

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 20:47           ` Andy Lutomirski
  2015-04-10 20:57             ` Andi Kleen
@ 2015-04-10 22:52             ` Andi Kleen
  2015-04-10 23:00               ` Andy Lutomirski
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 22:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

> I think all of this stems from unfortunate naming.  DEBUG_STACK isn't
> one stack -- it's a debug stack *array*.  The IST shift mechanism
> means that we can use different entries in that array as our stacks
> depending on how deeply nested we are.

It's not. It was always intended as one stack. It still is, for anyone
not nesting debuggers (which afaik doesn't happen in-tree, it was
only for Jan Beulich's magic debugger)

I wrote the original code, so I should know.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 22:52             ` Andi Kleen
@ 2015-04-10 23:00               ` Andy Lutomirski
  2015-04-10 23:05                 ` Andi Kleen
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 23:00 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov

On Fri, Apr 10, 2015 at 3:52 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> I think all of this stems from unfortunate naming.  DEBUG_STACK isn't
>> one stack -- it's a debug stack *array*.  The IST shift mechanism
>> means that we can use different entries in that array as our stacks
>> depending on how deeply nested we are.
>
> It's not. It was always intended as one stack. It still is, for anyone
> not nesting debuggers (which afaik doesn't happen in-tree, it was
> only for Jan Beulich's magic debugger)
>
> I wrote the original code, so I should know.

Regardless of who wrote what, the entry code does this:

        .if \shift_ist != -1
        subq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist)
        .endif

        call \do_sym

        .if \shift_ist != -1
        addq $EXCEPTION_STKSZ, INIT_TSS_IST(\shift_ist)
        .endif

I wrote that in its current form, but it does *exactly* the same thing
(as in literally identical generated code) as the old code.

One might argue that this code serves no purpose, but it's there, so
we had better keep our per-invocation usage of DEBUG_STACK within 4k.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:00               ` Andy Lutomirski
@ 2015-04-10 23:05                 ` Andi Kleen
  2015-04-10 23:15                   ` Andy Lutomirski
                                     ` (2 more replies)
  0 siblings, 3 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov, JBeulich

> One might argue that this code serves no purpose, but it's there, so
> we had better keep our per-invocation usage of DEBUG_STACK within 4k.

Only if you run NKLD. I doubt KDB or GDB support nesting.
We can ask Jan if he still uses it.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 22:14   ` Borislav Petkov
@ 2015-04-10 23:07     ` Andi Kleen
  2015-04-10 23:08     ` Andi Kleen
  1 sibling, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:07 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Andi Kleen, x86, luto, linux-kernel, Andi Kleen

> > +++ b/arch/x86/include/uapi/asm/hwcap.h
> > @@ -0,0 +1,7 @@
> > +#ifndef _ASM_HWCAP_H
> > +#define _ASM_HWCAP_H 1
> > +
> > +#define HWCAP2_FSGSBASE	(1 << 0) 	/* Kernel enabled RD/WR FS/GS BASE */
> 
> BIT()

We cannot use BIT in uapi headers for obvious reasons.

> >  cpumask_var_t cpu_initialized_mask;
> >  cpumask_var_t cpu_callout_mask;
> > @@ -958,8 +961,10 @@ static void identify_cpu(struct cpuinfo_x86 *c)
> >  	numa_add_cpu(smp_processor_id());
> >  #endif
> >  
> > -	if (cpu_has(c, X86_FEATURE_FSGSBASE))
> > +	if (cpu_has(c, X86_FEATURE_FSGSBASE)) {
> 
> static_cpu_has_safe()

That checks the boot cpu? We want to check the current CPU.

-Andi


-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 22:14   ` Borislav Petkov
  2015-04-10 23:07     ` Andi Kleen
@ 2015-04-10 23:08     ` Andi Kleen
  2015-04-11  7:16       ` Borislav Petkov
  1 sibling, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:08 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Andi Kleen, x86, luto, linux-kernel, Andi Kleen

> > +extern unsigned kernel_enabled_features;
> > +
> > +/* HWCAP2 supplies kernel enabled CPU feature, so that the application
> > +   can know that it can safely use them. The bits are defined in
> > +   uapi/asm/hwcap.h. */
> 
> Comments formatting.

The formatting matches all the other comments in the file.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] x86: Make old K8 swapgs workaround conditional
  2015-04-10 22:01   ` Borislav Petkov
@ 2015-04-10 23:10     ` Andi Kleen
  2015-04-11  7:18       ` Borislav Petkov
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:10 UTC (permalink / raw)
  To: Borislav Petkov; +Cc: Andi Kleen, x86, luto, linux-kernel, Andi Kleen

On Sat, Apr 11, 2015 at 12:01:10AM +0200, Borislav Petkov wrote:
> > diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> > index a220239..e7f5667 100644
> > --- a/arch/x86/kernel/cpu/amd.c
> > +++ b/arch/x86/kernel/cpu/amd.c
> > @@ -551,6 +551,9 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
> >  	if ((level >= 0x0f48 && level < 0x0f50) || level >= 0x0f58)
> >  		set_cpu_cap(c, X86_FEATURE_REP_GOOD);
> >  
> > +	/* Early steppings needed a mfence on swapgs. */
> > +	set_cpu_cap(c, X86_BUG_SWAPGS_MFENCE);
> 
> set_cpu_bug()
> 
> and this should not be set on all K8 but for the early steppings only
> which need it.

I don't know which one, but I don't think it really matters.

-Andi

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:05                 ` Andi Kleen
@ 2015-04-10 23:15                   ` Andy Lutomirski
  2015-04-10 23:18                     ` Andi Kleen
  2015-04-10 23:16                   ` Andi Kleen
  2015-04-13  7:07                   ` Jan Beulich
  2 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 23:15 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov, Jan Beulich

On Fri, Apr 10, 2015 at 4:05 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> One might argue that this code serves no purpose, but it's there, so
>> we had better keep our per-invocation usage of DEBUG_STACK within 4k.
>
> Only if you run NKLD. I doubt KDB or GDB support nesting.
> We can ask Jan if he still uses it.

You can trigger these things in various ways with kprobes,
single-stepping, and user-space watchpoints.  These things can
definitely nest in interesting ways.  Whether or not the IST shift
mechanism is involved if they nest is a complicated question, because
there's also the "debug" IDT in play.

OMG this code is overcomplicated.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:05                 ` Andi Kleen
  2015-04-10 23:15                   ` Andy Lutomirski
@ 2015-04-10 23:16                   ` Andi Kleen
  2015-04-13  7:07                   ` Jan Beulich
  2 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andy Lutomirski, Andi Kleen, X86 ML, Andrew Lutomirski,
	linux-kernel, Steven Rostedt, Borislav Petkov, JBeulich

On Sat, Apr 11, 2015 at 01:05:46AM +0200, Andi Kleen wrote:
> > One might argue that this code serves no purpose, but it's there, so
> > we had better keep our per-invocation usage of DEBUG_STACK within 4k.
> 
> Only if you run NKLD. I doubt KDB or GDB support nesting.
> We can ask Jan if he still uses it.

Thinking about it more we likely still need it for kprobes,
in case someone sets them into the int3 or debug path.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:15                   ` Andy Lutomirski
@ 2015-04-10 23:18                     ` Andi Kleen
  2015-04-10 23:21                       ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2015-04-10 23:18 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andi Kleen, Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov, Jan Beulich

On Fri, Apr 10, 2015 at 04:15:39PM -0700, Andy Lutomirski wrote:
> On Fri, Apr 10, 2015 at 4:05 PM, Andi Kleen <andi@firstfloor.org> wrote:
> >> One might argue that this code serves no purpose, but it's there, so
> >> we had better keep our per-invocation usage of DEBUG_STACK within 4k.
> >
> > Only if you run NKLD. I doubt KDB or GDB support nesting.
> > We can ask Jan if he still uses it.
> 
> You can trigger these things in various ways with kprobes,
> single-stepping, and user-space watchpoints.  These things can
> definitely nest in interesting ways.  Whether or not the IST shift
> mechanism is involved if they nest is a complicated question, because
> there's also the "debug" IDT in play.

Yes we need it for (old style) kprobes in those paths.

For user space it shouldn't matter.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:18                     ` Andi Kleen
@ 2015-04-10 23:21                       ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2015-04-10 23:21 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Andi Kleen, X86 ML, Andrew Lutomirski, linux-kernel,
	Steven Rostedt, Borislav Petkov, Jan Beulich

On Fri, Apr 10, 2015 at 4:18 PM, Andi Kleen <andi@firstfloor.org> wrote:
> On Fri, Apr 10, 2015 at 04:15:39PM -0700, Andy Lutomirski wrote:
>> On Fri, Apr 10, 2015 at 4:05 PM, Andi Kleen <andi@firstfloor.org> wrote:
>> >> One might argue that this code serves no purpose, but it's there, so
>> >> we had better keep our per-invocation usage of DEBUG_STACK within 4k.
>> >
>> > Only if you run NKLD. I doubt KDB or GDB support nesting.
>> > We can ask Jan if he still uses it.
>>
>> You can trigger these things in various ways with kprobes,
>> single-stepping, and user-space watchpoints.  These things can
>> definitely nest in interesting ways.  Whether or not the IST shift
>> mechanism is involved if they nest is a complicated question, because
>> there's also the "debug" IDT in play.
>
> Yes we need it for (old style) kprobes in those paths.
>
> For user space it shouldn't matter.

For entries directly from userspace, we're not using the IST stack any more.

--Andy

>
> -Andi
>
> --
> ak@linux.intel.com -- Speaking for myself only.



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2
  2015-04-10 23:08     ` Andi Kleen
@ 2015-04-11  7:16       ` Borislav Petkov
  2015-04-11  8:35         ` Intel FSGSBASE support (was: Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2) Ingo Molnar
  0 siblings, 1 reply; 51+ messages in thread
From: Borislav Petkov @ 2015-04-11  7:16 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, luto, linux-kernel, Andi Kleen

On Sat, Apr 11, 2015 at 01:08:48AM +0200, Andi Kleen wrote:
> The formatting matches all the other comments in the file.

That doesn't mean you need to add new comments with the old formatting
which we're trying to get rid of.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 5/8] x86: Make old K8 swapgs workaround conditional
  2015-04-10 23:10     ` Andi Kleen
@ 2015-04-11  7:18       ` Borislav Petkov
  0 siblings, 0 replies; 51+ messages in thread
From: Borislav Petkov @ 2015-04-11  7:18 UTC (permalink / raw)
  To: Andi Kleen; +Cc: x86, luto, linux-kernel, Andi Kleen

On Sat, Apr 11, 2015 at 01:10:13AM +0200, Andi Kleen wrote:
> I don't know which one, but I don't think it really matters.

We're not punishing innocent K8 machines just because you don't
remember.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Intel FSGSBASE support (was: Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2)
  2015-04-11  7:16       ` Borislav Petkov
@ 2015-04-11  8:35         ` Ingo Molnar
  0 siblings, 0 replies; 51+ messages in thread
From: Ingo Molnar @ 2015-04-11  8:35 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andi Kleen, x86, luto, linux-kernel, Andi Kleen, Thomas Gleixner,
	H. Peter Anvin, Peter Zijlstra, Denys Vlasenko, Andy Lutomirski


* Borislav Petkov <bp@alien8.de> wrote:

> On Sat, Apr 11, 2015 at 01:08:48AM +0200, Andi Kleen wrote:

> > > > +/* HWCAP2 supplies kernel enabled CPU feature, so that the application
> > > > +   can know that it can safely use them. The bits are defined in
> > > > +   uapi/asm/hwcap.h. */
> > >
> > > Comments formatting.
> >
> > The formatting matches all the other comments in the file.
>
> That doesn't mean you need to add new comments with the old 
> formatting which we're trying to get rid of.

Exactly, and the thing is, I've seen this behavior before, so I'm also 
going to ignore all these Intel FSGSBASE patches from Andi Kleen, for 
the following technical reasons:

 - they are poorly written,

 - a necessary precondition of such features is the thorough (and
   constructively conducted) clean-up of the underlying code,

 - the series exposes a new user-space ABI that is going to be exposed
   forever and has to be done right on the first attempt,

 - unacceptable passive-aggressive behavior was directed by Andi
   against constructive, technical feedback from reviewers and
   maintainers.

So consider Intel FSGSBASE support NACK-ed on these four technical 
grounds. All four problems have to be properly addressed (beyond 
addressing all the other feedback that was given) before the NACK is 
lifted.

Maybe someone else has the time to pick up and deobfuscate (or 
entirely rewrite) these patches into a properly written series?

The Intel FSGSBASE hardware feature itself looks potentially useful, 
so I'm not opposed to the concept itself.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2015-04-10 23:05                 ` Andi Kleen
  2015-04-10 23:15                   ` Andy Lutomirski
  2015-04-10 23:16                   ` Andi Kleen
@ 2015-04-13  7:07                   ` Jan Beulich
  2 siblings, 0 replies; 51+ messages in thread
From: Jan Beulich @ 2015-04-13  7:07 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Borislav Petkov, Andy Lutomirski, Steven Rostedt,
	Andrew Lutomirski, X86 ML, Andi Kleen, linux-kernel

>>> On 11.04.15 at 01:05, <andi@firstfloor.org> wrote:
>>  One might argue that this code serves no purpose, but it's there, so
>> we had better keep our per-invocation usage of DEBUG_STACK within 4k.
> 
> Only if you run NKLD. I doubt KDB or GDB support nesting.
> We can ask Jan if he still uses it.

I didn't have the time to bring forward NLKD for the last several years.

Jan


^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2014-11-11 20:05   ` Andy Lutomirski
@ 2014-11-11 20:49     ` Andy Lutomirski
  0 siblings, 0 replies; 51+ messages in thread
From: Andy Lutomirski @ 2014-11-11 20:49 UTC (permalink / raw)
  To: Andi Kleen, X86 ML; +Cc: linux-kernel, Andi Kleen, Borislav Petkov

On Tue, Nov 11, 2014 at 12:05 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On 11/10/2014 03:55 PM, Andi Kleen wrote:
>> To prevent recursive interrupts clobbering this
>> state in the task_struct this is only done for interrupts
>> coming directly from ring 3.
>
> Since this just came up in a different context today, I'd like to
> propose a different solution to this piece of the problem.
>
> Can we change the paranoid entry to check if the entry came from ring 3
> and to just switch stacks immediately to the standard kernel stack and
> run the non-paranoid entry code?  This eliminates paranoid_userspace
> entirely, and there are no special gsbase machinations any more for the
> entry-from-userspace path.
>
> In fact, I think that this will result in the MSR KERNEL_GS_BASE value
> *always* matching the userspace gs value from any C code in the kernel,
> since we'll always swapgs exactly once on entry from userspace.

I'll send patches in a couple hours.  I have it mostly working.

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* Re: [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2014-11-10 23:55 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
@ 2014-11-11 20:05   ` Andy Lutomirski
  2014-11-11 20:49     ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andy Lutomirski @ 2014-11-11 20:05 UTC (permalink / raw)
  To: Andi Kleen, x86; +Cc: linux-kernel, Andi Kleen, Borislav Petkov

On 11/10/2014 03:55 PM, Andi Kleen wrote:
> From: Andi Kleen <ak@linux.intel.com>
> 
> Introduction:
> 
> IvyBridge added four new instructions to directly write the fs and gs
> 64bit base registers. Previously this had to be done with a system
> call to write to MSRs. The main use case is fast user space threading
> and switching the fs/gs registers quickly there. Another use
> case is having (relatively) cheap access to a new address
> register per thread.
> 
> The instructions are opt-in and have to be explicitely enabled
> by the OS.
> 
> For more details on how to use the instructions see
> Documentation/x86/fsgs.txt added in a followon patch.
> 
> Paranoid exception path changes:
> ===============================
> 
> The paranoid entry/exit code is used for any NMI like
> exception.
> 
> Previously Linux couldn't support the new instructions
> because the paranoid entry code relied on the gs base never being
> negative outside the kernel to decide when to use swaps. It would
> check the gs MSR value and assume it was already running in
> kernel if negative.
> 
> To get rid of this assumption we have to revamp the paranoid exception
> path to not rely on this. We can use the new instructions
> to get (relatively) quick access to the GS value, and use
> it directly.
> 
> This is also significantly faster than a MSR read, so will speed
> NMIs (critical for profiling)
> 
> The kernel gs for the paranoid path is now stored at the
> bottom of the IST stack (so that it can be derived from RSP).
> For this we need to know the size of the IST stack
> (4K or 8K), which is now passed in as a stack parameter
> to save_paranoid.
> 
> The original patch compared the gs with the kernel gs and
> assumed that if it was identical, swapgs was not needed
> (and no user space processing was needed). This
> was nice and simple and didn't need a lot of changes.
> 
> But this had the side effect that if a user process set its
> GS to the same as the kernel it may lose rescheduling
> checks (so a racing reschedule IPI would have been
> only acted upon the next non paranoid interrupt)
> 
> This version now switches to full save/restore of the GS.
> This requires quite some changes in the paranoid path.
> Unfortunately I didn't come up with a simpler scheme.
> 
> Previously we had a flag in EBX that indicated whether
> SWAPGS needs to be called later or not. In the new scheme
> this turns into a tristate, with a new "restore from R15"
> mode that is used when the FSGSBASE instructions are available.
> In this case the GS base is saved and restored.
> The exit paths are all adjusted to handle this correctly.
> 
> There is one complication: to allow debuggers (especially
> from the int3 or debug vectors) access to the user GS
> we need to save it in the task struct. Normally
> the next context switch would overwrite it with the wrong
> value from kernel_gs, so we set new flag also in task_struct
> that prevents it. The flag is cleared on context switch.
> 
> [Even with the additional flag the new FS/GS context switch
> is vastly faster than the old MSR based version for bases >4GB]
> 
> To prevent recursive interrupts clobbering this
> state in the task_struct this is only done for interrupts
> coming directly from ring 3.

Since this just came up in a different context today, I'd like to
propose a different solution to this piece of the problem.

Can we change the paranoid entry to check if the entry came from ring 3
and to just switch stacks immediately to the standard kernel stack and
run the non-paranoid entry code?  This eliminates paranoid_userspace
entirely, and there are no special gsbase machinations any more for the
entry-from-userspace path.

In fact, I think that this will result in the MSR KERNEL_GS_BASE value
*always* matching the userspace gs value from any C code in the kernel,
since we'll always swapgs exactly once on entry from userspace.

If we make this change, then:

> 
> After a reschedule comes back we check if the flag is still
> set. If it wasn't set the GS is back in the (swapped) kernel
> gs so we revert to the SWAPGS mode, instead of restoring GS.
> 
> So the three possible states for the paranoid exit path are:
> 
> - Do nothing (pre FSGSBASE), if we interrupted the kernel
> as determined by the existing negative GS

This works the same way as before.

> - Do swapgs, if we interrupted user space with positive GS
> (pre FSGSBASE), or we saved gs, but it was overwritten
> later by a context switch (with FSGSBASE)

This can never happen on paranoid return.

> - Restore from R15 (with FSGSBASE), if the gs base was saved
> in R15, and not overwritten by a context switch.
> 

I think that if we implement my suggestion, then we can't have a context
switch here either.

> Context switch changes:
> ======================
> 
> Then after these changes we need to also use the new instructions
> to save/restore fs and gs, so that the new values set by the
> users won't disappear.  This is also significantly
> faster for the case when the 64bit base has to be switched
> (that is when GS is larger than 4GB), as we can replace
> the slow MSR write with a faster wr[fg]sbase execution.
> 
> The instructions do not context switch
> the segment index, so the old invariant that fs or gs index
> have to be 0 for a different 64bit value to stick is still
> true. Previously it was enforced by arch_prctl, now the user
> program has to make sure it keeps the segment indexes zero.
> If it doesn't the changes may not stick.
> 
> [Yes, this implies that programs can find out when they
> got context switched. However they could already do that
> before by checking the timing.]

I don't like this.  Also, starting in 3.19 (if my CR4 patches get
merged), then this isn't true in a tight enough seccomp sandbox.

What's wrong with switching the segment indexes separately?  In the
common case, this will be two reads of segment registers per context
switch (to check if the previous segments registers were nonzero), which
should be pretty cheap.

Also, it's probably very unsafe to let one task leak an LDT index in fs
or gs into another task.  This is asking for cross-task segfaults in the
hands of malicious users.


Also, at some point hpa suggested that we should start saving and
restoring fs, gs, fsbase, and gsbase in sigcontext when we enable this
feature.  Userspace code might want that.

> diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
> index 59def4c..db74af5 100644
> --- a/arch/x86/kernel/cpu/common.c
> +++ b/arch/x86/kernel/cpu/common.c
> @@ -953,6 +953,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
>  #ifdef CONFIG_NUMA
>  	numa_add_cpu(smp_processor_id());
>  #endif
> +
> +	if (cpu_has(c, X86_FEATURE_FSGSBASE))
> +		set_in_cr4(X86_CR4_FSGSBASE);

This will conflict with my cr4 patches (the merge will fail to compile).
 We can cross that bridge when we get to it.

>  
> +/* Values of the ebx flag: */
> +#define DO_RESTORE_R15	 2 /* Restore GS at end */
> +#define DO_SWAPGS	 1 /* Use SWAPGS at end */
> +#define DO_NOTHING	 0 /* Back to ring 0 with same gs */
> +
> +/*
> + * Stack layout:
> + * +16 pt_regs
> + * +8  stack mask for ist or 0
> + * +0  return address.
> + */
> +#define OFF 16
> +#define STKMSK 8
> +
>  ENTRY(save_paranoid)
> -	XCPT_FRAME 1 RDI+8
> +	XCPT_FRAME 1 RDI+OFF
>  	cld
> -	movq %rdi, RDI+8(%rsp)
> -	movq %rsi, RSI+8(%rsp)
> -	movq_cfi rdx, RDX+8
> -	movq_cfi rcx, RCX+8
> -	movq_cfi rax, RAX+8
> -	movq %r8, R8+8(%rsp)
> -	movq %r9, R9+8(%rsp)
> -	movq %r10, R10+8(%rsp)
> -	movq %r11, R11+8(%rsp)
> -	movq_cfi rbx, RBX+8
> -	movq %rbp, RBP+8(%rsp)
> -	movq %r12, R12+8(%rsp)
> -	movq %r13, R13+8(%rsp)
> -	movq %r14, R14+8(%rsp)
> -	movq %r15, R15+8(%rsp)
> -	movl $1,%ebx
> +	movq %rdi, RDI+OFF(%rsp)
> +	movq %rsi, RSI+OFF(%rsp)
> +	movq_cfi rdx, RDX+OFF
> +	movq_cfi rcx, RCX+OFF
> +	movq_cfi rax, RAX+OFF
> +	movq %r8, R8+OFF(%rsp)
> +	movq %r9, R9+OFF(%rsp)
> +	movq %r10, R10+OFF(%rsp)
> +	movq %r11, R11+OFF(%rsp)
> +	movq_cfi rbx, RBX+OFF
> +	movq %rbp, RBP+OFF(%rsp)
> +	movq %r12, R12+OFF(%rsp)
> +	movq %r13, R13+OFF(%rsp)
> +	movq %r14, R14+OFF(%rsp)
> +	movq %r15, R15+OFF(%rsp)
> +	movq $-1,ORIG_RAX+OFF(%rsp)	/* no syscall to restart */
> +33:
> +	ASM_NOP5	/* May be replaced with jump to paranoid_save_gs */
> +34:
> +	movl $DO_NOTHING,%ebx
>  	movl $MSR_GS_BASE,%ecx
>  	rdmsr
>  	testl %edx,%edx
>  	js 1f	/* negative -> in kernel */
>  	SWAPGS
> -	xorl %ebx,%ebx
> +	movl $DO_SWAPGS,%ebx
>  1:	ret
> +
> +	/* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
> +	.section .altinstr_replacement,"ax"
> +35:	.byte 0xe9 /* 32bit near jump */
> +	.long paranoid_save_gs-34b
> +	.previous
> +	.section .altinstructions,"a"
> +	altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
> +	.previous
> +
> +	/* Faster version not using RDMSR, and also not assuming
> +	 * anything about the previous GS value.
> +	 * This allows the user to arbitarily change GS using
> +	 * WRGSBASE.
> +	 */
> +paranoid_save_gs:
> +	RDGSBASE_R15			# read previous gs
> +	movq STKMSK(%rsp),%rax		# get ist stack mask
> +	andq %rsp,%rax			# get bottom of stack
> +	movq (%rax),%rdi		# get expected GS
> +	WRGSBASE_RDI			# set gs for kernel
> +	mov $DO_RESTORE_R15,%ebx	# flag for exit code
> +	testl $3,CS+OFF(%rsp)		# did we come from ring 0?
> +	je paranoid_save_gs_kernel
> +	/*
> +	 * Saving GS in the task struct allows a debugger to manipulate
> +	 * it. We only do this when coming from ring 3 to avoid recursion
> +	 * clobbering the state.
> +	 */
> +	movq PER_CPU_VAR(current_task),%rcx # get current
> +	movb $1,task_thread_gs_saved(%rcx)  # tell context switch gs is
> +					    # is already saved
> +	movq %r15,task_thread_gs(%rcx)      # save gs in task struct

As mentioned above, it seems like this would be simplified if this code
could never happen.

> +paranoid_save_gs_kernel:
> +	ret
> +
>  	CFI_ENDPROC
>  END(save_paranoid)
>  
> @@ -1040,7 +1096,8 @@ apicinterrupt IRQ_WORK_VECTOR \
>   */
>  #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
>  
> -.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
> +.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
> +		stack_mask=-EXCEPTION_STKSZ
>  ENTRY(\sym)
>  	/* Sanity check */
>  	.if \shift_ist != -1 && \paranoid == 0
> @@ -1064,7 +1121,10 @@ ENTRY(\sym)
>  	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
>  
>  	.if \paranoid
> +	pushq_cfi $\stack_mask		/* ist stack size */
>  	call save_paranoid
> +	/* r15: previous gs */
> +	popq_cfi %rax			/* Drop ist stack size */

IMO this is a bit ugly, but I don't see a better solution off the top of
my head.

>  	.else
>  	call error_entry
>  	.endif
> @@ -1099,7 +1159,7 @@ ENTRY(\sym)
>  	.endif
>  
>  	.if \paranoid
> -	jmp paranoid_exit		/* %ebx: no swapgs flag */
> +	jmp paranoid_exit		/* %ebx: no swapgs flag, r15: old gs */
>  	.else
>  	jmp error_exit			/* %ebx: no swapgs flag */
>  	.endif
> @@ -1287,8 +1347,10 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
>  	hyperv_callback_vector hyperv_vector_handler
>  #endif /* CONFIG_HYPERV */
>  
> -idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> -idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
> +idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> +	 stack_mask=-DEBUG_STKSZ
> +idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
> +	 stack_mask=-DEBUG_STKSZ
>  idtentry stack_segment do_stack_segment has_error_code=1 paranoid=1
>  #ifdef CONFIG_XEN
>  idtentry xen_debug do_debug has_error_code=0
> @@ -1317,49 +1379,69 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
>  	 * hard flags at once, atomically)
>  	 */
>  
> -	/* ebx:	no swapgs flag */
> +	/* ebx:	no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
>  ENTRY(paranoid_exit)
>  	DEFAULT_FRAME
>  	DISABLE_INTERRUPTS(CLBR_NONE)
>  	TRACE_IRQS_OFF_DEBUG
> -	testl %ebx,%ebx				/* swapgs needed? */
> -	jnz paranoid_restore
> +	cmpl  $DO_NOTHING,%ebx		/* swapgs needed? */
> +	je  paranoid_restore
>  	testl $3,CS(%rsp)
>  	jnz   paranoid_userspace
>  paranoid_swapgs:
>  	TRACE_IRQS_IRETQ 0
> +	cmpl  $DO_RESTORE_R15,%ebx	/* restore gs? */
> +	je  paranoid_restore_gs

If we switched off the IST for an entry from userspace, we wouldn't need
the conditionals at all for the rdfsgsbase case, right?


>  	SWAPGS_UNSAFE_STACK
>  	RESTORE_ALL 8
>  	jmp irq_return
> +paranoid_restore_gs:
> +	WRGSBASE_R15
> +	RESTORE_ALL 8
> +	jmp irq_return
>  paranoid_restore:
>  	TRACE_IRQS_IRETQ_DEBUG 0
>  	RESTORE_ALL 8
>  	jmp irq_return
>  paranoid_userspace:
>  	GET_THREAD_INFO(%rcx)
> -	movl TI_flags(%rcx),%ebx
> -	andl $_TIF_WORK_MASK,%ebx
> -	jz paranoid_swapgs
> +	movl TI_flags(%rcx),%ebp
> +	andl $_TIF_WORK_MASK,%ebp
> +	jz paranoid_clear_gs_flag
>  	movq %rsp,%rdi			/* &pt_regs */
>  	call sync_regs
>  	movq %rax,%rsp			/* switch stack for scheduling */
> -	testl $_TIF_NEED_RESCHED,%ebx
> +	testl $_TIF_NEED_RESCHED,%ebp
>  	jnz paranoid_schedule
> -	movl %ebx,%edx			/* arg3: thread flags */
> +	movl %ebp,%edx			/* arg3: thread flags */
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_NONE)
>  	xorl %esi,%esi 			/* arg2: oldset */
>  	movq %rsp,%rdi 			/* arg1: &pt_regs */
>  	call do_notify_resume
> -	DISABLE_INTERRUPTS(CLBR_NONE)
> -	TRACE_IRQS_OFF
> -	jmp paranoid_userspace
> +	jmp paranoid_after_schedule
> +paranoid_clear_gs_flag:
> +	movq PER_CPU_VAR(current_task),%rcx
> +	movb $0,task_thread_gs_saved(%rcx)
> +	jmp paranoid_swapgs
>  paranoid_schedule:
>  	TRACE_IRQS_ON
>  	ENABLE_INTERRUPTS(CLBR_ANY)
>  	SCHEDULE_USER
> +paranoid_after_schedule:
>  	DISABLE_INTERRUPTS(CLBR_ANY)
>  	TRACE_IRQS_OFF
> +	cmpl $DO_RESTORE_R15,%ebx
> +	jne  paranoid_userspace
> +	movq PER_CPU_VAR(current_task),%rcx
> +	cmpb $0,task_thread_gs_saved(%rcx) /* did we schedule? */
> +	jne  paranoid_userspace		   /* did not schedule: keep our gs */
> +	/*
> +	 * Otherwise the context switch has put the correct gs into
> +	 * kernel_gs, so we can just end with swapgs and it
> +	 * will DTRT.
> +	 */
> +	mov  $DO_SWAPGS,%ebx		   /* force swapgs after schedule */

OMG more special cases :(  I'm liking my stack-switching proposal even
more.  Grr, entry code is a mess.

> +	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
> +		prev->fs = rdfsbase();
> +		/* Interrupts are disabled here. */
> +		if (!prev->gs_saved) {
> +			swapgs();
> +			prev->gs = rdgsbase();
> +			swapgs();

Is this adequately protected against kprobes?

--Andy

^ permalink raw reply	[flat|nested] 51+ messages in thread

* [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2014-11-10 23:55 [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
@ 2014-11-10 23:55 ` Andi Kleen
  2014-11-11 20:05   ` Andy Lutomirski
  0 siblings, 1 reply; 51+ messages in thread
From: Andi Kleen @ 2014-11-10 23:55 UTC (permalink / raw)
  To: x86; +Cc: linux-kernel, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Introduction:

IvyBridge added four new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there. Another use
case is having (relatively) cheap access to a new address
register per thread.

The instructions are opt-in and have to be explicitely enabled
by the OS.

For more details on how to use the instructions see
Documentation/x86/fsgs.txt added in a followon patch.

Paranoid exception path changes:
===============================

The paranoid entry/exit code is used for any NMI like
exception.

Previously Linux couldn't support the new instructions
because the paranoid entry code relied on the gs base never being
negative outside the kernel to decide when to use swaps. It would
check the gs MSR value and assume it was already running in
kernel if negative.

To get rid of this assumption we have to revamp the paranoid exception
path to not rely on this. We can use the new instructions
to get (relatively) quick access to the GS value, and use
it directly.

This is also significantly faster than a MSR read, so will speed
NMIs (critical for profiling)

The kernel gs for the paranoid path is now stored at the
bottom of the IST stack (so that it can be derived from RSP).
For this we need to know the size of the IST stack
(4K or 8K), which is now passed in as a stack parameter
to save_paranoid.

The original patch compared the gs with the kernel gs and
assumed that if it was identical, swapgs was not needed
(and no user space processing was needed). This
was nice and simple and didn't need a lot of changes.

But this had the side effect that if a user process set its
GS to the same as the kernel it may lose rescheduling
checks (so a racing reschedule IPI would have been
only acted upon the next non paranoid interrupt)

This version now switches to full save/restore of the GS.
This requires quite some changes in the paranoid path.
Unfortunately I didn't come up with a simpler scheme.

Previously we had a flag in EBX that indicated whether
SWAPGS needs to be called later or not. In the new scheme
this turns into a tristate, with a new "restore from R15"
mode that is used when the FSGSBASE instructions are available.
In this case the GS base is saved and restored.
The exit paths are all adjusted to handle this correctly.

There is one complication: to allow debuggers (especially
from the int3 or debug vectors) access to the user GS
we need to save it in the task struct. Normally
the next context switch would overwrite it with the wrong
value from kernel_gs, so we set new flag also in task_struct
that prevents it. The flag is cleared on context switch.

[Even with the additional flag the new FS/GS context switch
is vastly faster than the old MSR based version for bases >4GB]

To prevent recursive interrupts clobbering this
state in the task_struct this is only done for interrupts
coming directly from ring 3.

After a reschedule comes back we check if the flag is still
set. If it wasn't set the GS is back in the (swapped) kernel
gs so we revert to the SWAPGS mode, instead of restoring GS.

So the three possible states for the paranoid exit path are:

- Do nothing (pre FSGSBASE), if we interrupted the kernel
as determined by the existing negative GS
- Do swapgs, if we interrupted user space with positive GS
(pre FSGSBASE), or we saved gs, but it was overwritten
later by a context switch (with FSGSBASE)
- Restore from R15 (with FSGSBASE), if the gs base was saved
in R15, and not overwritten by a context switch.

Context switch changes:
======================

Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear.  This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.

The instructions do not context switch
the segment index, so the old invariant that fs or gs index
have to be 0 for a different 64bit value to stick is still
true. Previously it was enforced by arch_prctl, now the user
program has to make sure it keeps the segment indexes zero.
If it doesn't the changes may not stick.

[Yes, this implies that programs can find out when they
got context switched. However they could already do that
before by checking the timing.]

This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB,
or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.

It is all done in a single patch because there was no
simple way to do it in pieces without having crash
holes inbetween.

v2: Change to save/restore GS instead of using swapgs
based on the value. Large scale changes.
v3: Fix wrong flag initialization in fallback path.
Thanks 0day!
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/processor.h |   4 +
 arch/x86/kernel/asm-offsets_64.c |   4 +
 arch/x86/kernel/cpu/common.c     |   6 ++
 arch/x86/kernel/entry_64.S       | 163 +++++++++++++++++++++++++++++++--------
 arch/x86/kernel/process_64.c     |  31 +++++++-
 5 files changed, 170 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..987a631 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -522,6 +522,10 @@ struct thread_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
+	/*
+	 * Set to one when gs is already saved.
+	 */
+	unsigned char  gs_saved;
 };
 
 /*
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index e7c798b..32ab510 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -85,6 +85,10 @@ int main(void)
 
 	DEFINE(__NR_ia32_syscall_max, sizeof(syscalls_ia32) - 1);
 	DEFINE(IA32_NR_syscalls, sizeof(syscalls_ia32));
+	BLANK();
+
+	OFFSET(task_thread_gs, task_struct, thread.gs);
+	OFFSET(task_thread_gs_saved, task_struct, thread.gs_saved);
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 59def4c..db74af5 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -953,6 +953,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #ifdef CONFIG_NUMA
 	numa_add_cpu(smp_processor_id());
 #endif
+
+	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+		set_in_cr4(X86_CR4_FSGSBASE);
 }
 
 #ifdef CONFIG_X86_64
@@ -1351,10 +1354,13 @@ void cpu_init(void)
 	 */
 	if (!oist->ist[0]) {
 		char *estacks = per_cpu(exception_stacks, cpu);
+		void *gs = per_cpu(irq_stack_union.gs_base, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			if (v == DEBUG_STACK - 1)
 				estacks = per_cpu(debug_stack, cpu);
+			/* Store GS at bottom of stack for bootstrap access */
+			*(void **)estacks = gs;
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb..9f7fde5 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,8 @@
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
+#include <asm/alternative-asm.h>
+#include <asm/fsgs.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -283,32 +285,86 @@ ENDPROC(native_usergs_sysret64)
 	TRACE_IRQS_OFF
 	.endm
 
+/* Values of the ebx flag: */
+#define DO_RESTORE_R15	 2 /* Restore GS at end */
+#define DO_SWAPGS	 1 /* Use SWAPGS at end */
+#define DO_NOTHING	 0 /* Back to ring 0 with same gs */
+
+/*
+ * Stack layout:
+ * +16 pt_regs
+ * +8  stack mask for ist or 0
+ * +0  return address.
+ */
+#define OFF 16
+#define STKMSK 8
+
 ENTRY(save_paranoid)
-	XCPT_FRAME 1 RDI+8
+	XCPT_FRAME 1 RDI+OFF
 	cld
-	movq %rdi, RDI+8(%rsp)
-	movq %rsi, RSI+8(%rsp)
-	movq_cfi rdx, RDX+8
-	movq_cfi rcx, RCX+8
-	movq_cfi rax, RAX+8
-	movq %r8, R8+8(%rsp)
-	movq %r9, R9+8(%rsp)
-	movq %r10, R10+8(%rsp)
-	movq %r11, R11+8(%rsp)
-	movq_cfi rbx, RBX+8
-	movq %rbp, RBP+8(%rsp)
-	movq %r12, R12+8(%rsp)
-	movq %r13, R13+8(%rsp)
-	movq %r14, R14+8(%rsp)
-	movq %r15, R15+8(%rsp)
-	movl $1,%ebx
+	movq %rdi, RDI+OFF(%rsp)
+	movq %rsi, RSI+OFF(%rsp)
+	movq_cfi rdx, RDX+OFF
+	movq_cfi rcx, RCX+OFF
+	movq_cfi rax, RAX+OFF
+	movq %r8, R8+OFF(%rsp)
+	movq %r9, R9+OFF(%rsp)
+	movq %r10, R10+OFF(%rsp)
+	movq %r11, R11+OFF(%rsp)
+	movq_cfi rbx, RBX+OFF
+	movq %rbp, RBP+OFF(%rsp)
+	movq %r12, R12+OFF(%rsp)
+	movq %r13, R13+OFF(%rsp)
+	movq %r14, R14+OFF(%rsp)
+	movq %r15, R15+OFF(%rsp)
+	movq $-1,ORIG_RAX+OFF(%rsp)	/* no syscall to restart */
+33:
+	ASM_NOP5	/* May be replaced with jump to paranoid_save_gs */
+34:
+	movl $DO_NOTHING,%ebx
 	movl $MSR_GS_BASE,%ecx
 	rdmsr
 	testl %edx,%edx
 	js 1f	/* negative -> in kernel */
 	SWAPGS
-	xorl %ebx,%ebx
+	movl $DO_SWAPGS,%ebx
 1:	ret
+
+	/* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
+	.section .altinstr_replacement,"ax"
+35:	.byte 0xe9 /* 32bit near jump */
+	.long paranoid_save_gs-34b
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
+	.previous
+
+	/* Faster version not using RDMSR, and also not assuming
+	 * anything about the previous GS value.
+	 * This allows the user to arbitarily change GS using
+	 * WRGSBASE.
+	 */
+paranoid_save_gs:
+	RDGSBASE_R15			# read previous gs
+	movq STKMSK(%rsp),%rax		# get ist stack mask
+	andq %rsp,%rax			# get bottom of stack
+	movq (%rax),%rdi		# get expected GS
+	WRGSBASE_RDI			# set gs for kernel
+	mov $DO_RESTORE_R15,%ebx	# flag for exit code
+	testl $3,CS+OFF(%rsp)		# did we come from ring 0?
+	je paranoid_save_gs_kernel
+	/*
+	 * Saving GS in the task struct allows a debugger to manipulate
+	 * it. We only do this when coming from ring 3 to avoid recursion
+	 * clobbering the state.
+	 */
+	movq PER_CPU_VAR(current_task),%rcx # get current
+	movb $1,task_thread_gs_saved(%rcx)  # tell context switch gs is
+					    # is already saved
+	movq %r15,task_thread_gs(%rcx)      # save gs in task struct
+paranoid_save_gs_kernel:
+	ret
+
 	CFI_ENDPROC
 END(save_paranoid)
 
@@ -1040,7 +1096,8 @@ apicinterrupt IRQ_WORK_VECTOR \
  */
 #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
 
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
+		stack_mask=-EXCEPTION_STKSZ
 ENTRY(\sym)
 	/* Sanity check */
 	.if \shift_ist != -1 && \paranoid == 0
@@ -1064,7 +1121,10 @@ ENTRY(\sym)
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 
 	.if \paranoid
+	pushq_cfi $\stack_mask		/* ist stack size */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax			/* Drop ist stack size */
 	.else
 	call error_entry
 	.endif
@@ -1099,7 +1159,7 @@ ENTRY(\sym)
 	.endif
 
 	.if \paranoid
-	jmp paranoid_exit		/* %ebx: no swapgs flag */
+	jmp paranoid_exit		/* %ebx: no swapgs flag, r15: old gs */
 	.else
 	jmp error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1287,8 +1347,10 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	hyperv_callback_vector hyperv_vector_handler
 #endif /* CONFIG_HYPERV */
 
-idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
+idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
+idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
 idtentry stack_segment do_stack_segment has_error_code=1 paranoid=1
 #ifdef CONFIG_XEN
 idtentry xen_debug do_debug has_error_code=0
@@ -1317,49 +1379,69 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 	 * hard flags at once, atomically)
 	 */
 
-	/* ebx:	no swapgs flag */
+	/* ebx:	no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
 ENTRY(paranoid_exit)
 	DEFAULT_FRAME
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF_DEBUG
-	testl %ebx,%ebx				/* swapgs needed? */
-	jnz paranoid_restore
+	cmpl  $DO_NOTHING,%ebx		/* swapgs needed? */
+	je  paranoid_restore
 	testl $3,CS(%rsp)
 	jnz   paranoid_userspace
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
+	cmpl  $DO_RESTORE_R15,%ebx	/* restore gs? */
+	je  paranoid_restore_gs
 	SWAPGS_UNSAFE_STACK
 	RESTORE_ALL 8
 	jmp irq_return
+paranoid_restore_gs:
+	WRGSBASE_R15
+	RESTORE_ALL 8
+	jmp irq_return
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_userspace:
 	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz paranoid_swapgs
+	movl TI_flags(%rcx),%ebp
+	andl $_TIF_WORK_MASK,%ebp
+	jz paranoid_clear_gs_flag
 	movq %rsp,%rdi			/* &pt_regs */
 	call sync_regs
 	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
+	testl $_TIF_NEED_RESCHED,%ebp
 	jnz paranoid_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
+	movl %ebp,%edx			/* arg3: thread flags */
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	xorl %esi,%esi 			/* arg2: oldset */
 	movq %rsp,%rdi 			/* arg1: &pt_regs */
 	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
+	jmp paranoid_after_schedule
+paranoid_clear_gs_flag:
+	movq PER_CPU_VAR(current_task),%rcx
+	movb $0,task_thread_gs_saved(%rcx)
+	jmp paranoid_swapgs
 paranoid_schedule:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_ANY)
 	SCHEDULE_USER
+paranoid_after_schedule:
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
+	cmpl $DO_RESTORE_R15,%ebx
+	jne  paranoid_userspace
+	movq PER_CPU_VAR(current_task),%rcx
+	cmpb $0,task_thread_gs_saved(%rcx) /* did we schedule? */
+	jne  paranoid_userspace		   /* did not schedule: keep our gs */
+	/*
+	 * Otherwise the context switch has put the correct gs into
+	 * kernel_gs, so we can just end with swapgs and it
+	 * will DTRT.
+	 */
+	mov  $DO_SWAPGS,%ebx		   /* force swapgs after schedule */
 	jmp paranoid_userspace
 	CFI_ENDPROC
 END(paranoid_exit)
@@ -1650,6 +1732,7 @@ end_repeat_nmi:
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	pushq_cfi $-EXCEPTION_STKSZ	/* ist stack size */
 	/*
 	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
 	 * as we should not be calling schedule in NMI context.
@@ -1658,6 +1741,8 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax		/* pop ist size */
 	DEFAULT_FRAME 0
 
 	/*
@@ -1682,11 +1767,21 @@ end_repeat_nmi:
 	je 1f
 	movq %r12, %cr2
 1:
-	
+	cmpl $DO_RESTORE_R15,%ebx
+	je nmi_gs_restore
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
+	jmp nmi_restore
+nmi_gs_restore:
+	/* If we came from ring 3 clear the saved gs flag. */
+	testl $3,CS(%rsp)
+	je   nmi_wrgsbase
+	movq PER_CPU_VAR(current_task),%rcx
+	movb $0,task_thread_gs_saved(%rcx)
+nmi_wrgsbase:
+	WRGSBASE_R15				/* restore gs */
 nmi_restore:
 	/* Pop the extra iret frame at once */
 	RESTORE_ALL 6*8
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 3ed4a68..df554e2 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/fsgs.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -308,6 +309,16 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	savesegment(fs, fsindex);
 	savesegment(gs, gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		prev->fs = rdfsbase();
+		/* Interrupts are disabled here. */
+		if (!prev->gs_saved) {
+			swapgs();
+			prev->gs = rdgsbase();
+			swapgs();
+		}
+		prev->gs_saved = 0;
+	}
 
 	load_TLS(next, cpu);
 
@@ -338,8 +349,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			prev->fs = 0;
 	}
 	/* when next process has a 64bit base use it */
-	if (next->fs)
-		wrmsrl(MSR_FS_BASE, next->fs);
+	if (next->fs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			wrfsbase(next->fs);
+		else
+			wrmsrl(MSR_FS_BASE, next->fs);
+	}
 	prev->fsindex = fsindex;
 
 	if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -347,8 +362,16 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (gsindex)
 			prev->gs = 0;
 	}
-	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	if (next->gs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+			/* Interrupts are disabled here. */
+			swapgs();
+			wrgsbase(next->gs);
+			swapgs();
+		} else {
+			wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+		}
+	}
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_p, fpu);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

* [PATCH 4/8] x86: Add support for rd/wr fs/gs base
  2014-10-15  5:11 [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
@ 2014-10-15  5:11 ` Andi Kleen
  0 siblings, 0 replies; 51+ messages in thread
From: Andi Kleen @ 2014-10-15  5:11 UTC (permalink / raw)
  To: linux-kernel; +Cc: x86, Andi Kleen

From: Andi Kleen <ak@linux.intel.com>

Introduction:

IvyBridge added four new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there. Another use
case is having (relatively) cheap access to a new address
register per thread.

The instructions are opt-in and have to be explicitely enabled
by the OS.

For more details on how to use the instructions see
Documentation/x86/fsgs.txt added in a followon patch.

Paranoid exception path changes:
===============================

The paranoid entry/exit code is used for any NMI like
exception.

Previously Linux couldn't support the new instructions
because the paranoid entry code relied on the gs base never being
negative outside the kernel to decide when to use swaps. It would
check the gs MSR value and assume it was already running in
kernel if negative.

To get rid of this assumption we have to revamp the paranoid exception
path to not rely on this. We can use the new instructions
to get (relatively) quick access to the GS value, and use
it directly.

This is also significantly faster than a MSR read, so will speed
NMIs (critical for profiling)

The kernel gs for the paranoid path is now stored at the
bottom of the IST stack (so that it can be derived from RSP).
For this we need to know the size of the IST stack
(4K or 8K), which is now passed in as a stack parameter
to save_paranoid.

The original patch compared the gs with the kernel gs and
assumed that if it was identical, swapgs was not needed
(and no user space processing was needed). This
was nice and simple and didn't need a lot of changes.

But this had the side effect that if a user process set its
GS to the same as the kernel it may lose rescheduling
checks (so a racing reschedule IPI would have been
only acted upon the next non paranoid interrupt)

This version now switches to full save/restore of the GS.
This requires quite some changes in the paranoid path.
Unfortunately I didn't come up with a simpler scheme.

Previously we had a flag in EBX that indicated whether
SWAPGS needs to be called later or not. In the new scheme
this turns into a tristate, with a new "restore from R15"
mode that is used when the FSGSBASE instructions are available.
In this case the GS base is saved and restored.
The exit paths are all adjusted to handle this correctly.

There is one complication: to allow debuggers (especially
from the int3 or debug vectors) access to the user GS
we need to save it in the task struct. Normally
the next context switch would overwrite it with the wrong
value from kernel_gs, so we set new flag also in task_struct
that prevents it. The flag is cleared on context switch.

[Even with the additional flag the new FS/GS context switch
is vastly faster than the old MSR based version for bases >4GB]

To prevent recursive interrupts clobbering this
state in the task_struct this is only done for interrupts
coming directly from ring 3.

After a reschedule comes back we check if the flag is still
set. If it wasn't set the GS is back in the (swapped) kernel
gs so we revert to the SWAPGS mode, instead of restoring GS.

So the three possible states for the paranoid exit path are:

- Do nothing (pre FSGSBASE), if we interrupted the kernel
as determined by the existing negative GS
- Do swapgs, if we interrupted user space with positive GS
(pre FSGSBASE), or we saved gs, but it was overwritten
later by a context switch (with FSGSBASE)
- Restore from R15 (with FSGSBASE), if the gs base was saved
in R15, and not overwritten by a context switch.

Context switch changes:
======================

Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear.  This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.

The instructions do not context switch
the segment index, so the old invariant that fs or gs index
have to be 0 for a different 64bit value to stick is still
true. Previously it was enforced by arch_prctl, now the user
program has to make sure it keeps the segment indexes zero.
If it doesn't the changes may not stick.

[Yes, this implies that programs can find out when they
got context switched. However they could already do that
before by checking the timing.]

This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB,
or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.

It is all done in a single patch because there was no
simple way to do it in pieces without having crash
holes inbetween.

v2: Change to save/restore GS instead of using swapgs
based on the value. Large scale changes.
v3: Fix wrong flag initialization in fallback path.
Thanks 0day!
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/include/asm/processor.h |   4 +
 arch/x86/kernel/asm-offsets_64.c |   4 +
 arch/x86/kernel/cpu/common.c     |   6 ++
 arch/x86/kernel/entry_64.S       | 163 +++++++++++++++++++++++++++++++--------
 arch/x86/kernel/process_64.c     |  31 +++++++-
 5 files changed, 170 insertions(+), 38 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index eb71ec7..987a631 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -522,6 +522,10 @@ struct thread_struct {
 	 * a short time
 	 */
 	unsigned char fpu_counter;
+	/*
+	 * Set to one when gs is already saved.
+	 */
+	unsigned char  gs_saved;
 };
 
 /*
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
index e7c798b..32ab510 100644
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -85,6 +85,10 @@ int main(void)
 
 	DEFINE(__NR_ia32_syscall_max, sizeof(syscalls_ia32) - 1);
 	DEFINE(IA32_NR_syscalls, sizeof(syscalls_ia32));
+	BLANK();
+
+	OFFSET(task_thread_gs, task_struct, thread.gs);
+	OFFSET(task_thread_gs_saved, task_struct, thread.gs_saved);
 
 	return 0;
 }
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index a9f1331..4d5368f 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -953,6 +953,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #ifdef CONFIG_NUMA
 	numa_add_cpu(smp_processor_id());
 #endif
+
+	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+		set_in_cr4(X86_CR4_FSGSBASE);
 }
 
 #ifdef CONFIG_X86_64
@@ -1338,10 +1341,13 @@ void cpu_init(void)
 	 */
 	if (!oist->ist[0]) {
 		char *estacks = per_cpu(exception_stacks, cpu);
+		void *gs = per_cpu(irq_stack_union.gs_base, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
 			if (v == DEBUG_STACK - 1)
 				estacks = per_cpu(debug_stack, cpu);
+			/* Store GS at bottom of stack for bootstrap access */
+			*(void **)estacks = gs;
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2fac134..43b46a1 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -58,6 +58,8 @@
 #include <asm/context_tracking.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
+#include <asm/alternative-asm.h>
+#include <asm/fsgs.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -283,32 +285,86 @@ ENDPROC(native_usergs_sysret64)
 	TRACE_IRQS_OFF
 	.endm
 
+/* Values of the ebx flag: */
+#define DO_RESTORE_R15	 2 /* Restore GS at end */
+#define DO_SWAPGS	 1 /* Use SWAPGS at end */
+#define DO_NOTHING	 0 /* Back to ring 0 with same gs */
+
+/*
+ * Stack layout:
+ * +16 pt_regs
+ * +8  stack mask for ist or 0
+ * +0  return address.
+ */
+#define OFF 16
+#define STKMSK 8
+
 ENTRY(save_paranoid)
-	XCPT_FRAME 1 RDI+8
+	XCPT_FRAME 1 RDI+OFF
 	cld
-	movq %rdi, RDI+8(%rsp)
-	movq %rsi, RSI+8(%rsp)
-	movq_cfi rdx, RDX+8
-	movq_cfi rcx, RCX+8
-	movq_cfi rax, RAX+8
-	movq %r8, R8+8(%rsp)
-	movq %r9, R9+8(%rsp)
-	movq %r10, R10+8(%rsp)
-	movq %r11, R11+8(%rsp)
-	movq_cfi rbx, RBX+8
-	movq %rbp, RBP+8(%rsp)
-	movq %r12, R12+8(%rsp)
-	movq %r13, R13+8(%rsp)
-	movq %r14, R14+8(%rsp)
-	movq %r15, R15+8(%rsp)
-	movl $1,%ebx
+	movq %rdi, RDI+OFF(%rsp)
+	movq %rsi, RSI+OFF(%rsp)
+	movq_cfi rdx, RDX+OFF
+	movq_cfi rcx, RCX+OFF
+	movq_cfi rax, RAX+OFF
+	movq %r8, R8+OFF(%rsp)
+	movq %r9, R9+OFF(%rsp)
+	movq %r10, R10+OFF(%rsp)
+	movq %r11, R11+OFF(%rsp)
+	movq_cfi rbx, RBX+OFF
+	movq %rbp, RBP+OFF(%rsp)
+	movq %r12, R12+OFF(%rsp)
+	movq %r13, R13+OFF(%rsp)
+	movq %r14, R14+OFF(%rsp)
+	movq %r15, R15+OFF(%rsp)
+	movq $-1,ORIG_RAX+OFF(%rsp)	/* no syscall to restart */
+33:
+	ASM_NOP5	/* May be replaced with jump to paranoid_save_gs */
+34:
+	movl $DO_NOTHING,%ebx
 	movl $MSR_GS_BASE,%ecx
 	rdmsr
 	testl %edx,%edx
 	js 1f	/* negative -> in kernel */
 	SWAPGS
-	xorl %ebx,%ebx
+	movl $DO_SWAPGS,%ebx
 1:	ret
+
+	/* Patch in jump to paranoid_save_gs for X86_FEATURE_FSGSBASE */
+	.section .altinstr_replacement,"ax"
+35:	.byte 0xe9 /* 32bit near jump */
+	.long paranoid_save_gs-34b
+	.previous
+	.section .altinstructions,"a"
+	altinstruction_entry 33b,35b,X86_FEATURE_FSGSBASE,5,5
+	.previous
+
+	/* Faster version not using RDMSR, and also not assuming
+	 * anything about the previous GS value.
+	 * This allows the user to arbitarily change GS using
+	 * WRGSBASE.
+	 */
+paranoid_save_gs:
+	RDGSBASE_R15			# read previous gs
+	movq STKMSK(%rsp),%rax		# get ist stack mask
+	andq %rsp,%rax			# get bottom of stack
+	movq (%rax),%rdi		# get expected GS
+	WRGSBASE_RDI			# set gs for kernel
+	mov $DO_RESTORE_R15,%ebx	# flag for exit code
+	testl $3,CS+OFF(%rsp)		# did we come from ring 0?
+	je paranoid_save_gs_kernel
+	/*
+	 * Saving GS in the task struct allows a debugger to manipulate
+	 * it. We only do this when coming from ring 3 to avoid recursion
+	 * clobbering the state.
+	 */
+	movq PER_CPU_VAR(current_task),%rcx # get current
+	movb $1,task_thread_gs_saved(%rcx)  # tell context switch gs is
+					    # is already saved
+	movq %r15,task_thread_gs(%rcx)      # save gs in task struct
+paranoid_save_gs_kernel:
+	ret
+
 	CFI_ENDPROC
 END(save_paranoid)
 
@@ -1053,7 +1109,8 @@ apicinterrupt IRQ_WORK_VECTOR \
  */
 #define INIT_TSS_IST(x) PER_CPU_VAR(init_tss) + (TSS_ist + ((x) - 1) * 8)
 
-.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
+.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1 \
+		stack_mask=-EXCEPTION_STKSZ
 ENTRY(\sym)
 	/* Sanity check */
 	.if \shift_ist != -1 && \paranoid == 0
@@ -1077,7 +1134,10 @@ ENTRY(\sym)
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
 
 	.if \paranoid
+	pushq_cfi $\stack_mask		/* ist stack size */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax			/* Drop ist stack size */
 	.else
 	call error_entry
 	.endif
@@ -1112,7 +1172,7 @@ ENTRY(\sym)
 	.endif
 
 	.if \paranoid
-	jmp paranoid_exit		/* %ebx: no swapgs flag */
+	jmp paranoid_exit		/* %ebx: no swapgs flag, r15: old gs */
 	.else
 	jmp error_exit			/* %ebx: no swapgs flag */
 	.endif
@@ -1300,8 +1360,10 @@ apicinterrupt3 HYPERVISOR_CALLBACK_VECTOR \
 	hyperv_callback_vector hyperv_vector_handler
 #endif /* CONFIG_HYPERV */
 
-idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
-idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK
+idtentry debug do_debug has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
+idtentry int3 do_int3 has_error_code=0 paranoid=1 shift_ist=DEBUG_STACK \
+	 stack_mask=-DEBUG_STKSZ
 idtentry stack_segment do_stack_segment has_error_code=1 paranoid=1
 #ifdef CONFIG_XEN
 idtentry xen_debug do_debug has_error_code=0
@@ -1330,49 +1392,69 @@ idtentry machine_check has_error_code=0 paranoid=1 do_sym=*machine_check_vector(
 	 * hard flags at once, atomically)
 	 */
 
-	/* ebx:	no swapgs flag */
+	/* ebx:	no swapgs flag, r15: gs if ebx == DO_RESTORE_R15 */
 ENTRY(paranoid_exit)
 	DEFAULT_FRAME
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF_DEBUG
-	testl %ebx,%ebx				/* swapgs needed? */
-	jnz paranoid_restore
+	cmpl  $DO_NOTHING,%ebx		/* swapgs needed? */
+	je  paranoid_restore
 	testl $3,CS(%rsp)
 	jnz   paranoid_userspace
 paranoid_swapgs:
 	TRACE_IRQS_IRETQ 0
+	cmpl  $DO_RESTORE_R15,%ebx	/* restore gs? */
+	je  paranoid_restore_gs
 	SWAPGS_UNSAFE_STACK
 	RESTORE_ALL 8
 	jmp irq_return
+paranoid_restore_gs:
+	WRGSBASE_R15
+	RESTORE_ALL 8
+	jmp irq_return
 paranoid_restore:
 	TRACE_IRQS_IRETQ_DEBUG 0
 	RESTORE_ALL 8
 	jmp irq_return
 paranoid_userspace:
 	GET_THREAD_INFO(%rcx)
-	movl TI_flags(%rcx),%ebx
-	andl $_TIF_WORK_MASK,%ebx
-	jz paranoid_swapgs
+	movl TI_flags(%rcx),%ebp
+	andl $_TIF_WORK_MASK,%ebp
+	jz paranoid_clear_gs_flag
 	movq %rsp,%rdi			/* &pt_regs */
 	call sync_regs
 	movq %rax,%rsp			/* switch stack for scheduling */
-	testl $_TIF_NEED_RESCHED,%ebx
+	testl $_TIF_NEED_RESCHED,%ebp
 	jnz paranoid_schedule
-	movl %ebx,%edx			/* arg3: thread flags */
+	movl %ebp,%edx			/* arg3: thread flags */
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_NONE)
 	xorl %esi,%esi 			/* arg2: oldset */
 	movq %rsp,%rdi 			/* arg1: &pt_regs */
 	call do_notify_resume
-	DISABLE_INTERRUPTS(CLBR_NONE)
-	TRACE_IRQS_OFF
-	jmp paranoid_userspace
+	jmp paranoid_after_schedule
+paranoid_clear_gs_flag:
+	movq PER_CPU_VAR(current_task),%rcx
+	movb $0,task_thread_gs_saved(%rcx)
+	jmp paranoid_swapgs
 paranoid_schedule:
 	TRACE_IRQS_ON
 	ENABLE_INTERRUPTS(CLBR_ANY)
 	SCHEDULE_USER
+paranoid_after_schedule:
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_OFF
+	cmpl $DO_RESTORE_R15,%ebx
+	jne  paranoid_userspace
+	movq PER_CPU_VAR(current_task),%rcx
+	cmpb $0,task_thread_gs_saved(%rcx) /* did we schedule? */
+	jne  paranoid_userspace		   /* did not schedule: keep our gs */
+	/*
+	 * Otherwise the context switch has put the correct gs into
+	 * kernel_gs, so we can just end with swapgs and it
+	 * will DTRT.
+	 */
+	mov  $DO_SWAPGS,%ebx		   /* force swapgs after schedule */
 	jmp paranoid_userspace
 	CFI_ENDPROC
 END(paranoid_exit)
@@ -1663,6 +1745,7 @@ end_repeat_nmi:
 	pushq_cfi $-1		/* ORIG_RAX: no syscall to restart */
 	subq $ORIG_RAX-R15, %rsp
 	CFI_ADJUST_CFA_OFFSET ORIG_RAX-R15
+	pushq_cfi $-EXCEPTION_STKSZ	/* ist stack size */
 	/*
 	 * Use save_paranoid to handle SWAPGS, but no need to use paranoid_exit
 	 * as we should not be calling schedule in NMI context.
@@ -1671,6 +1754,8 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call save_paranoid
+	/* r15: previous gs */
+	popq_cfi %rax		/* pop ist size */
 	DEFAULT_FRAME 0
 
 	/*
@@ -1695,11 +1780,21 @@ end_repeat_nmi:
 	je 1f
 	movq %r12, %cr2
 1:
-	
+	cmpl $DO_RESTORE_R15,%ebx
+	je nmi_gs_restore
 	testl %ebx,%ebx				/* swapgs needed? */
 	jnz nmi_restore
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
+	jmp nmi_restore
+nmi_gs_restore:
+	/* If we came from ring 3 clear the saved gs flag. */
+	testl $3,CS(%rsp)
+	je   nmi_wrgsbase
+	movq PER_CPU_VAR(current_task),%rcx
+	movb $0,task_thread_gs_saved(%rcx)
+nmi_wrgsbase:
+	WRGSBASE_R15				/* restore gs */
 nmi_restore:
 	/* Pop the extra iret frame at once */
 	RESTORE_ALL 6*8
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index ca5b02d..9bad75a 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -49,6 +49,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/fsgs.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -311,6 +312,16 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	savesegment(fs, fsindex);
 	savesegment(gs, gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		prev->fs = rdfsbase();
+		/* Interrupts are disabled here. */
+		if (!prev->gs_saved) {
+			swapgs();
+			prev->gs = rdgsbase();
+			swapgs();
+		}
+		prev->gs_saved = 0;
+	}
 
 	load_TLS(next, cpu);
 
@@ -341,8 +352,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 			prev->fs = 0;
 	}
 	/* when next process has a 64bit base use it */
-	if (next->fs)
-		wrmsrl(MSR_FS_BASE, next->fs);
+	if (next->fs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			wrfsbase(next->fs);
+		else
+			wrmsrl(MSR_FS_BASE, next->fs);
+	}
 	prev->fsindex = fsindex;
 
 	if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -350,8 +365,16 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (gsindex)
 			prev->gs = 0;
 	}
-	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	if (next->gs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+			/* Interrupts are disabled here. */
+			swapgs();
+			wrgsbase(next->gs);
+			swapgs();
+		} else {
+			wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+		}
+	}
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_p, fpu);
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 51+ messages in thread

end of thread, other threads:[~2015-04-13  7:07 UTC | newest]

Thread overview: 51+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-10 15:50 Updated RD/WRFS/GSBASE patchkit Andi Kleen
2015-04-10 15:50 ` [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
2015-04-10 15:50 ` [PATCH 2/8] x86: Naturally align the debug IST stack Andi Kleen
2015-04-10 18:50   ` Andy Lutomirski
2015-04-10 15:50 ` [PATCH 3/8] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
2015-04-10 19:06   ` Andy Lutomirski
2015-04-10 19:07   ` Andy Lutomirski
2015-04-10 15:50 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
2015-04-10 19:12   ` Andy Lutomirski
2015-04-10 20:21     ` Andi Kleen
2015-04-10 20:25       ` Borislav Petkov
2015-04-10 20:52         ` Andi Kleen
2015-04-10 20:53           ` Borislav Petkov
2015-04-10 20:34       ` Andy Lutomirski
2015-04-10 20:41         ` Andi Kleen
2015-04-10 20:47           ` Andy Lutomirski
2015-04-10 20:57             ` Andi Kleen
2015-04-10 21:07               ` Andy Lutomirski
2015-04-10 22:52             ` Andi Kleen
2015-04-10 23:00               ` Andy Lutomirski
2015-04-10 23:05                 ` Andi Kleen
2015-04-10 23:15                   ` Andy Lutomirski
2015-04-10 23:18                     ` Andi Kleen
2015-04-10 23:21                       ` Andy Lutomirski
2015-04-10 23:16                   ` Andi Kleen
2015-04-13  7:07                   ` Jan Beulich
2015-04-10 15:50 ` [PATCH 5/8] x86: Make old K8 swapgs workaround conditional Andi Kleen
2015-04-10 21:46   ` Andy Lutomirski
2015-04-10 22:01   ` Borislav Petkov
2015-04-10 23:10     ` Andi Kleen
2015-04-11  7:18       ` Borislav Petkov
2015-04-10 15:50 ` [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
2015-04-10 19:15   ` Andy Lutomirski
2015-04-10 22:14   ` Borislav Petkov
2015-04-10 23:07     ` Andi Kleen
2015-04-10 23:08     ` Andi Kleen
2015-04-11  7:16       ` Borislav Petkov
2015-04-11  8:35         ` Intel FSGSBASE support (was: Re: [PATCH 6/8] x86: Enumerate kernel FSGS capability in AT_HWCAP2) Ingo Molnar
2015-04-10 15:50 ` [PATCH 7/8] x86: Add documentation for rd/wr fs/gs base Andi Kleen
2015-04-10 19:17   ` Andy Lutomirski
2015-04-10 20:22     ` Andi Kleen
2015-04-10 20:38       ` Andy Lutomirski
2015-04-10 20:46         ` Andi Kleen
2015-04-10 20:52           ` Andy Lutomirski
2015-04-10 15:50 ` [PATCH 8/8] x86: Use rd/wr fs/gs base in arch_prctl Andi Kleen
2015-04-10 19:19   ` Andy Lutomirski
2015-04-10 19:58     ` Andi Kleen
  -- strict thread matches above, loose matches on Subject: below --
2014-11-10 23:55 [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
2014-11-10 23:55 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen
2014-11-11 20:05   ` Andy Lutomirski
2014-11-11 20:49     ` Andy Lutomirski
2014-10-15  5:11 [PATCH 1/8] percpu: Add a DEFINE_PER_CPU_2PAGE_ALIGNED Andi Kleen
2014-10-15  5:11 ` [PATCH 4/8] x86: Add support for rd/wr fs/gs base Andi Kleen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.