linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Andi Kleen <andi@firstfloor.org>
To: x86@kernel.org
Cc: luto@amacapital.net, linux-kernel@vger.kernel.org,
	Andi Kleen <ak@linux.intel.com>
Subject: [PATCH 2/9] x86: Add support for rd/wr fs/gs base
Date: Mon, 21 Mar 2016 09:16:02 -0700	[thread overview]
Message-ID: <1458576969-13309-3-git-send-email-andi@firstfloor.org> (raw)
In-Reply-To: <1458576969-13309-1-git-send-email-andi@firstfloor.org>

From: Andi Kleen <ak@linux.intel.com>

Introduction:

IvyBridge added four new instructions to directly write the fs and gs
64bit base registers. Previously this had to be done with a system
call to write to MSRs. The main use case is fast user space threading
and switching the fs/gs registers quickly there. Another use
case is having (relatively) cheap access to a new address
register per thread.

The instructions are opt-in and have to be explicitely enabled
by the OS.

For more details on how to use the instructions see
Documentation/x86/fsgs.txt added in a followon patch.

Paranoid exception path changes:
===============================

The paranoid entry/exit code is used for any NMI like
exception.

Previously Linux couldn't support the new instructions
because the paranoid entry code relied on the gs base never being
negative outside the kernel to decide when to use swaps. It would
check the gs MSR value and assume it was already running in
kernel if negative.

To get rid of this assumption we have to revamp the paranoid exception
path to not rely on this. We can use the new instructions
to get (relatively) quick access to the GS value, and use
it directly to save/restore the GSBASE instead of using
SWAPGS.

This is also significantly faster than a MSR read, so will speed
NMIs (useful for profiling)

The kernel gs for the paranoid path is now stored at the
bottom of the IST stack (so that it can be derived from RSP).

The original patch compared the gs with the kernel gs and
assumed that if it was identical, swapgs was not needed
(and no user space processing was needed). This
was nice and simple and didn't need a lot of changes.

But this had the side effect that if a user process set its
GS to the same as the kernel it may lose rescheduling
checks (so a racing reschedule IPI would have been
only acted upon the next non paranoid interrupt)

This version now switches to full save/restore of the GS.

When swapgs used to be needed, but we have the new
instructions, we restore original GS value in the exit
path.

Context switch changes:
======================

Then after these changes we need to also use the new instructions
to save/restore fs and gs, so that the new values set by the
users won't disappear.  This is also significantly
faster for the case when the 64bit base has to be switched
(that is when GS is larger than 4GB), as we can replace
the slow MSR write with a faster wr[fg]sbase execution.

This is in term enables fast switching when there are
enough threads that their TLS segment does not fit below 4GB
(or with some newer systems which don't properly hint the
stack limit), or alternatively programs that use fs as an additional base
register will not get a sigificant context switch penalty.

It is all done in a single patch because there was no
simple way to do it in pieces without having crash
holes inbetween.

v2: Change to save/restore GS instead of using swapgs
based on the value. Large scale changes.
v3: Fix wrong flag initialization in fallback path.
Thanks 0day!
v4: Make swapgs code paths kprobes safe.
Port to new base line code which now switches indexes.
v5: Port to new kernel which avoids paranoid entry for ring 3.
Removed some code that handled this previously.
v6: Remove obsolete code. Use macro for ALTERNATIVE. Use
ALTERNATIVE for exit path, eliminating the DO_RESTORE_G15 flag.
Various cleanups. Improve description.
v7: Port to new entry code. Some fixes/cleanups.
v8: Lots of changes.
Signed-off-by: Andi Kleen <ak@linux.intel.com>
---
 arch/x86/entry/entry_64.S    | 31 +++++++++++++++++++++++++++
 arch/x86/kernel/cpu/common.c |  9 ++++++++
 arch/x86/kernel/process_64.c | 51 ++++++++++++++++++++++++++++++++++++++------
 3 files changed, 85 insertions(+), 6 deletions(-)

diff --git a/arch/x86/entry/entry_64.S b/arch/x86/entry/entry_64.S
index 858b555..c605710 100644
--- a/arch/x86/entry/entry_64.S
+++ b/arch/x86/entry/entry_64.S
@@ -35,6 +35,8 @@
 #include <asm/asm.h>
 #include <asm/smap.h>
 #include <asm/pgtable_types.h>
+#include <asm/alternative-asm.h>
+#include <asm/fsgs.h>
 #include <linux/err.h>
 
 /* Avoid __ASSEMBLER__'ifying <linux/audit.h> just for this.  */
@@ -678,6 +680,7 @@ ENTRY(\sym)
 	jnz	1f
 	.endif
 	call	paranoid_entry
+	/* r15: previous gs if FSGSBASE, otherwise %ebx: swapgs flag */
 	.else
 	call	error_entry
 	.endif
@@ -933,6 +936,7 @@ ENTRY(paranoid_entry)
 	cld
 	SAVE_C_REGS 8
 	SAVE_EXTRA_REGS 8
+	ALTERNATIVE "", "jmp paranoid_save_gs", X86_FEATURE_FSGSBASE
 	movl	$1, %ebx
 	movl	$MSR_GS_BASE, %ecx
 	rdmsr
@@ -943,6 +947,25 @@ ENTRY(paranoid_entry)
 1:	ret
 END(paranoid_entry)
 
+	/*
+	 * Faster version not using RDMSR, and also not assuming
+	 * anything about the previous GS value.
+	 * This allows the user to arbitarily change GS using
+	 * WRGSBASE. The kernel GS is at the bottom of the
+	 * IST stack.
+	 *
+	 * We don't use the %ebx flag in this case, gs is always
+	 * conditionally saved/restored in R15
+	 */
+ENTRY(paranoid_save_gs)
+	RDGSBASE_R15				# read previous gs
+	movq $~(EXCEPTION_STKSZ-1), %rax	# get ist stack mask
+	andq %rsp,%rax				# get bottom of stack
+	movq (%rax),%rdi			# get expected GS
+	WRGSBASE_RDI				# set gs for kernel
+	ret
+END(paranoid_save_gs)
+
 /*
  * "Paranoid" exit path from exception stack.  This is invoked
  * only on return from non-NMI IST interrupts that came
@@ -958,11 +981,14 @@ END(paranoid_entry)
 ENTRY(paranoid_exit)
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF_DEBUG
+	ALTERNATIVE "", "jmp paranoid_gsrestore", X86_FEATURE_FSGSBASE
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	paranoid_exit_no_swapgs
 	TRACE_IRQS_IRETQ
 	SWAPGS_UNSAFE_STACK
 	jmp	paranoid_exit_restore
+paranoid_gsrestore:
+	WRGSBASE_R15
 paranoid_exit_no_swapgs:
 	TRACE_IRQS_IRETQ_DEBUG
 paranoid_exit_restore:
@@ -1380,16 +1406,21 @@ end_repeat_nmi:
 	 * exceptions might do.
 	 */
 	call	paranoid_entry
+	/* r15: previous gs if FSGSBASE, otherwise %ebx swapgs flag */
 
 	/* paranoidentry do_nmi, 0; without TRACE_IRQS_OFF */
 	movq	%rsp, %rdi
 	movq	$-1, %rsi
 	call	do_nmi
 
+	ALTERNATIVE "", "jmp nmi_gsrestore", X86_FEATURE_FSGSBASE
 	testl	%ebx, %ebx			/* swapgs needed? */
 	jnz	nmi_restore
 nmi_swapgs:
 	SWAPGS_UNSAFE_STACK
+	jmp nmi_restore
+nmi_gsrestore:
+	WRGSBASE_R15
 nmi_restore:
 	RESTORE_EXTRA_REGS
 	RESTORE_C_REGS
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 249461f..f581cd1 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -1018,6 +1018,9 @@ static void identify_cpu(struct cpuinfo_x86 *c)
 #endif
 	/* The boot/hotplug time assigment got cleared, restore it */
 	c->logical_proc_id = topology_phys_to_logical_pkg(c->phys_proc_id);
+
+	if (cpu_has(c, X86_FEATURE_FSGSBASE))
+		cr4_set_bits(X86_CR4_FSGSBASE);
 }
 
 /*
@@ -1422,8 +1425,14 @@ void cpu_init(void)
 	 */
 	if (!oist->ist[0]) {
 		char *estacks = per_cpu(exception_stacks, cpu);
+		void *gs = per_cpu(irq_stack_union.gs_base, cpu);
 
 		for (v = 0; v < N_EXCEPTION_STACKS; v++) {
+			/* Store GS at bottom of stack for bootstrap access */
+			*(void **)estacks = gs;
+			/* Put it on every 4K entry */
+			if (exception_stack_sizes[v] > EXCEPTION_STKSZ)
+				*(void **)(estacks + EXCEPTION_STKSZ) = gs;
 			estacks += exception_stack_sizes[v];
 			oist->ist[v] = t->x86_tss.ist[v] =
 					(unsigned long)estacks;
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index b9d99e0..53fa839 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -48,6 +48,7 @@
 #include <asm/syscalls.h>
 #include <asm/debugreg.h>
 #include <asm/switch_to.h>
+#include <asm/fsgs.h>
 
 asmlinkage extern void ret_from_fork(void);
 
@@ -260,6 +261,27 @@ void compat_start_thread(struct pt_regs *regs, u32 new_ip, u32 new_sp)
 }
 #endif
 
+/* Out of line to be protected from kprobes. */
+
+/* Interrupts are disabled here. */
+static noinline __kprobes void switch_gs_base(unsigned long gs)
+{
+	swapgs();
+	wrgsbase(gs);
+	swapgs();
+}
+
+/* Interrupts are disabled here. */
+static noinline __kprobes unsigned long read_user_gsbase(void)
+{
+	unsigned long gs;
+
+	swapgs();
+	gs = rdgsbase();
+	swapgs();
+	return gs;
+}
+
 /*
  *	switch_to(x,y) should switch tasks from x to y.
  *
@@ -291,6 +313,10 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 */
 	savesegment(fs, fsindex);
 	savesegment(gs, gsindex);
+	if (static_cpu_has(X86_FEATURE_FSGSBASE)) {
+		prev->fs = rdfsbase();
+		prev->gs = read_user_gsbase();
+	}
 
 	/*
 	 * Load TLS before restoring any segments so that segment loads
@@ -330,6 +356,8 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		loadsegment(ds, next->ds);
 
 	/*
+	 * Description of code path without FSGSBASE:
+	 *
 	 * Switch FS and GS.
 	 *
 	 * These are even more complicated than DS and ES: they have
@@ -361,8 +389,11 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	 * base address.
 	 *
 	 * Note: This all depends on arch_prctl being the only way that
-	 * user code can override the segment base.  Once wrfsbase and
-	 * wrgsbase are enabled, most of this code will need to change.
+	 * user code can override the segment base.
+	 *
+	 * Description with FSGSBASE:
+	 * We simply save/restore the bases, and the indexes.
+	 *
 	 */
 	if (unlikely(fsindex | next->fsindex | prev->fs)) {
 		loadsegment(fs, next->fsindex);
@@ -379,8 +410,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (fsindex)
 			prev->fs = 0;
 	}
-	if (next->fs)
-		wrmsrl(MSR_FS_BASE, next->fs);
+	if (next->fs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			wrfsbase(next->fs);
+		else
+			wrmsrl(MSR_FS_BASE, next->fs);
+	}
 	prev->fsindex = fsindex;
 
 	if (unlikely(gsindex | next->gsindex | prev->gs)) {
@@ -390,8 +425,12 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 		if (gsindex)
 			prev->gs = 0;
 	}
-	if (next->gs)
-		wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	if (next->gs) {
+		if (static_cpu_has(X86_FEATURE_FSGSBASE))
+			switch_gs_base(next->gs);
+		else
+			wrmsrl(MSR_KERNEL_GS_BASE, next->gs);
+	}
 	prev->gsindex = gsindex;
 
 	switch_fpu_finish(next_fpu, fpu_switch);
-- 
2.5.5

  parent reply	other threads:[~2016-03-21 16:17 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-21 16:16 Updated version of RD/WR FS/GS BASE patchkit Andi Kleen
2016-03-21 16:16 ` [PATCH 1/9] x86: Add intrinsics/macros for new rd/wr fs/gs base instructions Andi Kleen
2016-03-21 18:14   ` Andy Lutomirski
2016-03-21 16:16 ` Andi Kleen [this message]
2016-03-21 18:13   ` [PATCH 2/9] x86: Add support for rd/wr fs/gs base Andy Lutomirski
2016-03-21 19:05     ` Andi Kleen
2016-03-21 19:22       ` Andy Lutomirski
2016-03-21 22:05     ` Andi Kleen
2016-03-21 22:08       ` Andy Lutomirski
2016-03-21 22:15         ` Andi Kleen
2016-03-22  8:36           ` Thomas Gleixner
2016-03-22 14:40           ` Brian Gerst
2016-04-15  0:06   ` Andy Lutomirski
2016-03-21 16:16 ` [PATCH 3/9] x86: Make old K8 swapgs workaround conditional Andi Kleen
2016-03-21 16:16 ` [PATCH 4/9] x86: Enumerate kernel FSGS capability in AT_HWCAP2 Andi Kleen
2016-03-21 18:49   ` Brian Gerst
2016-03-21 18:54     ` Andi Kleen
2016-03-21 19:32       ` Brian Gerst
2016-03-21 19:43         ` Andi Kleen
2016-03-21 22:10           ` Andy Lutomirski
2016-03-21 16:16 ` [PATCH 5/9] x86: Add documentation for rd/wr fs/gs base Andi Kleen
2016-03-23 19:14   ` Valdis.Kletnieks
2016-03-21 16:16 ` [PATCH 6/9] x86: Use rd/wr fs/gs base in arch_prctl Andi Kleen
2016-03-21 18:17   ` Andy Lutomirski
2016-03-21 16:16 ` [PATCH 7/9] x86: Add self test code for fsgsbase Andi Kleen
2016-03-21 16:16 ` [PATCH 8/9] x86: Support arbitrary fs/gs base in getregs Andi Kleen
2016-03-21 16:16 ` [PATCH 9/9] x86: Save FS/GS base in core dump Andi Kleen
2016-03-21 18:39 ` Updated version of RD/WR FS/GS BASE patchkit Andy Lutomirski
2016-03-21 19:03   ` Andi Kleen
2016-03-21 19:23     ` Andy Lutomirski
2016-03-21 19:40       ` Andi Kleen
2016-03-21 22:05         ` Andy Lutomirski
2016-03-21 22:11           ` Andi Kleen
2016-03-21 22:27             ` Andy Lutomirski
2016-03-21 22:41               ` Andi Kleen
2016-03-21 22:47                 ` Andy Lutomirski
2016-03-21 22:52                   ` Andi Kleen
2016-03-21 22:57                     ` Andy Lutomirski
2016-03-21 23:02                       ` Andi Kleen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1458576969-13309-3-git-send-email-andi@firstfloor.org \
    --to=andi@firstfloor.org \
    --cc=ak@linux.intel.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).