[PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations
@ 2015-01-17  0:19 Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17  0:19 UTC (permalink / raw)
  To: x86, linux-kernel, Linus Torvalds
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
	Borislav Petkov, Rik van Riel, Andy Lutomirski

Linus, I suspect you'll either like or hate this series.  Or maybe
you'll think it's crazy but you'll like it anyway.  I'm curious
which of those is the case. :)

The syscall exit asm is a big mess.  There's a really fast path, some
kind of fast path code (with a hard-coded optimization for audit), and
the really slow path.  The result is that it's very hard to work with
this code.  There are some asm paths that are much slower than they
should be (context tracking is a major offender), but no one really
wants to add even more asm to speed them up.

This series takes a different, unorthodox approach.  Rather than trying
to avoid entering the very slow iret path, it adds a way back out of the
iret path.  The result is a dramatic speedup for context tracking, user
return notification, and similar code, as the cost of a few lines of
tricky asm.  Nonetheless, it's barely a net addition of asm code,
because we get to remove the fast path optimizations for audit and
rescheduling.

Thoughts?  If this works, it opens the door for a lot of further
consolidation of the exit code.

Note: patch 1 in this series has been floating around on the list
for quite a while.  It's mandatory for this series to work, because
the buglet that it fixes almost completely defeats the optimization
that I'm introducing.

Note: Some optimization along these lines is probably certainly needed
to get anything resembling acceptable performance out of the FPU changes
that Rik is working on.

Andy Lutomirski (3):
  x86_64,entry: Fix RCX for traced syscalls
  x86_64,entry: Use sysret to return to userspace when possible
  x86_64,entry: Remove the syscall exit audit and schedule optimizations

 arch/x86/kernel/entry_64.S | 109 +++++++++++++++++++++++++--------------------
 1 file changed, 61 insertions(+), 48 deletions(-)

-- 
2.1.0

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-17  0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
@ 2015-01-17  0:19 ` Andy Lutomirski
  2015-01-17 10:13   ` [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls tip-bot for Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17  0:19 UTC (permalink / raw)
  To: x86, linux-kernel, Linus Torvalds
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
	Borislav Petkov, Rik van Riel, Andy Lutomirski

The int_ret_from_sys_call and syscall tracing code disagrees with
the sysret path as to the value of RCX.

The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP.  The syscall tracing code does not respect
this property.

For example, this program:

int main()
{
	extern const char syscall_rip[];
	unsigned long rcx = 1;
	unsigned long orig_rcx = rcx;
	asm ("mov $-1, %%eax\n\t"
	     "syscall\n\t"
	     "syscall_rip:"
	     : "+c" (rcx) : : "r11");
	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
	       rcx, (unsigned long)syscall_rip, orig_rcx);
	return 0;
}

prints:
syscall: RCX = 400556  RIP = 400556  orig RCX = 1

Running it under strace gives this instead:
syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1

This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
show RCX == RIP even under strace.

It looks like this is a partial revert of:
88e4bc32686e [PATCH] x86-64 architecture specific sync for 2.5.8
from the historic git tree.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7d59df23e5bb..501212f14c87 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
 	movq \tmp,RSP+\offset(%rsp)
 	movq $__USER_DS,SS+\offset(%rsp)
 	movq $__USER_CS,CS+\offset(%rsp)
-	movq $-1,RCX+\offset(%rsp)
+	movq RIP+\offset(%rsp),\tmp  /* get rip */
+	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
 	movq R11+\offset(%rsp),\tmp  /* get eflags */
 	movq \tmp,EFLAGS+\offset(%rsp)
 	.endm
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2015-01-17  0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2015-01-17  0:19 ` Andy Lutomirski
  2015-02-04  6:01   ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
  2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds
  3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17  0:19 UTC (permalink / raw)
  To: x86, linux-kernel, Linus Torvalds
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
	Borislav Petkov, Rik van Riel, Andy Lutomirski

The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast.  Under a specific set of conditions, iret is unnecessary.  In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f14c87..eeab4cf8b2c9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs:		/* return to user-space */
 	 */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
+
+	/*
+	 * Try to use SYSRET instead of IRET if we're returning to
+	 * a completely clean 64-bit userspace context.
+	 */
+	movq (RCX-R11)(%rsp), %rcx
+	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.  It's not worth
+	 * testing for canonicalness exactly -- this check detects any
+	 * of the 17 high bits set, which is true for non-canonical
+	 * or kernel addresses.  (This will pessimize vsyscall=native.
+	 * Big deal.)
+	 *
+	 * If virtual addresses ever become wider, this will need
+	 * to be updated to remain correct on both old and new CPUs.
+	 */
+	.ifne __VIRTUAL_MASK_SHIFT - 47
+	.error "virtual address width changed -- sysret checks need update"
+	.endif
+	shr $__VIRTUAL_MASK_SHIFT, %rcx
+	jnz opportunistic_sysret_failed
+
+	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	movq (R11-ARGOFFSET)(%rsp), %r11
+	cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp)	/* R11 == RFLAGS */
+	jne opportunistic_sysret_failed
+
+	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
+	jnz opportunistic_sysret_failed
+
+	/* nothing to check for RSP */
+
+	cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp)	/* SS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * We win!  This label is here just for ease of understanding
+	 * perf profiles.  Nothing jumps here.
+	 */
+irq_return_via_sysret:
+	CFI_REMEMBER_STATE
+	RESTORE_ARGS 1,8,1
+	movq (RSP-RIP)(%rsp),%rsp
+	USERGS_SYSRET64
+	CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
 	SWAPGS
 	jmp restore_args

-- 
2.1.0

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
  2015-01-17  0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
  2015-01-17  0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2015-01-17  0:19 ` Andy Lutomirski
  2015-02-04  6:02   ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
  2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds
  3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17  0:19 UTC (permalink / raw)
  To: x86, linux-kernel, Linus Torvalds
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
	Borislav Petkov, Rik van Riel, Andy Lutomirski

We used to optimize rescheduling and audit on syscall exit.  Now
that the full slow path is reasonably fast, remove these
optimizations.  Syscall exit auditing is now handled exclusively by
syscall_trace_leave.

This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.

I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.

Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
 1 file changed, 5 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eeab4cf8b2c9..db13655c3a2a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -361,15 +361,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-	movl $_TIF_ALLWORK_MASK,%edi
-	/* edi:	flagmask */
-sysret_check:
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
+
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
-	andl %edi,%edx
-	jnz  sysret_careful
 	CFI_REMEMBER_STATE
 	/*
 	 * sysretq will re-enable interrupts:
@@ -383,49 +380,10 @@ sysret_check:
 	USERGS_SYSRET64
 
 	CFI_RESTORE_STATE
-	/* Handle reschedules */
-	/* edx:	work, edi: workmask */
-sysret_careful:
-	bt $TIF_NEED_RESCHED,%edx
-	jnc sysret_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq_cfi %rdi
-	SCHEDULE_USER
-	popq_cfi %rdi
-	jmp sysret_check
 
-	/* Handle a signal */
-sysret_signal:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
-	bt $TIF_SYSCALL_AUDIT,%edx
-	jc sysret_audit
-#endif
-	/*
-	 * We have a signal, or exit tracing or single-step.
-	 * These all wind up with the iret return path anyway,
-	 * so just join that path right now.
-	 */
+int_ret_from_sys_call_fixup:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
-	jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
-	/*
-	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
-	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
-	 * masked off.
-	 */
-sysret_audit:
-	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
-	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
-	setbe %al		/* 1 if so, 0 if not */
-	movzbl %al,%edi		/* zero-extend that into %edi */
-	call __audit_syscall_exit
-	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
-	jmp sysret_check
-#endif	/* CONFIG_AUDITSYSCALL */
+	jmp int_ret_from_sys_call
 
 	/* Do syscall tracing */
 tracesys:
-- 
2.1.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls
  2015-01-17  0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2015-01-17 10:13   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-01-17 10:13 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: fweisbec, luto, mingo, hpa, bp, oleg, tglx, linux-kernel, torvalds, riel

Commit-ID:  0fcedc8631ec28ca25d3c0b116e8fa0c19dd5f6d
Gitweb:     http://git.kernel.org/tip/0fcedc8631ec28ca25d3c0b116e8fa0c19dd5f6d
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 16 Jan 2015 16:19:27 -0800
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 17 Jan 2015 11:02:53 +0100

x86_64 entry: Fix RCX for ptraced syscalls

The int_ret_from_sys_call and syscall tracing code disagrees
with the sysret path as to the value of RCX.

The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP.  The syscall tracing code does not
respect this property.

For example, this program:

int main()
{
	extern const char syscall_rip[];
	unsigned long rcx = 1;
	unsigned long orig_rcx = rcx;
	asm ("mov $-1, %%eax\n\t"
	     "syscall\n\t"
	     "syscall_rip:"
	     : "+c" (rcx) : : "r11");
	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
	       rcx, (unsigned long)syscall_rip, orig_rcx);
	return 0;
}

prints:

  syscall: RCX = 400556  RIP = 400556  orig RCX = 1

Running it under strace gives this instead:

  syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1

This changes FIXUP_TOP_OF_STACK to match sysret, causing the
test to show RCX == RIP even under strace.

It looks like this is a partial revert of:
88e4bc32686e ("[PATCH] x86-64 architecture specific sync for 2.5.8")
from the historic git tree.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/c9a418c3dc3993cb88bb7773800225fd318a4c67.1421453410.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63..c653dc4 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
 	movq \tmp,RSP+\offset(%rsp)
 	movq $__USER_DS,SS+\offset(%rsp)
 	movq $__USER_CS,CS+\offset(%rsp)
-	movq $-1,RCX+\offset(%rsp)
+	movq RIP+\offset(%rsp),\tmp  /* get rip */
+	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
 	movq R11+\offset(%rsp),\tmp  /* get eflags */
 	movq \tmp,EFLAGS+\offset(%rsp)
 	.endm

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations
  2015-01-17  0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
                   ` (2 preceding siblings ...)
  2015-01-17  0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2015-01-17 11:15 ` Linus Torvalds
  3 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2015-01-17 11:15 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
	Frédéric Weisbecker, Oleg Nesterov, kvm list,
	Borislav Petkov, Rik van Riel

On Sat, Jan 17, 2015 at 1:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Linus, I suspect you'll either like or hate this series.  Or maybe
> you'll think it's crazy but you'll like it anyway.  I'm curious
> which of those is the case. :)

I have no hugely strong reaction to the patches, but it seems to be a
good simplification of our model, in addition to allowing sysret for
more cases. So Ack, as far as I'm concerned.

           Linus

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [tip:x86/asm] x86_64, entry: Use sysret to return to userspace when possible
  2015-01-17  0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2015-02-04  6:01   ` tip-bot for Andy Lutomirski
  2015-02-06 18:56     ` Fwd: " Andy Lutomirski
  0 siblings, 1 reply; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-02-04  6:01 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: tglx, luto, linux-kernel, mingo, hpa

Commit-ID:  2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Gitweb:     http://git.kernel.org/tip/2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Tue, 22 Jul 2014 12:46:50 -0700
Committer:  Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:01 -0800

x86_64, entry: Use sysret to return to userspace when possible

The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast.  Under a specific set of conditions, iret is unnecessary.  In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f..eeab4cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs:		/* return to user-space */
 	 */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
+
+	/*
+	 * Try to use SYSRET instead of IRET if we're returning to
+	 * a completely clean 64-bit userspace context.
+	 */
+	movq (RCX-R11)(%rsp), %rcx
+	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.  It's not worth
+	 * testing for canonicalness exactly -- this check detects any
+	 * of the 17 high bits set, which is true for non-canonical
+	 * or kernel addresses.  (This will pessimize vsyscall=native.
+	 * Big deal.)
+	 *
+	 * If virtual addresses ever become wider, this will need
+	 * to be updated to remain correct on both old and new CPUs.
+	 */
+	.ifne __VIRTUAL_MASK_SHIFT - 47
+	.error "virtual address width changed -- sysret checks need update"
+	.endif
+	shr $__VIRTUAL_MASK_SHIFT, %rcx
+	jnz opportunistic_sysret_failed
+
+	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	movq (R11-ARGOFFSET)(%rsp), %r11
+	cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp)	/* R11 == RFLAGS */
+	jne opportunistic_sysret_failed
+
+	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
+	jnz opportunistic_sysret_failed
+
+	/* nothing to check for RSP */
+
+	cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp)	/* SS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * We win!  This label is here just for ease of understanding
+	 * perf profiles.  Nothing jumps here.
+	 */
+irq_return_via_sysret:
+	CFI_REMEMBER_STATE
+	RESTORE_ARGS 1,8,1
+	movq (RSP-RIP)(%rsp),%rsp
+	USERGS_SYSRET64
+	CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
 	SWAPGS
 	jmp restore_args

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [tip:x86/asm] x86_64, entry: Remove the syscall exit audit and schedule optimizations
  2015-01-17  0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2015-02-04  6:02   ` tip-bot for Andy Lutomirski
  0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-02-04  6:02 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: bp, luto, hpa, mingo, linux-kernel, tglx

Commit-ID:  96b6352c12711d5c0bb7157f49c92580248e8146
Gitweb:     http://git.kernel.org/tip/96b6352c12711d5c0bb7157f49c92580248e8146
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Mon, 7 Jul 2014 11:37:17 -0700
Committer:  Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:02 -0800

x86_64, entry: Remove the syscall exit audit and schedule optimizations

We used to optimize rescheduling and audit on syscall exit.  Now
that the full slow path is reasonably fast, remove these
optimizations.  Syscall exit auditing is now handled exclusively by
syscall_trace_leave.

This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.

I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.

Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.net
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
 1 file changed, 5 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eeab4cf..db13655 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -361,15 +361,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-	movl $_TIF_ALLWORK_MASK,%edi
-	/* edi:	flagmask */
-sysret_check:
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
+
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
-	andl %edi,%edx
-	jnz  sysret_careful
 	CFI_REMEMBER_STATE
 	/*
 	 * sysretq will re-enable interrupts:
@@ -383,49 +380,10 @@ sysret_check:
 	USERGS_SYSRET64
 
 	CFI_RESTORE_STATE
-	/* Handle reschedules */
-	/* edx:	work, edi: workmask */
-sysret_careful:
-	bt $TIF_NEED_RESCHED,%edx
-	jnc sysret_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq_cfi %rdi
-	SCHEDULE_USER
-	popq_cfi %rdi
-	jmp sysret_check
 
-	/* Handle a signal */
-sysret_signal:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
-	bt $TIF_SYSCALL_AUDIT,%edx
-	jc sysret_audit
-#endif
-	/*
-	 * We have a signal, or exit tracing or single-step.
-	 * These all wind up with the iret return path anyway,
-	 * so just join that path right now.
-	 */
+int_ret_from_sys_call_fixup:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
-	jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
-	/*
-	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
-	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
-	 * masked off.
-	 */
-sysret_audit:
-	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
-	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
-	setbe %al		/* 1 if so, 0 if not */
-	movzbl %al,%edi		/* zero-extend that into %edi */
-	call __audit_syscall_exit
-	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
-	jmp sysret_check
-#endif	/* CONFIG_AUDITSYSCALL */
+	jmp int_ret_from_sys_call
 
 	/* Do syscall tracing */
 tracesys:

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Fwd: [tip:x86/asm] x86_64, entry: Use sysret to return to userspace when possible
  2015-02-04  6:01   ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
@ 2015-02-06 18:56     ` Andy Lutomirski
  0 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2015-02-06 18:56 UTC (permalink / raw)
  To: kvm list

In case you're interested, this change (queued for 3.20) should cut a
couple hundred cycles off of kvm heavyweight exits.

--Andy

---------- Forwarded message ----------
From: tip-bot for Andy Lutomirski <tipbot@zytor.com>
Date: Tue, Feb 3, 2015 at 10:01 PM
Subject: [tip:x86/asm] x86_64, entry:  Use sysret to return to
userspace when possible
To: linux-tip-commits@vger.kernel.org
Cc: tglx@linutronix.de, luto@amacapital.net,
linux-kernel@vger.kernel.org, mingo@kernel.org, hpa@zytor.com

Commit-ID:  2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Gitweb:     http://git.kernel.org/tip/2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Author:     Andy Lutomirski <luto@amacapital.net>
AuthorDate: Tue, 22 Jul 2014 12:46:50 -0700
Committer:  Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:01 -0800

x86_64, entry: Use sysret to return to userspace when possible

The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast.  Under a specific set of conditions, iret is unnecessary.  In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret.  This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 54 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f..eeab4cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs:             /* return to user-space */
         */
        DISABLE_INTERRUPTS(CLBR_ANY)
        TRACE_IRQS_IRETQ
+
+       /*
+        * Try to use SYSRET instead of IRET if we're returning to
+        * a completely clean 64-bit userspace context.
+        */
+       movq (RCX-R11)(%rsp), %rcx
+       cmpq %rcx,(RIP-R11)(%rsp)               /* RCX == RIP */
+       jne opportunistic_sysret_failed
+
+       /*
+        * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+        * in kernel space.  This essentially lets the user take over
+        * the kernel, since userspace controls RSP.  It's not worth
+        * testing for canonicalness exactly -- this check detects any
+        * of the 17 high bits set, which is true for non-canonical
+        * or kernel addresses.  (This will pessimize vsyscall=native.
+        * Big deal.)
+        *
+        * If virtual addresses ever become wider, this will need
+        * to be updated to remain correct on both old and new CPUs.
+        */
+       .ifne __VIRTUAL_MASK_SHIFT - 47
+       .error "virtual address width changed -- sysret checks need update"
+       .endif
+       shr $__VIRTUAL_MASK_SHIFT, %rcx
+       jnz opportunistic_sysret_failed
+
+       cmpq $__USER_CS,(CS-R11)(%rsp)          /* CS must match SYSRET */
+       jne opportunistic_sysret_failed
+
+       movq (R11-ARGOFFSET)(%rsp), %r11
+       cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp)      /* R11 == RFLAGS */
+       jne opportunistic_sysret_failed
+
+       testq $X86_EFLAGS_RF,%r11               /* sysret can't restore RF */
+       jnz opportunistic_sysret_failed
+
+       /* nothing to check for RSP */
+
+       cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp)    /* SS must match SYSRET */
+       jne opportunistic_sysret_failed
+
+       /*
+        * We win!  This label is here just for ease of understanding
+        * perf profiles.  Nothing jumps here.
+        */
+irq_return_via_sysret:
+       CFI_REMEMBER_STATE
+       RESTORE_ARGS 1,8,1
+       movq (RSP-RIP)(%rsp),%rsp
+       USERGS_SYSRET64
+       CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
        SWAPGS
        jmp restore_args

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply related	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2015-02-06 18:56 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-17  0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
2015-01-17  0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
2015-01-17 10:13   ` [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls tip-bot for Andy Lutomirski
2015-01-17  0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
2015-02-04  6:01   ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-02-06 18:56     ` Fwd: " Andy Lutomirski
2015-01-17  0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
2015-02-04  6:02   ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.