* [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations
@ 2015-01-17 0:19 Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17 0:19 UTC (permalink / raw)
To: x86, linux-kernel, Linus Torvalds
Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
Borislav Petkov, Rik van Riel, Andy Lutomirski
Linus, I suspect you'll either like or hate this series. Or maybe
you'll think it's crazy but you'll like it anyway. I'm curious
which of those is the case. :)
The syscall exit asm is a big mess. There's a really fast path, some
kind of fast path code (with a hard-coded optimization for audit), and
the really slow path. The result is that it's very hard to work with
this code. There are some asm paths that are much slower than they
should be (context tracking is a major offender), but no one really
wants to add even more asm to speed them up.
This series takes a different, unorthodox approach. Rather than trying
to avoid entering the very slow iret path, it adds a way back out of the
iret path. The result is a dramatic speedup for context tracking, user
return notification, and similar code, as the cost of a few lines of
tricky asm. Nonetheless, it's barely a net addition of asm code,
because we get to remove the fast path optimizations for audit and
rescheduling.
Thoughts? If this works, it opens the door for a lot of further
consolidation of the exit code.
Note: patch 1 in this series has been floating around on the list
for quite a while. It's mandatory for this series to work, because
the buglet that it fixes almost completely defeats the optimization
that I'm introducing.
Note: Some optimization along these lines is probably certainly needed
to get anything resembling acceptable performance out of the FPU changes
that Rik is working on.
Andy Lutomirski (3):
x86_64,entry: Fix RCX for traced syscalls
x86_64,entry: Use sysret to return to userspace when possible
x86_64,entry: Remove the syscall exit audit and schedule optimizations
arch/x86/kernel/entry_64.S | 109 +++++++++++++++++++++++++--------------------
1 file changed, 61 insertions(+), 48 deletions(-)
--
2.1.0
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls
2015-01-17 0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
@ 2015-01-17 0:19 ` Andy Lutomirski
2015-01-17 10:13 ` [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls tip-bot for Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
` (2 subsequent siblings)
3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17 0:19 UTC (permalink / raw)
To: x86, linux-kernel, Linus Torvalds
Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
Borislav Petkov, Rik van Riel, Andy Lutomirski
The int_ret_from_sys_call and syscall tracing code disagrees with
the sysret path as to the value of RCX.
The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP. The syscall tracing code does not respect
this property.
For example, this program:
int main()
{
extern const char syscall_rip[];
unsigned long rcx = 1;
unsigned long orig_rcx = rcx;
asm ("mov $-1, %%eax\n\t"
"syscall\n\t"
"syscall_rip:"
: "+c" (rcx) : : "r11");
printf("syscall: RCX = %lX RIP = %lX orig RCX = %lx\n",
rcx, (unsigned long)syscall_rip, orig_rcx);
return 0;
}
prints:
syscall: RCX = 400556 RIP = 400556 orig RCX = 1
Running it under strace gives this instead:
syscall: RCX = FFFFFFFFFFFFFFFF RIP = 400556 orig RCX = 1
This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
show RCX == RIP even under strace.
It looks like this is a partial revert of:
88e4bc32686e [PATCH] x86-64 architecture specific sync for 2.5.8
from the historic git tree.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7d59df23e5bb..501212f14c87 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
movq \tmp,RSP+\offset(%rsp)
movq $__USER_DS,SS+\offset(%rsp)
movq $__USER_CS,CS+\offset(%rsp)
- movq $-1,RCX+\offset(%rsp)
+ movq RIP+\offset(%rsp),\tmp /* get rip */
+ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */
movq R11+\offset(%rsp),\tmp /* get eflags */
movq \tmp,EFLAGS+\offset(%rsp)
.endm
--
2.1.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible
2015-01-17 0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2015-01-17 0:19 ` Andy Lutomirski
2015-02-04 6:01 ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds
3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17 0:19 UTC (permalink / raw)
To: x86, linux-kernel, Linus Torvalds
Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
Borislav Petkov, Rik van Riel, Andy Lutomirski
The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work. For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret. For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.
Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret. This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly. It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.
I would argue that this approach is backwards. Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast. Under a specific set of conditions, iret is unnecessary. In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret. This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.
Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work. This includes tracing and context tracking, and any return
that invokes KVM's user return notifier. For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.
This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.
This may require further tweaking to give the full benefit on Xen.
It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.
This does not optimize returns to 32-bit userspace. Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f14c87..eeab4cf8b2c9 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs: /* return to user-space */
*/
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
+
+ /*
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ */
+ movq (RCX-R11)(%rsp), %rcx
+ cmpq %rcx,(RIP-R11)(%rsp) /* RCX == RIP */
+ jne opportunistic_sysret_failed
+
+ /*
+ * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+ * in kernel space. This essentially lets the user take over
+ * the kernel, since userspace controls RSP. It's not worth
+ * testing for canonicalness exactly -- this check detects any
+ * of the 17 high bits set, which is true for non-canonical
+ * or kernel addresses. (This will pessimize vsyscall=native.
+ * Big deal.)
+ *
+ * If virtual addresses ever become wider, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+ .ifne __VIRTUAL_MASK_SHIFT - 47
+ .error "virtual address width changed -- sysret checks need update"
+ .endif
+ shr $__VIRTUAL_MASK_SHIFT, %rcx
+ jnz opportunistic_sysret_failed
+
+ cmpq $__USER_CS,(CS-R11)(%rsp) /* CS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ movq (R11-ARGOFFSET)(%rsp), %r11
+ cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp) /* R11 == RFLAGS */
+ jne opportunistic_sysret_failed
+
+ testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ jnz opportunistic_sysret_failed
+
+ /* nothing to check for RSP */
+
+ cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp) /* SS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ /*
+ * We win! This label is here just for ease of understanding
+ * perf profiles. Nothing jumps here.
+ */
+irq_return_via_sysret:
+ CFI_REMEMBER_STATE
+ RESTORE_ARGS 1,8,1
+ movq (RSP-RIP)(%rsp),%rsp
+ USERGS_SYSRET64
+ CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
SWAPGS
jmp restore_args
--
2.1.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
2015-01-17 0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2015-01-17 0:19 ` Andy Lutomirski
2015-02-04 6:02 ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds
3 siblings, 1 reply; 9+ messages in thread
From: Andy Lutomirski @ 2015-01-17 0:19 UTC (permalink / raw)
To: x86, linux-kernel, Linus Torvalds
Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list,
Borislav Petkov, Rik van Riel, Andy Lutomirski
We used to optimize rescheduling and audit on syscall exit. Now
that the full slow path is reasonably fast, remove these
optimizations. Syscall exit auditing is now handled exclusively by
syscall_trace_leave.
This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.
I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
1 file changed, 5 insertions(+), 47 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eeab4cf8b2c9..db13655c3a2a 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -361,15 +361,12 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
ret_from_sys_call:
- movl $_TIF_ALLWORK_MASK,%edi
- /* edi: flagmask */
-sysret_check:
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+ jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
- movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
- andl %edi,%edx
- jnz sysret_careful
CFI_REMEMBER_STATE
/*
* sysretq will re-enable interrupts:
@@ -383,49 +380,10 @@ sysret_check:
USERGS_SYSRET64
CFI_RESTORE_STATE
- /* Handle reschedules */
- /* edx: work, edi: workmask */
-sysret_careful:
- bt $TIF_NEED_RESCHED,%edx
- jnc sysret_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq_cfi %rdi
- SCHEDULE_USER
- popq_cfi %rdi
- jmp sysret_check
- /* Handle a signal */
-sysret_signal:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
- bt $TIF_SYSCALL_AUDIT,%edx
- jc sysret_audit
-#endif
- /*
- * We have a signal, or exit tracing or single-step.
- * These all wind up with the iret return path anyway,
- * so just join that path right now.
- */
+int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
- jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
- /*
- * Return fast path for syscall audit. Call __audit_syscall_exit()
- * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
- * masked off.
- */
-sysret_audit:
- movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */
- cmpq $-MAX_ERRNO,%rsi /* is it < -MAX_ERRNO? */
- setbe %al /* 1 if so, 0 if not */
- movzbl %al,%edi /* zero-extend that into %edi */
- call __audit_syscall_exit
- movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
- jmp sysret_check
-#endif /* CONFIG_AUDITSYSCALL */
+ jmp int_ret_from_sys_call
/* Do syscall tracing */
tracesys:
--
2.1.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls
2015-01-17 0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2015-01-17 10:13 ` tip-bot for Andy Lutomirski
0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-01-17 10:13 UTC (permalink / raw)
To: linux-tip-commits
Cc: fweisbec, luto, mingo, hpa, bp, oleg, tglx, linux-kernel, torvalds, riel
Commit-ID: 0fcedc8631ec28ca25d3c0b116e8fa0c19dd5f6d
Gitweb: http://git.kernel.org/tip/0fcedc8631ec28ca25d3c0b116e8fa0c19dd5f6d
Author: Andy Lutomirski <luto@amacapital.net>
AuthorDate: Fri, 16 Jan 2015 16:19:27 -0800
Committer: Ingo Molnar <mingo@kernel.org>
CommitDate: Sat, 17 Jan 2015 11:02:53 +0100
x86_64 entry: Fix RCX for ptraced syscalls
The int_ret_from_sys_call and syscall tracing code disagrees
with the sysret path as to the value of RCX.
The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP. The syscall tracing code does not
respect this property.
For example, this program:
int main()
{
extern const char syscall_rip[];
unsigned long rcx = 1;
unsigned long orig_rcx = rcx;
asm ("mov $-1, %%eax\n\t"
"syscall\n\t"
"syscall_rip:"
: "+c" (rcx) : : "r11");
printf("syscall: RCX = %lX RIP = %lX orig RCX = %lx\n",
rcx, (unsigned long)syscall_rip, orig_rcx);
return 0;
}
prints:
syscall: RCX = 400556 RIP = 400556 orig RCX = 1
Running it under strace gives this instead:
syscall: RCX = FFFFFFFFFFFFFFFF RIP = 400556 orig RCX = 1
This changes FIXUP_TOP_OF_STACK to match sysret, causing the
test to show RCX == RIP even under strace.
It looks like this is a partial revert of:
88e4bc32686e ("[PATCH] x86-64 architecture specific sync for 2.5.8")
from the historic git tree.
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
Cc: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Borislav Petkov <bp@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/c9a418c3dc3993cb88bb7773800225fd318a4c67.1421453410.git.luto@amacapital.net
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
arch/x86/kernel/entry_64.S | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 9ebaf63..c653dc4 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
movq \tmp,RSP+\offset(%rsp)
movq $__USER_DS,SS+\offset(%rsp)
movq $__USER_CS,CS+\offset(%rsp)
- movq $-1,RCX+\offset(%rsp)
+ movq RIP+\offset(%rsp),\tmp /* get rip */
+ movq \tmp,RCX+\offset(%rsp) /* copy it to rcx as sysret would do */
movq R11+\offset(%rsp),\tmp /* get eflags */
movq \tmp,EFLAGS+\offset(%rsp)
.endm
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations
2015-01-17 0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
` (2 preceding siblings ...)
2015-01-17 0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2015-01-17 11:15 ` Linus Torvalds
3 siblings, 0 replies; 9+ messages in thread
From: Linus Torvalds @ 2015-01-17 11:15 UTC (permalink / raw)
To: Andy Lutomirski
Cc: the arch/x86 maintainers, Linux Kernel Mailing List,
Frédéric Weisbecker, Oleg Nesterov, kvm list,
Borislav Petkov, Rik van Riel
On Sat, Jan 17, 2015 at 1:19 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Linus, I suspect you'll either like or hate this series. Or maybe
> you'll think it's crazy but you'll like it anyway. I'm curious
> which of those is the case. :)
I have no hugely strong reaction to the patches, but it seems to be a
good simplification of our model, in addition to allowing sysret for
more cases. So Ack, as far as I'm concerned.
Linus
^ permalink raw reply [flat|nested] 9+ messages in thread
* [tip:x86/asm] x86_64, entry: Use sysret to return to userspace when possible
2015-01-17 0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2015-02-04 6:01 ` tip-bot for Andy Lutomirski
2015-02-06 18:56 ` Fwd: " Andy Lutomirski
0 siblings, 1 reply; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-02-04 6:01 UTC (permalink / raw)
To: linux-tip-commits; +Cc: tglx, luto, linux-kernel, mingo, hpa
Commit-ID: 2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Gitweb: http://git.kernel.org/tip/2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Author: Andy Lutomirski <luto@amacapital.net>
AuthorDate: Tue, 22 Jul 2014 12:46:50 -0700
Committer: Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:01 -0800
x86_64, entry: Use sysret to return to userspace when possible
The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work. For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret. For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.
Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret. This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly. It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.
I would argue that this approach is backwards. Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast. Under a specific set of conditions, iret is unnecessary. In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret. This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.
Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work. This includes tracing and context tracking, and any return
that invokes KVM's user return notifier. For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.
This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.
This may require further tweaking to give the full benefit on Xen.
It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.
This does not optimize returns to 32-bit userspace. Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.
Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f..eeab4cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs: /* return to user-space */
*/
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
+
+ /*
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ */
+ movq (RCX-R11)(%rsp), %rcx
+ cmpq %rcx,(RIP-R11)(%rsp) /* RCX == RIP */
+ jne opportunistic_sysret_failed
+
+ /*
+ * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+ * in kernel space. This essentially lets the user take over
+ * the kernel, since userspace controls RSP. It's not worth
+ * testing for canonicalness exactly -- this check detects any
+ * of the 17 high bits set, which is true for non-canonical
+ * or kernel addresses. (This will pessimize vsyscall=native.
+ * Big deal.)
+ *
+ * If virtual addresses ever become wider, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+ .ifne __VIRTUAL_MASK_SHIFT - 47
+ .error "virtual address width changed -- sysret checks need update"
+ .endif
+ shr $__VIRTUAL_MASK_SHIFT, %rcx
+ jnz opportunistic_sysret_failed
+
+ cmpq $__USER_CS,(CS-R11)(%rsp) /* CS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ movq (R11-ARGOFFSET)(%rsp), %r11
+ cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp) /* R11 == RFLAGS */
+ jne opportunistic_sysret_failed
+
+ testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ jnz opportunistic_sysret_failed
+
+ /* nothing to check for RSP */
+
+ cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp) /* SS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ /*
+ * We win! This label is here just for ease of understanding
+ * perf profiles. Nothing jumps here.
+ */
+irq_return_via_sysret:
+ CFI_REMEMBER_STATE
+ RESTORE_ARGS 1,8,1
+ movq (RSP-RIP)(%rsp),%rsp
+ USERGS_SYSRET64
+ CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
SWAPGS
jmp restore_args
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [tip:x86/asm] x86_64, entry: Remove the syscall exit audit and schedule optimizations
2015-01-17 0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2015-02-04 6:02 ` tip-bot for Andy Lutomirski
0 siblings, 0 replies; 9+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-02-04 6:02 UTC (permalink / raw)
To: linux-tip-commits; +Cc: bp, luto, hpa, mingo, linux-kernel, tglx
Commit-ID: 96b6352c12711d5c0bb7157f49c92580248e8146
Gitweb: http://git.kernel.org/tip/96b6352c12711d5c0bb7157f49c92580248e8146
Author: Andy Lutomirski <luto@amacapital.net>
AuthorDate: Mon, 7 Jul 2014 11:37:17 -0700
Committer: Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:02 -0800
x86_64, entry: Remove the syscall exit audit and schedule optimizations
We used to optimize rescheduling and audit on syscall exit. Now
that the full slow path is reasonably fast, remove these
optimizations. Syscall exit auditing is now handled exclusively by
syscall_trace_leave.
This adds something like 10ns to the previously optimized paths on
my computer, presumably due mostly to SAVE_REST / RESTORE_REST.
I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.
Link: http://lkml.kernel.org/r/22f2aa4a0361707a5cfb1de9d45260b39965dead.1421453410.git.luto@amacapital.net
Acked-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
1 file changed, 5 insertions(+), 47 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index eeab4cf..db13655 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -361,15 +361,12 @@ system_call_fastpath:
* Has incomplete stack frame and undefined top of stack.
*/
ret_from_sys_call:
- movl $_TIF_ALLWORK_MASK,%edi
- /* edi: flagmask */
-sysret_check:
+ testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+ jnz int_ret_from_sys_call_fixup /* Go the the slow path */
+
LOCKDEP_SYS_EXIT
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
- movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
- andl %edi,%edx
- jnz sysret_careful
CFI_REMEMBER_STATE
/*
* sysretq will re-enable interrupts:
@@ -383,49 +380,10 @@ sysret_check:
USERGS_SYSRET64
CFI_RESTORE_STATE
- /* Handle reschedules */
- /* edx: work, edi: workmask */
-sysret_careful:
- bt $TIF_NEED_RESCHED,%edx
- jnc sysret_signal
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
- pushq_cfi %rdi
- SCHEDULE_USER
- popq_cfi %rdi
- jmp sysret_check
- /* Handle a signal */
-sysret_signal:
- TRACE_IRQS_ON
- ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
- bt $TIF_SYSCALL_AUDIT,%edx
- jc sysret_audit
-#endif
- /*
- * We have a signal, or exit tracing or single-step.
- * These all wind up with the iret return path anyway,
- * so just join that path right now.
- */
+int_ret_from_sys_call_fixup:
FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
- jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
- /*
- * Return fast path for syscall audit. Call __audit_syscall_exit()
- * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
- * masked off.
- */
-sysret_audit:
- movq RAX-ARGOFFSET(%rsp),%rsi /* second arg, syscall return value */
- cmpq $-MAX_ERRNO,%rsi /* is it < -MAX_ERRNO? */
- setbe %al /* 1 if so, 0 if not */
- movzbl %al,%edi /* zero-extend that into %edi */
- call __audit_syscall_exit
- movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
- jmp sysret_check
-#endif /* CONFIG_AUDITSYSCALL */
+ jmp int_ret_from_sys_call
/* Do syscall tracing */
tracesys:
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Fwd: [tip:x86/asm] x86_64, entry: Use sysret to return to userspace when possible
2015-02-04 6:01 ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
@ 2015-02-06 18:56 ` Andy Lutomirski
0 siblings, 0 replies; 9+ messages in thread
From: Andy Lutomirski @ 2015-02-06 18:56 UTC (permalink / raw)
To: kvm list
In case you're interested, this change (queued for 3.20) should cut a
couple hundred cycles off of kvm heavyweight exits.
--Andy
---------- Forwarded message ----------
From: tip-bot for Andy Lutomirski <tipbot@zytor.com>
Date: Tue, Feb 3, 2015 at 10:01 PM
Subject: [tip:x86/asm] x86_64, entry: Use sysret to return to
userspace when possible
To: linux-tip-commits@vger.kernel.org
Cc: tglx@linutronix.de, luto@amacapital.net,
linux-kernel@vger.kernel.org, mingo@kernel.org, hpa@zytor.com
Commit-ID: 2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Gitweb: http://git.kernel.org/tip/2a23c6b8a9c42620182a2d2cfc7c16f6ff8c42b4
Author: Andy Lutomirski <luto@amacapital.net>
AuthorDate: Tue, 22 Jul 2014 12:46:50 -0700
Committer: Andy Lutomirski <luto@amacapital.net>
CommitDate: Sun, 1 Feb 2015 04:03:01 -0800
x86_64, entry: Use sysret to return to userspace when possible
The x86_64 entry code currently jumps through complex and
inconsistent hoops to try to minimize the impact of syscall exit
work. For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret. For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.
Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret. This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly. It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.
I would argue that this approach is backwards. Rather than trying
to avoid the iret path, we should instead try to make the iret path
fast. Under a specific set of conditions, iret is unnecessary. In
particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not
set, and both SS and CS are as expected, then
movq 32(%rsp),%rsp;sysret does the same thing as iret. This set of
conditions is nearly always satisfied on return from syscalls, and
it can even occasionally be satisfied on return from an irq.
Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work. This includes tracing and context tracking, and any return
that invokes KVM's user return notifier. For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.
This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.
This may require further tweaking to give the full benefit on Xen.
It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.
This does not optimize returns to 32-bit userspace. Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.
Link: http://lkml.kernel.org/r/71428f63e681e1b4aa1a781e3ef7c27f027d1103.1421453410.git.luto@amacapital.net
Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
arch/x86/kernel/entry_64.S | 54 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 54 insertions(+)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 501212f..eeab4cf 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -794,6 +794,60 @@ retint_swapgs: /* return to user-space */
*/
DISABLE_INTERRUPTS(CLBR_ANY)
TRACE_IRQS_IRETQ
+
+ /*
+ * Try to use SYSRET instead of IRET if we're returning to
+ * a completely clean 64-bit userspace context.
+ */
+ movq (RCX-R11)(%rsp), %rcx
+ cmpq %rcx,(RIP-R11)(%rsp) /* RCX == RIP */
+ jne opportunistic_sysret_failed
+
+ /*
+ * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+ * in kernel space. This essentially lets the user take over
+ * the kernel, since userspace controls RSP. It's not worth
+ * testing for canonicalness exactly -- this check detects any
+ * of the 17 high bits set, which is true for non-canonical
+ * or kernel addresses. (This will pessimize vsyscall=native.
+ * Big deal.)
+ *
+ * If virtual addresses ever become wider, this will need
+ * to be updated to remain correct on both old and new CPUs.
+ */
+ .ifne __VIRTUAL_MASK_SHIFT - 47
+ .error "virtual address width changed -- sysret checks need update"
+ .endif
+ shr $__VIRTUAL_MASK_SHIFT, %rcx
+ jnz opportunistic_sysret_failed
+
+ cmpq $__USER_CS,(CS-R11)(%rsp) /* CS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ movq (R11-ARGOFFSET)(%rsp), %r11
+ cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp) /* R11 == RFLAGS */
+ jne opportunistic_sysret_failed
+
+ testq $X86_EFLAGS_RF,%r11 /* sysret can't restore RF */
+ jnz opportunistic_sysret_failed
+
+ /* nothing to check for RSP */
+
+ cmpq $__USER_DS,(SS-ARGOFFSET)(%rsp) /* SS must match SYSRET */
+ jne opportunistic_sysret_failed
+
+ /*
+ * We win! This label is here just for ease of understanding
+ * perf profiles. Nothing jumps here.
+ */
+irq_return_via_sysret:
+ CFI_REMEMBER_STATE
+ RESTORE_ARGS 1,8,1
+ movq (RSP-RIP)(%rsp),%rsp
+ USERGS_SYSRET64
+ CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
SWAPGS
jmp restore_args
--
Andy Lutomirski
AMA Capital Management, LLC
^ permalink raw reply related [flat|nested] 9+ messages in thread
end of thread, other threads:[~2015-02-06 18:56 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-01-17 0:19 [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
2015-01-17 10:13 ` [tip:x86/asm] x86_64 entry: Fix RCX for ptraced syscalls tip-bot for Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
2015-02-04 6:01 ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-02-06 18:56 ` Fwd: " Andy Lutomirski
2015-01-17 0:19 ` [PATCH v2 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
2015-02-04 6:02 ` [tip:x86/asm] x86_64, entry: " tip-bot for Andy Lutomirski
2015-01-17 11:15 ` [PATCH v2 0/3] x86_64,entry: Rearrange the syscall exit optimizations Linus Torvalds
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.