All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations
@ 2014-11-07 23:58 Andy Lutomirski
  2014-11-07 23:58 ` [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-11-07 23:58 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list, Andy Lutomirski

The syscall exit asm is a big mess.  There's a really fast path, some
kind of fast path code (with a hard-coded optimization for audit), and
the really slow path.  The result is that it's very hard to work with
this code.  There are some asm paths that are much slower than they
should be (context tracking is a major offender), but no one really
wants to add even more asm to speed them up.

This series takes a different, unorthodox approach.  Rather than trying
to avoid entering the very slow iret path, it adds a way back out of the
iret path.  The result is a dramatic speedup for context tracking, user
return notification, and similar code, as the cost of a few lines of
tricky asm.  Nonetheless, it's barely a net addition of asm code,
because we get to remove the fast path optimizations for audit and
rescheduling.

Thoughts?  If this works, it opens the door for a lot of further
consolidation of the exit code.

Note: patch 1 in this series has been floating around on the list
for quite a while.  It's mandatory for this series to work, because
the buglet that it fixes almost completely defeats the optimization
that I'm introducing.

Andy Lutomirski (3):
  x86_64,entry: Fix RCX for traced syscalls
  x86_64,entry: Use sysret to return to userspace when possible
  x86_64,entry: Remove the syscall exit audit and schedule optimizations

 arch/x86/kernel/entry_64.S | 103 ++++++++++++++++++++++++---------------------
 1 file changed, 55 insertions(+), 48 deletions(-)

-- 
1.9.3


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2014-11-07 23:58 [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
@ 2014-11-07 23:58 ` Andy Lutomirski
  2015-01-05 12:59   ` Borislav Petkov
  2014-11-07 23:58 ` [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2014-11-07 23:58 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list, Andy Lutomirski

The int_ret_from_sys_call and syscall tracing code disagrees with
the sysret path as to the value of RCX.

The Intel SDM, the AMD APM, and my laptop all agree that sysret
returns with RCX == RIP.  The syscall tracing code does not respect
this property.

For example, this program:

int main()
{
	extern const char syscall_rip[];
	unsigned long rcx = 1;
	unsigned long orig_rcx = rcx;
	asm ("mov $-1, %%eax\n\t"
	     "syscall\n\t"
	     "syscall_rip:"
	     : "+c" (rcx) : : "r11");
	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
	       rcx, (unsigned long)syscall_rip, orig_rcx);
	return 0;
}

prints:
syscall: RCX = 400556  RIP = 400556  orig RCX = 1

Running it under strace gives this instead:
syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1

This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
show RCX == RIP even under strace.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index df088bb03fb3..3710b8241945 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
 	movq \tmp,RSP+\offset(%rsp)
 	movq $__USER_DS,SS+\offset(%rsp)
 	movq $__USER_CS,CS+\offset(%rsp)
-	movq $-1,RCX+\offset(%rsp)
+	movq RIP+\offset(%rsp),\tmp  /* get rip */
+	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
 	movq R11+\offset(%rsp),\tmp  /* get eflags */
 	movq \tmp,EFLAGS+\offset(%rsp)
 	.endm
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2014-11-07 23:58 [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  2014-11-07 23:58 ` [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2014-11-07 23:58 ` Andy Lutomirski
  2015-01-08 12:29   ` Borislav Petkov
  2015-01-09 10:40   ` Borislav Petkov
  2014-11-07 23:58 ` [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
  2014-12-04  1:42 ` [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  3 siblings, 2 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-11-07 23:58 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list, Andy Lutomirski

The x86_64 entry code currently jumps through complex and
inconsisnent hoops to try to minimize the impact of syscall exit
work.  For a true fast-path syscall, almost nothing needs to be
done, so returning is just a check for exit work and sysret.  For a
full slow-path return from a syscall, the C exit hook is invoked if
needed and we join the iret path.

Using iret to return to userspace is very slow, so the entry code
has accumulated various special cases to try to do certain forms of
exit work without invoking iret.  This is error-prone, since it
duplicates assembly code paths, and it's dangerous, since sysret
can malfunction in interesting ways if used carelessly.  It's
also inefficient, since a lot of useful cases aren't optimized
and therefore force an iret out of a combination of paranoia and
the fact that no one has bothered to write even more asm code
to avoid it.

I would argue that this approach is backwards.  Rather than
trying to avoid the iret path, we should instead try to make
the iret path fast.  Under a specific set of conditions, iret
is unnecessary.  In particular, if RIP==RCX, RFLAGS==R11, RIP is canonical, RF is not set, and both
SS and CS are as expected, then movq 32(%rsp),%rsp;sysret does the
same thing as iret.  This set of conditions is nearly always satisfied
on return from syscalls, and it can even occasionally be satisfied on
return from an irq.

Even with the careful checks for sysret applicability, this cuts
nearly 80ns off of the overhead from syscalls with unoptimized exit
work.  This includes tracing and context tracking, and any return
that invokes KVM's user return notifier.  For example, the cost of
getpid with CONFIG_CONTEXT_TRACKING_FORCE=y drops from ~360ns to
~280ns on my computer.

This may allow the removal and even eventual conversion to C
of a respectable amount of exit asm.

This may require further tweaking to give the full benefit on Xen.

It may be worthwhile to adjust signal delivery and exec to try hit
the sysret path.

This does not optimize returns to 32-bit userspace.  Making the same
optimization for CS == __USER32_CS is conceptually straightforward,
but it will require some tedious code to handle the differences
between sysretl and sysexitl.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 48 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3710b8241945..a5afdf0f7fa4 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -804,6 +804,54 @@ retint_swapgs:		/* return to user-space */
 	 */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ
+
+	/*
+	 * Try to use SYSRET instead of IRET if we're returning to
+	 * a completely clean 64-bit userspace context.
+	 */
+	movq (RCX-R11)(%rsp), %rcx
+	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.  It's not worth
+	 * testing for canonicalness exactly -- this check detects any
+	 * of the 17 high bits set, which is true for non-canonical
+	 * or kernel addresses.  (This will pessimize vsyscall=native.
+	 * Big deal.)
+	 */
+	shr $47, %rcx
+	jnz opportunistic_sysret_failed
+
+	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	movq (R11-R11)(%rsp), %r11
+	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
+	jne opportunistic_sysret_failed
+
+	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
+	jnz opportunistic_sysret_failed
+
+	/* nothing to check for RSP */
+
+	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * We win!  This label is here just for ease of understanding
+	 * perf profiles.  Nothing jumps here.
+	 */
+irq_return_via_sysret:
+	CFI_REMEMBER_STATE
+	RESTORE_ARGS 1,8,1
+	movq (RSP-RIP)(%rsp),%rsp
+	USERGS_SYSRET64
+	CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
 	SWAPGS
 	jmp restore_args
 
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
  2014-11-07 23:58 [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  2014-11-07 23:58 ` [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
  2014-11-07 23:58 ` [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2014-11-07 23:58 ` Andy Lutomirski
  2015-01-09 15:53   ` Borislav Petkov
  2014-12-04  1:42 ` [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
  3 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2014-11-07 23:58 UTC (permalink / raw)
  To: x86, linux-kernel
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list, Andy Lutomirski

We used to optimize rescheduling and audit on syscall exit.  Now that
the full slow path is reasonably fast, remove these optimizations.

This adds something like 10ns to the previously optimized paths on my
computer, presumably due mostly to SAVE_REST / RESTORE_REST.

I think that we should eventually replace both the syscall and
non-paranoid interrupt exit slow paths with a pair of C functions
along the lines of the syscall entry hooks.

Signed-off-by: Andy Lutomirski <luto@amacapital.net>
---
 arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
 1 file changed, 5 insertions(+), 47 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index a5afdf0f7fa4..222dc5c45ac3 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -427,15 +427,12 @@ system_call_fastpath:
  * Has incomplete stack frame and undefined top of stack.
  */
 ret_from_sys_call:
-	movl $_TIF_ALLWORK_MASK,%edi
-	/* edi:	flagmask */
-sysret_check:
+	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
+	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
+
 	LOCKDEP_SYS_EXIT
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
-	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
-	andl %edi,%edx
-	jnz  sysret_careful
 	CFI_REMEMBER_STATE
 	/*
 	 * sysretq will re-enable interrupts:
@@ -449,49 +446,10 @@ sysret_check:
 	USERGS_SYSRET64
 
 	CFI_RESTORE_STATE
-	/* Handle reschedules */
-	/* edx:	work, edi: workmask */
-sysret_careful:
-	bt $TIF_NEED_RESCHED,%edx
-	jnc sysret_signal
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-	pushq_cfi %rdi
-	SCHEDULE_USER
-	popq_cfi %rdi
-	jmp sysret_check
 
-	/* Handle a signal */
-sysret_signal:
-	TRACE_IRQS_ON
-	ENABLE_INTERRUPTS(CLBR_NONE)
-#ifdef CONFIG_AUDITSYSCALL
-	bt $TIF_SYSCALL_AUDIT,%edx
-	jc sysret_audit
-#endif
-	/*
-	 * We have a signal, or exit tracing or single-step.
-	 * These all wind up with the iret return path anyway,
-	 * so just join that path right now.
-	 */
+int_ret_from_sys_call_fixup:
 	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
-	jmp int_check_syscall_exit_work
-
-#ifdef CONFIG_AUDITSYSCALL
-	/*
-	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
-	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
-	 * masked off.
-	 */
-sysret_audit:
-	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
-	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
-	setbe %al		/* 1 if so, 0 if not */
-	movzbl %al,%edi		/* zero-extend that into %edi */
-	call __audit_syscall_exit
-	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
-	jmp sysret_check
-#endif	/* CONFIG_AUDITSYSCALL */
+	jmp int_ret_from_sys_call
 
 	/* Do syscall tracing */
 tracesys:
-- 
1.9.3


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations
  2014-11-07 23:58 [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
                   ` (2 preceding siblings ...)
  2014-11-07 23:58 ` [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2014-12-04  1:42 ` Andy Lutomirski
  3 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2014-12-04  1:42 UTC (permalink / raw)
  To: X86 ML, linux-kernel, Thomas Gleixner, H. Peter Anvin,
	Eric Paris, Richard Guy Briggs
  Cc: Frédéric Weisbecker, Oleg Nesterov, kvm list, Andy Lutomirski

[adding potentially interested people]

On Fri, Nov 7, 2014 at 3:58 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> The syscall exit asm is a big mess.  There's a really fast path, some
> kind of fast path code (with a hard-coded optimization for audit), and
> the really slow path.  The result is that it's very hard to work with
> this code.  There are some asm paths that are much slower than they
> should be (context tracking is a major offender), but no one really
> wants to add even more asm to speed them up.
>
> This series takes a different, unorthodox approach.  Rather than trying
> to avoid entering the very slow iret path, it adds a way back out of the
> iret path.  The result is a dramatic speedup for context tracking, user
> return notification, and similar code, as the cost of a few lines of
> tricky asm.  Nonetheless, it's barely a net addition of asm code,
> because we get to remove the fast path optimizations for audit and
> rescheduling.
>
> Thoughts?  If this works, it opens the door for a lot of further
> consolidation of the exit code.
>
> Note: patch 1 in this series has been floating around on the list
> for quite a while.  It's mandatory for this series to work, because
> the buglet that it fixes almost completely defeats the optimization
> that I'm introducing.

It turns out that sysret_audit may be rather buggy.  I think it leaves
edx and edi in a confused state, and it interacts badly with
SCHEDULE_USER if context tracking is on.  My preferred long-term
solution is to delete sysret_audit entirely, which this patch set
does.  Can you (x86 people and people who, for reasons that escape me,
enjoy reviewing this stuff) take a look?

This clearly isn't 3.18 material, and it may want to soak in -next
(can -tip do that?  I can do it myself, I suppose), but it might also
be a good idea to try to do this for 3.19 to get rid of sysret_audit.

For those who haven't followed all the recent threads: the asm that's
deleted in patch 3 currently has a nasty RCU + context tracking +
audit bug that has become much easier to trigger as a result of the
seccomp changes in 3.18.  This isn't directly a bug in the seccomp
changes -- it's just that the seccomp changes make it much easier to
cause the offending ask to be executed.

--Andy

>
> Andy Lutomirski (3):
>   x86_64,entry: Fix RCX for traced syscalls
>   x86_64,entry: Use sysret to return to userspace when possible
>   x86_64,entry: Remove the syscall exit audit and schedule optimizations
>
>  arch/x86/kernel/entry_64.S | 103 ++++++++++++++++++++++++---------------------
>  1 file changed, 55 insertions(+), 48 deletions(-)
>
> --
> 1.9.3
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2014-11-07 23:58 ` [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
@ 2015-01-05 12:59   ` Borislav Petkov
  2015-01-05 20:31     ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2015-01-05 12:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Oleg Nesterov,
	kvm list

On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
> The int_ret_from_sys_call and syscall tracing code disagrees with
> the sysret path as to the value of RCX.
> 
> The Intel SDM, the AMD APM, and my laptop all agree that sysret
> returns with RCX == RIP.  The syscall tracing code does not respect
> this property.
> 
> For example, this program:
> 
> int main()
> {
> 	extern const char syscall_rip[];
> 	unsigned long rcx = 1;
> 	unsigned long orig_rcx = rcx;
> 	asm ("mov $-1, %%eax\n\t"
> 	     "syscall\n\t"
> 	     "syscall_rip:"
> 	     : "+c" (rcx) : : "r11");
> 	printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
> 	       rcx, (unsigned long)syscall_rip, orig_rcx);
> 	return 0;
> }
> 
> prints:
> syscall: RCX = 400556  RIP = 400556  orig RCX = 1
> 
> Running it under strace gives this instead:
> syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1

I can trigger the same even without tracing it:

syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 40052C  orig RCX = 1

> This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
> show RCX == RIP even under strace.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/entry_64.S | 3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index df088bb03fb3..3710b8241945 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
>  	movq \tmp,RSP+\offset(%rsp)
>  	movq $__USER_DS,SS+\offset(%rsp)
>  	movq $__USER_CS,CS+\offset(%rsp)
> -	movq $-1,RCX+\offset(%rsp)
> +	movq RIP+\offset(%rsp),\tmp  /* get rip */
> +	movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
>  	movq R11+\offset(%rsp),\tmp  /* get eflags */
>  	movq \tmp,EFLAGS+\offset(%rsp)
>  	.endm
> --

For some reason this patch is causing ata resets on by box, see the
end of this mail. So something's not kosher yet. If I boot the kernel
without it, it all seems ok.

Btw, this change got introduced in 2002 where it used to return rIP in
%rcx before, but it got changed to return -1 for rIP for some reason.

commit af53c7a2c81399b805b6d4eff887401a5e50feef
Author: Andi Kleen <ak@muc.de>
Date:   Fri Apr 19 20:23:17 2002 -0700

    [PATCH] x86-64 architecture specific sync for 2.5.8
    
    This patch brings 2.5.8 in sync with the x86-64 2.4 development tree again
    (excluding device drivers)
    
    It has lots of bug fixes and enhancements. It only touches architecture
    specific files.

...

diff --git a/arch/x86_64/kernel/entry.S b/arch/x86_64/kernel/entry.S
index 6b98b90891f4..16c6e3faf5a7 100644
--- a/arch/x86_64/kernel/entry.S
+++ b/arch/x86_64/kernel/entry.S
@@ -5,7 +5,7 @@
  *  Copyright (C) 2000, 2001, 2002  Andi Kleen SuSE Labs
  *  Copyright (C) 2000  Pavel Machek <pavel@suse.cz>
  * 
- *  $Id: entry.S,v 1.66 2001/11/11 17:47:47 ak Exp $           
+ *  $Id$
  */
 
 /*
@@ -39,8 +39,7 @@
 #include <asm/msr.h>
 #include <asm/unistd.h>
 #include <asm/thread_info.h>
-       
-#define RIP_SYMBOL_NAME(x) x(%rip)
+#include <asm/hw_irq.h>
 
        .code64
 
@@ -67,8 +66,7 @@
        movq    \tmp,RSP(%rsp)
        movq    $__USER_DS,SS(%rsp)
        movq    $__USER_CS,CS(%rsp)
-       movq    RCX(%rsp),\tmp  /* get return address */
-       movq    \tmp,RIP(%rsp)
+       movq    $-1,RCX(%rsp)
        movq    R11(%rsp),\tmp  /* get eflags */
        movq    \tmp,EFLAGS(%rsp)
        .endm
@@ -76,8 +74,6 @@
        .macro RESTORE_TOP_OF_STACK tmp,offset=0
        movq   RSP-\offset(%rsp),\tmp
        movq   \tmp,PDAREF(pda_oldrsp)
-       movq   RIP-\offset(%rsp),\tmp
-       movq   \tmp,RCX-\offset(%rsp)
        movq   EFLAGS-\offset(%rsp),\tmp
        movq   \tmp,R11-\offset(%rsp)
        .endm

---

[  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
[  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out
[  180.072158]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.086912] ata1.00: status: { DRDY }
[  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 4096 out
[  180.095846]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.110603] ata1.00: status: { DRDY }
[  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 8192 out
[  180.119539]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.134292] ata1.00: status: { DRDY }
[  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 4096 out
[  180.143226]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.158105] ata1.00: status: { DRDY }
[  180.161809] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.167071] ata1.00: cmd 61/10:20:18:98:18/00:00:1d:00:00/40 tag 4 ncq 8192 out
[  180.167071]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.181822] ata1.00: status: { DRDY }
[  180.185503] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.190756] ata1.00: cmd 61/a0:28:e0:7c:5d/25:00:1d:00:00/40 tag 5 ncq 4931584 out
[  180.190756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.205770] ata1.00: status: { DRDY }
[  180.209448] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.214692] ata1.00: cmd 61/80:30:80:a2:5d/4a:00:1d:00:00/40 tag 6 ncq 9764864 out
[  180.214692]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.229701] ata1.00: status: { DRDY }
[  180.233381] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.238631] ata1.00: cmd 61/08:38:00:ed:5d/22:00:1d:00:00/40 tag 7 ncq 4460544 out
[  180.238631]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.253647] ata1.00: status: { DRDY }
[  180.257321] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.262576] ata1.00: cmd 61/08:40:38:b8:18/00:00:1d:00:00/40 tag 8 ncq 4096 out
[  180.262576]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.277321] ata1.00: status: { DRDY }
[  180.281007] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.286263] ata1.00: cmd 61/10:48:78:b8:18/00:00:1d:00:00/40 tag 9 ncq 8192 out
[  180.286263]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.301013] ata1.00: status: { DRDY }
[  180.304693] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.309947] ata1.00: cmd 61/08:50:60:98:d8/00:00:1d:00:00/40 tag 10 ncq 4096 out
[  180.309947]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.324781] ata1.00: status: { DRDY }
[  180.328461] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.333713] ata1.00: cmd 61/08:58:18:f8:a1/00:00:20:00:00/40 tag 11 ncq 4096 out
[  180.333713]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.348550] ata1.00: status: { DRDY }
[  180.352235] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.357484] ata1.00: cmd 61/08:60:00:88:a0/00:00:20:00:00/40 tag 12 ncq 4096 out
[  180.357484]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.372318] ata1.00: status: { DRDY }
[  180.375997] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.381251] ata1.00: cmd 61/48:68:c8:ac:dc/00:00:22:00:00/40 tag 13 ncq 36864 out
[  180.381251]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.396203] ata1.00: status: { DRDY }
[  180.399882] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.405134] ata1.00: cmd 61/08:70:40:c8:9d/00:00:20:00:00/40 tag 14 ncq 4096 out
[  180.405134]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.419993] ata1.00: status: { DRDY }
[  180.423674] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.428920] ata1.00: cmd 61/08:78:58:a7:9c/00:00:20:00:00/40 tag 15 ncq 4096 out
[  180.428920]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.443989] ata1.00: status: { DRDY }
[  180.447892] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.453367] ata1.00: cmd 61/08:80:28:c5:dd/00:00:25:00:00/40 tag 16 ncq 4096 out
[  180.453367]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.469053] ata1.00: status: { DRDY }
[  180.472950] ata1.00: failed command: READ FPDMA QUEUED
[  180.478330] ata1.00: cmd 60/20:88:00:42:3c/00:00:1d:00:00/40 tag 17 ncq 16384 in
[  180.478330]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.494011] ata1.00: status: { DRDY }
[  180.497910] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.503378] ata1.00: cmd 61/10:90:80:98:d8/00:00:23:00:00/40 tag 18 ncq 8192 out
[  180.503378]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.519072] ata1.00: status: { DRDY }
[  180.522970] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.528434] ata1.00: cmd 61/08:98:88:99:d8/00:00:23:00:00/40 tag 19 ncq 4096 out
[  180.528434]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.544106] ata1.00: status: { DRDY }
[  180.547997] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.553462] ata1.00: cmd 61/08:a0:c8:99:d8/00:00:23:00:00/40 tag 20 ncq 4096 out
[  180.553462]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.569137] ata1.00: status: { DRDY }
[  180.573030] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.578508] ata1.00: cmd 61/08:a8:e0:9c:d8/00:00:23:00:00/40 tag 21 ncq 4096 out
[  180.578508]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.594221] ata1.00: status: { DRDY }
[  180.598126] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.603591] ata1.00: cmd 61/10:b0:18:9f:d8/00:00:23:00:00/40 tag 22 ncq 8192 out
[  180.603591]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.619266] ata1.00: status: { DRDY }
[  180.623157] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.628614] ata1.00: cmd 61/08:b8:78:b2:d8/00:00:23:00:00/40 tag 23 ncq 4096 out
[  180.628614]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.644286] ata1.00: status: { DRDY }
[  180.648178] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.653642] ata1.00: cmd 61/08:c0:b8:a3:d9/00:00:23:00:00/40 tag 24 ncq 4096 out
[  180.653642]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.669316] ata1.00: status: { DRDY }
[  180.673205] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.678667] ata1.00: cmd 61/08:c8:00:a5:d9/00:00:23:00:00/40 tag 25 ncq 4096 out
[  180.678667]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.694340] ata1.00: status: { DRDY }
[  180.698231] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.703690] ata1.00: cmd 61/08:d0:78:a9:d9/00:00:23:00:00/40 tag 26 ncq 4096 out
[  180.703690]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.719368] ata1.00: status: { DRDY }
[  180.723258] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.728720] ata1.00: cmd 61/08:d8:10:aa:d9/00:00:23:00:00/40 tag 27 ncq 4096 out
[  180.728720]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.744390] ata1.00: status: { DRDY }
[  180.748276] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.753737] ata1.00: cmd 61/08:e0:20:aa:d9/00:00:23:00:00/40 tag 28 ncq 4096 out
[  180.753737]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.769414] ata1.00: status: { DRDY }
[  180.773303] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.778760] ata1.00: cmd 61/10:e8:20:ab:d9/00:00:23:00:00/40 tag 29 ncq 8192 out
[  180.778760]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.794433] ata1.00: status: { DRDY }
[  180.798323] ata1.00: failed command: WRITE FPDMA QUEUED
[  180.803785] ata1.00: cmd 61/08:f0:90:ac:d9/00:00:23:00:00/40 tag 30 ncq 4096 out
[  180.803785]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  180.819455] ata1.00: status: { DRDY }
[  180.823358] ata1: hard resetting link
[  181.131630] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  181.140035] ata1.00: configured for UDMA/133
[  181.144549] ata1.00: device reported invalid CHS sector 0
[  181.150207] ata1.00: device reported invalid CHS sector 0
[  181.155843] ata1.00: device reported invalid CHS sector 0
[  181.161481] ata1.00: device reported invalid CHS sector 0
[  181.167147] ata1.00: device reported invalid CHS sector 0
[  181.172782] ata1.00: device reported invalid CHS sector 0
[  181.178407] ata1.00: device reported invalid CHS sector 0
[  181.184033] ata1.00: device reported invalid CHS sector 0
[  181.189655] ata1.00: device reported invalid CHS sector 0
[  181.195267] ata1.00: device reported invalid CHS sector 0
[  181.200882] ata1.00: device reported invalid CHS sector 0
[  181.206489] ata1.00: device reported invalid CHS sector 0
[  181.212107] ata1.00: device reported invalid CHS sector 0
[  181.217720] ata1.00: device reported invalid CHS sector 0
[  181.223326] ata1.00: device reported invalid CHS sector 0
[  181.228936] ata1.00: device reported invalid CHS sector 0
[  181.234533] ata1.00: device reported invalid CHS sector 0
[  181.240134] ata1.00: device reported invalid CHS sector 0
[  181.245726] ata1.00: device reported invalid CHS sector 0
[  181.251323] ata1.00: device reported invalid CHS sector 0
[  181.256913] ata1.00: device reported invalid CHS sector 0
[  181.262505] ata1.00: device reported invalid CHS sector 0
[  181.268087] ata1.00: device reported invalid CHS sector 0
[  181.273672] ata1.00: device reported invalid CHS sector 0
[  181.279257] ata1.00: device reported invalid CHS sector 0
[  181.284836] ata1.00: device reported invalid CHS sector 0
[  181.290416] ata1.00: device reported invalid CHS sector 0
[  181.295988] ata1.00: device reported invalid CHS sector 0
[  181.301571] ata1.00: device reported invalid CHS sector 0
[  181.307139] ata1.00: device reported invalid CHS sector 0
[  181.312708] ata1.00: device reported invalid CHS sector 0
[  181.318426] ata1: EH complete

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-05 12:59   ` Borislav Petkov
@ 2015-01-05 20:31     ` Andy Lutomirski
  2015-01-06 15:34       ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-01-05 20:31 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, linux-kernel, Frédéric Weisbecker,
	Oleg Nesterov, kvm list

On Mon, Jan 5, 2015 at 4:59 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 07, 2014 at 03:58:17PM -0800, Andy Lutomirski wrote:
>> The int_ret_from_sys_call and syscall tracing code disagrees with
>> the sysret path as to the value of RCX.
>>
>> The Intel SDM, the AMD APM, and my laptop all agree that sysret
>> returns with RCX == RIP.  The syscall tracing code does not respect
>> this property.
>>
>> For example, this program:
>>
>> int main()
>> {
>>       extern const char syscall_rip[];
>>       unsigned long rcx = 1;
>>       unsigned long orig_rcx = rcx;
>>       asm ("mov $-1, %%eax\n\t"
>>            "syscall\n\t"
>>            "syscall_rip:"
>>            : "+c" (rcx) : : "r11");
>>       printf("syscall: RCX = %lX  RIP = %lX  orig RCX = %lx\n",
>>              rcx, (unsigned long)syscall_rip, orig_rcx);
>>       return 0;
>> }
>>
>> prints:
>> syscall: RCX = 400556  RIP = 400556  orig RCX = 1
>>
>> Running it under strace gives this instead:
>> syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 400556  orig RCX = 1
>
> I can trigger the same even without tracing it:
>
> syscall: RCX = FFFFFFFFFFFFFFFF  RIP = 40052C  orig RCX = 1

Do you have context tracking on?

>
>> This changes FIXUP_TOP_OF_STACK to match sysret, causing the test to
>> show RCX == RIP even under strace.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/kernel/entry_64.S | 3 ++-
>>  1 file changed, 2 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index df088bb03fb3..3710b8241945 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -143,7 +143,8 @@ ENDPROC(native_usergs_sysret64)
>>       movq \tmp,RSP+\offset(%rsp)
>>       movq $__USER_DS,SS+\offset(%rsp)
>>       movq $__USER_CS,CS+\offset(%rsp)
>> -     movq $-1,RCX+\offset(%rsp)
>> +     movq RIP+\offset(%rsp),\tmp  /* get rip */
>> +     movq \tmp,RCX+\offset(%rsp)  /* copy it to rcx as sysret would do */
>>       movq R11+\offset(%rsp),\tmp  /* get eflags */
>>       movq \tmp,EFLAGS+\offset(%rsp)
>>       .endm
>> --
>
> For some reason this patch is causing ata resets on by box, see the
> end of this mail. So something's not kosher yet. If I boot the kernel
> without it, it all seems ok.
>
> Btw, this change got introduced in 2002 where it used to return rIP in
> %rcx before, but it got changed to return -1 for rIP for some reason.


Thanks!  I assume that's in the historical tree?

[...]

>
> ---
>
> [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
> [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out
> [  180.072158]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)

That's really weird.  The only thing I can think of is that somehow we
returned to user mode without enabling interrupts.  This leads me to
wonder: why do we save eflags in the R11 pt_regs slot?  This seems
entirely backwards, not to mention that it accounts for two
instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
for no apparently reason whatsoever.

Can you send the full output from syscall_exit_regs_64 from here:

https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:

with the patch applied (assuming it even gets that far for you)?  I
see results like:

[NOTE]    syscall ffff: orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
217  orig_flags = 217

which seems fine.

Are you seeing this with the whole series applied or with only this patch?

--Andy

> [  180.086912] ata1.00: status: { DRDY }
> [  180.090591] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.095846] ata1.00: cmd 61/08:08:18:ae:d9/00:00:23:00:00/40 tag 1 ncq 4096 out
> [  180.095846]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.110603] ata1.00: status: { DRDY }
> [  180.114283] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.119539] ata1.00: cmd 61/10:10:f0:b1:d9/00:00:23:00:00/40 tag 2 ncq 8192 out
> [  180.119539]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.134292] ata1.00: status: { DRDY }
> [  180.137973] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.143226] ata1.00: cmd 61/08:18:00:98:18/00:00:1d:00:00/40 tag 3 ncq 4096 out
> [  180.143226]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.158105] ata1.00: status: { DRDY }
> [  180.161809] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.167071] ata1.00: cmd 61/10:20:18:98:18/00:00:1d:00:00/40 tag 4 ncq 8192 out
> [  180.167071]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.181822] ata1.00: status: { DRDY }
> [  180.185503] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.190756] ata1.00: cmd 61/a0:28:e0:7c:5d/25:00:1d:00:00/40 tag 5 ncq 4931584 out
> [  180.190756]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.205770] ata1.00: status: { DRDY }
> [  180.209448] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.214692] ata1.00: cmd 61/80:30:80:a2:5d/4a:00:1d:00:00/40 tag 6 ncq 9764864 out
> [  180.214692]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.229701] ata1.00: status: { DRDY }
> [  180.233381] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.238631] ata1.00: cmd 61/08:38:00:ed:5d/22:00:1d:00:00/40 tag 7 ncq 4460544 out
> [  180.238631]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.253647] ata1.00: status: { DRDY }
> [  180.257321] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.262576] ata1.00: cmd 61/08:40:38:b8:18/00:00:1d:00:00/40 tag 8 ncq 4096 out
> [  180.262576]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.277321] ata1.00: status: { DRDY }
> [  180.281007] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.286263] ata1.00: cmd 61/10:48:78:b8:18/00:00:1d:00:00/40 tag 9 ncq 8192 out
> [  180.286263]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.301013] ata1.00: status: { DRDY }
> [  180.304693] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.309947] ata1.00: cmd 61/08:50:60:98:d8/00:00:1d:00:00/40 tag 10 ncq 4096 out
> [  180.309947]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.324781] ata1.00: status: { DRDY }
> [  180.328461] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.333713] ata1.00: cmd 61/08:58:18:f8:a1/00:00:20:00:00/40 tag 11 ncq 4096 out
> [  180.333713]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.348550] ata1.00: status: { DRDY }
> [  180.352235] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.357484] ata1.00: cmd 61/08:60:00:88:a0/00:00:20:00:00/40 tag 12 ncq 4096 out
> [  180.357484]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.372318] ata1.00: status: { DRDY }
> [  180.375997] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.381251] ata1.00: cmd 61/48:68:c8:ac:dc/00:00:22:00:00/40 tag 13 ncq 36864 out
> [  180.381251]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.396203] ata1.00: status: { DRDY }
> [  180.399882] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.405134] ata1.00: cmd 61/08:70:40:c8:9d/00:00:20:00:00/40 tag 14 ncq 4096 out
> [  180.405134]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.419993] ata1.00: status: { DRDY }
> [  180.423674] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.428920] ata1.00: cmd 61/08:78:58:a7:9c/00:00:20:00:00/40 tag 15 ncq 4096 out
> [  180.428920]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.443989] ata1.00: status: { DRDY }
> [  180.447892] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.453367] ata1.00: cmd 61/08:80:28:c5:dd/00:00:25:00:00/40 tag 16 ncq 4096 out
> [  180.453367]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.469053] ata1.00: status: { DRDY }
> [  180.472950] ata1.00: failed command: READ FPDMA QUEUED
> [  180.478330] ata1.00: cmd 60/20:88:00:42:3c/00:00:1d:00:00/40 tag 17 ncq 16384 in
> [  180.478330]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.494011] ata1.00: status: { DRDY }
> [  180.497910] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.503378] ata1.00: cmd 61/10:90:80:98:d8/00:00:23:00:00/40 tag 18 ncq 8192 out
> [  180.503378]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.519072] ata1.00: status: { DRDY }
> [  180.522970] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.528434] ata1.00: cmd 61/08:98:88:99:d8/00:00:23:00:00/40 tag 19 ncq 4096 out
> [  180.528434]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.544106] ata1.00: status: { DRDY }
> [  180.547997] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.553462] ata1.00: cmd 61/08:a0:c8:99:d8/00:00:23:00:00/40 tag 20 ncq 4096 out
> [  180.553462]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.569137] ata1.00: status: { DRDY }
> [  180.573030] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.578508] ata1.00: cmd 61/08:a8:e0:9c:d8/00:00:23:00:00/40 tag 21 ncq 4096 out
> [  180.578508]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.594221] ata1.00: status: { DRDY }
> [  180.598126] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.603591] ata1.00: cmd 61/10:b0:18:9f:d8/00:00:23:00:00/40 tag 22 ncq 8192 out
> [  180.603591]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.619266] ata1.00: status: { DRDY }
> [  180.623157] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.628614] ata1.00: cmd 61/08:b8:78:b2:d8/00:00:23:00:00/40 tag 23 ncq 4096 out
> [  180.628614]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.644286] ata1.00: status: { DRDY }
> [  180.648178] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.653642] ata1.00: cmd 61/08:c0:b8:a3:d9/00:00:23:00:00/40 tag 24 ncq 4096 out
> [  180.653642]          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.669316] ata1.00: status: { DRDY }
> [  180.673205] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.678667] ata1.00: cmd 61/08:c8:00:a5:d9/00:00:23:00:00/40 tag 25 ncq 4096 out
> [  180.678667]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.694340] ata1.00: status: { DRDY }
> [  180.698231] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.703690] ata1.00: cmd 61/08:d0:78:a9:d9/00:00:23:00:00/40 tag 26 ncq 4096 out
> [  180.703690]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.719368] ata1.00: status: { DRDY }
> [  180.723258] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.728720] ata1.00: cmd 61/08:d8:10:aa:d9/00:00:23:00:00/40 tag 27 ncq 4096 out
> [  180.728720]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.744390] ata1.00: status: { DRDY }
> [  180.748276] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.753737] ata1.00: cmd 61/08:e0:20:aa:d9/00:00:23:00:00/40 tag 28 ncq 4096 out
> [  180.753737]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.769414] ata1.00: status: { DRDY }
> [  180.773303] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.778760] ata1.00: cmd 61/10:e8:20:ab:d9/00:00:23:00:00/40 tag 29 ncq 8192 out
> [  180.778760]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.794433] ata1.00: status: { DRDY }
> [  180.798323] ata1.00: failed command: WRITE FPDMA QUEUED
> [  180.803785] ata1.00: cmd 61/08:f0:90:ac:d9/00:00:23:00:00/40 tag 30 ncq 4096 out
> [  180.803785]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  180.819455] ata1.00: status: { DRDY }
> [  180.823358] ata1: hard resetting link
> [  181.131630] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  181.140035] ata1.00: configured for UDMA/133
> [  181.144549] ata1.00: device reported invalid CHS sector 0
> [  181.150207] ata1.00: device reported invalid CHS sector 0
> [  181.155843] ata1.00: device reported invalid CHS sector 0
> [  181.161481] ata1.00: device reported invalid CHS sector 0
> [  181.167147] ata1.00: device reported invalid CHS sector 0
> [  181.172782] ata1.00: device reported invalid CHS sector 0
> [  181.178407] ata1.00: device reported invalid CHS sector 0
> [  181.184033] ata1.00: device reported invalid CHS sector 0
> [  181.189655] ata1.00: device reported invalid CHS sector 0
> [  181.195267] ata1.00: device reported invalid CHS sector 0
> [  181.200882] ata1.00: device reported invalid CHS sector 0
> [  181.206489] ata1.00: device reported invalid CHS sector 0
> [  181.212107] ata1.00: device reported invalid CHS sector 0
> [  181.217720] ata1.00: device reported invalid CHS sector 0
> [  181.223326] ata1.00: device reported invalid CHS sector 0
> [  181.228936] ata1.00: device reported invalid CHS sector 0
> [  181.234533] ata1.00: device reported invalid CHS sector 0
> [  181.240134] ata1.00: device reported invalid CHS sector 0
> [  181.245726] ata1.00: device reported invalid CHS sector 0
> [  181.251323] ata1.00: device reported invalid CHS sector 0
> [  181.256913] ata1.00: device reported invalid CHS sector 0
> [  181.262505] ata1.00: device reported invalid CHS sector 0
> [  181.268087] ata1.00: device reported invalid CHS sector 0
> [  181.273672] ata1.00: device reported invalid CHS sector 0
> [  181.279257] ata1.00: device reported invalid CHS sector 0
> [  181.284836] ata1.00: device reported invalid CHS sector 0
> [  181.290416] ata1.00: device reported invalid CHS sector 0
> [  181.295988] ata1.00: device reported invalid CHS sector 0
> [  181.301571] ata1.00: device reported invalid CHS sector 0
> [  181.307139] ata1.00: device reported invalid CHS sector 0
> [  181.312708] ata1.00: device reported invalid CHS sector 0
> [  181.318426] ata1: EH complete
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-05 20:31     ` Andy Lutomirski
@ 2015-01-06 15:34       ` Borislav Petkov
  2015-01-06 18:43         ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2015-01-06 15:34 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Frédéric Weisbecker,
	Oleg Nesterov, kvm list

On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote:
> Do you have context tracking on?

Yap, it is enabled for whatever reason:
CONFIG_CONTEXT_TRACKING=y
CONFIG_CONTEXT_TRACKING_FORCE=y
CONFIG_HAVE_CONTEXT_TRACKING=y

> I assume that's in the historical tree?

Yeah.

> > [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
> > [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
> > [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out
> > [  180.072158]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> 
> That's really weird.  The only thing I can think of is that somehow we
> returned to user mode without enabling interrupts.

Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in
entry_64.S, no wonder it corrupts something.

> This leads me to wonder: why do we save eflags in the R11 pt_regs
> slot?

That: "If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits
from R11[31:0] and clears the upper 32 RFLAGS bits."

> This seems entirely backwards, not to mention that it accounts for two
> instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
> for no apparently reason whatsoever.

> Can you send the full output from syscall_exit_regs_64 from here:
> 
> https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:
> 
> with the patch applied (assuming it even gets that far for you)?  I
> see results like:
> 
> [NOTE]    syscall ffff: orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
> 217  orig_flags = 217
> 
> which seems fine.

./syscall_exit_regs_64
[OK]    int80 ffff: AX = ffffffffffffffda
[OK]    int80 40000000: AX = ffffffffffffffda
[OK]    syscall ffff: RCX = 400962  RIP = 400962
[OK]    syscall ffff: AX = ffffffffffffffda
[NOTE]  syscall ffff: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  orig_flags = 217
[OK]    syscall 40000000: RCX = 400962  RIP = 400962
[FAIL]  syscall 40000000: AX = fffffffffffffff7
[NOTE]  syscall 40000000: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  orig_flags = 217
[OK]    syscall(ffff): ret = -1, errno = 38

> Are you seeing this with the whole series applied or with only this patch?

I applied this patch only and started seeing those. Then I booted in the
previous kernel and tried to repro but it didn't trigger.

I'll try hammering on the kernel *without* your patch to see whether I
can trigger it somehow...

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-06 15:34       ` Borislav Petkov
@ 2015-01-06 18:43         ` Andy Lutomirski
  2015-01-06 19:00           ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Andy Lutomirski @ 2015-01-06 18:43 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Frederic Weisbecker, linux-kernel, X86 ML, kvm list, Oleg Nesterov

On Jan 6, 2015 7:34 AM, "Borislav Petkov" <bp@alien8.de> wrote:
>
> On Mon, Jan 05, 2015 at 12:31:15PM -0800, Andy Lutomirski wrote:
> > Do you have context tracking on?
>
> Yap, it is enabled for whatever reason:
> CONFIG_CONTEXT_TRACKING=y
> CONFIG_CONTEXT_TRACKING_FORCE=y
> CONFIG_HAVE_CONTEXT_TRACKING=y

I'll boot a kernel like this on bare metal and see what shakes loose.

>
> > I assume that's in the historical tree?
>
> Yeah.
>
> > > [  180.059170] ata1.00: exception Emask 0x0 SAct 0x7fffffff SErr 0x0 action 0x6 frozen
> > > [  180.066873] ata1.00: failed command: WRITE FPDMA QUEUED
> > > [  180.072158] ata1.00: cmd 61/08:00:a8:ac:d9/00:00:23:00:00/40 tag 0 ncq 4096 out
> > > [  180.072158]          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> >
> > That's really weird.  The only thing I can think of is that somehow we
> > returned to user mode without enabling interrupts.
>
> Right, considering FIXUP_TOP_OF_STACK is used in a bunch of cases in
> entry_64.S, no wonder it corrupts something.
>
> > This leads me to wonder: why do we save eflags in the R11 pt_regs
> > slot?
>
> That: "If executed in 64-bit mode, SYSRET loads the lower-32 RFLAGS bits
> from R11[31:0] and clears the upper 32 RFLAGS bits."

Sure, but the code would be simpler if we shoved that value in the EFLAGS slot.

>
> > This seems entirely backwards, not to mention that it accounts for two
> > instructions in each of FIXUP_TOP_OF_STACK and RESTORE_TOP_OF_STACK
> > for no apparently reason whatsoever.
>
> > Can you send the full output from syscall_exit_regs_64 from here:
> >
> > https://gitorious.org/linux-test-utils/linux-clock-tests/source/34884122b6ebe81d9b96e3e5128b6d6d95082c6e:
> >
> > with the patch applied (assuming it even gets that far for you)?  I
> > see results like:
> >
> > [NOTE]    syscall ffff: orig RCX = 1  ss = 2b  orig_ss = 6b  flags =
> > 217  orig_flags = 217
> >
> > which seems fine.
>
> ./syscall_exit_regs_64
> [OK]    int80 ffff: AX = ffffffffffffffda
> [OK]    int80 40000000: AX = ffffffffffffffda
> [OK]    syscall ffff: RCX = 400962  RIP = 400962
> [OK]    syscall ffff: AX = ffffffffffffffda
> [NOTE]  syscall ffff: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  orig_flags = 217
> [OK]    syscall 40000000: RCX = 400962  RIP = 400962
> [FAIL]  syscall 40000000: AX = fffffffffffffff7
> [NOTE]  syscall 40000000: orig RCX = 1  ss = 2b  orig_ss = 6b  flags = 217  orig_flags = 217
> [OK]    syscall(ffff): ret = -1, errno = 38
>
> > Are you seeing this with the whole series applied or with only this patch?
>
> I applied this patch only and started seeing those. Then I booted in the
> previous kernel and tried to repro but it didn't trigger.
>
> I'll try hammering on the kernel *without* your patch to see whether I
> can trigger it somehow...

Hmm.  I added and pushed a test for fork, but that didn't turn
anything up.  And I don't see any bugs in the code.

I booted 3.18 plus this patch with context tracking forced on on my
laptop, and something seems to have gone wrong.

--Andy

>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-06 18:43         ` Andy Lutomirski
@ 2015-01-06 19:00           ` Borislav Petkov
  2015-01-07 15:55             ` Borislav Petkov
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2015-01-06 19:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Frederic Weisbecker, linux-kernel, X86 ML, kvm list, Oleg Nesterov

On Tue, Jan 06, 2015 at 10:43:57AM -0800, Andy Lutomirski wrote:
> Sure, but the code would be simpler if we shoved that value in the
> EFLAGS slot.

There probably is some reason for that but it's not like we can change
it :-)

> Hmm.  I added and pushed a test for fork, but that didn't turn
> anything up.  And I don't see any bugs in the code.
> 
> I booted 3.18 plus this patch with context tracking forced on on my
> laptop, and something seems to have gone wrong.

Well, I ran it on -rc2 + tip/master and it did trigger sporadically
yesterday but was unable to trigger something today.

I'll redo the whole games tomorrow.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls
  2015-01-06 19:00           ` Borislav Petkov
@ 2015-01-07 15:55             ` Borislav Petkov
  0 siblings, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2015-01-07 15:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Frederic Weisbecker, linux-kernel, X86 ML, kvm list, Oleg Nesterov

On Tue, Jan 06, 2015 at 08:00:41PM +0100, Borislav Petkov wrote:
> I'll redo the whole games tomorrow.

Ok I can't reproduce today either. Let's ascribe it to this particular
test box being funny or something in 3.19-rcs.

We can look at it if it happens again.

Thanks.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2014-11-07 23:58 ` [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
@ 2015-01-08 12:29   ` Borislav Petkov
  2015-01-08 13:57     ` Borislav Petkov
  2015-01-10 21:05     ` Andy Lutomirski
  2015-01-09 10:40   ` Borislav Petkov
  1 sibling, 2 replies; 17+ messages in thread
From: Borislav Petkov @ 2015-01-08 12:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Oleg Nesterov,
	kvm list

On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 3710b8241945..a5afdf0f7fa4 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -804,6 +804,54 @@ retint_swapgs:		/* return to user-space */

Ok, so retint_swapgs is also on the error_exit path.

What you're basically proposing is to use SYSRET on exceptions exit
too AFAICT. And while I don't see anything wrong with the patch, you
probably need to run this by more people like tip guys + Linus just in
case. We can't allow ourselves to leak stuff here.

>  	 */
>  	DISABLE_INTERRUPTS(CLBR_ANY)
>  	TRACE_IRQS_IRETQ
> +
> +	/*
> +	 * Try to use SYSRET instead of IRET if we're returning to
> +	 * a completely clean 64-bit userspace context.
> +	 */
> +	movq (RCX-R11)(%rsp), %rcx
> +	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.  It's not worth
> +	 * testing for canonicalness exactly -- this check detects any
> +	 * of the 17 high bits set, which is true for non-canonical
> +	 * or kernel addresses.  (This will pessimize vsyscall=native.
> +	 * Big deal.)
> +	 */
> +	shr $47, %rcx

	shr $__VIRTUAL_MASK_SHIFT, %rcx

I guess, in case someone decides to play with the address space again
and forgets this naked bit here.

> +	jnz opportunistic_sysret_failed
> +
> +	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	movq (R11-R11)(%rsp), %r11
> +	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
> +	jne opportunistic_sysret_failed
> +
> +	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
> +	jnz opportunistic_sysret_failed
> +
> +	/* nothing to check for RSP */
> +
> +	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * We win!  This label is here just for ease of understanding
> +	 * perf profiles.  Nothing jumps here.
> +	 */
> +irq_return_via_sysret:
> +	CFI_REMEMBER_STATE
> +	RESTORE_ARGS 1,8,1
> +	movq (RSP-RIP)(%rsp),%rsp
> +	USERGS_SYSRET64
> +	CFI_RESTORE_STATE
> +
> +opportunistic_sysret_failed:
>  	SWAPGS
>  	jmp restore_args

Ok, dammit, it happened again:

...
[   13.480778] BTRFS info (device sda9): disk space caching is enabled
[   13.487270] BTRFS: has skinny extents
[   14.368392] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
[   15.928679] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
[   15.936406] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
[   15.942879] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  115.065408] ata1.00: exception Emask 0x0 SAct 0x7fd80000 SErr 0x0 action 0x6 frozen
[  115.073159] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.078459] ata1.00: cmd 61/80:98:c0:e7:35/4a:00:1f:00:00/40 tag 19 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.093623] ata1.00: status: { DRDY }
[  115.097314] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.102569] ata1.00: cmd 61/30:a0:40:32:36/20:00:1f:00:00/40 tag 20 ncq 4218880 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.117668] ata1.00: status: { DRDY }
[  115.121351] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.126602] ata1.00: cmd 61/80:b0:80:f7:37/20:00:1f:00:00/40 tag 22 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.141701] ata1.00: status: { DRDY }
[  115.145389] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.150638] ata1.00: cmd 61/90:b8:70:52:36/03:00:1f:00:00/40 tag 23 ncq 466944 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.165682] ata1.00: status: { DRDY }
[  115.169357] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.174617] ata1.00: cmd 61/c0:c0:00:58:36/39:00:1f:00:00/40 tag 24 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.189713] ata1.00: status: { DRDY }
[  115.193400] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.198650] ata1.00: cmd 61/80:c8:c0:91:36/4b:00:1f:00:00/40 tag 25 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.213755] ata1.00: status: { DRDY }
[  115.217431] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.222692] ata1.00: cmd 61/80:d0:40:dd:36/4a:00:1f:00:00/40 tag 26 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.237788] ata1.00: status: { DRDY }
[  115.241479] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.246723] ata1.00: cmd 61/40:d8:c0:27:37/30:00:1f:00:00/40 tag 27 ncq 6324224 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.261825] ata1.00: status: { DRDY }
[  115.265519] ata1.00: failed command: READ FPDMA QUEUED
[  115.270683] ata1.00: cmd 60/08:e0:40:98:18/00:00:1f:00:00/40 tag 28 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.285432] ata1.00: status: { DRDY }
[  115.289113] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.294367] ata1.00: cmd 61/00:e8:00:58:37/15:00:1f:00:00/40 tag 29 ncq 2752512 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.309463] ata1.00: status: { DRDY }
[  115.313149] ata1.00: failed command: WRITE FPDMA QUEUED
[  115.318399] ata1.00: cmd 61/00:f0:00:6d:37/2b:00:1f:00:00/40 tag 30 ncq 5636096 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  115.333503] ata1.00: status: { DRDY }
[  115.337201] ata1: hard resetting link
[  115.645895] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  115.743776] ata1.00: configured for UDMA/133
[  115.748074] ata1.00: device reported invalid CHS sector 0
[  115.753516] ata1.00: device reported invalid CHS sector 0
[  115.758947] ata1.00: device reported invalid CHS sector 0
[  115.764383] ata1.00: device reported invalid CHS sector 0
[  115.769825] ata1.00: device reported invalid CHS sector 0
[  115.775260] ata1.00: device reported invalid CHS sector 0
[  115.780689] ata1.00: device reported invalid CHS sector 0
[  115.786123] ata1.00: device reported invalid CHS sector 0
[  115.791563] ata1.00: device reported invalid CHS sector 0
[  115.796998] ata1.00: device reported invalid CHS sector 0
[  115.802431] ata1.00: device reported invalid CHS sector 0
[  115.807914] ata1: EH complete
[  146.085052] ata1.00: exception Emask 0x0 SAct 0x77c SErr 0x0 action 0x6 frozen
[  146.092320] ata1.00: failed command: READ FPDMA QUEUED
[  146.097489] ata1.00: cmd 60/08:10:40:98:18/00:00:1f:00:00/40 tag 2 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.112367] ata1.00: status: { DRDY }
[  146.116244] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.121696] ata1.00: cmd 61/40:18:c0:27:37/30:00:1f:00:00/40 tag 3 ncq 6324224 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.137389] ata1.00: status: { DRDY }
[  146.141267] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.146710] ata1.00: cmd 61/80:20:40:dd:36/4a:00:1f:00:00/40 tag 4 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.162395] ata1.00: status: { DRDY }
[  146.166269] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.171723] ata1.00: cmd 61/80:28:c0:91:36/4b:00:1f:00:00/40 tag 5 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.187402] ata1.00: status: { DRDY }
[  146.191278] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.196718] ata1.00: cmd 61/c0:30:00:58:36/39:00:1f:00:00/40 tag 6 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.212399] ata1.00: status: { DRDY }
[  146.216275] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.221723] ata1.00: cmd 61/80:40:80:f7:37/20:00:1f:00:00/40 tag 8 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.237407] ata1.00: status: { DRDY }
[  146.241280] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.246725] ata1.00: cmd 61/30:48:40:32:36/20:00:1f:00:00/40 tag 9 ncq 4218880 out
         res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.262407] ata1.00: status: { DRDY }
[  146.266282] ata1.00: failed command: WRITE FPDMA QUEUED
[  146.271731] ata1.00: cmd 61/80:50:c0:e7:35/4a:00:1f:00:00/40 tag 10 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  146.287498] ata1.00: status: { DRDY }
[  146.291371] ata1: hard resetting link
[  146.599768] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  146.608680] ata1.00: configured for UDMA/133
[  146.613180] ata1.00: device reported invalid CHS sector 0
[  146.618807] ata1.00: device reported invalid CHS sector 0
[  146.624430] ata1.00: device reported invalid CHS sector 0
[  146.630048] ata1.00: device reported invalid CHS sector 0
[  146.635658] ata1.00: device reported invalid CHS sector 0
[  146.641270] ata1.00: device reported invalid CHS sector 0
[  146.646881] ata1.00: device reported invalid CHS sector 0
[  146.652484] ata1.00: device reported invalid CHS sector 0
[  146.658122] ata1: EH complete
[  177.110908] ata1.00: exception Emask 0x0 SAct 0x7f800 SErr 0x0 action 0x6 frozen
[  177.118525] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.123960] ata1.00: cmd 61/80:58:c0:e7:35/4a:00:1f:00:00/40 tag 11 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.139559] ata1.00: status: { DRDY }
[  177.143419] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.148849] ata1.00: cmd 61/30:60:40:32:36/20:00:1f:00:00/40 tag 12 ncq 4218880 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.164454] ata1.00: status: { DRDY }
[  177.168311] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.173747] ata1.00: cmd 61/80:68:80:f7:37/20:00:1f:00:00/40 tag 13 ncq 4259840 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.189387] ata1.00: status: { DRDY }
[  177.193254] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.198691] ata1.00: cmd 61/c0:70:00:58:36/39:00:1f:00:00/40 tag 14 ncq 7569408 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.214430] ata1.00: status: { DRDY }
[  177.218304] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.223755] ata1.00: cmd 61/80:78:c0:91:36/4b:00:1f:00:00/40 tag 15 ncq 9895936 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.239575] ata1.00: status: { DRDY }
[  177.243460] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.248908] ata1.00: cmd 61/80:80:40:dd:36/4a:00:1f:00:00/40 tag 16 ncq 9764864 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.264743] ata1.00: status: { DRDY }
[  177.268622] ata1.00: failed command: WRITE FPDMA QUEUED
[  177.274075] ata1.00: cmd 61/40:88:c0:27:37/30:00:1f:00:00/40 tag 17 ncq 6324224 out
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.289913] ata1.00: status: { DRDY }
[  177.293795] ata1.00: failed command: READ FPDMA QUEUED
[  177.299153] ata1.00: cmd 60/08:90:40:98:18/00:00:1f:00:00/40 tag 18 ncq 4096 in
         res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[  177.314633] ata1.00: status: { DRDY }
[  177.318509] ata1: hard resetting link
[  177.626616] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[  177.968639] ata1.00: configured for UDMA/133
[  177.973609] ata1.00: device reported invalid CHS sector 0
[  177.979669] ata1.00: device reported invalid CHS sector 0
[  177.985723] ata1.00: device reported invalid CHS sector 0
[  177.991371] ata1.00: device reported invalid CHS sector 0
[  177.997008] ata1.00: device reported invalid CHS sector 0
[  178.002641] ata1.00: device reported invalid CHS sector 0
[  178.008260] ata1.00: device reported invalid CHS sector 0
[  178.013886] ata1.00: device reported invalid CHS sector 0
[  178.019558] ata1: EH complete

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2015-01-08 12:29   ` Borislav Petkov
@ 2015-01-08 13:57     ` Borislav Petkov
  2015-01-10 21:05     ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2015-01-08 13:57 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Oleg Nesterov,
	kvm list

On Thu, Jan 08, 2015 at 01:29:28PM +0100, Borislav Petkov wrote:
> Ok, dammit, it happened again:

Running -rc+ 2without your first two patches doesn't trigger it. Well,
I don't know what workload even triggered it, it used to happen during
system upgrade. I left the box without your patches to build the kernel
in a loop and went to lunch.

Now I'm back and it all still looks good.

I'll try running only with this second patch, i.e. the ret-to-user
SYSRET speedup thing. See what happens.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2014-11-07 23:58 ` [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
  2015-01-08 12:29   ` Borislav Petkov
@ 2015-01-09 10:40   ` Borislav Petkov
  1 sibling, 0 replies; 17+ messages in thread
From: Borislav Petkov @ 2015-01-09 10:40 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Oleg Nesterov,
	kvm list

On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
> +	/*
> +	 * Try to use SYSRET instead of IRET if we're returning to
> +	 * a completely clean 64-bit userspace context.
> +	 */
> +	movq (RCX-R11)(%rsp), %rcx
> +	cmpq %rcx,(RIP-R11)(%rsp)		/* RCX == RIP */
> +	jne opportunistic_sysret_failed
> +
> +	/*
> +	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
> +	 * in kernel space.  This essentially lets the user take over
> +	 * the kernel, since userspace controls RSP.  It's not worth
> +	 * testing for canonicalness exactly -- this check detects any
> +	 * of the 17 high bits set, which is true for non-canonical
> +	 * or kernel addresses.  (This will pessimize vsyscall=native.
> +	 * Big deal.)
> +	 */
> +	shr $47, %rcx
> +	jnz opportunistic_sysret_failed
> +
> +	cmpq $__USER_CS,(CS-R11)(%rsp)		/* CS must match SYSRET */
> +	jne opportunistic_sysret_failed
> +
> +	movq (R11-R11)(%rsp), %r11
> +	cmpq %r11,(EFLAGS-R11)(%rsp)		/* R11 == RFLAGS */
> +	jne opportunistic_sysret_failed
> +
> +	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
> +	jnz opportunistic_sysret_failed
> +
> +	/* nothing to check for RSP */
> +
> +	cmpq $__USER_DS,(SS-R11)(%rsp)		/* SS must match SYSRET */
> +	jne opportunistic_sysret_failed

Btw, Denys' R11->ARGOFFSET fix makes sense here too - using ARGOFFSET
instead of R11 would make this here clearer.

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
  2014-11-07 23:58 ` [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
@ 2015-01-09 15:53   ` Borislav Petkov
  2015-01-09 16:08     ` Andy Lutomirski
  0 siblings, 1 reply; 17+ messages in thread
From: Borislav Petkov @ 2015-01-09 15:53 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Frédéric Weisbecker, Oleg Nesterov,
	kvm list

On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote:
> We used to optimize rescheduling and audit on syscall exit.  Now that
> the full slow path is reasonably fast, remove these optimizations.
> 
> This adds something like 10ns to the previously optimized paths on my
> computer, presumably due mostly to SAVE_REST / RESTORE_REST.
> 
> I think that we should eventually replace both the syscall and
> non-paranoid interrupt exit slow paths with a pair of C functions
> along the lines of the syscall entry hooks.
> 
> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
> ---
>  arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
>  1 file changed, 5 insertions(+), 47 deletions(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index a5afdf0f7fa4..222dc5c45ac3 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -427,15 +427,12 @@ system_call_fastpath:
>   * Has incomplete stack frame and undefined top of stack.
>   */
>  ret_from_sys_call:
> -	movl $_TIF_ALLWORK_MASK,%edi
> -	/* edi:	flagmask */
> -sysret_check:
> +	testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
> +	jnz int_ret_from_sys_call_fixup	/* Go the the slow path */
> +
>  	LOCKDEP_SYS_EXIT
>  	DISABLE_INTERRUPTS(CLBR_NONE)
>  	TRACE_IRQS_OFF
> -	movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
> -	andl %edi,%edx
> -	jnz  sysret_careful
>  	CFI_REMEMBER_STATE
>  	/*
>  	 * sysretq will re-enable interrupts:
> @@ -449,49 +446,10 @@ sysret_check:
>  	USERGS_SYSRET64
>  
>  	CFI_RESTORE_STATE
> -	/* Handle reschedules */
> -	/* edx:	work, edi: workmask */
> -sysret_careful:
> -	bt $TIF_NEED_RESCHED,%edx
> -	jnc sysret_signal
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
> -	pushq_cfi %rdi
> -	SCHEDULE_USER
> -	popq_cfi %rdi
> -	jmp sysret_check
>  
> -	/* Handle a signal */
> -sysret_signal:
> -	TRACE_IRQS_ON
> -	ENABLE_INTERRUPTS(CLBR_NONE)
> -#ifdef CONFIG_AUDITSYSCALL
> -	bt $TIF_SYSCALL_AUDIT,%edx
> -	jc sysret_audit
> -#endif
> -	/*
> -	 * We have a signal, or exit tracing or single-step.
> -	 * These all wind up with the iret return path anyway,
> -	 * so just join that path right now.
> -	 */
> +int_ret_from_sys_call_fixup:
>  	FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
> -	jmp int_check_syscall_exit_work
> -
> -#ifdef CONFIG_AUDITSYSCALL
> -	/*
> -	 * Return fast path for syscall audit.  Call __audit_syscall_exit()
> -	 * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
> -	 * masked off.
> -	 */
> -sysret_audit:
> -	movq RAX-ARGOFFSET(%rsp),%rsi	/* second arg, syscall return value */
> -	cmpq $-MAX_ERRNO,%rsi	/* is it < -MAX_ERRNO? */
> -	setbe %al		/* 1 if so, 0 if not */
> -	movzbl %al,%edi		/* zero-extend that into %edi */
> -	call __audit_syscall_exit
> -	movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
> -	jmp sysret_check
> -#endif	/* CONFIG_AUDITSYSCALL */
> +	jmp int_ret_from_sys_call

Ok, dumb question: what is replacing that audit functionality now?
Or are we remaining with the single path down syscall_trace_leave()?
AFAICT, the intention is to direct everything to int_ret_from_sys_call
after the syscall has been done so that paths can get simplified. Which
is a good thing, AFAICT.

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations
  2015-01-09 15:53   ` Borislav Petkov
@ 2015-01-09 16:08     ` Andy Lutomirski
  0 siblings, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-01-09 16:08 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, linux-kernel, Frédéric Weisbecker,
	Oleg Nesterov, kvm list

On Fri, Jan 9, 2015 at 7:53 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 07, 2014 at 03:58:19PM -0800, Andy Lutomirski wrote:
>> We used to optimize rescheduling and audit on syscall exit.  Now that
>> the full slow path is reasonably fast, remove these optimizations.
>>
>> This adds something like 10ns to the previously optimized paths on my
>> computer, presumably due mostly to SAVE_REST / RESTORE_REST.
>>
>> I think that we should eventually replace both the syscall and
>> non-paranoid interrupt exit slow paths with a pair of C functions
>> along the lines of the syscall entry hooks.
>>
>> Signed-off-by: Andy Lutomirski <luto@amacapital.net>
>> ---
>>  arch/x86/kernel/entry_64.S | 52 +++++-----------------------------------------
>>  1 file changed, 5 insertions(+), 47 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index a5afdf0f7fa4..222dc5c45ac3 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -427,15 +427,12 @@ system_call_fastpath:
>>   * Has incomplete stack frame and undefined top of stack.
>>   */
>>  ret_from_sys_call:
>> -     movl $_TIF_ALLWORK_MASK,%edi
>> -     /* edi: flagmask */
>> -sysret_check:
>> +     testl $_TIF_ALLWORK_MASK,TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET)
>> +     jnz int_ret_from_sys_call_fixup /* Go the the slow path */
>> +
>>       LOCKDEP_SYS_EXIT
>>       DISABLE_INTERRUPTS(CLBR_NONE)
>>       TRACE_IRQS_OFF
>> -     movl TI_flags+THREAD_INFO(%rsp,RIP-ARGOFFSET),%edx
>> -     andl %edi,%edx
>> -     jnz  sysret_careful
>>       CFI_REMEMBER_STATE
>>       /*
>>        * sysretq will re-enable interrupts:
>> @@ -449,49 +446,10 @@ sysret_check:
>>       USERGS_SYSRET64
>>
>>       CFI_RESTORE_STATE
>> -     /* Handle reschedules */
>> -     /* edx: work, edi: workmask */
>> -sysret_careful:
>> -     bt $TIF_NEED_RESCHED,%edx
>> -     jnc sysret_signal
>> -     TRACE_IRQS_ON
>> -     ENABLE_INTERRUPTS(CLBR_NONE)
>> -     pushq_cfi %rdi
>> -     SCHEDULE_USER
>> -     popq_cfi %rdi
>> -     jmp sysret_check
>>
>> -     /* Handle a signal */
>> -sysret_signal:
>> -     TRACE_IRQS_ON
>> -     ENABLE_INTERRUPTS(CLBR_NONE)
>> -#ifdef CONFIG_AUDITSYSCALL
>> -     bt $TIF_SYSCALL_AUDIT,%edx
>> -     jc sysret_audit
>> -#endif
>> -     /*
>> -      * We have a signal, or exit tracing or single-step.
>> -      * These all wind up with the iret return path anyway,
>> -      * so just join that path right now.
>> -      */
>> +int_ret_from_sys_call_fixup:
>>       FIXUP_TOP_OF_STACK %r11, -ARGOFFSET
>> -     jmp int_check_syscall_exit_work
>> -
>> -#ifdef CONFIG_AUDITSYSCALL
>> -     /*
>> -      * Return fast path for syscall audit.  Call __audit_syscall_exit()
>> -      * directly and then jump back to the fast path with TIF_SYSCALL_AUDIT
>> -      * masked off.
>> -      */
>> -sysret_audit:
>> -     movq RAX-ARGOFFSET(%rsp),%rsi   /* second arg, syscall return value */
>> -     cmpq $-MAX_ERRNO,%rsi   /* is it < -MAX_ERRNO? */
>> -     setbe %al               /* 1 if so, 0 if not */
>> -     movzbl %al,%edi         /* zero-extend that into %edi */
>> -     call __audit_syscall_exit
>> -     movl $(_TIF_ALLWORK_MASK & ~_TIF_SYSCALL_AUDIT),%edi
>> -     jmp sysret_check
>> -#endif       /* CONFIG_AUDITSYSCALL */
>> +     jmp int_ret_from_sys_call
>
> Ok, dumb question: what is replacing that audit functionality now?
> Or are we remaining with the single path down syscall_trace_leave()?
> AFAICT, the intention is to direct everything to int_ret_from_sys_call
> after the syscall has been done so that paths can get simplified. Which
> is a good thing, AFAICT.

Exactly.  int_ret_from_sys_call calls into the audit code in
syscall_trace_leave.  IIRC the current code sometimes calls the audit
exit hook twice, which seems like a bug to me.

--Andy

>
> Acked-by: Borislav Petkov <bp@suse.de>
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible
  2015-01-08 12:29   ` Borislav Petkov
  2015-01-08 13:57     ` Borislav Petkov
@ 2015-01-10 21:05     ` Andy Lutomirski
  1 sibling, 0 replies; 17+ messages in thread
From: Andy Lutomirski @ 2015-01-10 21:05 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: X86 ML, linux-kernel, Frédéric Weisbecker,
	Oleg Nesterov, kvm list

On Thu, Jan 8, 2015 at 4:29 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Nov 07, 2014 at 03:58:18PM -0800, Andy Lutomirski wrote:
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 3710b8241945..a5afdf0f7fa4 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -804,6 +804,54 @@ retint_swapgs:           /* return to user-space */
>
> Ok, so retint_swapgs is also on the error_exit path.
>
> What you're basically proposing is to use SYSRET on exceptions exit
> too AFAICT. And while I don't see anything wrong with the patch, you
> probably need to run this by more people like tip guys + Linus just in
> case. We can't allow ourselves to leak stuff here.

I'll cc Linus et all on v2.

>
>>        */
>>       DISABLE_INTERRUPTS(CLBR_ANY)
>>       TRACE_IRQS_IRETQ
>> +
>> +     /*
>> +      * Try to use SYSRET instead of IRET if we're returning to
>> +      * a completely clean 64-bit userspace context.
>> +      */
>> +     movq (RCX-R11)(%rsp), %rcx
>> +     cmpq %rcx,(RIP-R11)(%rsp)               /* RCX == RIP */
>> +     jne opportunistic_sysret_failed
>> +
>> +     /*
>> +      * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
>> +      * in kernel space.  This essentially lets the user take over
>> +      * the kernel, since userspace controls RSP.  It's not worth
>> +      * testing for canonicalness exactly -- this check detects any
>> +      * of the 17 high bits set, which is true for non-canonical
>> +      * or kernel addresses.  (This will pessimize vsyscall=native.
>> +      * Big deal.)
>> +      */
>> +     shr $47, %rcx
>
>         shr $__VIRTUAL_MASK_SHIFT, %rcx
>
> I guess, in case someone decides to play with the address space again
> and forgets this naked bit here.
>

I'll probably add a build-time assertion that __VIRTUAL_MASK_SHIFT ==
47 instead.  If we ever support CPUs with an extra level of page
tables, we'll probably need to patch the instruction, since we have a
security hole if that shift ever exceeds 47 on existing CPUs.

--Andy

>> +     jnz opportunistic_sysret_failed
>> +
>> +     cmpq $__USER_CS,(CS-R11)(%rsp)          /* CS must match SYSRET */
>> +     jne opportunistic_sysret_failed
>> +
>> +     movq (R11-R11)(%rsp), %r11
>> +     cmpq %r11,(EFLAGS-R11)(%rsp)            /* R11 == RFLAGS */
>> +     jne opportunistic_sysret_failed
>> +
>> +     testq $X86_EFLAGS_RF,%r11               /* sysret can't restore RF */
>> +     jnz opportunistic_sysret_failed
>> +
>> +     /* nothing to check for RSP */
>> +
>> +     cmpq $__USER_DS,(SS-R11)(%rsp)          /* SS must match SYSRET */
>> +     jne opportunistic_sysret_failed
>> +
>> +     /*
>> +      * We win!  This label is here just for ease of understanding
>> +      * perf profiles.  Nothing jumps here.
>> +      */
>> +irq_return_via_sysret:
>> +     CFI_REMEMBER_STATE
>> +     RESTORE_ARGS 1,8,1
>> +     movq (RSP-RIP)(%rsp),%rsp
>> +     USERGS_SYSRET64
>> +     CFI_RESTORE_STATE
>> +
>> +opportunistic_sysret_failed:
>>       SWAPGS
>>       jmp restore_args
>
> Ok, dammit, it happened again:
>
> ...
> [   13.480778] BTRFS info (device sda9): disk space caching is enabled
> [   13.487270] BTRFS: has skinny extents
> [   14.368392] IPv6: ADDRCONF(NETDEV_UP): eth0: link is not ready
> [   15.928679] e1000e: eth0 NIC Link is Up 100 Mbps Full Duplex, Flow Control: Rx/Tx
> [   15.936406] e1000e 0000:00:19.0 eth0: 10/100 speed: disabling TSO
> [   15.942879] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> [  115.065408] ata1.00: exception Emask 0x0 SAct 0x7fd80000 SErr 0x0 action 0x6 frozen
> [  115.073159] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.078459] ata1.00: cmd 61/80:98:c0:e7:35/4a:00:1f:00:00/40 tag 19 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.093623] ata1.00: status: { DRDY }
> [  115.097314] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.102569] ata1.00: cmd 61/30:a0:40:32:36/20:00:1f:00:00/40 tag 20 ncq 4218880 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.117668] ata1.00: status: { DRDY }
> [  115.121351] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.126602] ata1.00: cmd 61/80:b0:80:f7:37/20:00:1f:00:00/40 tag 22 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.141701] ata1.00: status: { DRDY }
> [  115.145389] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.150638] ata1.00: cmd 61/90:b8:70:52:36/03:00:1f:00:00/40 tag 23 ncq 466944 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.165682] ata1.00: status: { DRDY }
> [  115.169357] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.174617] ata1.00: cmd 61/c0:c0:00:58:36/39:00:1f:00:00/40 tag 24 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.189713] ata1.00: status: { DRDY }
> [  115.193400] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.198650] ata1.00: cmd 61/80:c8:c0:91:36/4b:00:1f:00:00/40 tag 25 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.213755] ata1.00: status: { DRDY }
> [  115.217431] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.222692] ata1.00: cmd 61/80:d0:40:dd:36/4a:00:1f:00:00/40 tag 26 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.237788] ata1.00: status: { DRDY }
> [  115.241479] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.246723] ata1.00: cmd 61/40:d8:c0:27:37/30:00:1f:00:00/40 tag 27 ncq 6324224 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.261825] ata1.00: status: { DRDY }
> [  115.265519] ata1.00: failed command: READ FPDMA QUEUED
> [  115.270683] ata1.00: cmd 60/08:e0:40:98:18/00:00:1f:00:00/40 tag 28 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.285432] ata1.00: status: { DRDY }
> [  115.289113] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.294367] ata1.00: cmd 61/00:e8:00:58:37/15:00:1f:00:00/40 tag 29 ncq 2752512 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.309463] ata1.00: status: { DRDY }
> [  115.313149] ata1.00: failed command: WRITE FPDMA QUEUED
> [  115.318399] ata1.00: cmd 61/00:f0:00:6d:37/2b:00:1f:00:00/40 tag 30 ncq 5636096 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  115.333503] ata1.00: status: { DRDY }
> [  115.337201] ata1: hard resetting link
> [  115.645895] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  115.743776] ata1.00: configured for UDMA/133
> [  115.748074] ata1.00: device reported invalid CHS sector 0
> [  115.753516] ata1.00: device reported invalid CHS sector 0
> [  115.758947] ata1.00: device reported invalid CHS sector 0
> [  115.764383] ata1.00: device reported invalid CHS sector 0
> [  115.769825] ata1.00: device reported invalid CHS sector 0
> [  115.775260] ata1.00: device reported invalid CHS sector 0
> [  115.780689] ata1.00: device reported invalid CHS sector 0
> [  115.786123] ata1.00: device reported invalid CHS sector 0
> [  115.791563] ata1.00: device reported invalid CHS sector 0
> [  115.796998] ata1.00: device reported invalid CHS sector 0
> [  115.802431] ata1.00: device reported invalid CHS sector 0
> [  115.807914] ata1: EH complete
> [  146.085052] ata1.00: exception Emask 0x0 SAct 0x77c SErr 0x0 action 0x6 frozen
> [  146.092320] ata1.00: failed command: READ FPDMA QUEUED
> [  146.097489] ata1.00: cmd 60/08:10:40:98:18/00:00:1f:00:00/40 tag 2 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.112367] ata1.00: status: { DRDY }
> [  146.116244] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.121696] ata1.00: cmd 61/40:18:c0:27:37/30:00:1f:00:00/40 tag 3 ncq 6324224 out
>          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.137389] ata1.00: status: { DRDY }
> [  146.141267] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.146710] ata1.00: cmd 61/80:20:40:dd:36/4a:00:1f:00:00/40 tag 4 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.162395] ata1.00: status: { DRDY }
> [  146.166269] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.171723] ata1.00: cmd 61/80:28:c0:91:36/4b:00:1f:00:00/40 tag 5 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.187402] ata1.00: status: { DRDY }
> [  146.191278] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.196718] ata1.00: cmd 61/c0:30:00:58:36/39:00:1f:00:00/40 tag 6 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.212399] ata1.00: status: { DRDY }
> [  146.216275] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.221723] ata1.00: cmd 61/80:40:80:f7:37/20:00:1f:00:00/40 tag 8 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.237407] ata1.00: status: { DRDY }
> [  146.241280] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.246725] ata1.00: cmd 61/30:48:40:32:36/20:00:1f:00:00/40 tag 9 ncq 4218880 out
>          res 40/00:01:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.262407] ata1.00: status: { DRDY }
> [  146.266282] ata1.00: failed command: WRITE FPDMA QUEUED
> [  146.271731] ata1.00: cmd 61/80:50:c0:e7:35/4a:00:1f:00:00/40 tag 10 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  146.287498] ata1.00: status: { DRDY }
> [  146.291371] ata1: hard resetting link
> [  146.599768] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  146.608680] ata1.00: configured for UDMA/133
> [  146.613180] ata1.00: device reported invalid CHS sector 0
> [  146.618807] ata1.00: device reported invalid CHS sector 0
> [  146.624430] ata1.00: device reported invalid CHS sector 0
> [  146.630048] ata1.00: device reported invalid CHS sector 0
> [  146.635658] ata1.00: device reported invalid CHS sector 0
> [  146.641270] ata1.00: device reported invalid CHS sector 0
> [  146.646881] ata1.00: device reported invalid CHS sector 0
> [  146.652484] ata1.00: device reported invalid CHS sector 0
> [  146.658122] ata1: EH complete
> [  177.110908] ata1.00: exception Emask 0x0 SAct 0x7f800 SErr 0x0 action 0x6 frozen
> [  177.118525] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.123960] ata1.00: cmd 61/80:58:c0:e7:35/4a:00:1f:00:00/40 tag 11 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.139559] ata1.00: status: { DRDY }
> [  177.143419] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.148849] ata1.00: cmd 61/30:60:40:32:36/20:00:1f:00:00/40 tag 12 ncq 4218880 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.164454] ata1.00: status: { DRDY }
> [  177.168311] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.173747] ata1.00: cmd 61/80:68:80:f7:37/20:00:1f:00:00/40 tag 13 ncq 4259840 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.189387] ata1.00: status: { DRDY }
> [  177.193254] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.198691] ata1.00: cmd 61/c0:70:00:58:36/39:00:1f:00:00/40 tag 14 ncq 7569408 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.214430] ata1.00: status: { DRDY }
> [  177.218304] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.223755] ata1.00: cmd 61/80:78:c0:91:36/4b:00:1f:00:00/40 tag 15 ncq 9895936 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.239575] ata1.00: status: { DRDY }
> [  177.243460] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.248908] ata1.00: cmd 61/80:80:40:dd:36/4a:00:1f:00:00/40 tag 16 ncq 9764864 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.264743] ata1.00: status: { DRDY }
> [  177.268622] ata1.00: failed command: WRITE FPDMA QUEUED
> [  177.274075] ata1.00: cmd 61/40:88:c0:27:37/30:00:1f:00:00/40 tag 17 ncq 6324224 out
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.289913] ata1.00: status: { DRDY }
> [  177.293795] ata1.00: failed command: READ FPDMA QUEUED
> [  177.299153] ata1.00: cmd 60/08:90:40:98:18/00:00:1f:00:00/40 tag 18 ncq 4096 in
>          res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
> [  177.314633] ata1.00: status: { DRDY }
> [  177.318509] ata1: hard resetting link
> [  177.626616] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
> [  177.968639] ata1.00: configured for UDMA/133
> [  177.973609] ata1.00: device reported invalid CHS sector 0
> [  177.979669] ata1.00: device reported invalid CHS sector 0
> [  177.985723] ata1.00: device reported invalid CHS sector 0
> [  177.991371] ata1.00: device reported invalid CHS sector 0
> [  177.997008] ata1.00: device reported invalid CHS sector 0
> [  178.002641] ata1.00: device reported invalid CHS sector 0
> [  178.008260] ata1.00: device reported invalid CHS sector 0
> [  178.013886] ata1.00: device reported invalid CHS sector 0
> [  178.019558] ata1: EH complete
>
> --
> Regards/Gruss,
>     Boris.
>
> Sent from a fat crate under my desk. Formatting is fine.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2015-01-10 21:05 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-11-07 23:58 [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski
2014-11-07 23:58 ` [PATCH 1/3] x86_64,entry: Fix RCX for traced syscalls Andy Lutomirski
2015-01-05 12:59   ` Borislav Petkov
2015-01-05 20:31     ` Andy Lutomirski
2015-01-06 15:34       ` Borislav Petkov
2015-01-06 18:43         ` Andy Lutomirski
2015-01-06 19:00           ` Borislav Petkov
2015-01-07 15:55             ` Borislav Petkov
2014-11-07 23:58 ` [PATCH 2/3] x86_64,entry: Use sysret to return to userspace when possible Andy Lutomirski
2015-01-08 12:29   ` Borislav Petkov
2015-01-08 13:57     ` Borislav Petkov
2015-01-10 21:05     ` Andy Lutomirski
2015-01-09 10:40   ` Borislav Petkov
2014-11-07 23:58 ` [PATCH 3/3] x86_64,entry: Remove the syscall exit audit and schedule optimizations Andy Lutomirski
2015-01-09 15:53   ` Borislav Petkov
2015-01-09 16:08     ` Andy Lutomirski
2014-12-04  1:42 ` [PATCH 0/3] x86_64,entry: Rearrange the syscall exit optimizations Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.