[PATCH] x86/asm/entry/64: better check for canonical address

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] x86/asm/entry/64: better check for canonical address
@ 2015-03-26 12:42 Denys Vlasenko
  2015-03-26 18:45 ` Andy Lutomirski
                   ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-26 12:42 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Denys Vlasenko, Borislav Petkov, x86, linux-kernel

This change makes the check exact (no more false positives
on kernel addresses).

It isn't really important to be fully correct here -
almost all addresses we'll ever see will be userspace ones,
but OTOH it looks to be cheap enough:
the new code uses two more ALU ops but preserves %rcx,
allowing to not reload it from pt_regs->cx again.
On disassembly level, the changes are:

cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
mov 0x58(%rsp),%rcx -> (eliminated)

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Borislav Petkov <bp@alien8.de>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---

Andy, I'd undecided myself on the merits of doing this.
If you like it, feel free to take it in your tree.
I trimmed CC list to not bother too many people with this trivial
and quite possibly "useless churn"-class change.

 arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index bf9afad..a36d04d 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -688,26 +688,27 @@ retint_swapgs:		/* return to user-space */
 	 * a completely clean 64-bit userspace context.
 	 */
 	movq RCX(%rsp),%rcx
-	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
+	movq RIP(%rsp),%r11
+	cmpq %rcx,%r11			/* RCX == RIP */
 	jne opportunistic_sysret_failed
 
 	/*
 	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
 	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.  It's not worth
-	 * testing for canonicalness exactly -- this check detects any
-	 * of the 17 high bits set, which is true for non-canonical
-	 * or kernel addresses.  (This will pessimize vsyscall=native.
-	 * Big deal.)
+	 * the kernel, since userspace controls RSP.
 	 *
-	 * If virtual addresses ever become wider, this will need
+	 * If width of "canonical tail" ever become variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 */
 	.ifne __VIRTUAL_MASK_SHIFT - 47
 	.error "virtual address width changed -- sysret checks need update"
 	.endif
-	shr $__VIRTUAL_MASK_SHIFT, %rcx
-	jnz opportunistic_sysret_failed
+	/* Change top 16 bits to be a sign-extension of the rest */
+	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
+	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
+	/* If this changed %rcx, it was not canonical */
+	cmpq	%rcx, %r11
+	jne	opportunistic_sysret_failed
 
 	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
 	jne opportunistic_sysret_failed
@@ -730,8 +731,8 @@ retint_swapgs:		/* return to user-space */
 	 */
 irq_return_via_sysret:
 	CFI_REMEMBER_STATE
-	/* r11 is already restored (see code above) */
-	RESTORE_C_REGS_EXCEPT_R11
+	/* rcx and r11 are already restored (see code above) */
+	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq RSP(%rsp),%rsp
 	USERGS_SYSRET64
 	CFI_RESTORE_STATE
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 12:42 [PATCH] x86/asm/entry/64: better check for canonical address Denys Vlasenko
@ 2015-03-26 18:45 ` Andy Lutomirski
  2015-03-27  8:57   ` Borislav Petkov
  2015-03-30 14:27   ` Denys Vlasenko
  2015-03-27  8:11 ` Ingo Molnar
                   ` (2 subsequent siblings)
  3 siblings, 2 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-26 18:45 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Borislav Petkov, X86 ML, linux-kernel

On Thu, Mar 26, 2015 at 5:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> This change makes the check exact (no more false positives
> on kernel addresses).
>
> It isn't really important to be fully correct here -
> almost all addresses we'll ever see will be userspace ones,
> but OTOH it looks to be cheap enough:
> the new code uses two more ALU ops but preserves %rcx,
> allowing to not reload it from pt_regs->cx again.
> On disassembly level, the changes are:
>
> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> mov 0x58(%rsp),%rcx -> (eliminated)
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>
> Andy, I'd undecided myself on the merits of doing this.
> If you like it, feel free to take it in your tree.
> I trimmed CC list to not bother too many people with this trivial
> and quite possibly "useless churn"-class change.

I suspect that the two added ALU ops are free for all practical
purposes, and the performance of this path isn't *that* critical.

If anyone is running with vsyscall=native because they need the
performance, then this would be a big win.  Otherwise I don't have a
real preference.  Anyone else have any thoughts here?

Let me just run through the math quickly to make sure I believe all the numbers:

Canonical addresses either start with 17 zeros or 17 ones.

In the old code, we checked that the top (64-47) = 17 bits were all
zero.  We did this by shifting right by 47 bits and making sure that
nothing was left.

In the new code, we're shifting left by (64 - 48) = 16 bits and then
signed shifting right by the same amount, this propagating the 17th
highest bit to all positions to its left.  If we get the same value we
started with, then we're good to go.

So it looks okay to me.

IOW, the new code extends the optimization correctly to one more case
(native vsyscalls or the really weird corner case of returns to
emulated vsyscalls, although that should basically never happen) at
the cost of two probably-free ALU ops.

--Andy

>
>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index bf9afad..a36d04d 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -688,26 +688,27 @@ retint_swapgs:            /* return to user-space */
>          * a completely clean 64-bit userspace context.
>          */
>         movq RCX(%rsp),%rcx
> -       cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
> +       movq RIP(%rsp),%r11
> +       cmpq %rcx,%r11                  /* RCX == RIP */
>         jne opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
> -        * the kernel, since userspace controls RSP.  It's not worth
> -        * testing for canonicalness exactly -- this check detects any
> -        * of the 17 high bits set, which is true for non-canonical
> -        * or kernel addresses.  (This will pessimize vsyscall=native.
> -        * Big deal.)
> +        * the kernel, since userspace controls RSP.
>          *
> -        * If virtual addresses ever become wider, this will need
> +        * If width of "canonical tail" ever become variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- sysret checks need update"
>         .endif
> -       shr $__VIRTUAL_MASK_SHIFT, %rcx
> -       jnz opportunistic_sysret_failed
> +       /* Change top 16 bits to be a sign-extension of the rest */
> +       shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +       sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +       /* If this changed %rcx, it was not canonical */
> +       cmpq    %rcx, %r11
> +       jne     opportunistic_sysret_failed
>
>         cmpq $__USER_CS,CS(%rsp)        /* CS must match SYSRET */
>         jne opportunistic_sysret_failed
> @@ -730,8 +731,8 @@ retint_swapgs:              /* return to user-space */
>          */
>  irq_return_via_sysret:
>         CFI_REMEMBER_STATE
> -       /* r11 is already restored (see code above) */
> -       RESTORE_C_REGS_EXCEPT_R11
> +       /* rcx and r11 are already restored (see code above) */
> +       RESTORE_C_REGS_EXCEPT_RCX_R11
>         movq RSP(%rsp),%rsp
>         USERGS_SYSRET64
>         CFI_RESTORE_STATE
> --
> 1.8.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 12:42 [PATCH] x86/asm/entry/64: better check for canonical address Denys Vlasenko
  2015-03-26 18:45 ` Andy Lutomirski
@ 2015-03-27  8:11 ` Ingo Molnar
  2015-03-27 10:45   ` Denys Vlasenko
  2015-03-27 11:27 ` Brian Gerst
  2015-04-02 17:37 ` Denys Vlasenko
  3 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-27  8:11 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> This change makes the check exact (no more false positives
> on kernel addresses).
> 
> It isn't really important to be fully correct here -
> almost all addresses we'll ever see will be userspace ones,
> but OTOH it looks to be cheap enough:
> the new code uses two more ALU ops but preserves %rcx,
> allowing to not reload it from pt_regs->cx again.
> On disassembly level, the changes are:
> 
> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> mov 0x58(%rsp),%rcx -> (eliminated)
> 
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
> 
> Andy, I'd undecided myself on the merits of doing this.
> If you like it, feel free to take it in your tree.
> I trimmed CC list to not bother too many people with this trivial
> and quite possibly "useless churn"-class change.
> 
>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index bf9afad..a36d04d 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -688,26 +688,27 @@ retint_swapgs:		/* return to user-space */
>  	 * a completely clean 64-bit userspace context.
>  	 */
>  	movq RCX(%rsp),%rcx
> -	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
> +	movq RIP(%rsp),%r11
> +	cmpq %rcx,%r11			/* RCX == RIP */
>  	jne opportunistic_sysret_failed

Btw., in the normal syscall entry path, RIP(%rsp) == RCX(%rsp), 
because we set up pt_regs like that - and at this point RIP/RCX is 
guaranteed to be canonical, right?

So if there's a mismatch generated, it's the kernel's doing.

Why don't we detect those cases where a new return address is created 
(ptrace, exec, etc.), check for canonicalness and add a TIF flag for 
it (and add it to the work mask) and execute the IRET from the slow 
path?

We already have a work-mask branch.

That would allow the removal of all these checks and canonization from 
the fast return path! We could go straight to the SYSRET...

The frequency of exec() and ptrace() is 2-3 orders of magnitude lower 
than the frequency of system calls, so this would be well worth it.

Am I missing anything?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 18:45 ` Andy Lutomirski
@ 2015-03-27  8:57   ` Borislav Petkov
  2015-03-30 14:27   ` Denys Vlasenko
  1 sibling, 0 replies; 32+ messages in thread
From: Borislav Petkov @ 2015-03-27  8:57 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Denys Vlasenko, X86 ML, linux-kernel

On Thu, Mar 26, 2015 at 11:45:19AM -0700, Andy Lutomirski wrote:
> I suspect that the two added ALU ops are free for all practical
> purposes, and the performance of this path isn't *that* critical.
> 
> If anyone is running with vsyscall=native because they need the
> performance, then this would be a big win.  Otherwise I don't have a
> real preference.  Anyone else have any thoughts here?
> 
> Let me just run through the math quickly to make sure I believe all the numbers:
> 
> Canonical addresses either start with 17 zeros or 17 ones.
> 
> In the old code, we checked that the top (64-47) = 17 bits were all
> zero.  We did this by shifting right by 47 bits and making sure that
> nothing was left.
> 
> In the new code, we're shifting left by (64 - 48) = 16 bits and then
> signed shifting right by the same amount, this propagating the 17th
> highest bit to all positions to its left.  If we get the same value we
> started with, then we're good to go.
> 
> So it looks okay to me.
> 
> IOW, the new code extends the optimization correctly to one more case
> (native vsyscalls or the really weird corner case of returns to
> emulated vsyscalls, although that should basically never happen) at
> the cost of two probably-free ALU ops.

If we're going to apply this, I'd like this text in the commit message
please.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27  8:11 ` Ingo Molnar
@ 2015-03-27 10:45   ` Denys Vlasenko
  2015-03-27 11:17     ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-27 10:45 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel

On 03/27/2015 09:11 AM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
>> This change makes the check exact (no more false positives
>> on kernel addresses).
>>
>> It isn't really important to be fully correct here -
>> almost all addresses we'll ever see will be userspace ones,
>> but OTOH it looks to be cheap enough:
>> the new code uses two more ALU ops but preserves %rcx,
>> allowing to not reload it from pt_regs->cx again.
>> On disassembly level, the changes are:
>>
>> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>> mov 0x58(%rsp),%rcx -> (eliminated)
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: x86@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> ---
>>
>> Andy, I'd undecided myself on the merits of doing this.
>> If you like it, feel free to take it in your tree.
>> I trimmed CC list to not bother too many people with this trivial
>> and quite possibly "useless churn"-class change.
>>
>>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>>  1 file changed, 12 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index bf9afad..a36d04d 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -688,26 +688,27 @@ retint_swapgs:		/* return to user-space */
>>  	 * a completely clean 64-bit userspace context.
>>  	 */
>>  	movq RCX(%rsp),%rcx
>> -	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
>> +	movq RIP(%rsp),%r11
>> +	cmpq %rcx,%r11			/* RCX == RIP */
>>  	jne opportunistic_sysret_failed
> 
> Btw., in the normal syscall entry path, RIP(%rsp) == RCX(%rsp), 
> because we set up pt_regs like that - and at this point RIP/RCX is 
> guaranteed to be canonical, right?
> 
> So if there's a mismatch generated, it's the kernel's doing.

This is an optimization on IRET exit code path.

We go here if we know that pt_regs can be modified by .e.g. ptrace.

I think we also go here even on interrupt return.
(Granted, chances that RCX was the same as RIP at the moment of interrupt
are slim, but we still would check that and (ab)use SYSRET
if it looks like it'll work).


> Why don't we detect those cases where a new return address is created 
> (ptrace, exec, etc.), check for canonicalness and add a TIF flag for 
> it (and add it to the work mask) and execute the IRET from the slow 
> path?
> 
> We already have a work-mask branch.
> 
> That would allow the removal of all these checks and canonization from 
> the fast return path! We could go straight to the SYSRET...

The point is, this is not a fast return path.

It's a "let's try to use fast SYSRET instead of IRET" path.


> The frequency of exec() and ptrace() is 2-3 orders of magnitude lower 
> than the frequency of system calls, so this would be well worth it.

On untraced system calls, we don't come here. We go to SYSRET
without these checks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 10:45   ` Denys Vlasenko
@ 2015-03-27 11:17     ` Ingo Molnar
  2015-03-27 11:28       ` Brian Gerst
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-27 11:17 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 03/27/2015 09:11 AM, Ingo Molnar wrote:
> > 
> > * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> > 
> >> This change makes the check exact (no more false positives
> >> on kernel addresses).
> >>
> >> It isn't really important to be fully correct here -
> >> almost all addresses we'll ever see will be userspace ones,
> >> but OTOH it looks to be cheap enough:
> >> the new code uses two more ALU ops but preserves %rcx,
> >> allowing to not reload it from pt_regs->cx again.
> >> On disassembly level, the changes are:
> >>
> >> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> >> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> >> mov 0x58(%rsp),%rcx -> (eliminated)
> >>
> >> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> >> CC: Borislav Petkov <bp@alien8.de>
> >> CC: x86@kernel.org
> >> CC: linux-kernel@vger.kernel.org
> >> ---
> >>
> >> Andy, I'd undecided myself on the merits of doing this.
> >> If you like it, feel free to take it in your tree.
> >> I trimmed CC list to not bother too many people with this trivial
> >> and quite possibly "useless churn"-class change.
> >>
> >>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
> >>  1 file changed, 12 insertions(+), 11 deletions(-)
> >>
> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >> index bf9afad..a36d04d 100644
> >> --- a/arch/x86/kernel/entry_64.S
> >> +++ b/arch/x86/kernel/entry_64.S
> >> @@ -688,26 +688,27 @@ retint_swapgs:		/* return to user-space */
> >>  	 * a completely clean 64-bit userspace context.
> >>  	 */
> >>  	movq RCX(%rsp),%rcx
> >> -	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
> >> +	movq RIP(%rsp),%r11
> >> +	cmpq %rcx,%r11			/* RCX == RIP */
> >>  	jne opportunistic_sysret_failed
> > 
> > Btw., in the normal syscall entry path, RIP(%rsp) == RCX(%rsp), 
> > because we set up pt_regs like that - and at this point RIP/RCX is 
> > guaranteed to be canonical, right?
> > 
> > So if there's a mismatch generated, it's the kernel's doing.
> 
> This is an optimization on IRET exit code path.
> 
> We go here if we know that pt_regs can be modified by .e.g. ptrace.
>
> I think we also go here even on interrupt return.

Yeah, missed that, this would kill any flag based approach.

> (Granted, chances that RCX was the same as RIP at the moment of 
> interrupt are slim, but we still would check that and (ab)use SYSRET 
> if it looks like it'll work).

Btw., there's a neat trick we could do: in the HLT, MWAIT and 
ACPI-idle code we could attempt to set up RCX to match RIP, to trigger 
this optimization in the common 'irq interrupted the idle task' case?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 12:42 [PATCH] x86/asm/entry/64: better check for canonical address Denys Vlasenko
  2015-03-26 18:45 ` Andy Lutomirski
  2015-03-27  8:11 ` Ingo Molnar
@ 2015-03-27 11:27 ` Brian Gerst
  2015-03-27 11:31   ` Ingo Molnar
  2015-04-02 17:37 ` Denys Vlasenko
  3 siblings, 1 reply; 32+ messages in thread
From: Brian Gerst @ 2015-03-27 11:27 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List

On Thu, Mar 26, 2015 at 8:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> This change makes the check exact (no more false positives
> on kernel addresses).
>
> It isn't really important to be fully correct here -
> almost all addresses we'll ever see will be userspace ones,
> but OTOH it looks to be cheap enough:
> the new code uses two more ALU ops but preserves %rcx,
> allowing to not reload it from pt_regs->cx again.
> On disassembly level, the changes are:
>
> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> mov 0x58(%rsp),%rcx -> (eliminated)
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Borislav Petkov <bp@alien8.de>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>
> Andy, I'd undecided myself on the merits of doing this.
> If you like it, feel free to take it in your tree.
> I trimmed CC list to not bother too many people with this trivial
> and quite possibly "useless churn"-class change.
>
>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>  1 file changed, 12 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index bf9afad..a36d04d 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -688,26 +688,27 @@ retint_swapgs:            /* return to user-space */
>          * a completely clean 64-bit userspace context.
>          */
>         movq RCX(%rsp),%rcx
> -       cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
> +       movq RIP(%rsp),%r11
> +       cmpq %rcx,%r11                  /* RCX == RIP */
>         jne opportunistic_sysret_failed
>
>         /*
>          * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
>          * in kernel space.  This essentially lets the user take over
> -        * the kernel, since userspace controls RSP.  It's not worth
> -        * testing for canonicalness exactly -- this check detects any
> -        * of the 17 high bits set, which is true for non-canonical
> -        * or kernel addresses.  (This will pessimize vsyscall=native.
> -        * Big deal.)
> +        * the kernel, since userspace controls RSP.
>          *
> -        * If virtual addresses ever become wider, this will need
> +        * If width of "canonical tail" ever become variable, this will need
>          * to be updated to remain correct on both old and new CPUs.
>          */
>         .ifne __VIRTUAL_MASK_SHIFT - 47
>         .error "virtual address width changed -- sysret checks need update"
>         .endif
> -       shr $__VIRTUAL_MASK_SHIFT, %rcx
> -       jnz opportunistic_sysret_failed
> +       /* Change top 16 bits to be a sign-extension of the rest */
> +       shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +       sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +       /* If this changed %rcx, it was not canonical */
> +       cmpq    %rcx, %r11
> +       jne     opportunistic_sysret_failed
>
>         cmpq $__USER_CS,CS(%rsp)        /* CS must match SYSRET */
>         jne opportunistic_sysret_failed

Would it be possible to to skip this check entirely on AMD processors?
 It's my understanding that AMD correctly issues the #GP from CPL3,
causing a stack switch.

Looking at the AMD docs, sysret doesn't even check for a canonical
address.  The #GP is probably from the instruction fetch at the
non-canonical address instead of from sysret itself.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 11:17     ` Ingo Molnar
@ 2015-03-27 11:28       ` Brian Gerst
  2015-03-27 11:34         ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Brian Gerst @ 2015-03-27 11:28 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Fri, Mar 27, 2015 at 7:17 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
>> On 03/27/2015 09:11 AM, Ingo Molnar wrote:
>> >
>> > * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> >
>> >> This change makes the check exact (no more false positives
>> >> on kernel addresses).
>> >>
>> >> It isn't really important to be fully correct here -
>> >> almost all addresses we'll ever see will be userspace ones,
>> >> but OTOH it looks to be cheap enough:
>> >> the new code uses two more ALU ops but preserves %rcx,
>> >> allowing to not reload it from pt_regs->cx again.
>> >> On disassembly level, the changes are:
>> >>
>> >> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>> >> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>> >> mov 0x58(%rsp),%rcx -> (eliminated)
>> >>
>> >> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> >> CC: Borislav Petkov <bp@alien8.de>
>> >> CC: x86@kernel.org
>> >> CC: linux-kernel@vger.kernel.org
>> >> ---
>> >>
>> >> Andy, I'd undecided myself on the merits of doing this.
>> >> If you like it, feel free to take it in your tree.
>> >> I trimmed CC list to not bother too many people with this trivial
>> >> and quite possibly "useless churn"-class change.
>> >>
>> >>  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>> >>  1 file changed, 12 insertions(+), 11 deletions(-)
>> >>
>> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> >> index bf9afad..a36d04d 100644
>> >> --- a/arch/x86/kernel/entry_64.S
>> >> +++ b/arch/x86/kernel/entry_64.S
>> >> @@ -688,26 +688,27 @@ retint_swapgs:               /* return to user-space */
>> >>     * a completely clean 64-bit userspace context.
>> >>     */
>> >>    movq RCX(%rsp),%rcx
>> >> -  cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
>> >> +  movq RIP(%rsp),%r11
>> >> +  cmpq %rcx,%r11                  /* RCX == RIP */
>> >>    jne opportunistic_sysret_failed
>> >
>> > Btw., in the normal syscall entry path, RIP(%rsp) == RCX(%rsp),
>> > because we set up pt_regs like that - and at this point RIP/RCX is
>> > guaranteed to be canonical, right?
>> >
>> > So if there's a mismatch generated, it's the kernel's doing.
>>
>> This is an optimization on IRET exit code path.
>>
>> We go here if we know that pt_regs can be modified by .e.g. ptrace.
>>
>> I think we also go here even on interrupt return.
>
> Yeah, missed that, this would kill any flag based approach.
>
>> (Granted, chances that RCX was the same as RIP at the moment of
>> interrupt are slim, but we still would check that and (ab)use SYSRET
>> if it looks like it'll work).
>
> Btw., there's a neat trick we could do: in the HLT, MWAIT and
> ACPI-idle code we could attempt to set up RCX to match RIP, to trigger
> this optimization in the common 'irq interrupted the idle task' case?

sysret only returns to CPL3.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 11:27 ` Brian Gerst
@ 2015-03-27 11:31   ` Ingo Molnar
  2015-03-27 21:37     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-27 11:31 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Denys Vlasenko, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List


* Brian Gerst <brgerst@gmail.com> wrote:

> On Thu, Mar 26, 2015 at 8:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> > This change makes the check exact (no more false positives
> > on kernel addresses).
> >
> > It isn't really important to be fully correct here -
> > almost all addresses we'll ever see will be userspace ones,
> > but OTOH it looks to be cheap enough:
> > the new code uses two more ALU ops but preserves %rcx,
> > allowing to not reload it from pt_regs->cx again.
> > On disassembly level, the changes are:
> >
> > cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> > shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> > mov 0x58(%rsp),%rcx -> (eliminated)
> >
> > Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> > CC: Borislav Petkov <bp@alien8.de>
> > CC: x86@kernel.org
> > CC: linux-kernel@vger.kernel.org
> > ---
> >
> > Andy, I'd undecided myself on the merits of doing this.
> > If you like it, feel free to take it in your tree.
> > I trimmed CC list to not bother too many people with this trivial
> > and quite possibly "useless churn"-class change.
> >
> >  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
> >  1 file changed, 12 insertions(+), 11 deletions(-)
> >
> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> > index bf9afad..a36d04d 100644
> > --- a/arch/x86/kernel/entry_64.S
> > +++ b/arch/x86/kernel/entry_64.S
> > @@ -688,26 +688,27 @@ retint_swapgs:            /* return to user-space */
> >          * a completely clean 64-bit userspace context.
> >          */
> >         movq RCX(%rsp),%rcx
> > -       cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
> > +       movq RIP(%rsp),%r11
> > +       cmpq %rcx,%r11                  /* RCX == RIP */
> >         jne opportunistic_sysret_failed
> >
> >         /*
> >          * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
> >          * in kernel space.  This essentially lets the user take over
> > -        * the kernel, since userspace controls RSP.  It's not worth
> > -        * testing for canonicalness exactly -- this check detects any
> > -        * of the 17 high bits set, which is true for non-canonical
> > -        * or kernel addresses.  (This will pessimize vsyscall=native.
> > -        * Big deal.)
> > +        * the kernel, since userspace controls RSP.
> >          *
> > -        * If virtual addresses ever become wider, this will need
> > +        * If width of "canonical tail" ever become variable, this will need
> >          * to be updated to remain correct on both old and new CPUs.
> >          */
> >         .ifne __VIRTUAL_MASK_SHIFT - 47
> >         .error "virtual address width changed -- sysret checks need update"
> >         .endif
> > -       shr $__VIRTUAL_MASK_SHIFT, %rcx
> > -       jnz opportunistic_sysret_failed
> > +       /* Change top 16 bits to be a sign-extension of the rest */
> > +       shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > +       sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > +       /* If this changed %rcx, it was not canonical */
> > +       cmpq    %rcx, %r11
> > +       jne     opportunistic_sysret_failed
> >
> >         cmpq $__USER_CS,CS(%rsp)        /* CS must match SYSRET */
> >         jne opportunistic_sysret_failed
> 
> Would it be possible to to skip this check entirely on AMD 
> processors? It's my understanding that AMD correctly issues the #GP 
> from CPL3, causing a stack switch.

This needs a testcase I suspect.

> Looking at the AMD docs, sysret doesn't even check for a canonical 
> address.  The #GP is probably from the instruction fetch at the 
> non-canonical address instead of from sysret itself.

I suspect it's similar to what would happen if we tried a RET to a 
non-canonical address: the fetch fails and the JMP gets the #GP?

In that sense it's the fault of the return instruction.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 11:28       ` Brian Gerst
@ 2015-03-27 11:34         ` Ingo Molnar
  2015-03-27 12:14           ` Denys Vlasenko
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-27 11:34 UTC (permalink / raw)
  To: Brian Gerst
  Cc: Denys Vlasenko, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List


* Brian Gerst <brgerst@gmail.com> wrote:

> > Btw., there's a neat trick we could do: in the HLT, MWAIT and 
> > ACPI-idle code we could attempt to set up RCX to match RIP, to 
> > trigger this optimization in the common 'irq interrupted the idle 
> > task' case?
> 
> sysret only returns to CPL3.

Indeed, an IRET ought to be pretty cheap for same-ring interrupt 
returns in any case.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 11:34         ` Ingo Molnar
@ 2015-03-27 12:14           ` Denys Vlasenko
  2015-03-27 12:16             ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-27 12:14 UTC (permalink / raw)
  To: Ingo Molnar, Brian Gerst
  Cc: Andy Lutomirski, Borislav Petkov, the arch/x86 maintainers,
	Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1612 bytes --]

On 03/27/2015 12:34 PM, Ingo Molnar wrote:
> 
> * Brian Gerst <brgerst@gmail.com> wrote:
> 
>>> Btw., there's a neat trick we could do: in the HLT, MWAIT and 
>>> ACPI-idle code we could attempt to set up RCX to match RIP, to 
>>> trigger this optimization in the common 'irq interrupted the idle 
>>> task' case?
>>
>> sysret only returns to CPL3.
> 
> Indeed, an IRET ought to be pretty cheap for same-ring interrupt 
> returns in any case.

Unfortunately, it is not. Try attached program.

On this CPU, 1 ns ~= 3 cycles.

$ ./timing_test64 callret
10000 loops in 0.00008s = 7.87 nsec/loop for callret
100000 loops in 0.00076s = 7.56 nsec/loop for callret
1000000 loops in 0.00548s = 5.48 nsec/loop for callret
10000000 loops in 0.02882s = 2.88 nsec/loop for callret
100000000 loops in 0.18334s = 1.83 nsec/loop for callret
200000000 loops in 0.36051s = 1.80 nsec/loop for callret
400000000 loops in 0.71632s = 1.79 nsec/loop for callret

Near call + near ret = 5 cycles

$ ./timing_test64 lret
10000 loops in 0.00034s = 33.95 nsec/loop for lret
100000 loops in 0.00328s = 32.83 nsec/loop for lret
1000000 loops in 0.04541s = 45.41 nsec/loop for lret
10000000 loops in 0.32130s = 32.13 nsec/loop for lret
20000000 loops in 0.64191s = 32.10 nsec/loop for lret

push my_cs + push next_label + far ret = ~90 cycles

$ ./timing_test64 iret
10000 loops in 0.00344s = 343.90 nsec/loop for iret
100000 loops in 0.01890s = 188.97 nsec/loop for iret
1000000 loops in 0.08228s = 82.28 nsec/loop for iret
10000000 loops in 0.77910s = 77.91 nsec/loop for iret

This is the "same-ring interrupt return". ~230 cycles!  :(



[-- Attachment #2: timing_test.c --]
[-- Type: text/x-csrc, Size: 6919 bytes --]

// To be unaffected by random cacheline placement, use generous "align":
//
// i686-gcc -O2 -Wall -falign-loops=32 -falign-jumps=32 -falign-labels=32 -static
// x86_64-gcc -O2 -Wall -falign-loops=32 -falign-jumps=32 -falign-labels=32 -static

#include <inttypes.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <time.h>
#include <sys/time.h>
#include <sys/syscall.h>
#include <stdio.h>

#if !defined(__i386__)
#define get_sysenter_addr() 0
#else
#include <elf.h>
long sysenter_addr;
long get_sysenter_addr(char **envp)
{
	Elf32_auxv_t *auxv;
	while (*envp++ != NULL)
		continue;
	for (auxv = (void *)envp; auxv->a_type != AT_NULL; auxv++)
		if( auxv->a_type == AT_SYSINFO)
			return (sysenter_addr = auxv->a_un.a_val);
	fprintf(stderr, "AT_SYSINFO not supplied, can't test\n");
	exit(0); /* this is not a failure */
}

void sysenter_getpid(void)
{
	asm volatile(
	"\n"   "	mov	$20,%eax" // GETPID
	"\n"   "	call	*sysenter_addr"
	);
}
#endif

#if defined(__i386__)
#define L_or_Q "l"
#define E_or_R "e"
#else
#define L_or_Q "q"
#define E_or_R "r"
#endif

asm (
"\n"   "	.text"
"\n"   "ret__:	ret"
);

int main(int argc, char **argv, char **envp)
{
	struct timespec start, end;
	unsigned long long duration;
	size_t loops, i;
	const char *mode;

	if (argc < 2) {
		printf("Usage: timing_test [MILLIONS_OF_ITERATIONS] MODE\n");
		return 1;
	}
	mode = argv[2];
	if (!mode) {
		mode = argv[1];
		loops = 10*1000;
	} else {
		loops = (size_t)atol(argv[1]) * 1000000;
	}

 again:
	if (!strcmp(mode, "nothing")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("# nothing");
		}
	} else if (!strcmp(mode, "nop")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("nop");
		}
	} else if (!strcmp(mode, "rdtsc")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("rdtsc" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "lfence_rdtsc")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("lfence;rdtsc" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "lfence_rdtsc_lfence")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("");
			asm volatile ("lfence;rdtsc;lfence" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "mfence_rdtsc_mfence")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, d;
			asm volatile ("mfence;rdtsc;mfence" : "=a" (a), "=d" (d));
		}
	} else if (!strcmp(mode, "rdtscp")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			unsigned int a, c, d;
			asm volatile ("rdtscp" : "=a" (a), "=c" (c), "=d" (d));
		}
	} else if (!strcmp(mode, "gettimeofday")) {
		struct timeval tv;
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			gettimeofday(&tv, 0);
	} else if (!strcmp(mode, "getpid")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			syscall(SYS_getpid);
#if defined(__i386__)
	} else if (!strcmp(mode, "sysenter_getpid")) {
		get_sysenter_addr(envp);
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			sysenter_getpid();
	} else if (!strcmp(mode, "iret")) {
		/* "push cs" is itself a bit expensive, moving it out of loop */
		long saved_cs;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	$0"	// flags
			"\n"   "	push	%0"	// cs
			"\n"   "	push	$1f"	// ip
			"\n"   "	iret"
			"\n"   "1:"
			:
			: "r" (saved_cs)
			);
		}
#endif
#if defined(__x86_64__)
	} else if (!strcmp(mode, "iret")) {
		long saved_cs;
		long saved_ss;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		asm volatile ("mov %%ss,%0" : "=r" (saved_ss));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	mov	%%rsp,%%rax"
			"\n"   "	push	%0"	// ss
			"\n"   "	push	%%rax"	// sp
			"\n"   "	push	$0"	// flags
			"\n"   "	push	%1"	// cs
			"\n"   "	push	$1f"	// ip
			"\n"   "	iretq"
			"\n"   "1:"
			:
			: "r" (saved_ss), "r" (saved_cs)
			: "ax"
			);
		}
#endif
	} else if (!strcmp(mode, "lret")) {
		/* "push cs" is itself a bit expensive, moving it out of loop */
		long saved_cs;
		asm volatile ("mov %%cs,%0" : "=r" (saved_cs));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%0"
			"\n"   "	push	$1f"
			"\n"   "	lret"L_or_Q
			"\n"   "1:"
			:
			: "r" (saved_cs)
			);
		}
	} else if (!strcmp(mode, "callret")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("call ret__");
		}
	} else if (!strcmp(mode, "ret")) {
		/* This is useful to measure delays due to
		 * return stack branch prediction not working
		 * (we aren't using paired call/rets here, as CPU expects).
		 * I observed "callret" test above being 4 times faster than this:
		 */
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	$1f"
			"\n"   "	ret"
			"\n"   "1:"
			);
		}
	} else if (!strcmp(mode, "loadss")) {
		long saved_ss;
		asm volatile ("mov %%ss,%0" : "=r" (saved_ss));
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile ("mov %0,%%ss" : : "r" (saved_ss));
		}
	} else if (!strcmp(mode, "pushf")) {
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	pushf"
			"\n"   "	pop	%%"E_or_R"ax"
			:
			:
			: "ax"
			);
		}
	} else if (!strcmp(mode, "popf")) {
		long flags;
		asm volatile (
		"\n"   "	pushf"
		"\n"   "	pop	%0"
		: "=r" (flags)
		);
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--) {
			asm volatile (
			"\n"   "	push	%0"
			"\n"   "	popf"
			:
			: "r" (flags)
			: "ax"
			);
		}
	} else if (!strcmp(mode, "rdpmc")) {
		// Unlikely to work.
		unsigned int eax, edx;
		unsigned int ecx = 0;
		clock_gettime(CLOCK_MONOTONIC, &start);
	        for (i = loops; i != 0; i--)
			asm volatile ("rdpmc" : "=a" (eax), "=d" (edx) : "c" (ecx));
	} else {
		printf("Unknown mode %s\n", mode);
		return 1;
	}

	clock_gettime(CLOCK_MONOTONIC, &end);
	duration = (1000*1000*1000ULL * end.tv_sec + end.tv_nsec)
	         - (1000*1000*1000ULL * start.tv_sec + start.tv_nsec);
	printf("%lu loops in %.5fs = %.2f nsec/loop for %s\n",
		(unsigned long)loops, (double)duration * 1e-9,
		(double)duration / loops,
		mode
	);
	if (!argv[2]) {
		if (duration < 90*1000*1000) {
			loops *= 10;
			goto again;
		}
		if (duration < 490*1000*1000) {
			loops *= 2;
			goto again;
		}
	}
	return 0;
}

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 12:14           ` Denys Vlasenko
@ 2015-03-27 12:16             ` Ingo Molnar
  2015-03-27 12:31               ` Denys Vlasenko
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-27 12:16 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> > Indeed, an IRET ought to be pretty cheap for same-ring interrupt 
> > returns in any case.
> 
> Unfortunately, it is not. Try attached program.
> 
> On this CPU, 1 ns ~= 3 cycles.
> 
> $ ./timing_test64 callret
> 10000 loops in 0.00008s = 7.87 nsec/loop for callret
> 100000 loops in 0.00076s = 7.56 nsec/loop for callret
> 1000000 loops in 0.00548s = 5.48 nsec/loop for callret
> 10000000 loops in 0.02882s = 2.88 nsec/loop for callret
> 100000000 loops in 0.18334s = 1.83 nsec/loop for callret
> 200000000 loops in 0.36051s = 1.80 nsec/loop for callret
> 400000000 loops in 0.71632s = 1.79 nsec/loop for callret
> 
> Near call + near ret = 5 cycles
> 
> $ ./timing_test64 lret
> 10000 loops in 0.00034s = 33.95 nsec/loop for lret
> 100000 loops in 0.00328s = 32.83 nsec/loop for lret
> 1000000 loops in 0.04541s = 45.41 nsec/loop for lret
> 10000000 loops in 0.32130s = 32.13 nsec/loop for lret
> 20000000 loops in 0.64191s = 32.10 nsec/loop for lret
> 
> push my_cs + push next_label + far ret = ~90 cycles
> 
> $ ./timing_test64 iret
> 10000 loops in 0.00344s = 343.90 nsec/loop for iret
> 100000 loops in 0.01890s = 188.97 nsec/loop for iret
> 1000000 loops in 0.08228s = 82.28 nsec/loop for iret
> 10000000 loops in 0.77910s = 77.91 nsec/loop for iret
> 
> This is the "same-ring interrupt return". ~230 cycles!  :(

Ugh, that's really expensive! Why is that so? Same-ring irqs are 
supposedly a lot simpler.

Thanks,

	Ingo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 12:16             ` Ingo Molnar
@ 2015-03-27 12:31               ` Denys Vlasenko
  2015-03-28  9:11                 ` Ingo Molnar
  0 siblings, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-27 12:31 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On 03/27/2015 01:16 PM, Ingo Molnar wrote:
>>> Indeed, an IRET ought to be pretty cheap for same-ring interrupt 
>>> returns in any case.
>>
>> Unfortunately, it is not. Try attached program.
>>
>> On this CPU, 1 ns ~= 3 cycles.
>>
>> $ ./timing_test64 callret
>> 10000 loops in 0.00008s = 7.87 nsec/loop for callret
>> 100000 loops in 0.00076s = 7.56 nsec/loop for callret
>> 1000000 loops in 0.00548s = 5.48 nsec/loop for callret
>> 10000000 loops in 0.02882s = 2.88 nsec/loop for callret
>> 100000000 loops in 0.18334s = 1.83 nsec/loop for callret
>> 200000000 loops in 0.36051s = 1.80 nsec/loop for callret
>> 400000000 loops in 0.71632s = 1.79 nsec/loop for callret
>>
>> Near call + near ret = 5 cycles
>>
>> $ ./timing_test64 lret
>> 10000 loops in 0.00034s = 33.95 nsec/loop for lret
>> 100000 loops in 0.00328s = 32.83 nsec/loop for lret
>> 1000000 loops in 0.04541s = 45.41 nsec/loop for lret
>> 10000000 loops in 0.32130s = 32.13 nsec/loop for lret
>> 20000000 loops in 0.64191s = 32.10 nsec/loop for lret
>>
>> push my_cs + push next_label + far ret = ~90 cycles
>>
>> $ ./timing_test64 iret
>> 10000 loops in 0.00344s = 343.90 nsec/loop for iret
>> 100000 loops in 0.01890s = 188.97 nsec/loop for iret
>> 1000000 loops in 0.08228s = 82.28 nsec/loop for iret
>> 10000000 loops in 0.77910s = 77.91 nsec/loop for iret
>>
>> This is the "same-ring interrupt return". ~230 cycles!  :(
> 
> Ugh, that's really expensive! Why is that so? Same-ring irqs are 
> supposedly a lot simpler.

Descriptor checks for restored CS and SS,
checking canonical-ness of RIP,
supporting "return to TSS" (flags.NT bit),
"return to VM86" (flags.VM bit),
complex logic around restoring RFLAGS
  ("don't allow CPL3 to be able to disable interrupts...
  ...unless their flags.IOPL is 3." Gasp)
return to 16-bit code ("do not touch high 16 bits")

All of this is a giant PITA to encode in microcode.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 11:31   ` Ingo Molnar
@ 2015-03-27 21:37     ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-27 21:37 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, Denys Vlasenko, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Fri, Mar 27, 2015 at 4:31 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Brian Gerst <brgerst@gmail.com> wrote:
>
>> On Thu, Mar 26, 2015 at 8:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> > This change makes the check exact (no more false positives
>> > on kernel addresses).
>> >
>> > It isn't really important to be fully correct here -
>> > almost all addresses we'll ever see will be userspace ones,
>> > but OTOH it looks to be cheap enough:
>> > the new code uses two more ALU ops but preserves %rcx,
>> > allowing to not reload it from pt_regs->cx again.
>> > On disassembly level, the changes are:
>> >
>> > cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>> > shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>> > mov 0x58(%rsp),%rcx -> (eliminated)
>> >
>> > Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> > CC: Borislav Petkov <bp@alien8.de>
>> > CC: x86@kernel.org
>> > CC: linux-kernel@vger.kernel.org
>> > ---
>> >
>> > Andy, I'd undecided myself on the merits of doing this.
>> > If you like it, feel free to take it in your tree.
>> > I trimmed CC list to not bother too many people with this trivial
>> > and quite possibly "useless churn"-class change.
>> >
>> >  arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
>> >  1 file changed, 12 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> > index bf9afad..a36d04d 100644
>> > --- a/arch/x86/kernel/entry_64.S
>> > +++ b/arch/x86/kernel/entry_64.S
>> > @@ -688,26 +688,27 @@ retint_swapgs:            /* return to user-space */
>> >          * a completely clean 64-bit userspace context.
>> >          */
>> >         movq RCX(%rsp),%rcx
>> > -       cmpq %rcx,RIP(%rsp)             /* RCX == RIP */
>> > +       movq RIP(%rsp),%r11
>> > +       cmpq %rcx,%r11                  /* RCX == RIP */
>> >         jne opportunistic_sysret_failed
>> >
>> >         /*
>> >          * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
>> >          * in kernel space.  This essentially lets the user take over
>> > -        * the kernel, since userspace controls RSP.  It's not worth
>> > -        * testing for canonicalness exactly -- this check detects any
>> > -        * of the 17 high bits set, which is true for non-canonical
>> > -        * or kernel addresses.  (This will pessimize vsyscall=native.
>> > -        * Big deal.)
>> > +        * the kernel, since userspace controls RSP.
>> >          *
>> > -        * If virtual addresses ever become wider, this will need
>> > +        * If width of "canonical tail" ever become variable, this will need
>> >          * to be updated to remain correct on both old and new CPUs.
>> >          */
>> >         .ifne __VIRTUAL_MASK_SHIFT - 47
>> >         .error "virtual address width changed -- sysret checks need update"
>> >         .endif
>> > -       shr $__VIRTUAL_MASK_SHIFT, %rcx
>> > -       jnz opportunistic_sysret_failed
>> > +       /* Change top 16 bits to be a sign-extension of the rest */
>> > +       shl     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>> > +       sar     $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
>> > +       /* If this changed %rcx, it was not canonical */
>> > +       cmpq    %rcx, %r11
>> > +       jne     opportunistic_sysret_failed
>> >
>> >         cmpq $__USER_CS,CS(%rsp)        /* CS must match SYSRET */
>> >         jne opportunistic_sysret_failed
>>
>> Would it be possible to to skip this check entirely on AMD
>> processors? It's my understanding that AMD correctly issues the #GP
>> from CPL3, causing a stack switch.
>
> This needs a testcase I suspect.

IMO one decent way to write the test case would be to extend the
sigreturn test I just submitted.  For each n, do raise(SIGUSR1), then
change RCX and RIP to 2^n.  Return and catch the SIGSEGV, then restore
the original RIP.  Repeat with 2^n replaced with 2^n-1 and ~(2^n-1).

The only real trick is that we need to make sure that there's no
actual executable code at any of these addresses.

--Andy

>
>> Looking at the AMD docs, sysret doesn't even check for a canonical
>> address.  The #GP is probably from the instruction fetch at the
>> non-canonical address instead of from sysret itself.
>
> I suspect it's similar to what would happen if we tried a RET to a
> non-canonical address: the fetch fails and the JMP gets the #GP?
>
> In that sense it's the fault of the return instruction.
>
> Thanks,
>
>         Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-27 12:31               ` Denys Vlasenko
@ 2015-03-28  9:11                 ` Ingo Molnar
  2015-03-29 19:36                   ` Denys Vlasenko
  0 siblings, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-28  9:11 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 03/27/2015 01:16 PM, Ingo Molnar wrote:
> >>> Indeed, an IRET ought to be pretty cheap for same-ring interrupt 
> >>> returns in any case.
> >>
> >> Unfortunately, it is not. Try attached program.
> >>
> >> On this CPU, 1 ns ~= 3 cycles.
> >>
> >> $ ./timing_test64 callret
> >> 10000 loops in 0.00008s = 7.87 nsec/loop for callret
> >> 100000 loops in 0.00076s = 7.56 nsec/loop for callret
> >> 1000000 loops in 0.00548s = 5.48 nsec/loop for callret
> >> 10000000 loops in 0.02882s = 2.88 nsec/loop for callret
> >> 100000000 loops in 0.18334s = 1.83 nsec/loop for callret
> >> 200000000 loops in 0.36051s = 1.80 nsec/loop for callret
> >> 400000000 loops in 0.71632s = 1.79 nsec/loop for callret
> >>
> >> Near call + near ret = 5 cycles
> >>
> >> $ ./timing_test64 lret
> >> 10000 loops in 0.00034s = 33.95 nsec/loop for lret
> >> 100000 loops in 0.00328s = 32.83 nsec/loop for lret
> >> 1000000 loops in 0.04541s = 45.41 nsec/loop for lret
> >> 10000000 loops in 0.32130s = 32.13 nsec/loop for lret
> >> 20000000 loops in 0.64191s = 32.10 nsec/loop for lret
> >>
> >> push my_cs + push next_label + far ret = ~90 cycles
> >>
> >> $ ./timing_test64 iret
> >> 10000 loops in 0.00344s = 343.90 nsec/loop for iret
> >> 100000 loops in 0.01890s = 188.97 nsec/loop for iret
> >> 1000000 loops in 0.08228s = 82.28 nsec/loop for iret
> >> 10000000 loops in 0.77910s = 77.91 nsec/loop for iret
> >>
> >> This is the "same-ring interrupt return". ~230 cycles!  :(
> > 
> > Ugh, that's really expensive! Why is that so? Same-ring irqs are 
> > supposedly a lot simpler.
> 
> Descriptor checks for restored CS and SS,
> checking canonical-ness of RIP,
> supporting "return to TSS" (flags.NT bit),
> "return to VM86" (flags.VM bit),
> complex logic around restoring RFLAGS
>   ("don't allow CPL3 to be able to disable interrupts...
>   ...unless their flags.IOPL is 3." Gasp)
> return to 16-bit code ("do not touch high 16 bits")
> 
> All of this is a giant PITA to encode in microcode.

I guess they could optimize it by adding a single "I am a modern OS 
executing regular userspace" flag to the descriptor [or expressing the 
same as a separate instruction], to avoid all that legacy crap that 
won't trigger on like 99.999999% of systems ...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-28  9:11                 ` Ingo Molnar
@ 2015-03-29 19:36                   ` Denys Vlasenko
  2015-03-29 21:12                     ` Andy Lutomirski
  2015-03-31 16:43                     ` Ingo Molnar
  0 siblings, 2 replies; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-29 19:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Brian Gerst, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Sat, Mar 28, 2015 at 10:11 AM, Ingo Molnar <mingo@kernel.org> wrote:
>> >> $ ./timing_test64 iret
>> >> 10000 loops in 0.00344s = 343.90 nsec/loop for iret
>> >> 100000 loops in 0.01890s = 188.97 nsec/loop for iret
>> >> 1000000 loops in 0.08228s = 82.28 nsec/loop for iret
>> >> 10000000 loops in 0.77910s = 77.91 nsec/loop for iret
>> >>
>> >> This is the "same-ring interrupt return". ~230 cycles!  :(
>> >
>> > Ugh, that's really expensive! Why is that so? Same-ring irqs are
>> > supposedly a lot simpler.
>>
>> Descriptor checks for restored CS and SS,
>> checking canonical-ness of RIP,
>> supporting "return to TSS" (flags.NT bit),
>> "return to VM86" (flags.VM bit),
>> complex logic around restoring RFLAGS
>>   ("don't allow CPL3 to be able to disable interrupts...
>>   ...unless their flags.IOPL is 3." Gasp)
>> return to 16-bit code ("do not touch high 16 bits")
>>
>> All of this is a giant PITA to encode in microcode.
>
> I guess they could optimize it by adding a single "I am a modern OS
> executing regular userspace" flag to the descriptor [or expressing the
> same as a separate instruction], to avoid all that legacy crap that
> won't trigger on like 99.999999% of systems ...

Yes, that would be a useful addition. Interrupt servicing on x86
takes a non-negligible hit because of IRET slowness.

Specifically, a CPL0-only IRET_FAST insn which uses the same stack layout
as IRET, but makes the following assumptions:

* The restored SS and CS are 0-based, 4G-limit segments.
   (as usual, in 64-bit mode limits are ignored).
* CS is read/execute, SS is read/write.
* The CPL to return to is equal to (CS & 3).

This would mean that IRET_FAST would not need to read descriptors
from GDT/LDT. It only needs to read values from stack.

It would be capable of returning both to CPL0 and CPL3 - iow,
usable for returning from interrupts both to userpace and kernelspace.

* FLAGS.NT is ignored (as if it is 0). IOW, no task returns.
* pt_regs->FLAGS.VM is not restored, but set to 0.
   IOW, no vm86.
* Extend this to other flags as well, if it makes return faster.
   We can have a separate code which restores AC,DF,IF,TF,RF,IOPL
   in the unlikely event they are "unusual". So it's okay
   if IRET_FAST just sets them to 0 (1 for IF).

The instruction would need a differentiator whether returned-to code
is 64-bit or 32-bit.
Then it probably can use the same approach SYSRET{O,L} uses:
with REX.W, return is to 64-bit; without it, return is to 32-bit.

Interrupt return then can check pt_regs->cs and use
IRETL_FAST if it is USER32_CS; use IRETQ_FAST if it is USER_CS
or KERNEL_CS; otherwise, fall back to slow but "universal" IRETQ.

Do we have contacts at Intel to petition for this? :D

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-29 19:36                   ` Denys Vlasenko
@ 2015-03-29 21:12                     ` Andy Lutomirski
  2015-03-29 21:46                       ` Denys Vlasenko
  2015-03-31 16:43                     ` Ingo Molnar
  1 sibling, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-29 21:12 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Denys Vlasenko, Brian Gerst, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Sun, Mar 29, 2015 at 12:36 PM, Denys Vlasenko
<vda.linux@googlemail.com> wrote:
> On Sat, Mar 28, 2015 at 10:11 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>> >> $ ./timing_test64 iret
>>> >> 10000 loops in 0.00344s = 343.90 nsec/loop for iret
>>> >> 100000 loops in 0.01890s = 188.97 nsec/loop for iret
>>> >> 1000000 loops in 0.08228s = 82.28 nsec/loop for iret
>>> >> 10000000 loops in 0.77910s = 77.91 nsec/loop for iret
>>> >>
>>> >> This is the "same-ring interrupt return". ~230 cycles!  :(
>>> >
>>> > Ugh, that's really expensive! Why is that so? Same-ring irqs are
>>> > supposedly a lot simpler.
>>>
>>> Descriptor checks for restored CS and SS,
>>> checking canonical-ness of RIP,
>>> supporting "return to TSS" (flags.NT bit),
>>> "return to VM86" (flags.VM bit),
>>> complex logic around restoring RFLAGS
>>>   ("don't allow CPL3 to be able to disable interrupts...
>>>   ...unless their flags.IOPL is 3." Gasp)
>>> return to 16-bit code ("do not touch high 16 bits")
>>>
>>> All of this is a giant PITA to encode in microcode.
>>
>> I guess they could optimize it by adding a single "I am a modern OS
>> executing regular userspace" flag to the descriptor [or expressing the
>> same as a separate instruction], to avoid all that legacy crap that
>> won't trigger on like 99.999999% of systems ...
>
> Yes, that would be a useful addition. Interrupt servicing on x86
> takes a non-negligible hit because of IRET slowness.
>
> Specifically, a CPL0-only IRET_FAST insn which uses the same stack layout
> as IRET, but makes the following assumptions:
>
> * The restored SS and CS are 0-based, 4G-limit segments.
>    (as usual, in 64-bit mode limits are ignored).
> * CS is read/execute, SS is read/write.
> * The CPL to return to is equal to (CS & 3).
>
> This would mean that IRET_FAST would not need to read descriptors
> from GDT/LDT. It only needs to read values from stack.
>
> It would be capable of returning both to CPL0 and CPL3 - iow,
> usable for returning from interrupts both to userpace and kernelspace.
>
> * FLAGS.NT is ignored (as if it is 0). IOW, no task returns.
> * pt_regs->FLAGS.VM is not restored, but set to 0.
>    IOW, no vm86.
> * Extend this to other flags as well, if it makes return faster.
>    We can have a separate code which restores AC,DF,IF,TF,RF,IOPL
>    in the unlikely event they are "unusual". So it's okay
>    if IRET_FAST just sets them to 0 (1 for IF).
>
> The instruction would need a differentiator whether returned-to code
> is 64-bit or 32-bit.
> Then it probably can use the same approach SYSRET{O,L} uses:
> with REX.W, return is to 64-bit; without it, return is to 32-bit.
>
> Interrupt return then can check pt_regs->cs and use
> IRETL_FAST if it is USER32_CS; use IRETQ_FAST if it is USER_CS
> or KERNEL_CS; otherwise, fall back to slow but "universal" IRETQ.
>
> Do we have contacts at Intel to petition for this? :D

Some of us do and have petitioned :)

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-29 21:12                     ` Andy Lutomirski
@ 2015-03-29 21:46                       ` Denys Vlasenko
  0 siblings, 0 replies; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-29 21:46 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Ingo Molnar, Denys Vlasenko, Brian Gerst, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Sun, Mar 29, 2015 at 11:12 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Sun, Mar 29, 2015 at 12:36 PM, Denys Vlasenko
> <vda.linux@googlemail.com> wrote:
>> The instruction would need a differentiator whether returned-to code
>> is 64-bit or 32-bit.
>> Then it probably can use the same approach SYSRET{O,L} uses:
>> with REX.W, return is to 64-bit; without it, return is to 32-bit.
>>
>> Interrupt return then can check pt_regs->cs and use
>> IRETL_FAST if it is USER32_CS; use IRETQ_FAST if it is USER_CS
>> or KERNEL_CS; otherwise, fall back to slow but "universal" IRETQ.

Hmm. In fact since we'd need such checks, then instructions
can be even simpler: they don't even need to check CPL,
it can be hardcoded too. We'd need four instructions then:
return to 64 and to 32 bits, to CPL0 and to CPL3.


>> Do we have contacts at Intel to petition for this? :D
>
> Some of us do and have petitioned :)

And what did Intel say?

If there's any interest in doing this, Intel better *do* talk to us
before they commit to implementing it. Their track record
in implementing "fast syscalls" is nothing to write home about.
SYSENTER is a design disaster; SYSRET is buggy.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 18:45 ` Andy Lutomirski
  2015-03-27  8:57   ` Borislav Petkov
@ 2015-03-30 14:27   ` Denys Vlasenko
  2015-03-30 14:30     ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-30 14:27 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Borislav Petkov, X86 ML, linux-kernel

On 03/26/2015 07:45 PM, Andy Lutomirski wrote:
> On Thu, Mar 26, 2015 at 5:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> This change makes the check exact (no more false positives
>> on kernel addresses).
>>
>> It isn't really important to be fully correct here -
>> almost all addresses we'll ever see will be userspace ones,
>> but OTOH it looks to be cheap enough:
>> the new code uses two more ALU ops but preserves %rcx,
>> allowing to not reload it from pt_regs->cx again.
>> On disassembly level, the changes are:
>>
>> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>> mov 0x58(%rsp),%rcx -> (eliminated)
>>
>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>> CC: Borislav Petkov <bp@alien8.de>
>> CC: x86@kernel.org
>> CC: linux-kernel@vger.kernel.org
>> ---
>>
>> Andy, I'd undecided myself on the merits of doing this.
>> If you like it, feel free to take it in your tree.
>> I trimmed CC list to not bother too many people with this trivial
>> and quite possibly "useless churn"-class change.
> 
> I suspect that the two added ALU ops are free for all practical
> purposes, and the performance of this path isn't *that* critical.
> 
> If anyone is running with vsyscall=native because they need the
> performance, then this would be a big win.  Otherwise I don't have a
> real preference.  Anyone else have any thoughts here?
> 
> Let me just run through the math quickly to make sure I believe all the numbers:
> 
> Canonical addresses either start with 17 zeros or 17 ones.
> 
> In the old code, we checked that the top (64-47) = 17 bits were all
> zero.  We did this by shifting right by 47 bits and making sure that
> nothing was left.
> 
> In the new code, we're shifting left by (64 - 48) = 16 bits and then
> signed shifting right by the same amount, this propagating the 17th
> highest bit to all positions to its left.  If we get the same value we
> started with, then we're good to go.
> 
> So it looks okay to me.


So please take it into your tree :)


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-30 14:27   ` Denys Vlasenko
@ 2015-03-30 14:30     ` Andy Lutomirski
  2015-03-30 14:45       ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-30 14:30 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Borislav Petkov, X86 ML, linux-kernel

On Mon, Mar 30, 2015 at 7:27 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> On 03/26/2015 07:45 PM, Andy Lutomirski wrote:
>> On Thu, Mar 26, 2015 at 5:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>> This change makes the check exact (no more false positives
>>> on kernel addresses).
>>>
>>> It isn't really important to be fully correct here -
>>> almost all addresses we'll ever see will be userspace ones,
>>> but OTOH it looks to be cheap enough:
>>> the new code uses two more ALU ops but preserves %rcx,
>>> allowing to not reload it from pt_regs->cx again.
>>> On disassembly level, the changes are:
>>>
>>> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>>> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>>> mov 0x58(%rsp),%rcx -> (eliminated)
>>>
>>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>>> CC: Borislav Petkov <bp@alien8.de>
>>> CC: x86@kernel.org
>>> CC: linux-kernel@vger.kernel.org
>>> ---
>>>
>>> Andy, I'd undecided myself on the merits of doing this.
>>> If you like it, feel free to take it in your tree.
>>> I trimmed CC list to not bother too many people with this trivial
>>> and quite possibly "useless churn"-class change.
>>
>> I suspect that the two added ALU ops are free for all practical
>> purposes, and the performance of this path isn't *that* critical.
>>
>> If anyone is running with vsyscall=native because they need the
>> performance, then this would be a big win.  Otherwise I don't have a
>> real preference.  Anyone else have any thoughts here?
>>
>> Let me just run through the math quickly to make sure I believe all the numbers:
>>
>> Canonical addresses either start with 17 zeros or 17 ones.
>>
>> In the old code, we checked that the top (64-47) = 17 bits were all
>> zero.  We did this by shifting right by 47 bits and making sure that
>> nothing was left.
>>
>> In the new code, we're shifting left by (64 - 48) = 16 bits and then
>> signed shifting right by the same amount, this propagating the 17th
>> highest bit to all positions to its left.  If we get the same value we
>> started with, then we're good to go.
>>
>> So it looks okay to me.
>
>
> So please take it into your tree :)
>

Will do, but not until later this week because I'm on vacation and I'm
allocating about ten minutes to using the computer :)  Or maybe Ingo
will beat me.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-30 14:30     ` Andy Lutomirski
@ 2015-03-30 14:45       ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-30 14:45 UTC (permalink / raw)
  To: Denys Vlasenko; +Cc: Borislav Petkov, X86 ML, linux-kernel

On Mon, Mar 30, 2015 at 7:30 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> On Mon, Mar 30, 2015 at 7:27 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>> On 03/26/2015 07:45 PM, Andy Lutomirski wrote:
>>> On Thu, Mar 26, 2015 at 5:42 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>>> This change makes the check exact (no more false positives
>>>> on kernel addresses).
>>>>
>>>> It isn't really important to be fully correct here -
>>>> almost all addresses we'll ever see will be userspace ones,
>>>> but OTOH it looks to be cheap enough:
>>>> the new code uses two more ALU ops but preserves %rcx,
>>>> allowing to not reload it from pt_regs->cx again.
>>>> On disassembly level, the changes are:
>>>>
>>>> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
>>>> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
>>>> mov 0x58(%rsp),%rcx -> (eliminated)
>>>>
>>>> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
>>>> CC: Borislav Petkov <bp@alien8.de>
>>>> CC: x86@kernel.org
>>>> CC: linux-kernel@vger.kernel.org
>>>> ---
>>>>
>>>> Andy, I'd undecided myself on the merits of doing this.
>>>> If you like it, feel free to take it in your tree.
>>>> I trimmed CC list to not bother too many people with this trivial
>>>> and quite possibly "useless churn"-class change.
>>>
>>> I suspect that the two added ALU ops are free for all practical
>>> purposes, and the performance of this path isn't *that* critical.
>>>
>>> If anyone is running with vsyscall=native because they need the
>>> performance, then this would be a big win.  Otherwise I don't have a
>>> real preference.  Anyone else have any thoughts here?
>>>
>>> Let me just run through the math quickly to make sure I believe all the numbers:
>>>
>>> Canonical addresses either start with 17 zeros or 17 ones.
>>>
>>> In the old code, we checked that the top (64-47) = 17 bits were all
>>> zero.  We did this by shifting right by 47 bits and making sure that
>>> nothing was left.
>>>
>>> In the new code, we're shifting left by (64 - 48) = 16 bits and then
>>> signed shifting right by the same amount, this propagating the 17th
>>> highest bit to all positions to its left.  If we get the same value we
>>> started with, then we're good to go.
>>>
>>> So it looks okay to me.
>>
>>
>> So please take it into your tree :)
>>
>
> Will do, but not until later this week because I'm on vacation and I'm
> allocating about ten minutes to using the computer :)  Or maybe Ingo
> will beat me.

Actually, before I do that, want to send a test case?  I don't think
it's that important (or easy) to test performance, but testing for
oopses is good.  Basing off of this:

https://git.kernel.org/cgit/linux/kernel/git/luto/linux.git/commit/?h=x86/entry&id=eeac7de873439bfb5cf49b04119f510fcbd5c040

might be reasonable, but it's also entirely optional -- it's just how
I would approach it.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-29 19:36                   ` Denys Vlasenko
  2015-03-29 21:12                     ` Andy Lutomirski
@ 2015-03-31 16:43                     ` Ingo Molnar
  2015-03-31 17:08                       ` Andy Lutomirski
  1 sibling, 1 reply; 32+ messages in thread
From: Ingo Molnar @ 2015-03-31 16:43 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Brian Gerst, Andy Lutomirski, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds


* Denys Vlasenko <vda.linux@googlemail.com> wrote:

> > I guess they could optimize it by adding a single "I am a modern 
> > OS executing regular userspace" flag to the descriptor [or 
> > expressing the same as a separate instruction], to avoid all that 
> > legacy crap that won't trigger on like 99.999999% of systems ...
> 
> Yes, that would be a useful addition. Interrupt servicing on x86 
> takes a non-negligible hit because of IRET slowness.

But ... to react to your other patch: detecting the common easy case 
and doing a POPF+RET ourselves ought to be pretty good as well?

But only if ptregs->rip != the magic RET itself, to avoid recursion.

Even with all those extra checks it should still be much faster.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-31 16:43                     ` Ingo Molnar
@ 2015-03-31 17:08                       ` Andy Lutomirski
  2015-03-31 17:31                         ` Denys Vlasenko
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-03-31 17:08 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Denys Vlasenko, Brian Gerst, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On Tue, Mar 31, 2015 at 9:43 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> > I guess they could optimize it by adding a single "I am a modern
>> > OS executing regular userspace" flag to the descriptor [or
>> > expressing the same as a separate instruction], to avoid all that
>> > legacy crap that won't trigger on like 99.999999% of systems ...
>>
>> Yes, that would be a useful addition. Interrupt servicing on x86
>> takes a non-negligible hit because of IRET slowness.
>
> But ... to react to your other patch: detecting the common easy case
> and doing a POPF+RET ourselves ought to be pretty good as well?
>
> But only if ptregs->rip != the magic RET itself, to avoid recursion.
>
> Even with all those extra checks it should still be much faster.
>

I have a smallish preference for doing sti;ret instead, because that
keeps the funny special case entirely localized to the NMI code
instead of putting it in the IRQ exit path.  I suspect that the
performance loss is at most a cycle or two (we're adding a branch, but
sti itself is quite fast).

That being said, I could easily be convinced otherwise.  We'd
certainly get better code coverage if the special case were in the IRQ
exit path, since it would presumably happen fairly frequently.  Heck,
executing write(anything, BAD_USER_POINTER, 1) in a tight loop would
probably hit that case very quickly if there's any network traffic.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-31 17:08                       ` Andy Lutomirski
@ 2015-03-31 17:31                         ` Denys Vlasenko
  0 siblings, 0 replies; 32+ messages in thread
From: Denys Vlasenko @ 2015-03-31 17:31 UTC (permalink / raw)
  To: Andy Lutomirski, Ingo Molnar
  Cc: Denys Vlasenko, Brian Gerst, Borislav Petkov,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Linus Torvalds

On 03/31/2015 07:08 PM, Andy Lutomirski wrote:
> On Tue, Mar 31, 2015 at 9:43 AM, Ingo Molnar <mingo@kernel.org> wrote:
>>
>> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
>>
>>>> I guess they could optimize it by adding a single "I am a modern
>>>> OS executing regular userspace" flag to the descriptor [or
>>>> expressing the same as a separate instruction], to avoid all that
>>>> legacy crap that won't trigger on like 99.999999% of systems ...
>>>
>>> Yes, that would be a useful addition. Interrupt servicing on x86
>>> takes a non-negligible hit because of IRET slowness.
>>
>> But ... to react to your other patch: detecting the common easy case
>> and doing a POPF+RET ourselves ought to be pretty good as well?
>>
>> But only if ptregs->rip != the magic RET itself, to avoid recursion.
>>
>> Even with all those extra checks it should still be much faster.
>>
> 
> I have a smallish preference for doing sti;ret instead, because that
> keeps the funny special case entirely localized to the NMI code
> instead of putting it in the IRQ exit path.  I suspect that the
> performance loss is at most a cycle or two (we're adding a branch, but
> sti itself is quite fast).
> 
> That being said, I could easily be convinced otherwise.

Let me try to convince you. sti is 6 cycles.

The patch atop your code would be:

 	movq RIP-ARGOFFSET(%rsp), %rcx
+	cmp $magic_ret, %rcx
+	je  real_iret
-	btr $9, %rdi
 	movq %rdi, (%rsi)
 	movq %rcx, 8(%rsi)
 	movq %rsi, ORIG_RAX-ARGOFFSET(%rsp)
 	popq_cfi %r11
 	popq_cfi %r10
 	popq_cfi %r9
 	popq_cfi %r8
 	popq_cfi %rax
 	popq_cfi %rcx
 	popq_cfi %rdx
 	popq_cfi %rsi
 	popq_cfi %rdi
 	popq %rsp
-	jc 1f
	popfq_cfi
+magic_ret:
	retq
-1:
-	popfq_cfi
-	sti
-	retq

It's a clear (albeit small) win: the branch is almost never taken,
and we do not need sti.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-03-26 12:42 [PATCH] x86/asm/entry/64: better check for canonical address Denys Vlasenko
                   ` (2 preceding siblings ...)
  2015-03-27 11:27 ` Brian Gerst
@ 2015-04-02 17:37 ` Denys Vlasenko
  2015-04-02 18:10   ` Ingo Molnar
  3 siblings, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-04-02 17:37 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Denys Vlasenko, Borislav Petkov, x86, linux-kernel

On 03/26/2015 01:42 PM, Denys Vlasenko wrote:
> This change makes the check exact (no more false positives
> on kernel addresses).
> 
> It isn't really important to be fully correct here -
> almost all addresses we'll ever see will be userspace ones,
> but OTOH it looks to be cheap enough:
> the new code uses two more ALU ops but preserves %rcx,
> allowing to not reload it from pt_regs->cx again.
> On disassembly level, the changes are:
> 
> cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> mov 0x58(%rsp),%rcx -> (eliminated)



>  	.ifne __VIRTUAL_MASK_SHIFT - 47
>  	.error "virtual address width changed -- sysret checks need update"
>  	.endif
> -	shr $__VIRTUAL_MASK_SHIFT, %rcx
> -	jnz opportunistic_sysret_failed
> +	/* Change top 16 bits to be a sign-extension of the rest */
> +	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> +	/* If this changed %rcx, it was not canonical */
> +	cmpq	%rcx, %r11
> +	jne	opportunistic_sysret_failed


Another thing we can do here is to just canonicalize the address.
IOW: same code as above but without last two insns.

The difference would be that if userspace gives us bogus,
noncanonical return address, we would return to a different address
instead of SIGSEGV.

There is no security implications in doing this as far as I can see,
and no sane program uses noncanonical addresses.
Apart from not having any legitimate need to do so, it's also quite
complicated to achieve.

So it should not break any real-world cases.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-02 17:37 ` Denys Vlasenko
@ 2015-04-02 18:10   ` Ingo Molnar
  0 siblings, 0 replies; 32+ messages in thread
From: Ingo Molnar @ 2015-04-02 18:10 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Borislav Petkov, x86, linux-kernel, Linus Torvalds


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 03/26/2015 01:42 PM, Denys Vlasenko wrote:
> > This change makes the check exact (no more false positives
> > on kernel addresses).
> > 
> > It isn't really important to be fully correct here -
> > almost all addresses we'll ever see will be userspace ones,
> > but OTOH it looks to be cheap enough:
> > the new code uses two more ALU ops but preserves %rcx,
> > allowing to not reload it from pt_regs->cx again.
> > On disassembly level, the changes are:
> > 
> > cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
> > shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
> > mov 0x58(%rsp),%rcx -> (eliminated)
> 
> 
> 
> >  	.ifne __VIRTUAL_MASK_SHIFT - 47
> >  	.error "virtual address width changed -- sysret checks need update"
> >  	.endif
> > -	shr $__VIRTUAL_MASK_SHIFT, %rcx
> > -	jnz opportunistic_sysret_failed
> > +	/* Change top 16 bits to be a sign-extension of the rest */
> > +	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > +	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
> > +	/* If this changed %rcx, it was not canonical */
> > +	cmpq	%rcx, %r11
> > +	jne	opportunistic_sysret_failed
> 
> 
> Another thing we can do here is to just canonicalize the address.
> IOW: same code as above but without last two insns.
> 
> The difference would be that if userspace gives us bogus,
> noncanonical return address, we would return to a different address
> instead of SIGSEGV.

So in general it's better to be proactive with such things and 
generate an error as early as possible, making it easier to debug 
user-space bugs.

> There is no security implications in doing this as far as I can see,
> and no sane program uses noncanonical addresses.
> Apart from not having any legitimate need to do so, it's also quite
> complicated to achieve.
> 
> So it should not break any real-world cases.

It would make user-space debugging harder though, so it's a quality of 
implmentation issue.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-23 15:49       ` Borislav Petkov
@ 2015-04-23 15:52         ` Andy Lutomirski
  0 siblings, 0 replies; 32+ messages in thread
From: Andy Lutomirski @ 2015-04-23 15:52 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Denys Vlasenko, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Thu, Apr 23, 2015 at 8:49 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Thu, Apr 23, 2015 at 08:41:15AM -0700, Andy Lutomirski wrote:
>> I was rather vague there.  Let me try again:
>>
>> If anyone in the AMD camp really cared, we could add a new bug flag
>> X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and set it on Intel chips only, so
>> we could use alternatives to patch out the check when running on
>> sensible AMD hardware.  This would speed the slow path up by a couple
>> of cycles on AMD chips.
>>
>> Does that make more sense?  We could call it
>> X86_BUG_SYSRET_NEEDS_CANONICAL_RIP if that makes more sense.
>
> Actually "...NEEDS_CANONICAL_RCX" makes more sense as this is what we're
> going to patch out eventually, if it makes sense - the RIP canonicalness
> test is being done as part of SYSRET, just RCX is not being tested.
>
> Tell you what - how about I perf stat this first by commenting out that
> couple of instructions on AMD to see whether it brings anything.
>
> Got an idea for a workload other than a kernel build? :-)
>
> Although a kernel build should do a lot of syscalls too...

Kernel build should be fine.  Or "timing_test_64 10 sys_enosys 1" or
"perf_self_monitor" (warning: WIP).  Make sure you either have context
tracking forced on or something else (audit?) that forces the slow
path, though, or you won't see it at all.

https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/

--Andy

>
> Thanks.
>
> --
> Regards/Gruss,
>     Boris.
>
> ECO tip #101: Trim your mails when you reply.
> --



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-23 15:41     ` Andy Lutomirski
@ 2015-04-23 15:49       ` Borislav Petkov
  2015-04-23 15:52         ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2015-04-23 15:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Thu, Apr 23, 2015 at 08:41:15AM -0700, Andy Lutomirski wrote:
> I was rather vague there.  Let me try again:
> 
> If anyone in the AMD camp really cared, we could add a new bug flag
> X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and set it on Intel chips only, so
> we could use alternatives to patch out the check when running on
> sensible AMD hardware.  This would speed the slow path up by a couple
> of cycles on AMD chips.
> 
> Does that make more sense?  We could call it
> X86_BUG_SYSRET_NEEDS_CANONICAL_RIP if that makes more sense.

Actually "...NEEDS_CANONICAL_RCX" makes more sense as this is what we're
going to patch out eventually, if it makes sense - the RIP canonicalness
test is being done as part of SYSRET, just RCX is not being tested.

Tell you what - how about I perf stat this first by commenting out that
couple of instructions on AMD to see whether it brings anything.

Got an idea for a workload other than a kernel build? :-)

Although a kernel build should do a lot of syscalls too...

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-23 15:10   ` Borislav Petkov
@ 2015-04-23 15:41     ` Andy Lutomirski
  2015-04-23 15:49       ` Borislav Petkov
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-04-23 15:41 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Denys Vlasenko, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Thu, Apr 23, 2015 at 8:10 AM, Borislav Petkov <bp@alien8.de> wrote:
> On Tue, Apr 21, 2015 at 11:08:42AM -0700, Andy Lutomirski wrote:
>> I'll take a full implementation of what Intel says over probably
>> unmeasurable performance.  If anyone in the AMD camp really cared, we
>> could add X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and use alternatives to
>> patch this out on AMD.  I doubt this would buy us much.
>
> Err, why do we care if RCX is canonical when executing SYSRET?
>
> The RIP canonicalness test is being done anyway and we read RIP from
> RCX. What am I missing?

I was rather vague there.  Let me try again:

If anyone in the AMD camp really cared, we could add a new bug flag
X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and set it on Intel chips only, so
we could use alternatives to patch out the check when running on
sensible AMD hardware.  This would speed the slow path up by a couple
of cycles on AMD chips.

Does that make more sense?  We could call it
X86_BUG_SYSRET_NEEDS_CANONICAL_RIP if that makes more sense.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-21 18:08 ` Andy Lutomirski
@ 2015-04-23 15:10   ` Borislav Petkov
  2015-04-23 15:41     ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Borislav Petkov @ 2015-04-23 15:10 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Ingo Molnar, Linus Torvalds, Steven Rostedt,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Tue, Apr 21, 2015 at 11:08:42AM -0700, Andy Lutomirski wrote:
> I'll take a full implementation of what Intel says over probably
> unmeasurable performance.  If anyone in the AMD camp really cared, we
> could add X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and use alternatives to
> patch this out on AMD.  I doubt this would buy us much.

Err, why do we care if RCX is canonical when executing SYSRET?

The RIP canonicalness test is being done anyway and we read RIP from
RCX. What am I missing?

Oh, and then there's this:

http://lists.xen.org/archives/html/xen-announce/2012-06/msg00001.html

and more specifically:

"AMD have issued the following statement:

   AMD processors' SYSRET behavior is such that a non-canonical address
   in RCX does not generate a #GP while in CPL0. We have verified this
   with our architecture team, with our design team, and have performed
   tests that verified this on silicon. Therefore, this privilege
   escalation exposure is not applicable to any AMD processor.
"

oh, and look at one of the xen fixes, hahaha! (at the end). Intel faults
in ring0 due to non-canonical RCX but with user RSP. Lovely.

---
x86_64: Do not execute sysret with a non-canonical return address

Check for non-canonical guest RIP before attempting to execute sysret.
If sysret is executed with a non-canonical value in RCX, Intel CPUs
take the fault in ring0, but we will necessarily already have switched
to the the user's stack pointer.

This is a security vulnerability, XSA-7 / CVE-2012-0217.

Signed-off-by: Jan Beulich <JBeulich@suse.com>
Signed-off-by: Ian Campbell <Ian.Campbell@citrix.com>
Signed-off-by: Ian Jackson <ian.jackson@eu.citrix.com>
Tested-by: Ian Campbell <Ian.Campbell@citrix.com>
Acked-by: Keir Fraser <keir.xen@gmail.com>
Committed-by: Ian Jackson <ian.jackson@eu.citrix.com>

diff -r 340062faf298 -r ad87903fdca1 xen/arch/x86/x86_64/entry.S
--- a/xen/arch/x86/x86_64/entry.S	Wed May 23 11:06:49 2012 +0100
+++ b/xen/arch/x86/x86_64/entry.S	Thu May 24 11:02:35 2012 +0100
@@ -40,6 +40,13 @@ restore_all_guest:
         testw $TRAP_syscall,4(%rsp)
         jz    iret_exit_to_guest
 
+        /* Don't use SYSRET path if the return address is not canonical. */
+        movq  8(%rsp),%rcx
+        sarq  $47,%rcx
+        incl  %ecx
+        cmpl  $1,%ecx
+        ja    .Lforce_iret
+
         addq  $8,%rsp
         popq  %rcx                    # RIP
         popq  %r11                    # CS
@@ -50,6 +57,10 @@ restore_all_guest:
         sysretq
 1:      sysretl
 
+.Lforce_iret:
+        /* Mimic SYSRET behavior. */
+        movq  8(%rsp),%rcx            # RIP
+        movq  24(%rsp),%r11           # RFLAGS
         ALIGN
 /* No special register assumptions. */
 iret_exit_to_guest:
---

Thanks.

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] x86/asm/entry/64: better check for canonical address
  2015-04-21 16:27 Denys Vlasenko
@ 2015-04-21 18:08 ` Andy Lutomirski
  2015-04-23 15:10   ` Borislav Petkov
  0 siblings, 1 reply; 32+ messages in thread
From: Andy Lutomirski @ 2015-04-21 18:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Ingo Molnar, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Tue, Apr 21, 2015 at 9:27 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> This change makes the check exact (no more false positives
> on "negative" addresses).

Acked-by: Andy Lutomirski <luto@kernel.org>

I'll take a full implementation of what Intel says over probably
unmeasurable performance.  If anyone in the AMD camp really cared, we
could add X86_BUG_SYSRET_NEEDS_CANONICAL_RCX and use alternatives to
patch this out on AMD.  I doubt this would buy us much.

--Andy

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] x86/asm/entry/64: better check for canonical address
@ 2015-04-21 16:27 Denys Vlasenko
  2015-04-21 18:08 ` Andy Lutomirski
  0 siblings, 1 reply; 32+ messages in thread
From: Denys Vlasenko @ 2015-04-21 16:27 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Borislav Petkov,
	H. Peter Anvin, Andy Lutomirski, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

This change makes the check exact (no more false positives
on "negative" addresses).

It isn't really important to be fully correct here -
almost all addresses we'll ever see will be userspace ones,
but OTOH it looks to be cheap enough:
the new code uses two more ALU ops but preserves %rcx,
allowing to not reload it from pt_regs->cx again.
On disassembly level, the changes are:

cmp %rcx,0x80(%rsp) -> mov 0x80(%rsp),%r11; cmp %rcx,%r11
shr $0x2f,%rcx      -> shl $0x10,%rcx; sar $0x10,%rcx; cmp %rcx,%r11
mov 0x58(%rsp),%rcx -> (eliminated)

On 03/26/2015 07:45 PM, Andy Lutomirski wrote:
> I suspect that the two added ALU ops are free for all practical
> purposes, and the performance of this path isn't *that* critical.
>
> If anyone is running with vsyscall=native because they need the
> performance, then this would be a big win.  Otherwise I don't have a
> real preference.  Anyone else have any thoughts here?
>
> Let me just run through the math quickly to make sure I believe all the numbers:
>
> Canonical addresses either start with 17 zeros or 17 ones.
>
> In the old code, we checked that the top (64-47) = 17 bits were all
> zero.  We did this by shifting right by 47 bits and making sure that
> nothing was left.
>
> In the new code, we're shifting left by (64 - 48) = 16 bits and then
> signed shifting right by the same amount, this propagating the 17th
> highest bit to all positions to its left.  If we get the same value we
> started with, then we're good to go.
>
> So it looks okay to me.
>
> IOW, the new code extends the optimization correctly to one more case
> (native vsyscalls or the really weird corner case of returns to
> emulated vsyscalls, although that should basically never happen) at
> the cost of two probably-free ALU ops.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---

Changes since last submission: expanded commit message with Andy's reply
as requested by Ingo.

 arch/x86/kernel/entry_64.S | 23 ++++++++++++-----------
 1 file changed, 12 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 3bdfdcd..e952f6b 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -410,26 +410,27 @@ syscall_return:
 	 * a completely clean 64-bit userspace context.
 	 */
 	movq RCX(%rsp),%rcx
-	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
+	movq RIP(%rsp),%r11
+	cmpq %rcx,%r11			/* RCX == RIP */
 	jne opportunistic_sysret_failed
 
 	/*
 	 * On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
 	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.  It's not worth
-	 * testing for canonicalness exactly -- this check detects any
-	 * of the 17 high bits set, which is true for non-canonical
-	 * or kernel addresses.  (This will pessimize vsyscall=native.
-	 * Big deal.)
+	 * the kernel, since userspace controls RSP.
 	 *
-	 * If virtual addresses ever become wider, this will need
+	 * If width of "canonical tail" ever becomes variable, this will need
 	 * to be updated to remain correct on both old and new CPUs.
 	 */
 	.ifne __VIRTUAL_MASK_SHIFT - 47
 	.error "virtual address width changed -- SYSRET checks need update"
 	.endif
-	shr $__VIRTUAL_MASK_SHIFT, %rcx
-	jnz opportunistic_sysret_failed
+	/* Change top 16 bits to be the sign-extension of 47th bit */
+	shl	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
+	sar	$(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
+	/* If this changed %rcx, it was not canonical */
+	cmpq	%rcx, %r11
+	jne	opportunistic_sysret_failed
 
 	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
 	jne opportunistic_sysret_failed
@@ -466,8 +467,8 @@ syscall_return:
 	 */
 syscall_return_via_sysret:
 	CFI_REMEMBER_STATE
-	/* r11 is already restored (see code above) */
-	RESTORE_C_REGS_EXCEPT_R11
+	/* rcx and r11 are already restored (see code above) */
+	RESTORE_C_REGS_EXCEPT_RCX_R11
 	movq RSP(%rsp),%rsp
 	USERGS_SYSRET64
 	CFI_RESTORE_STATE
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2015-04-23 15:52 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-26 12:42 [PATCH] x86/asm/entry/64: better check for canonical address Denys Vlasenko
2015-03-26 18:45 ` Andy Lutomirski
2015-03-27  8:57   ` Borislav Petkov
2015-03-30 14:27   ` Denys Vlasenko
2015-03-30 14:30     ` Andy Lutomirski
2015-03-30 14:45       ` Andy Lutomirski
2015-03-27  8:11 ` Ingo Molnar
2015-03-27 10:45   ` Denys Vlasenko
2015-03-27 11:17     ` Ingo Molnar
2015-03-27 11:28       ` Brian Gerst
2015-03-27 11:34         ` Ingo Molnar
2015-03-27 12:14           ` Denys Vlasenko
2015-03-27 12:16             ` Ingo Molnar
2015-03-27 12:31               ` Denys Vlasenko
2015-03-28  9:11                 ` Ingo Molnar
2015-03-29 19:36                   ` Denys Vlasenko
2015-03-29 21:12                     ` Andy Lutomirski
2015-03-29 21:46                       ` Denys Vlasenko
2015-03-31 16:43                     ` Ingo Molnar
2015-03-31 17:08                       ` Andy Lutomirski
2015-03-31 17:31                         ` Denys Vlasenko
2015-03-27 11:27 ` Brian Gerst
2015-03-27 11:31   ` Ingo Molnar
2015-03-27 21:37     ` Andy Lutomirski
2015-04-02 17:37 ` Denys Vlasenko
2015-04-02 18:10   ` Ingo Molnar
2015-04-21 16:27 Denys Vlasenko
2015-04-21 18:08 ` Andy Lutomirski
2015-04-23 15:10   ` Borislav Petkov
2015-04-23 15:41     ` Andy Lutomirski
2015-04-23 15:49       ` Borislav Petkov
2015-04-23 15:52         ` Andy Lutomirski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).