All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
@ 2015-04-01 21:26 Andy Lutomirski
  2015-04-02  6:21 ` Borislav Petkov
                   ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andy Lutomirski @ 2015-04-01 21:26 UTC (permalink / raw)
  To: Ingo Molnar, x86, linux-kernel
  Cc: Borislav Petkov, Denys Vlasenko, Andy Lutomirski

When I wrote the opportunistic SYSRET code, I missed an important
difference between SYSRET and IRET.  Both instructions are capable
of setting EFLAGS.TF, but they behave differently when doing so.
IRET will not issue a #DB trap after execution when it sets TF This
is critical -- otherwise you'd never be able to make forward
progress when returning to userspace.  SYSRET, on the other hand,
will trap with #DB immediately after returning to CPL3, and the next
instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set.  For example, running this code with TF set and
a SIGTRAP handler loaded never gets past post_nop.

	extern unsigned char post_nop[];
	asm volatile ("pushfq\n\t"
		      "popq %%r11\n\t"
		      "nop\n\t"
		      "post_nop:"
		      : : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel
manual.

Fix it by using IRET to restore TF.

Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
Signed-off-by: Andy Lutomirski <luto@kernel.org>
---

This affects 4.0-rc as well as -tip.  A full test case lives here:

https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/

It's called single_step_syscall_64.

On Intel systems, the 32-bit version of that test fails for unrelated
reasons, but that's not a regression, and fixing it will be much more
intrusive.

Changes from v1:
 - Remove mention of testl from changelog.
 - Improve comment per Denys' suggestion.

arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 750c6efcb718..537716380959 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
 	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
 	jne opportunistic_sysret_failed
 
-	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
+	/*
+	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.  This would cause an infinite loop whenever #DB happens
+	 * with register state that satisfies the opportunistic SYSRET
+	 * conditions.  For example, single-stepping this user code:
+	 *
+	 *           movq $stuck_here,%rcx
+	 *           pushfq
+	 *           popq %r11
+	 *   stuck_here:
+	 *
+	 * would never get past stuck_here.
+	 */
+	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
 	jnz opportunistic_sysret_failed
 
 	/* nothing to check for RSP */
-- 
2.3.0


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-01 21:26 [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set Andy Lutomirski
@ 2015-04-02  6:21 ` Borislav Petkov
  2015-04-02  9:07 ` Ingo Molnar
  2015-04-02 12:32 ` [tip:x86/urgent] x86/asm/entry/64: " tip-bot for Andy Lutomirski
  2 siblings, 0 replies; 13+ messages in thread
From: Borislav Petkov @ 2015-04-02  6:21 UTC (permalink / raw)
  To: Andy Lutomirski; +Cc: Ingo Molnar, x86, linux-kernel, Denys Vlasenko

On Wed, Apr 01, 2015 at 02:26:34PM -0700, Andy Lutomirski wrote:
> When I wrote the opportunistic SYSRET code, I missed an important
> difference between SYSRET and IRET.  Both instructions are capable
> of setting EFLAGS.TF, but they behave differently when doing so.
> IRET will not issue a #DB trap after execution when it sets TF This
> is critical -- otherwise you'd never be able to make forward
> progress when returning to userspace.  SYSRET, on the other hand,
> will trap with #DB immediately after returning to CPL3, and the next
> instruction will never execute.
> 
> This breaks anything that opportunistically SYSRETs to a user
> context with TF set.  For example, running this code with TF set and
> a SIGTRAP handler loaded never gets past post_nop.
> 
> 	extern unsigned char post_nop[];
> 	asm volatile ("pushfq\n\t"
> 		      "popq %%r11\n\t"
> 		      "nop\n\t"
> 		      "post_nop:"
> 		      : : "c" (post_nop) : "r11");
> 
> In my defense, I can't find this documented in the AMD or Intel
> manual.
> 
> Fix it by using IRET to restore TF.
> 
> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> This affects 4.0-rc as well as -tip.  A full test case lives here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
> 
> It's called single_step_syscall_64.
> 
> On Intel systems, the 32-bit version of that test fails for unrelated
> reasons, but that's not a regression, and fixing it will be much more
> intrusive.
> 
> Changes from v1:
>  - Remove mention of testl from changelog.
>  - Improve comment per Denys' suggestion.
> 
> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 750c6efcb718..537716380959 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
>  	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
>  	jne opportunistic_sysret_failed
>  
> -	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
> +	/*
> +	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
> +	 * restoring TF results in a trap from userspace immediately after
> +	 * SYSRET.  This would cause an infinite loop whenever #DB happens
> +	 * with register state that satisfies the opportunistic SYSRET
> +	 * conditions.  For example, single-stepping this user code:
> +	 *
> +	 *           movq $stuck_here,%rcx
> +	 *           pushfq
> +	 *           popq %r11
> +	 *   stuck_here:
> +	 *
> +	 * would never get past stuck_here.
> +	 */
> +	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>  	jnz opportunistic_sysret_failed
>  
>  	/* nothing to check for RSP */

Acked-by: Borislav Petkov <bp@suse.de>

-- 
Regards/Gruss,
    Boris.

ECO tip #101: Trim your mails when you reply.
--

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-01 21:26 [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set Andy Lutomirski
  2015-04-02  6:21 ` Borislav Petkov
@ 2015-04-02  9:07 ` Ingo Molnar
  2015-04-02 10:07   ` Denys Vlasenko
  2015-04-02 12:32 ` [tip:x86/urgent] x86/asm/entry/64: " tip-bot for Andy Lutomirski
  2 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2015-04-02  9:07 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Denys Vlasenko,
	Linus Torvalds, Borislav Petkov, Brian Gerst


* Andy Lutomirski <luto@kernel.org> wrote:

> When I wrote the opportunistic SYSRET code, I missed an important
> difference between SYSRET and IRET.  Both instructions are capable
> of setting EFLAGS.TF, but they behave differently when doing so.
> IRET will not issue a #DB trap after execution when it sets TF This
> is critical -- otherwise you'd never be able to make forward
> progress when returning to userspace.  SYSRET, on the other hand,
> will trap with #DB immediately after returning to CPL3, and the next
> instruction will never execute.
> 
> This breaks anything that opportunistically SYSRETs to a user
> context with TF set.  For example, running this code with TF set and
> a SIGTRAP handler loaded never gets past post_nop.
> 
> 	extern unsigned char post_nop[];
> 	asm volatile ("pushfq\n\t"
> 		      "popq %%r11\n\t"
> 		      "nop\n\t"
> 		      "post_nop:"
> 		      : : "c" (post_nop) : "r11");
> 
> In my defense, I can't find this documented in the AMD or Intel
> manual.
> 
> Fix it by using IRET to restore TF.
> 
> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> ---
> 
> This affects 4.0-rc as well as -tip.  A full test case lives here:
> 
> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
> 
> It's called single_step_syscall_64.
> 
> On Intel systems, the 32-bit version of that test fails for unrelated
> reasons, but that's not a regression, and fixing it will be much more
> intrusive.
> 
> Changes from v1:
>  - Remove mention of testl from changelog.
>  - Improve comment per Denys' suggestion.
> 
> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>  1 file changed, 15 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> index 750c6efcb718..537716380959 100644
> --- a/arch/x86/kernel/entry_64.S
> +++ b/arch/x86/kernel/entry_64.S
> @@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
>  	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
>  	jne opportunistic_sysret_failed
>  
> -	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
> +	/*
> +	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
> +	 * restoring TF results in a trap from userspace immediately after
> +	 * SYSRET.  This would cause an infinite loop whenever #DB happens
> +	 * with register state that satisfies the opportunistic SYSRET
> +	 * conditions.  For example, single-stepping this user code:
> +	 *
> +	 *           movq $stuck_here,%rcx
> +	 *           pushfq
> +	 *           popq %r11
> +	 *   stuck_here:
> +	 *
> +	 * would never get past stuck_here.
> +	 */
> +	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>  	jnz opportunistic_sysret_failed

So I merged this as it's an obvious bugfix, but in hindsight I'm 
really uneasy about the whole opportunistic SYSRET concept: it appears 
that the chance that %rcx matches return-%rip is astronomical - this 
is why this bug wasn't noticed live so far.

So should we really be doing this?

It invites fragility and despite being clever, it adds average 
overhead to interrupt returns to user-space: the chance of the 
optimization hitting and helping us are much lower than the always 
paid for cost of doing the tests for it...

So please defend it.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02  9:07 ` Ingo Molnar
@ 2015-04-02 10:07   ` Denys Vlasenko
  2015-04-02 10:37     ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Denys Vlasenko @ 2015-04-02 10:07 UTC (permalink / raw)
  To: Ingo Molnar, Andy Lutomirski
  Cc: x86, linux-kernel, Borislav Petkov, Linus Torvalds,
	Borislav Petkov, Brian Gerst

On 04/02/2015 11:07 AM, Ingo Molnar wrote:
> 
> * Andy Lutomirski <luto@kernel.org> wrote:
> 
>> When I wrote the opportunistic SYSRET code, I missed an important
>> difference between SYSRET and IRET.  Both instructions are capable
>> of setting EFLAGS.TF, but they behave differently when doing so.
>> IRET will not issue a #DB trap after execution when it sets TF This
>> is critical -- otherwise you'd never be able to make forward
>> progress when returning to userspace.  SYSRET, on the other hand,
>> will trap with #DB immediately after returning to CPL3, and the next
>> instruction will never execute.
>>
>> This breaks anything that opportunistically SYSRETs to a user
>> context with TF set.  For example, running this code with TF set and
>> a SIGTRAP handler loaded never gets past post_nop.
>>
>> 	extern unsigned char post_nop[];
>> 	asm volatile ("pushfq\n\t"
>> 		      "popq %%r11\n\t"
>> 		      "nop\n\t"
>> 		      "post_nop:"
>> 		      : : "c" (post_nop) : "r11");
>>
>> In my defense, I can't find this documented in the AMD or Intel
>> manual.
>>
>> Fix it by using IRET to restore TF.
>>
>> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
>> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> ---
>>
>> This affects 4.0-rc as well as -tip.  A full test case lives here:
>>
>> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>>
>> It's called single_step_syscall_64.
>>
>> On Intel systems, the 32-bit version of that test fails for unrelated
>> reasons, but that's not a regression, and fixing it will be much more
>> intrusive.
>>
>> Changes from v1:
>>  - Remove mention of testl from changelog.
>>  - Improve comment per Denys' suggestion.
>>
>> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>>  1 file changed, 15 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> index 750c6efcb718..537716380959 100644
>> --- a/arch/x86/kernel/entry_64.S
>> +++ b/arch/x86/kernel/entry_64.S
>> @@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
>>  	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
>>  	jne opportunistic_sysret_failed
>>  
>> -	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
>> +	/*
>> +	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>> +	 * restoring TF results in a trap from userspace immediately after
>> +	 * SYSRET.  This would cause an infinite loop whenever #DB happens
>> +	 * with register state that satisfies the opportunistic SYSRET
>> +	 * conditions.  For example, single-stepping this user code:
>> +	 *
>> +	 *           movq $stuck_here,%rcx
>> +	 *           pushfq
>> +	 *           popq %r11
>> +	 *   stuck_here:
>> +	 *
>> +	 * would never get past stuck_here.
>> +	 */
>> +	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>>  	jnz opportunistic_sysret_failed
> 
> So I merged this as it's an obvious bugfix, but in hindsight I'm 
> really uneasy about the whole opportunistic SYSRET concept: it appears 
> that the chance that %rcx matches return-%rip is astronomical - this 
> is why this bug wasn't noticed live so far.
> 
> So should we really be doing this?

Andy does this not for the off-chance that userspace's RCX
is equal to return address and R11 == RFLAGS. The chances of that
are astronomically small.

This code path triggers when ptrace/audit/seccomp is active.
Instead of torturing ourselves trying to not divert into IRET return,
now code is steered that way. But then immediately before
actual IRET, we check again: "do we really need IRET?"
IOW "did ptrace really touch pt_regs->ss? ->flags? ->rip? ->rcx?"
which in vast majority of cases will not be true.

Since paranoiacs of Selinux managed to convince many distros
to run with Selinux enabled, many machines in fact run
with at least some on audit machinery *always active*.

-- 
vda


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 10:07   ` Denys Vlasenko
@ 2015-04-02 10:37     ` Ingo Molnar
  2015-04-02 11:14       ` Brian Gerst
  0 siblings, 1 reply; 13+ messages in thread
From: Ingo Molnar @ 2015-04-02 10:37 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, x86, linux-kernel, Borislav Petkov,
	Linus Torvalds, Borislav Petkov, Brian Gerst


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 04/02/2015 11:07 AM, Ingo Molnar wrote:
> > 
> > * Andy Lutomirski <luto@kernel.org> wrote:
> > 
> >> When I wrote the opportunistic SYSRET code, I missed an important
> >> difference between SYSRET and IRET.  Both instructions are capable
> >> of setting EFLAGS.TF, but they behave differently when doing so.
> >> IRET will not issue a #DB trap after execution when it sets TF This
> >> is critical -- otherwise you'd never be able to make forward
> >> progress when returning to userspace.  SYSRET, on the other hand,
> >> will trap with #DB immediately after returning to CPL3, and the next
> >> instruction will never execute.
> >>
> >> This breaks anything that opportunistically SYSRETs to a user
> >> context with TF set.  For example, running this code with TF set and
> >> a SIGTRAP handler loaded never gets past post_nop.
> >>
> >> 	extern unsigned char post_nop[];
> >> 	asm volatile ("pushfq\n\t"
> >> 		      "popq %%r11\n\t"
> >> 		      "nop\n\t"
> >> 		      "post_nop:"
> >> 		      : : "c" (post_nop) : "r11");
> >>
> >> In my defense, I can't find this documented in the AMD or Intel
> >> manual.
> >>
> >> Fix it by using IRET to restore TF.
> >>
> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
> >> ---
> >>
> >> This affects 4.0-rc as well as -tip.  A full test case lives here:
> >>
> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
> >>
> >> It's called single_step_syscall_64.
> >>
> >> On Intel systems, the 32-bit version of that test fails for unrelated
> >> reasons, but that's not a regression, and fixing it will be much more
> >> intrusive.
> >>
> >> Changes from v1:
> >>  - Remove mention of testl from changelog.
> >>  - Improve comment per Denys' suggestion.
> >>
> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
> >>  1 file changed, 15 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
> >> index 750c6efcb718..537716380959 100644
> >> --- a/arch/x86/kernel/entry_64.S
> >> +++ b/arch/x86/kernel/entry_64.S
> >> @@ -715,7 +715,21 @@ retint_swapgs:		/* return to user-space */
> >>  	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
> >>  	jne opportunistic_sysret_failed
> >>  
> >> -	testq $X86_EFLAGS_RF,%r11	/* sysret can't restore RF */
> >> +	/*
> >> +	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
> >> +	 * restoring TF results in a trap from userspace immediately after
> >> +	 * SYSRET.  This would cause an infinite loop whenever #DB happens
> >> +	 * with register state that satisfies the opportunistic SYSRET
> >> +	 * conditions.  For example, single-stepping this user code:
> >> +	 *
> >> +	 *           movq $stuck_here,%rcx
> >> +	 *           pushfq
> >> +	 *           popq %r11
> >> +	 *   stuck_here:
> >> +	 *
> >> +	 * would never get past stuck_here.
> >> +	 */
> >> +	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
> >>  	jnz opportunistic_sysret_failed
> > 
> > So I merged this as it's an obvious bugfix, but in hindsight I'm 
> > really uneasy about the whole opportunistic SYSRET concept: it appears 
> > that the chance that %rcx matches return-%rip is astronomical - this 
> > is why this bug wasn't noticed live so far.
> > 
> > So should we really be doing this?
> 
> Andy does this not for the off-chance that userspace's RCX is equal 
> to return address and R11 == RFLAGS. The chances of that are 
> astronomically small.
> 
> This code path triggers when ptrace/audit/seccomp is active. Instead 
> of torturing ourselves trying to not divert into IRET return, now 
> code is steered that way. But then immediately before actual IRET, 
> we check again: "do we really need IRET?" IOW "did ptrace really 
> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of 
> cases will not be true.

I keep forgetting about that, my test systems have the audit muck 
turned off ;-)

Fair enough - and it's sensible to share the IRET path between 
interrupts and complex-return system calls, even though the check
is unnecessary overhead for the pure interrupt return path...

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 10:37     ` Ingo Molnar
@ 2015-04-02 11:14       ` Brian Gerst
  2015-04-02 12:24         ` Denys Vlasenko
  0 siblings, 1 reply; 13+ messages in thread
From: Brian Gerst @ 2015-04-02 11:14 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov

On Thu, Apr 2, 2015 at 6:37 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
>> On 04/02/2015 11:07 AM, Ingo Molnar wrote:
>> >
>> > * Andy Lutomirski <luto@kernel.org> wrote:
>> >
>> >> When I wrote the opportunistic SYSRET code, I missed an important
>> >> difference between SYSRET and IRET.  Both instructions are capable
>> >> of setting EFLAGS.TF, but they behave differently when doing so.
>> >> IRET will not issue a #DB trap after execution when it sets TF This
>> >> is critical -- otherwise you'd never be able to make forward
>> >> progress when returning to userspace.  SYSRET, on the other hand,
>> >> will trap with #DB immediately after returning to CPL3, and the next
>> >> instruction will never execute.
>> >>
>> >> This breaks anything that opportunistically SYSRETs to a user
>> >> context with TF set.  For example, running this code with TF set and
>> >> a SIGTRAP handler loaded never gets past post_nop.
>> >>
>> >>    extern unsigned char post_nop[];
>> >>    asm volatile ("pushfq\n\t"
>> >>                  "popq %%r11\n\t"
>> >>                  "nop\n\t"
>> >>                  "post_nop:"
>> >>                  : : "c" (post_nop) : "r11");
>> >>
>> >> In my defense, I can't find this documented in the AMD or Intel
>> >> manual.
>> >>
>> >> Fix it by using IRET to restore TF.
>> >>
>> >> Fixes: 2a23c6b8a9c4 x86_64, entry: Use sysret to return to userspace when possible
>> >> Signed-off-by: Andy Lutomirski <luto@kernel.org>
>> >> ---
>> >>
>> >> This affects 4.0-rc as well as -tip.  A full test case lives here:
>> >>
>> >> https://git.kernel.org/cgit/linux/kernel/git/luto/misc-tests.git/
>> >>
>> >> It's called single_step_syscall_64.
>> >>
>> >> On Intel systems, the 32-bit version of that test fails for unrelated
>> >> reasons, but that's not a regression, and fixing it will be much more
>> >> intrusive.
>> >>
>> >> Changes from v1:
>> >>  - Remove mention of testl from changelog.
>> >>  - Improve comment per Denys' suggestion.
>> >>
>> >> arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
>> >>  1 file changed, 15 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
>> >> index 750c6efcb718..537716380959 100644
>> >> --- a/arch/x86/kernel/entry_64.S
>> >> +++ b/arch/x86/kernel/entry_64.S
>> >> @@ -715,7 +715,21 @@ retint_swapgs:                /* return to user-space */
>> >>    cmpq %r11,EFLAGS(%rsp)          /* R11 == RFLAGS */
>> >>    jne opportunistic_sysret_failed
>> >>
>> >> -  testq $X86_EFLAGS_RF,%r11       /* sysret can't restore RF */
>> >> +  /*
>> >> +   * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
>> >> +   * restoring TF results in a trap from userspace immediately after
>> >> +   * SYSRET.  This would cause an infinite loop whenever #DB happens
>> >> +   * with register state that satisfies the opportunistic SYSRET
>> >> +   * conditions.  For example, single-stepping this user code:
>> >> +   *
>> >> +   *           movq $stuck_here,%rcx
>> >> +   *           pushfq
>> >> +   *           popq %r11
>> >> +   *   stuck_here:
>> >> +   *
>> >> +   * would never get past stuck_here.
>> >> +   */
>> >> +  testq $(X86_EFLAGS_RF|X86_EFLAGS_TF),%r11
>> >>    jnz opportunistic_sysret_failed
>> >
>> > So I merged this as it's an obvious bugfix, but in hindsight I'm
>> > really uneasy about the whole opportunistic SYSRET concept: it appears
>> > that the chance that %rcx matches return-%rip is astronomical - this
>> > is why this bug wasn't noticed live so far.
>> >
>> > So should we really be doing this?
>>
>> Andy does this not for the off-chance that userspace's RCX is equal
>> to return address and R11 == RFLAGS. The chances of that are
>> astronomically small.
>>
>> This code path triggers when ptrace/audit/seccomp is active. Instead
>> of torturing ourselves trying to not divert into IRET return, now
>> code is steered that way. But then immediately before actual IRET,
>> we check again: "do we really need IRET?" IOW "did ptrace really
>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>> cases will not be true.
>
> I keep forgetting about that, my test systems have the audit muck
> turned off ;-)
>
> Fair enough - and it's sensible to share the IRET path between
> interrupts and complex-return system calls, even though the check
> is unnecessary overhead for the pure interrupt return path...


Maybe we could reintroduce TIF_IRET for this purpose instead of
(ab)using TIF_NOTIFY_RESUME.  Then we would only do the opportunistic
check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
it for interrupts.

--
Brian Gerst

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 11:14       ` Brian Gerst
@ 2015-04-02 12:24         ` Denys Vlasenko
  2015-04-02 12:31           ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Denys Vlasenko @ 2015-04-02 12:24 UTC (permalink / raw)
  To: Brian Gerst, Ingo Molnar
  Cc: Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov

On 04/02/2015 01:14 PM, Brian Gerst wrote:
>>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>>>> that the chance that %rcx matches return-%rip is astronomical - this
>>>> is why this bug wasn't noticed live so far.
>>>>
>>>> So should we really be doing this?
>>>
>>> Andy does this not for the off-chance that userspace's RCX is equal
>>> to return address and R11 == RFLAGS. The chances of that are
>>> astronomically small.
>>>
>>> This code path triggers when ptrace/audit/seccomp is active. Instead
>>> of torturing ourselves trying to not divert into IRET return, now
>>> code is steered that way. But then immediately before actual IRET,
>>> we check again: "do we really need IRET?" IOW "did ptrace really
>>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>>> cases will not be true.
>>
>> I keep forgetting about that, my test systems have the audit muck
>> turned off ;-)
>>
>> Fair enough - and it's sensible to share the IRET path between
>> interrupts and complex-return system calls, even though the check
>> is unnecessary overhead for the pure interrupt return path...
> 
> 
> Maybe we could reintroduce TIF_IRET for this purpose instead of
> (ab)using TIF_NOTIFY_RESUME.  Then we would only do the opportunistic
> check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
> it for interrupts.

The very first check in the existing code, pt_regs->cx == pt_regs->ip,
will fail for interrupt returns.

You hardly can save anything by placing a (ti->flags & TIF_TRY_SYSRET)
check in front of it, it's almost as expensive.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 12:24         ` Denys Vlasenko
@ 2015-04-02 12:31           ` Ingo Molnar
  2015-04-02 12:59             ` Denys Vlasenko
  2015-04-02 14:26             ` Andy Lutomirski
  0 siblings, 2 replies; 13+ messages in thread
From: Ingo Molnar @ 2015-04-02 12:31 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 04/02/2015 01:14 PM, Brian Gerst wrote:
> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears
> >>>> that the chance that %rcx matches return-%rip is astronomical - this
> >>>> is why this bug wasn't noticed live so far.
> >>>>
> >>>> So should we really be doing this?
> >>>
> >>> Andy does this not for the off-chance that userspace's RCX is equal
> >>> to return address and R11 == RFLAGS. The chances of that are
> >>> astronomically small.
> >>>
> >>> This code path triggers when ptrace/audit/seccomp is active. Instead
> >>> of torturing ourselves trying to not divert into IRET return, now
> >>> code is steered that way. But then immediately before actual IRET,
> >>> we check again: "do we really need IRET?" IOW "did ptrace really
> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
> >>> cases will not be true.
> >>
> >> I keep forgetting about that, my test systems have the audit muck
> >> turned off ;-)
> >>
> >> Fair enough - and it's sensible to share the IRET path between
> >> interrupts and complex-return system calls, even though the check
> >> is unnecessary overhead for the pure interrupt return path...
> > 
> > 
> > Maybe we could reintroduce TIF_IRET for this purpose instead of
> > (ab)using TIF_NOTIFY_RESUME.  Then we would only do the opportunistic
> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
> > it for interrupts.
> 
> The very first check in the existing code, pt_regs->cx == 
> pt_regs->ip, will fail for interrupt returns.
> 
> You hardly can save anything by placing a (ti->flags & 
> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.

Well, what I was thinking of was to have a pure irq (well, async 
context) return path, not shared with the weird-syscall-IRET return 
path at all ...

It would be open coded, not obfuscated via macros.

That way AFAICS the upsides are:

  - it's easier to read (and maintain) what goes on in which case.
    '*intr*' labels would truly identify interrupt return related 
    processing, for a change!

  - we can optimize in a more directed fashion - like here

... while the downsides are:

  - more code
  - a (small) chance of a fix going to one path while not the other.

How much extra code would it be?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [tip:x86/urgent] x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-01 21:26 [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set Andy Lutomirski
  2015-04-02  6:21 ` Borislav Petkov
  2015-04-02  9:07 ` Ingo Molnar
@ 2015-04-02 12:32 ` tip-bot for Andy Lutomirski
  2 siblings, 0 replies; 13+ messages in thread
From: tip-bot for Andy Lutomirski @ 2015-04-02 12:32 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: luto, mingo, bp, bp, linux-kernel, tglx, brgerst, hpa, torvalds,
	dvlasenk

Commit-ID:  7ea24169097d3d3a3eab2dcc5773bc43fd5593e7
Gitweb:     http://git.kernel.org/tip/7ea24169097d3d3a3eab2dcc5773bc43fd5593e7
Author:     Andy Lutomirski <luto@kernel.org>
AuthorDate: Wed, 1 Apr 2015 14:26:34 -0700
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Thu, 2 Apr 2015 11:09:54 +0200

x86/asm/entry/64: Disable opportunistic SYSRET if regs->flags has TF set

When I wrote the opportunistic SYSRET code, I missed an important difference
between SYSRET and IRET.

Both instructions are capable of setting EFLAGS.TF, but they behave differently
when doing so:

 - IRET will not issue a #DB trap after execution when it sets TF.
   This is critical -- otherwise you'd never be able to make forward progress when
   returning to userspace.

 - SYSRET, on the other hand, will trap with #DB immediately after
   returning to CPL3, and the next instruction will never execute.

This breaks anything that opportunistically SYSRETs to a user
context with TF set.  For example, running this code with TF set
and a SIGTRAP handler loaded never gets past 'post_nop':

	extern unsigned char post_nop[];
	asm volatile ("pushfq\n\t"
		      "popq %%r11\n\t"
		      "nop\n\t"
		      "post_nop:"
		      : : "c" (post_nop) : "r11");

In my defense, I can't find this documented in the AMD or Intel manual.

Fix it by using IRET to restore TF.

Signed-off-by: Andy Lutomirski <luto@kernel.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Borislav Petkov <bp@suse.de>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Denys Vlasenko <dvlasenk@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Fixes: 2a23c6b8a9c4 ("x86_64, entry: Use sysret to return to userspace when possible")
Link: http://lkml.kernel.org/r/9472f1ca4c19a38ecda45bba9c91b7168135fcfa.1427923514.git.luto@kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/x86/kernel/entry_64.S | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 2babb39..f0095a7 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -799,7 +799,21 @@ retint_swapgs:		/* return to user-space */
 	cmpq %r11,(EFLAGS-ARGOFFSET)(%rsp)	/* R11 == RFLAGS */
 	jne opportunistic_sysret_failed
 
-	testq $X86_EFLAGS_RF,%r11		/* sysret can't restore RF */
+	/*
+	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.  This would cause an infinite loop whenever #DB happens
+	 * with register state that satisfies the opportunistic SYSRET
+	 * conditions.  For example, single-stepping this user code:
+	 *
+	 *           movq $stuck_here,%rcx
+	 *           pushfq
+	 *           popq %r11
+	 *   stuck_here:
+	 *
+	 * would never get past 'stuck_here'.
+	 */
+	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
 	jnz opportunistic_sysret_failed
 
 	/* nothing to check for RSP */

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 12:31           ` Ingo Molnar
@ 2015-04-02 12:59             ` Denys Vlasenko
  2015-04-02 15:49               ` Denys Vlasenko
  2015-04-02 14:26             ` Andy Lutomirski
  1 sibling, 1 reply; 13+ messages in thread
From: Denys Vlasenko @ 2015-04-02 12:59 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov

On 04/02/2015 02:31 PM, Ingo Molnar wrote:
> 
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
> 
>> On 04/02/2015 01:14 PM, Brian Gerst wrote:
>>>>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>>>>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>>>>>> that the chance that %rcx matches return-%rip is astronomical - this
>>>>>> is why this bug wasn't noticed live so far.
>>>>>>
>>>>>> So should we really be doing this?
>>>>>
>>>>> Andy does this not for the off-chance that userspace's RCX is equal
>>>>> to return address and R11 == RFLAGS. The chances of that are
>>>>> astronomically small.
>>>>>
>>>>> This code path triggers when ptrace/audit/seccomp is active. Instead
>>>>> of torturing ourselves trying to not divert into IRET return, now
>>>>> code is steered that way. But then immediately before actual IRET,
>>>>> we check again: "do we really need IRET?" IOW "did ptrace really
>>>>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>>>>> cases will not be true.
>>>>
>>>> I keep forgetting about that, my test systems have the audit muck
>>>> turned off ;-)
>>>>
>>>> Fair enough - and it's sensible to share the IRET path between
>>>> interrupts and complex-return system calls, even though the check
>>>> is unnecessary overhead for the pure interrupt return path...
>>>
>>>
>>> Maybe we could reintroduce TIF_IRET for this purpose instead of
>>> (ab)using TIF_NOTIFY_RESUME.  Then we would only do the opportunistic
>>> check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
>>> it for interrupts.
>>
>> The very first check in the existing code, pt_regs->cx == 
>> pt_regs->ip, will fail for interrupt returns.
>>
>> You hardly can save anything by placing a (ti->flags & 
>> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.
> 
> Well, what I was thinking of was to have a pure irq (well, async 
> context) return path, not shared with the weird-syscall-IRET return 
> path at all ...
> 
> It would be open coded, not obfuscated via macros.
> 
> That way AFAICS the upsides are:
> 
>   - it's easier to read (and maintain) what goes on in which case.
>     '*intr*' labels would truly identify interrupt return related 
>     processing, for a change!

Re labels: I fully agree they need cleanup (mass rename).

Something along the lines of

int_ret_from_sys_call -> return_from_syscall
int_with_check   -> sysret_check_workmask_in_edi
int_careful      -> sysret_check_NEED_RESCHED
int_very_careful -> sysret_check_SYSCALL_EXIT
int_signal       -> sysret_check_DO_NOTIFY_MASK
int_restore_rest -> sysret_next_check

ret_from_intr    -> return_from_intr
retint_with_reschedule -> intr_check_WORK_MASK
retint_check     -> intr_check_workmask_in_edi
retint_careful   -> intr_check_NEED_RESCHED
retint_signal    -> intr_check_DO_NOTIFY_MASK

retint_swapgs    -> return_from_syscall_or_intr
irq_return_via_sysret -> return_via_sysret

retint_kernel    -> intr_check_preempt
restore_args     -> restore_c_regs
irq_return       -> return_via_iret


and then your proposal can be rephrased as "let's stop
merging sysret and intr code paths at retint_swapgs".

Makes sense. It would entail some code duplication,
but the code will be easier to maintain.

>   - we can optimize in a more directed fashion - like here
> 
> ... while the downsides are:
> 
>   - more code
>   - a (small) chance of a fix going to one path while not the other.
> 
> How much extra code would it be?

A screenful or two.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 12:31           ` Ingo Molnar
  2015-04-02 12:59             ` Denys Vlasenko
@ 2015-04-02 14:26             ` Andy Lutomirski
  1 sibling, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2015-04-02 14:26 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Brian Gerst, Andy Lutomirski,
	the arch/x86 maintainers, Linux Kernel Mailing List,
	Borislav Petkov, Linus Torvalds, Borislav Petkov

On Thu, Apr 2, 2015 at 5:31 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <dvlasenk@redhat.com> wrote:
>
>> On 04/02/2015 01:14 PM, Brian Gerst wrote:
>> >>>> So I merged this as it's an obvious bugfix, but in hindsight I'm
>> >>>> really uneasy about the whole opportunistic SYSRET concept: it appears
>> >>>> that the chance that %rcx matches return-%rip is astronomical - this
>> >>>> is why this bug wasn't noticed live so far.
>> >>>>
>> >>>> So should we really be doing this?
>> >>>
>> >>> Andy does this not for the off-chance that userspace's RCX is equal
>> >>> to return address and R11 == RFLAGS. The chances of that are
>> >>> astronomically small.
>> >>>
>> >>> This code path triggers when ptrace/audit/seccomp is active. Instead
>> >>> of torturing ourselves trying to not divert into IRET return, now
>> >>> code is steered that way. But then immediately before actual IRET,
>> >>> we check again: "do we really need IRET?" IOW "did ptrace really
>> >>> touch pt_regs->ss? ->flags? ->rip? ->rcx?" which in vast majority of
>> >>> cases will not be true.
>> >>
>> >> I keep forgetting about that, my test systems have the audit muck
>> >> turned off ;-)
>> >>
>> >> Fair enough - and it's sensible to share the IRET path between
>> >> interrupts and complex-return system calls, even though the check
>> >> is unnecessary overhead for the pure interrupt return path...
>> >
>> >
>> > Maybe we could reintroduce TIF_IRET for this purpose instead of
>> > (ab)using TIF_NOTIFY_RESUME.  Then we would only do the opportunistic
>> > check for those cases (ptrace, audit, exec, sigreturn, etc.), and skip
>> > it for interrupts.
>>
>> The very first check in the existing code, pt_regs->cx ==
>> pt_regs->ip, will fail for interrupt returns.
>>
>> You hardly can save anything by placing a (ti->flags &
>> TIF_TRY_SYSRET) check in front of it, it's almost as expensive.
>
> Well, what I was thinking of was to have a pure irq (well, async
> context) return path, not shared with the weird-syscall-IRET return
> path at all ...
>
> It would be open coded, not obfuscated via macros.
>
> That way AFAICS the upsides are:
>
>   - it's easier to read (and maintain) what goes on in which case.
>     '*intr*' labels would truly identify interrupt return related
>     processing, for a change!
>
>   - we can optimize in a more directed fashion - like here
>
> ... while the downsides are:
>
>   - more code
>   - a (small) chance of a fix going to one path while not the other.
>
> How much extra code would it be?

Negative if we did it right, perhaps.

I think the best approach is a complete rewrite, not an attempt to
incrementally improve it.  The current code is held together by gotos
and bailing wire, and I'm surprised it works at all.  Some of it seems
to work by accident AFAICT.  For example, the sysret audit "fast path"
that I deleted as part of the opportunistic sysret work was quite
buggy AFAICT, but no one who looks at this code cares about audit, and
it's nearly impossible to even tell what the code is supposed to do.

Linus tried to rewrite some of it last year, but it was incomplete.
Here's my vague inventory of what the exit paths need to do on return
to userspace:

 - syscall_trace_leave [1] (syscall only)

 - context tracking if TIF_NOHZ (all exits)

 - one-shot work.  These are things that must be done if the flags are
set, and doing it clears the flags.  These flags need to be checked
with IRQs off, and IRQs cannot be re-enabled afterwards.  This
includes signal delivery, scheduling, and user return notifiers.  (all
exits)

 - uprobes.  Maybe this counts as one-shot work.  (all exits, I assume)

 - Check for sysret applicability (only important for syscalls)

 - espfix, iret, unless we're using sysret (all exits)

There's also the special case of interrupt returns to kernel mode.
That should schedule in preemptable kernels.

To me, this suggests that we should have a total of four asm exit paths:

 - syscall return.  IMO this should call a single C function, check
the sysret conditions, and jump to the espfix code.

 - paranoid return to kernel.  This should just return after a
possible swapgs.  (We do this now.)

 - interrupt/exception return.  IMO this should call a single C
function and jump to the espfix code. [2]

 - NMI -- this is its own crazy thing.

All of the check/careful/very_careful crap can just go.  Good riddance.


[1] syscall_trace_leave contains context tracking hooks that are
AFAICT completely unnecessary.  That's almost okay, though -- other
similarly silly context tracking hooks fix up the mess.  This might
explain part of why context tracking is so incredibly slow.

[2] I don't even think we need the retint_careful vs retint_kernel
distinction in asm.  If we ditch that, we should be able to replace it
with something like:

call prepare_intr_exception_return
testl $ebx,$ebx
jz 1f
SWAPGS
1f:
INTERRUPT_RETURN

Of course, prepare_intr_exception_return would need to DTRT if we're
returning to kernel mode, but that's easy.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 12:59             ` Denys Vlasenko
@ 2015-04-02 15:49               ` Denys Vlasenko
  2015-04-02 16:08                 ` Ingo Molnar
  0 siblings, 1 reply; 13+ messages in thread
From: Denys Vlasenko @ 2015-04-02 15:49 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov

On 04/02/2015 02:59 PM, Denys Vlasenko wrote:
> On 04/02/2015 02:31 PM, Ingo Molnar wrote:
>>   - we can optimize in a more directed fashion - like here
>>
>> ... while the downsides are:
>>
>>   - more code
>>   - a (small) chance of a fix going to one path while not the other.
>>
>> How much extra code would it be?
> 
> A screenful or two.

I took a stab at it:

   text	   data	    bss	    dec	    hex	filename
  12530	      0	      0	  12530	   30f2	entry_64.o2
  12562	      0	      0	  12562	   3112	entry_64.o

The patch does two steps:

(1) copy-pastes "retint_swapgs:" code into syscall handling code,
the copy is under "syscall_return:" label.

(2) remove "opportunistic sysret" code from "retint_swapgs" code block,
since now it won't be reached by syscall return. This in fact removes
most of the code in question.

Lightly run-tested so far.

Ingo, do you want this in a proper patch form?
-- 
vda


diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
index 7bc097a..5ea2dd1 100644
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -354,8 +354,8 @@ GLOBAL(int_with_check)
 	movl TI_flags(%rcx),%edx
 	andl %edi,%edx
 	jnz   int_careful
-	andl    $~TS_COMPAT,TI_status(%rcx)
-	jmp   retint_swapgs
+	andl	$~TS_COMPAT,TI_status(%rcx)
+	jmp	syscall_return

 	/* Either reschedule or signal or syscall exit tracking needed. */
 	/* First do a reschedule test. */
@@ -399,9 +399,86 @@ int_restore_rest:
 	DISABLE_INTERRUPTS(CLBR_NONE)
 	TRACE_IRQS_OFF
 	jmp int_with_check
+
+syscall_return:
+	/* The iretq could re-enable interrupts: */
+	DISABLE_INTERRUPTS(CLBR_ANY)
+	TRACE_IRQS_IRETQ
+
+	/*
+	 * Try to use SYSRET instead of IRET if we're returning to
+	 * a completely clean 64-bit userspace context.
+	 */
+	movq RCX(%rsp),%rcx
+	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
+	 * in kernel space.  This essentially lets the user take over
+	 * the kernel, since userspace controls RSP.  It's not worth
+	 * testing for canonicalness exactly -- this check detects any
+	 * of the 17 high bits set, which is true for non-canonical
+	 * or kernel addresses.  (This will pessimize vsyscall=native.
+	 * Big deal.)
+	 *
+	 * If virtual addresses ever become wider, this will need
+	 * to be updated to remain correct on both old and new CPUs.
+	 */
+	.ifne __VIRTUAL_MASK_SHIFT - 47
+	.error "virtual address width changed -- sysret checks need update"
+	.endif
+	shr $__VIRTUAL_MASK_SHIFT, %rcx
+	jnz opportunistic_sysret_failed
+
+	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	movq R11(%rsp),%r11
+	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
+	 * restoring TF results in a trap from userspace immediately after
+	 * SYSRET.  This would cause an infinite loop whenever #DB happens
+	 * with register state that satisfies the opportunistic SYSRET
+	 * conditions.  For example, single-stepping this user code:
+	 *
+	 *           movq $stuck_here,%rcx
+	 *           pushfq
+	 *           popq %r11
+	 *   stuck_here:
+	 *
+	 * would never get past 'stuck_here'.
+	 */
+	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
+	jnz opportunistic_sysret_failed
+
+	/* nothing to check for RSP */
+
+	cmpq $__USER_DS,SS(%rsp)	/* SS must match SYSRET */
+	jne opportunistic_sysret_failed
+
+	/*
+	 * We win!  This label is here just for ease of understanding
+	 * perf profiles.  Nothing jumps here.
+	 */
+syscall_return_via_sysret:
+	CFI_REMEMBER_STATE
+	/* r11 is already restored (see code above) */
+	RESTORE_C_REGS_EXCEPT_R11
+	movq RSP(%rsp),%rsp
+	USERGS_SYSRET64
+	CFI_RESTORE_STATE
+
+opportunistic_sysret_failed:
+	SWAPGS
+	jmp restore_args
 	CFI_ENDPROC
 END(system_call)

+
 	.macro FORK_LIKE func
 ENTRY(stub_\func)
 	CFI_STARTPROC
@@ -672,74 +749,6 @@ retint_swapgs:		/* return to user-space */
 	DISABLE_INTERRUPTS(CLBR_ANY)
 	TRACE_IRQS_IRETQ

-	/*
-	 * Try to use SYSRET instead of IRET if we're returning to
-	 * a completely clean 64-bit userspace context.
-	 */
-	movq RCX(%rsp),%rcx
-	cmpq %rcx,RIP(%rsp)		/* RCX == RIP */
-	jne opportunistic_sysret_failed
-
-	/*
-	 * On Intel CPUs, sysret with non-canonical RCX/RIP will #GP
-	 * in kernel space.  This essentially lets the user take over
-	 * the kernel, since userspace controls RSP.  It's not worth
-	 * testing for canonicalness exactly -- this check detects any
-	 * of the 17 high bits set, which is true for non-canonical
-	 * or kernel addresses.  (This will pessimize vsyscall=native.
-	 * Big deal.)
-	 *
-	 * If virtual addresses ever become wider, this will need
-	 * to be updated to remain correct on both old and new CPUs.
-	 */
-	.ifne __VIRTUAL_MASK_SHIFT - 47
-	.error "virtual address width changed -- sysret checks need update"
-	.endif
-	shr $__VIRTUAL_MASK_SHIFT, %rcx
-	jnz opportunistic_sysret_failed
-
-	cmpq $__USER_CS,CS(%rsp)	/* CS must match SYSRET */
-	jne opportunistic_sysret_failed
-
-	movq R11(%rsp),%r11
-	cmpq %r11,EFLAGS(%rsp)		/* R11 == RFLAGS */
-	jne opportunistic_sysret_failed
-
-	/*
-	 * SYSRET can't restore RF.  SYSRET can restore TF, but unlike IRET,
-	 * restoring TF results in a trap from userspace immediately after
-	 * SYSRET.  This would cause an infinite loop whenever #DB happens
-	 * with register state that satisfies the opportunistic SYSRET
-	 * conditions.  For example, single-stepping this user code:
-	 *
-	 *           movq $stuck_here,%rcx
-	 *           pushfq
-	 *           popq %r11
-	 *   stuck_here:
-	 *
-	 * would never get past 'stuck_here'.
-	 */
-	testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
-	jnz opportunistic_sysret_failed
-
-	/* nothing to check for RSP */
-
-	cmpq $__USER_DS,SS(%rsp)	/* SS must match SYSRET */
-	jne opportunistic_sysret_failed
-
-	/*
-	 * We win!  This label is here just for ease of understanding
-	 * perf profiles.  Nothing jumps here.
-	 */
-irq_return_via_sysret:
-	CFI_REMEMBER_STATE
-	/* r11 is already restored (see code above) */
-	RESTORE_C_REGS_EXCEPT_R11
-	movq RSP(%rsp),%rsp
-	USERGS_SYSRET64
-	CFI_RESTORE_STATE
-
-opportunistic_sysret_failed:
 	SWAPGS
 	jmp restore_args


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set
  2015-04-02 15:49               ` Denys Vlasenko
@ 2015-04-02 16:08                 ` Ingo Molnar
  0 siblings, 0 replies; 13+ messages in thread
From: Ingo Molnar @ 2015-04-02 16:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Brian Gerst, Andy Lutomirski, the arch/x86 maintainers,
	Linux Kernel Mailing List, Borislav Petkov, Linus Torvalds,
	Borislav Petkov


* Denys Vlasenko <dvlasenk@redhat.com> wrote:

> On 04/02/2015 02:59 PM, Denys Vlasenko wrote:
> > On 04/02/2015 02:31 PM, Ingo Molnar wrote:
> >>   - we can optimize in a more directed fashion - like here
> >>
> >> ... while the downsides are:
> >>
> >>   - more code
> >>   - a (small) chance of a fix going to one path while not the other.
> >>
> >> How much extra code would it be?
> > 
> > A screenful or two.
> 
> I took a stab at it:
> 
>    text	   data	    bss	    dec	    hex	filename
>   12530	      0	      0	  12530	   30f2	entry_64.o2
>   12562	      0	      0	  12562	   3112	entry_64.o
> 
> The patch does two steps:
> 
> (1) copy-pastes "retint_swapgs:" code into syscall handling code,
> the copy is under "syscall_return:" label.
> 
> (2) remove "opportunistic sysret" code from "retint_swapgs" code block,
> since now it won't be reached by syscall return. This in fact removes
> most of the code in question.
> 
> Lightly run-tested so far.
> 
> Ingo, do you want this in a proper patch form?

Yeah, that looks good to me (only lightly reviewed).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2015-04-02 16:08 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-04-01 21:26 [PATCH urgent v2] x86, asm: Disable opportunistic SYSRET if regs->flags has TF set Andy Lutomirski
2015-04-02  6:21 ` Borislav Petkov
2015-04-02  9:07 ` Ingo Molnar
2015-04-02 10:07   ` Denys Vlasenko
2015-04-02 10:37     ` Ingo Molnar
2015-04-02 11:14       ` Brian Gerst
2015-04-02 12:24         ` Denys Vlasenko
2015-04-02 12:31           ` Ingo Molnar
2015-04-02 12:59             ` Denys Vlasenko
2015-04-02 15:49               ` Denys Vlasenko
2015-04-02 16:08                 ` Ingo Molnar
2015-04-02 14:26             ` Andy Lutomirski
2015-04-02 12:32 ` [tip:x86/urgent] x86/asm/entry/64: " tip-bot for Andy Lutomirski

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.