linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
@ 2015-03-09 14:05 Denys Vlasenko
  2015-03-09 14:18 ` Andy Lutomirski
  2015-03-09 16:08 ` Linus Torvalds
  0 siblings, 2 replies; 24+ messages in thread
From: Denys Vlasenko @ 2015-03-09 14:05 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	x86, linux-kernel

Old code was trying to avoid having three branch insns,
but instead it has a chain of six insns where each insn
depends on previos one.

And it was touching PT_OLDSS(%esp) unconditionally, even when it may
contain bogus data. Elsewhere we have to jump thru hoops
just to make sure here PT_OLDSS(%esp) is at least in a valid page.

All this just to have one branch instead of three?

The new code simply checks each condition.
All three checks can run in parallel on an out-of-order CPU.
Most of the time, none of branches will be taken.

Comparison of object code:
    Old:
     1e6:   8b 44 24 38             mov    0x38(%esp),%eax
     1ea:   8a 64 24 40             mov    0x40(%esp),%ah
     1ee:   8a 44 24 34             mov    0x34(%esp),%al
     1f2:   25 03 04 02 00          and    $0x20403,%eax
     1f7:   3d 03 04 00 00          cmp    $0x403,%eax
     1fc:   74 0f                   je     20d <ldt_ss>
    New:
     1e6:   0f ba 64 24 38 11       btl    $0x11,0x38(%esp)
     1ec:   72 0e                   jb     1fc <restore_nocheck>
     1ee:   f6 44 24 34 03          testb  $0x3,0x34(%esp)
     1f3:   74 07                   je     1fc <restore_nocheck>
     1f5:   f6 44 24 40 04          testb  $0x4,0x40(%esp)
     1fa:   75 0f                   jne    20b <ldt_ss>

Patch is run-tested.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Linus Torvalds <torvalds@linux-foundation.org>
CC: Steven Rostedt <rostedt@goodmis.org>
CC: Ingo Molnar <mingo@kernel.org>
CC: Borislav Petkov <bp@alien8.de>
CC: "H. Peter Anvin" <hpa@zytor.com>
CC: Andy Lutomirski <luto@amacapital.net>
CC: Oleg Nesterov <oleg@redhat.com>
CC: Frederic Weisbecker <fweisbec@gmail.com>
CC: Alexei Starovoitov <ast@plumgrid.com>
CC: Will Drewry <wad@chromium.org>
CC: Kees Cook <keescook@chromium.org>
CC: x86@kernel.org
CC: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/entry_32.S | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
index e33ba51..0a4996b 100644
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -516,16 +516,15 @@ restore_all:
 	TRACE_IRQS_IRET
 restore_all_notrace:
 #ifdef CONFIG_X86_ESPFIX32
-	movl PT_EFLAGS(%esp), %eax	# mix EFLAGS, SS and CS
-	# Warning: PT_OLDSS(%esp) contains the wrong/random values if we
-	# are returning to the kernel.
-	# See comments in process.c:copy_thread() for details.
-	movb PT_OLDSS(%esp), %ah
-	movb PT_CS(%esp), %al
-	andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
-	cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
 	CFI_REMEMBER_STATE
-	je ldt_ss			# returning to user-space with LDT SS
+	btl	$X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
+	jc	restore_nocheck		# VM set, not it
+	testb	$3,PT_CS(%esp)
+	jz	restore_nocheck		# CPL0, not it
+	# Note: we access PT_OLDSS only when we know it exists.
+	# If PT_CS is from CPL0, it does not.
+	testb	$SEGMENT_TI_MASK,PT_OLDSS(%esp)
+	jnz	ldt_ss			# returning to user-space with LDT SS
 #endif
 restore_nocheck:
 	RESTORE_REGS 4			# skip orig_eax/error_code
-- 
1.8.1.4


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 14:05 [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp) Denys Vlasenko
@ 2015-03-09 14:18 ` Andy Lutomirski
  2015-03-09 15:00   ` Denys Vlasenko
  2015-03-09 16:08 ` Linus Torvalds
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 14:18 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Linus Torvalds, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook, X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 7:05 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
> Old code was trying to avoid having three branch insns,
> but instead it has a chain of six insns where each insn
> depends on previos one.
>
> And it was touching PT_OLDSS(%esp) unconditionally, even when it may
> contain bogus data. Elsewhere we have to jump thru hoops
> just to make sure here PT_OLDSS(%esp) is at least in a valid page.
>
> All this just to have one branch instead of three?
>
> The new code simply checks each condition.
> All three checks can run in parallel on an out-of-order CPU.
> Most of the time, none of branches will be taken.
>
> Comparison of object code:
>     Old:
>      1e6:   8b 44 24 38             mov    0x38(%esp),%eax
>      1ea:   8a 64 24 40             mov    0x40(%esp),%ah
>      1ee:   8a 44 24 34             mov    0x34(%esp),%al
>      1f2:   25 03 04 02 00          and    $0x20403,%eax
>      1f7:   3d 03 04 00 00          cmp    $0x403,%eax
>      1fc:   74 0f                   je     20d <ldt_ss>
>     New:
>      1e6:   0f ba 64 24 38 11       btl    $0x11,0x38(%esp)
>      1ec:   72 0e                   jb     1fc <restore_nocheck>
>      1ee:   f6 44 24 34 03          testb  $0x3,0x34(%esp)
>      1f3:   74 07                   je     1fc <restore_nocheck>
>      1f5:   f6 44 24 40 04          testb  $0x4,0x40(%esp)
>      1fa:   75 0f                   jne    20b <ldt_ss>
>
> Patch is run-tested.
>
> Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
> CC: Linus Torvalds <torvalds@linux-foundation.org>
> CC: Steven Rostedt <rostedt@goodmis.org>
> CC: Ingo Molnar <mingo@kernel.org>
> CC: Borislav Petkov <bp@alien8.de>
> CC: "H. Peter Anvin" <hpa@zytor.com>
> CC: Andy Lutomirski <luto@amacapital.net>
> CC: Oleg Nesterov <oleg@redhat.com>
> CC: Frederic Weisbecker <fweisbec@gmail.com>
> CC: Alexei Starovoitov <ast@plumgrid.com>
> CC: Will Drewry <wad@chromium.org>
> CC: Kees Cook <keescook@chromium.org>
> CC: x86@kernel.org
> CC: linux-kernel@vger.kernel.org
> ---
>  arch/x86/kernel/entry_32.S | 17 ++++++++---------
>  1 file changed, 8 insertions(+), 9 deletions(-)
>
> diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
> index e33ba51..0a4996b 100644
> --- a/arch/x86/kernel/entry_32.S
> +++ b/arch/x86/kernel/entry_32.S
> @@ -516,16 +516,15 @@ restore_all:
>         TRACE_IRQS_IRET
>  restore_all_notrace:
>  #ifdef CONFIG_X86_ESPFIX32
> -       movl PT_EFLAGS(%esp), %eax      # mix EFLAGS, SS and CS
> -       # Warning: PT_OLDSS(%esp) contains the wrong/random values if we
> -       # are returning to the kernel.
> -       # See comments in process.c:copy_thread() for details.
> -       movb PT_OLDSS(%esp), %ah
> -       movb PT_CS(%esp), %al
> -       andl $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
> -       cmpl $((SEGMENT_LDT << 8) | USER_RPL), %eax
>         CFI_REMEMBER_STATE
> -       je ldt_ss                       # returning to user-space with LDT SS
> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
> +       jc      restore_nocheck         # VM set, not it

This seems useless.  In vm86 mode, espfix should work fine (even if
pointlessly), CS won't have the two low bits set, and SS won't
reference the LDT because it's not a selector at all.

That being said, what ends up in the high bits of esp when we iret to
vm86 mode?  Do we actually need espfix on all returns to vm86 mode?

Your patch passes my sigreturn test, so it at least results in
functional espvix32 in the non-vm86 case.

--Andy

> +       testb   $3,PT_CS(%esp)
> +       jz      restore_nocheck         # CPL0, not it
> +       # Note: we access PT_OLDSS only when we know it exists.
> +       # If PT_CS is from CPL0, it does not.
> +       testb   $SEGMENT_TI_MASK,PT_OLDSS(%esp)
> +       jnz     ldt_ss                  # returning to user-space with LDT SS
>  #endif
>  restore_nocheck:
>         RESTORE_REGS 4                  # skip orig_eax/error_code
> --
> 1.8.1.4
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 14:18 ` Andy Lutomirski
@ 2015-03-09 15:00   ` Denys Vlasenko
  2015-03-09 15:09     ` Andy Lutomirski
  2015-03-09 15:13     ` Ingo Molnar
  0 siblings, 2 replies; 24+ messages in thread
From: Denys Vlasenko @ 2015-03-09 15:00 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 3:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
> Do we actually need espfix on all returns to vm86 mode?

No, the current code (and my new version) does *not* do
espfix for vm86. It's not needed (apparently).

>> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
>> +       jc      restore_nocheck         # VM set, not it
>
> This seems useless.  In vm86 mode, espfix should work fine (even if
> pointlessly), CS won't have the two low bits set, and SS won't
> reference the LDT because it's not a selector at all.

You seem to suggest we can drop VM flag test.

If we do that, the tests for CS and SS will work on bogus data.
I.e. they will semi-randomly rouse execution through espfix.

Which will probably work correctly, but IIRC espfix does crazy stuff
which is likely to be slow.

What we definitely should do here is at least frame this check with
"#ifdef CONFIG_VM86".

> That being said, what ends up in the high bits of esp when we iret to
> vm86 mode?

I don't know. I guess it's time to write an actual vm86 testcase :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:00   ` Denys Vlasenko
@ 2015-03-09 15:09     ` Andy Lutomirski
  2015-03-09 19:31       ` Denys Vlasenko
  2015-03-09 15:13     ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 15:09 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 8:00 AM, Denys Vlasenko <vda.linux@googlemail.com> wrote:
> On Mon, Mar 9, 2015 at 3:18 PM, Andy Lutomirski <luto@amacapital.net> wrote:
>> Do we actually need espfix on all returns to vm86 mode?
>
> No, the current code (and my new version) does *not* do
> espfix for vm86. It's not needed (apparently).
>
>>> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
>>> +       jc      restore_nocheck         # VM set, not it
>>
>> This seems useless.  In vm86 mode, espfix should work fine (even if
>> pointlessly), CS won't have the two low bits set, and SS won't
>> reference the LDT because it's not a selector at all.
>
> You seem to suggest we can drop VM flag test.
>
> If we do that, the tests for CS and SS will work on bogus data.
> I.e. they will semi-randomly rouse execution through espfix.
>

Mmm, right.  My bad, that test is needed.

> Which will probably work correctly, but IIRC espfix does crazy stuff
> which is likely to be slow.
>
> What we definitely should do here is at least frame this check with
> "#ifdef CONFIG_VM86".
>
>> That being said, what ends up in the high bits of esp when we iret to
>> vm86 mode?
>
> I don't know. I guess it's time to write an actual vm86 testcase :)

Ick.  I can try...

Anyway, you've convinced me that your patch is good.  I queued it up.

-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:00   ` Denys Vlasenko
  2015-03-09 15:09     ` Andy Lutomirski
@ 2015-03-09 15:13     ` Ingo Molnar
  2015-03-09 15:18       ` Andy Lutomirski
  2015-03-09 15:47       ` Steven Rostedt
  1 sibling, 2 replies; 24+ messages in thread
From: Ingo Molnar @ 2015-03-09 15:13 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel


* Denys Vlasenko <vda.linux@googlemail.com> wrote:

> > That being said, what ends up in the high bits of esp when we iret 
> > to vm86 mode?
> 
> I don't know. I guess it's time to write an actual vm86 testcase :)

Btw., could you please stick it and your other testcases into 
tools/x86/tests/ or so? Having such a testsuite would ease maintenance 
considerably.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:13     ` Ingo Molnar
@ 2015-03-09 15:18       ` Andy Lutomirski
  2015-03-09 15:47       ` Steven Rostedt
  1 sibling, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 15:18 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Denys Vlasenko, Linus Torvalds, Steven Rostedt,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

On Mon, Mar 9, 2015 at 8:13 AM, Ingo Molnar <mingo@kernel.org> wrote:
>
> * Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
>> > That being said, what ends up in the high bits of esp when we iret
>> > to vm86 mode?
>>
>> I don't know. I guess it's time to write an actual vm86 testcase :)
>
> Btw., could you please stick it and your other testcases into
> tools/x86/tests/ or so? Having such a testsuite would ease maintenance
> considerably.
>

My excuse used to be that the test actively exploited really nasty
security issues.  Then it was that I was waiting for the
infrastructure to stabilize a bit.  I'll go see how it's looking now.

--Andy

> Thanks,
>
>         Ingo



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:13     ` Ingo Molnar
  2015-03-09 15:18       ` Andy Lutomirski
@ 2015-03-09 15:47       ` Steven Rostedt
  2015-03-09 15:54         ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Steven Rostedt @ 2015-03-09 15:47 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Denys Vlasenko, Andy Lutomirski, Denys Vlasenko, Linus Torvalds,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel, Shuah Khan

On Mon, 9 Mar 2015 16:13:55 +0100
Ingo Molnar <mingo@kernel.org> wrote:


> Btw., could you please stick it and your other testcases into 
> tools/x86/tests/ or so? Having such a testsuite would ease maintenance 
> considerably.

Should this be part of tools/testing/selftests?

-- Steve

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:47       ` Steven Rostedt
@ 2015-03-09 15:54         ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2015-03-09 15:54 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Denys Vlasenko, Andy Lutomirski, Denys Vlasenko, Linus Torvalds,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel, Shuah Khan


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 9 Mar 2015 16:13:55 +0100
> Ingo Molnar <mingo@kernel.org> wrote:
> 
> > Btw., could you please stick it and your other testcases into 
> > tools/x86/tests/ or so? Having such a testsuite would ease 
> > maintenance considerably.
> 
> Should this be part of tools/testing/selftests?

Yeah.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 14:05 [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp) Denys Vlasenko
  2015-03-09 14:18 ` Andy Lutomirski
@ 2015-03-09 16:08 ` Linus Torvalds
  2015-03-09 16:28   ` Denys Vlasenko
  2015-03-09 17:42   ` H. Peter Anvin
  1 sibling, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-03-09 16:08 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Andy Lutomirski, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	H. Peter Anvin, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 7:05 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>     New:
>      1e6:   0f ba 64 24 38 11       btl    $0x11,0x38(%esp)

btl? Really?

Why isn't that just

    testb $2,0x3a(%esp)

which is both smaller and quite a bit faster on older machines.

Sure, the btl is easier to explain in the source code, but instead of this:

> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)

you'd have to add a comment, like

    testb $2, PT_EFLAGS+2(%esp)  # X86_EFLAGS_VM_BIT

or something.

Or just at least *partially* do what we used to do, and make it all be

    movb  PT_EFLAGS+2(%esp),%al
    andb $2,%al
    orb PT_CS(%esp),%al
    testb $3,%al
    je restore_nocheck
    testb $SEGMENT_TI_MASK,PT_OLDSS(%esp)
    jne ldt_ss

which still avoids looking at SS unless needed, and is smaller and
faster than the btl, afaik.

                       Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 16:08 ` Linus Torvalds
@ 2015-03-09 16:28   ` Denys Vlasenko
  2015-03-09 16:44     ` Linus Torvalds
  2015-03-09 17:42   ` H. Peter Anvin
  1 sibling, 1 reply; 24+ messages in thread
From: Denys Vlasenko @ 2015-03-09 16:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Denys Vlasenko, Andy Lutomirski, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 5:08 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 9, 2015 at 7:05 AM, Denys Vlasenko <dvlasenk@redhat.com> wrote:
>>     New:
>>      1e6:   0f ba 64 24 38 11       btl    $0x11,0x38(%esp)
>
> btl? Really?
>
> Why isn't that just
>
>     testb $2,0x3a(%esp)
>
> which is both smaller and quite a bit faster on older machines.

Ok. (I thought people won't be happy about this.)

> Or just at least *partially* do what we used to do, and make it all be
>
>     movb  PT_EFLAGS+2(%esp),%al
>     andb $2,%al
>     orb PT_CS(%esp),%al
>     testb $3,%al
>     je restore_nocheck

These insns must run serially. 4 cycles.

        test
        branch
        test
        branch

can execute both test/branch'es out-of-order in parallel.
2 cycles.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 16:28   ` Denys Vlasenko
@ 2015-03-09 16:44     ` Linus Torvalds
  2015-03-09 17:44       ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2015-03-09 16:44 UTC (permalink / raw)
  To: Denys Vlasenko
  Cc: Denys Vlasenko, Andy Lutomirski, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 9:28 AM, Denys Vlasenko <vda.linux@googlemail.com> wrote:
>
> can execute both test/branch'es out-of-order in parallel.

Assuming it predicts perfectly, yes, and the fallthrough is the default case.

Which is *probably* true her, at least often. And at least for the VM
bit. So it's likely good to have three branches. The unpredictable one
is likely the CS low bit test, which with interrupts in the idle
routine will possibly get a lot of noise from kernel returns too.

The *old* code likely predicted perfectly (because with the cmp we
would care about the LDT SS bit only if the other bits were set, which
is correct).

And remember: those zero-cost out-of-order branches turn quite
expensive if they *ever* mispredict. Even a 5% mispredict rate is
likely to mean "it's better to have a data dependency chain".

So it could easily go either way. I'm not convinced the old code is bad at all.

                             Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 16:08 ` Linus Torvalds
  2015-03-09 16:28   ` Denys Vlasenko
@ 2015-03-09 17:42   ` H. Peter Anvin
  2015-03-09 17:45     ` Andy Lutomirski
  1 sibling, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2015-03-09 17:42 UTC (permalink / raw)
  To: Linus Torvalds, Denys Vlasenko
  Cc: Andy Lutomirski, Steven Rostedt, Ingo Molnar, Borislav Petkov,
	Oleg Nesterov, Frederic Weisbecker, Alexei Starovoitov,
	Will Drewry, Kees Cook, the arch/x86 maintainers,
	Linux Kernel Mailing List

On 03/09/2015 09:08 AM, Linus Torvalds wrote:
> 
> Sure, the btl is easier to explain in the source code, but instead of this:
> 
>> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
> 
> you'd have to add a comment, like
> 
>     testb $2, PT_EFLAGS+2(%esp)  # X86_EFLAGS_VM_BIT
> 
> or something.
> 

Maybe:

	testb $(X86_EFLAGS_VM-16), PT_EFLAGS+2(%esp)

> Or just at least *partially* do what we used to do, and make it all be
> 
>     movb  PT_EFLAGS+2(%esp),%al
>     andb $2,%al
>     orb PT_CS(%esp),%al
>     testb $3,%al
>     je restore_nocheck
>     testb $SEGMENT_TI_MASK,PT_OLDSS(%esp)
>     jne ldt_ss
> 
> which still avoids looking at SS unless needed, and is smaller and
> faster than the btl, afaik.

The question is if avoiding looking at a field on the stack matters at all.

	-hpa



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 16:44     ` Linus Torvalds
@ 2015-03-09 17:44       ` H. Peter Anvin
  2015-03-09 19:13         ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2015-03-09 17:44 UTC (permalink / raw)
  To: Linus Torvalds, Denys Vlasenko
  Cc: Denys Vlasenko, Andy Lutomirski, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 03/09/2015 09:44 AM, Linus Torvalds wrote:
> 
> And remember: those zero-cost out-of-order branches turn quite
> expensive if they *ever* mispredict. Even a 5% mispredict rate is
> likely to mean "it's better to have a data dependency chain".
> 
> So it could easily go either way. I'm not convinced the old code is bad at all.
> 

I'm inclined to side with Linus here.  I'm hesitant to change this based
on pure speculation.

To answer Andy's question: I do believe we need espfix for V86 mode as well.

	-hpa



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 17:42   ` H. Peter Anvin
@ 2015-03-09 17:45     ` Andy Lutomirski
  2015-03-09 17:59       ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 17:45 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Denys Vlasenko, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 10:42 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/09/2015 09:08 AM, Linus Torvalds wrote:
>>
>> Sure, the btl is easier to explain in the source code, but instead of this:
>>
>>> +       btl     $X86_EFLAGS_VM_BIT,PT_EFLAGS(%esp)
>>
>> you'd have to add a comment, like
>>
>>     testb $2, PT_EFLAGS+2(%esp)  # X86_EFLAGS_VM_BIT
>>
>> or something.
>>
>
> Maybe:
>
>         testb $(X86_EFLAGS_VM-16), PT_EFLAGS+2(%esp)
>
>> Or just at least *partially* do what we used to do, and make it all be
>>
>>     movb  PT_EFLAGS+2(%esp),%al
>>     andb $2,%al
>>     orb PT_CS(%esp),%al
>>     testb $3,%al
>>     je restore_nocheck
>>     testb $SEGMENT_TI_MASK,PT_OLDSS(%esp)
>>     jne ldt_ss
>>
>> which still avoids looking at SS unless needed, and is smaller and
>> faster than the btl, afaik.
>
> The question is if avoiding looking at a field on the stack matters at all.

It does for silly reasons.

If sp0 is set to the very top of the stack, then an NMI immediately
after sysenter will have OLDSS off the top of the stack, and reading
it can crash.  This is why 32-bit kernels have a (buggy!) 8 byte
offset in sp0.

An alternative would be to fix the bug, but I still think it's ugly.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 17:45     ` Andy Lutomirski
@ 2015-03-09 17:59       ` Linus Torvalds
  2015-03-09 18:04         ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: Linus Torvalds @ 2015-03-09 17:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Denys Vlasenko, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 10:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> If sp0 is set to the very top of the stack, then an NMI immediately
> after sysenter will have OLDSS off the top of the stack, and reading
> it can crash.  This is why 32-bit kernels have a (buggy!) 8 byte
> offset in sp0.

So I think that for sysenter, we *should* have that 8-byte buffer.

Not in general for sp0, but for MSR_IA32_SYSENTER_ESP (which is sp1, afaik).

Just make the rule be that you can never ever have a kernel stack
frame that doesn't contain room for ss/sp at the top.

We have various code that looks at and touches "pt_regs" anyway, and
accesses things out for debugging/oopsing/tracing etc. Let's not make
the rule be that you cannot look at regs->ss without checking various
random other fields first.

                         Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 17:59       ` Linus Torvalds
@ 2015-03-09 18:04         ` Andy Lutomirski
  2015-03-09 18:16           ` Linus Torvalds
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 18:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Denys Vlasenko, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 10:59 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 9, 2015 at 10:45 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> If sp0 is set to the very top of the stack, then an NMI immediately
>> after sysenter will have OLDSS off the top of the stack, and reading
>> it can crash.  This is why 32-bit kernels have a (buggy!) 8 byte
>> offset in sp0.
>
> So I think that for sysenter, we *should* have that 8-byte buffer.
>
> Not in general for sp0, but for MSR_IA32_SYSENTER_ESP (which is sp1, afaik).

Bah.  I'm slightly wrong here.  It's not NMI immediately after
sysenter (that's handled by the emergency stack gunk, although we
might need to fix that too).  It's this:

ENTRY(ia32_sysenter_target)
    CFI_STARTPROC simple
    CFI_SIGNAL_FRAME
    CFI_DEF_CFA esp, 0
    CFI_REGISTER esp, ebp
    movl TSS_sysenter_sp0(%esp),%esp
sysenter_past_esp:
    /*
     * Interrupts are disabled here, but we can't trace it until
     * enough kernel state to call TRACE_IRQS_OFF can be called - but
     * we immediately enable interrupts at that point anyway.
     */

--> NMI here

    pushl_cfi $__USER_DS

--> or here

    /*CFI_REL_OFFSET ss, 0*/
    pushl_cfi %ebp

We could fix that locally by loading sp0 - 8 atomically (w/ percpu
help) and populating the top two words with mov instead of push.

One option would be to change the NMI entry code to move itself down 8
bytes if this happens (came from kernel mode or sp == sp0 - 12,
perhaps).  Or we could suck it up and keep that 8 byte offset (and fix
the cpu_tss initializer bug that threw me off in the first place).

--Andy

>
> Just make the rule be that you can never ever have a kernel stack
> frame that doesn't contain room for ss/sp at the top.
>
> We have various code that looks at and touches "pt_regs" anyway, and
> accesses things out for debugging/oopsing/tracing etc. Let's not make
> the rule be that you cannot look at regs->ss without checking various
> random other fields first.
>
>                          Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 18:04         ` Andy Lutomirski
@ 2015-03-09 18:16           ` Linus Torvalds
  2015-03-09 18:32             ` Denys Vlasenko
  2015-03-09 18:36             ` Andy Lutomirski
  0 siblings, 2 replies; 24+ messages in thread
From: Linus Torvalds @ 2015-03-09 18:16 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Denys Vlasenko, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>
> One option would be to change the NMI entry code to move itself down 8
> bytes if this happens (came from kernel mode or sp == sp0 - 12,
> perhaps).

Hmm. That whole code currently depends on the stack setup being just a
single instruction (the move to esp). And that simplifies things, I'd
like to keep it that way.

I'd *much* rather just keep the 8-byte padding. What was so
problematic with that? It worked. It's been around forever. Removing
it is the bug.

                        Linus

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 18:16           ` Linus Torvalds
@ 2015-03-09 18:32             ` Denys Vlasenko
  2015-03-09 18:36             ` Andy Lutomirski
  1 sibling, 0 replies; 24+ messages in thread
From: Denys Vlasenko @ 2015-03-09 18:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, H. Peter Anvin, Denys Vlasenko, Steven Rostedt,
	Ingo Molnar, Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 7:16 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 9, 2015 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> One option would be to change the NMI entry code to move itself down 8
>> bytes if this happens (came from kernel mode or sp == sp0 - 12,
>> perhaps).
>
> Hmm. That whole code currently depends on the stack setup being just a
> single instruction (the move to esp). And that simplifies things, I'd
> like to keep it that way.
>
> I'd *much* rather just keep the 8-byte padding. What was so
> problematic with that? It worked. It's been around forever. Removing
> it is the bug.

Unfortunately, looks like keeping the gap is the safest thing to do.

Hopes that tss.sp0 was true top of the stack, sans "a few little problems",
died so quickly :/
(and we didn't even touch vm86 yet).

We probably better off to give that 8 a name. Now it's just another
magic constant in the entry.S maze.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 18:16           ` Linus Torvalds
  2015-03-09 18:32             ` Denys Vlasenko
@ 2015-03-09 18:36             ` Andy Lutomirski
  2015-03-10  6:25               ` Ingo Molnar
  1 sibling, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 18:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: H. Peter Anvin, Denys Vlasenko, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 11:16 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Mar 9, 2015 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
>>
>> One option would be to change the NMI entry code to move itself down 8
>> bytes if this happens (came from kernel mode or sp == sp0 - 12,
>> perhaps).
>
> Hmm. That whole code currently depends on the stack setup being just a
> single instruction (the move to esp). And that simplifies things, I'd
> like to keep it that way.
>
> I'd *much* rather just keep the 8-byte padding. What was so
> problematic with that? It worked. It's been around forever. Removing
> it is the bug.

Let's at least fix it, then.  processor.h has:

#define INIT_TSS  {                              \
    .x86_tss = {                              \
        .sp0        = sizeof(init_stack) + (long)&init_stack, \

(moved in -tip)

That's bogus, and this bogosity is why I broke 32-bit -next in the
first place: I assumed it was correct.

I'll get it if no one beats me to it.

--Andy

>
>                         Linus



-- 
Andy Lutomirski
AMA Capital Management, LLC

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 17:44       ` H. Peter Anvin
@ 2015-03-09 19:13         ` Andy Lutomirski
  2015-03-09 19:26           ` H. Peter Anvin
  0 siblings, 1 reply; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 19:13 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Denys Vlasenko, Denys Vlasenko, Steven Rostedt,
	Ingo Molnar, Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 717 bytes --]

On Mon, Mar 9, 2015 at 10:44 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/09/2015 09:44 AM, Linus Torvalds wrote:
>>
>> And remember: those zero-cost out-of-order branches turn quite
>> expensive if they *ever* mispredict. Even a 5% mispredict rate is
>> likely to mean "it's better to have a data dependency chain".
>>
>> So it could easily go either way. I'm not convinced the old code is bad at all.
>>
>
> I'm inclined to side with Linus here.  I'm hesitant to change this based
> on pure speculation.
>
> To answer Andy's question: I do believe we need espfix for V86 mode as well.
>

I think we don't.  Did I screw up my test?

--Andy

>         -hpa
>
>



-- 
Andy Lutomirski
AMA Capital Management, LLC

[-- Attachment #2: vm86regs.c --]
[-- Type: text/x-csrc, Size: 1740 bytes --]

/*
 * vm86 regs test.
 * Copyright (c) 2014-2015 Andrew Lutomirski.
 *
 * This tests that vm86 regs work as expected.
 *
 * GPL v2.
 */

#define _GNU_SOURCE

#include <time.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <sys/signal.h>
#include <sys/ucontext.h>
#include <asm/ldt.h>
#include <err.h>
#include <setjmp.h>
#include <stddef.h>
#include <stdbool.h>
#include <sys/user.h>
#include <errno.h>

#include <asm/vm86.h>

static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
		       int flags)
{
	struct sigaction sa;

	memset(&sa, 0, sizeof(sa));
	sa.sa_sigaction = handler;
	sa.sa_flags = SA_SIGINFO | flags;
	sigemptyset(&sa.sa_mask);
	if (sigaction(sig, &sa, 0))
		err(1, "sigaction");
}

static void sigsegv_vm86(int sig, siginfo_t *info, void *ctx_void)
{
	ucontext_t *ctx = (ucontext_t*)ctx_void;

	printf("Back from vm86.  EIP = %lx\n",
	       (unsigned long)ctx->uc_mcontext.gregs[REG_EIP]);
	
}

static void test_vm86(unsigned short cs, unsigned short ss)
{
	struct vm86plus_struct v86, req_v86;
	long ret;

	memset(&v86, 0, sizeof(v86));

	v86.regs.eip = 0;
	v86.regs.cs = cs;
	v86.regs.ss = ss;
	v86.regs.esp = 0xbaadf00d;

	req_v86 = v86;

	printf("[RUN]\tcs = 0x%hx, ss = 0x%hx\n", cs, ss);

	ret = syscall(SYS_vm86, VM86_ENTER, &v86);

	if (ret == -1 && errno == ENOSYS) {
		printf("[SKIP]\tvm86 not supported\n");
		return;
	}

	printf("[OK]\tSurvived vm86 roundtrip.  esp = %lx, should be %lx\n", v86.regs.esp, req_v86.regs.esp);
}

int main(void)
{
	sethandler(SIGSEGV, sigsegv_vm86, SA_ONSTACK);
	test_vm86(0, 0);
	test_vm86(0, 3);
	test_vm86(3, 0);
	test_vm86(3, 3);
	return 0;
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 19:13         ` Andy Lutomirski
@ 2015-03-09 19:26           ` H. Peter Anvin
  2015-03-09 19:51             ` Andy Lutomirski
  0 siblings, 1 reply; 24+ messages in thread
From: H. Peter Anvin @ 2015-03-09 19:26 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, Denys Vlasenko, Denys Vlasenko, Steven Rostedt,
	Ingo Molnar, Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On 03/09/2015 12:13 PM, Andy Lutomirski wrote:
> On Mon, Mar 9, 2015 at 10:44 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>> On 03/09/2015 09:44 AM, Linus Torvalds wrote:
>>>
>>> And remember: those zero-cost out-of-order branches turn quite
>>> expensive if they *ever* mispredict. Even a 5% mispredict rate is
>>> likely to mean "it's better to have a data dependency chain".
>>>
>>> So it could easily go either way. I'm not convinced the old code is bad at all.
>>>
>>
>> I'm inclined to side with Linus here.  I'm hesitant to change this based
>> on pure speculation.
>>
>> To answer Andy's question: I do believe we need espfix for V86 mode as well.
>>
> 
> I think we don't.  Did I screw up my test?
> 

I don't see how your test executes V86 mode code at all, since there
seems to be nothing mapped at the bottom of memory?

	-hpa



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 15:09     ` Andy Lutomirski
@ 2015-03-09 19:31       ` Denys Vlasenko
  0 siblings, 0 replies; 24+ messages in thread
From: Denys Vlasenko @ 2015-03-09 19:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Denys Vlasenko, Linus Torvalds, Steven Rostedt, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Oleg Nesterov,
	Frederic Weisbecker, Alexei Starovoitov, Will Drewry, Kees Cook,
	X86 ML, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 398 bytes --]

>> What we definitely should do here is at least frame this check with
>>> That being said, what ends up in the high bits of esp when we iret to
>>> vm86 mode?
>>
>> I don't know. I guess it's time to write an actual vm86 testcase :)
>
> Ick.  I can try...

I found an example which runs small bit of 16-bit code using vm86 machinery.
Tried in 32-bit kernel under qemu, it worked: printed "Hello".

[-- Attachment #2: vm86.c --]
[-- Type: text/x-csrc, Size: 4700 bytes --]

/*
 * Adaped from: runcom version 0.1 (c) 2003 Fabrice Bellar
 * "Simple example of use of vm86: launch a basic .com DOS executable"
 *
 * gcc -m32 -Os -Wall -static vm86.c -ovm86
 */
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <signal.h>
#include <linux/unistd.h>
#include <asm/vm86.h>
#include <sys/vm86.h>

//#define SIGTEST

#define COM_BASE_ADDR    0x10100

static inline void set_bit(uint8_t *a, unsigned int bit)
{
    a[bit / 8] |= (1 << (bit % 8));
}

static inline uint8_t *seg_to_linear(unsigned int seg, unsigned int reg)
{
    return (uint8_t *)((seg << 4) + (reg & 0xffff));
}

static inline void pushw(struct vm86_regs *r, int val)
{
    r->esp = (r->esp & ~0xffff) | ((r->esp - 2) & 0xffff);
    *(uint16_t *)seg_to_linear(r->ss, r->esp) = val;
}

void dump_regs(struct vm86_regs *r)
{
    fprintf(stderr,
            "AX=%08lx BX=%08lx CX=%08lx DX=%08lx\n"
            "SI=%08lx DI=%08lx BP=%08lx SP=%08lx\n"
            "IP=%08lx FL=%08lx\n"
            "CS=%04x DS=%04x ES=%04x SS=%04x FS=%04x GS=%04x\n",
            r->eax, r->ebx, r->ecx, r->edx, r->esi, r->edi, r->ebp, r->esp,
            r->eip, r->eflags,
            r->cs, r->ds, r->es, r->ss, r->fs, r->gs);
}

#ifdef SIGTEST
void alarm_handler(int sig)
{
    fprintf(stderr, "alarm signal=%d\n", sig);
    alarm(1);
}
#endif

extern char code16;
extern char code16_end;

int main(int argc, char **argv)
{
    uint8_t *vm86_mem;
    int ret, seg;
    struct vm86plus_struct ctx;
    struct vm86_regs *r;

    vm86_mem = mmap((void *)0x00000000, 0x110000,
                    PROT_WRITE | PROT_READ | PROT_EXEC,
                    MAP_FIXED | MAP_ANON | MAP_PRIVATE, -1, 0);
    if (vm86_mem == MAP_FAILED) {
        perror("mmap");
        exit(1);
    }
#ifdef SIGTEST
    {
        struct sigaction act;

        act.sa_handler = alarm_handler;
        sigemptyset(&act.sa_mask);
        act.sa_flags = 0;
        sigaction(SIGALRM, &act, NULL);
        alarm(1);
    }
#endif

    /* load 16-bit code at COM_BASE_ADDR */
    memcpy(vm86_mem + COM_BASE_ADDR, &code16, &code16_end - &code16);

    memset(&ctx, 0, sizeof(ctx));
    /* init basic registers */
    r = &ctx.regs;
    r->eip = 0x100;
    r->esp = 0xfffe;
    seg = (COM_BASE_ADDR - 0x100) >> 4;
    r->cs = seg;
    r->ss = seg;
    r->ds = seg;
    r->es = seg;
    r->fs = seg;
    r->gs = seg;
    r->eflags = 1 << 19; //EFLAGS.VIF

    set_bit((uint8_t *)&ctx.int_revectored, 0x21);
    /* put return code */
    *seg_to_linear(r->cs, 0) = 0xb4; /* mov ah, $0 */
    *seg_to_linear(r->cs, 1) = 0x00;
    *seg_to_linear(r->cs, 2) = 0xcd; /* int $0x21 */
    *seg_to_linear(r->cs, 3) = 0x21;
    pushw(&ctx.regs, 0x0000);

    for(;;) {
        ret = vm86(VM86_ENTER, &ctx);
        switch(VM86_TYPE(ret)) {
        case VM86_INTx:
            {
                int int_num, ah;

                int_num = VM86_ARG(ret);
                if (int_num != 0x21)
                    goto unknown_int;
                ah = (r->eax >> 8) & 0xff;
                switch(ah) {
                case 0x00: /* exit */
                    exit(0);
                case 0x02: /* write char */
                    {
                        uint8_t c = r->edx;
                        write(1, &c, 1);
                    }
                    break;
                case 0x09: /* write string */
                    {
			int ptr = r->edx;
                        uint8_t c;
                        for(;;) {
                            c = *seg_to_linear(r->ds, ptr++);
                            if (c == '$')
                                break;
                            write(1, &c, 1);
                        }
                        r->eax = (r->eax & ~0xff) | '$';
                    }
                    break;
                default:
unknown_int:
                    fprintf(stderr, "unsupported int 0x%02x\n", int_num);
                    dump_regs(&ctx.regs);
                    //                    exit(1);
                }
            }
            break;
        case VM86_SIGNAL:
            /* a signal came, we just ignore that */
            break;
        case VM86_STI:
            break;
        default:
            fprintf(stderr, "unhandled vm86 return code (0x%x)\n", ret);
            dump_regs(&ctx.regs);
            exit(1);
        }
    }
}

void code()
{
	asm volatile("\n"
	"	.code16""\n"

	"code16:""\n"
	"	mov	$(0x100+msg-code16),%dx""\n"
	"	mov	$0x09,%ah""\n"
	"	int	$0x21""\n"
	"	ret""\n"

	"msg:""\n"
	"	.string	\"Hello\"""\n"
	"	.byte	10""\n"
	"	.string	\"$\"""\n"

	"code16_end:""\n"
	"	.code32""\n"
	);
}

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 19:26           ` H. Peter Anvin
@ 2015-03-09 19:51             ` Andy Lutomirski
  0 siblings, 0 replies; 24+ messages in thread
From: Andy Lutomirski @ 2015-03-09 19:51 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Linus Torvalds, Denys Vlasenko, Denys Vlasenko, Steven Rostedt,
	Ingo Molnar, Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List

On Mon, Mar 9, 2015 at 12:26 PM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 03/09/2015 12:13 PM, Andy Lutomirski wrote:
>> On Mon, Mar 9, 2015 at 10:44 AM, H. Peter Anvin <hpa@zytor.com> wrote:
>>> On 03/09/2015 09:44 AM, Linus Torvalds wrote:
>>>>
>>>> And remember: those zero-cost out-of-order branches turn quite
>>>> expensive if they *ever* mispredict. Even a 5% mispredict rate is
>>>> likely to mean "it's better to have a data dependency chain".
>>>>
>>>> So it could easily go either way. I'm not convinced the old code is bad at all.
>>>>
>>>
>>> I'm inclined to side with Linus here.  I'm hesitant to change this based
>>> on pure speculation.
>>>
>>> To answer Andy's question: I do believe we need espfix for V86 mode as well.
>>>
>>
>> I think we don't.  Did I screw up my test?
>>
>
> I don't see how your test executes V86 mode code at all, since there
> seems to be nothing mapped at the bottom of memory?

It executes the implicit #PF at the bottom of memory :)

Seriously, though, I think that IRET doesn't check whether the
instruction we're returning to can be fetched, so IRET will complete
successfully and then we'll get #PF.  The resulting SIGSEGV kicks us
out of vm86 mode, and, assuming that the kernel isn't buggy (hah!) the
v86 state will get saved back to the vm86plus_struct and we can see if
esp got corrupted.

--Andy

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp)
  2015-03-09 18:36             ` Andy Lutomirski
@ 2015-03-10  6:25               ` Ingo Molnar
  0 siblings, 0 replies; 24+ messages in thread
From: Ingo Molnar @ 2015-03-10  6:25 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Linus Torvalds, H. Peter Anvin, Denys Vlasenko, Steven Rostedt,
	Borislav Petkov, Oleg Nesterov, Frederic Weisbecker,
	Alexei Starovoitov, Will Drewry, Kees Cook,
	the arch/x86 maintainers, Linux Kernel Mailing List


* Andy Lutomirski <luto@amacapital.net> wrote:

> On Mon, Mar 9, 2015 at 11:16 AM, Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> > On Mon, Mar 9, 2015 at 11:04 AM, Andy Lutomirski <luto@amacapital.net> wrote:
> >>
> >> One option would be to change the NMI entry code to move itself down 8
> >> bytes if this happens (came from kernel mode or sp == sp0 - 12,
> >> perhaps).
> >
> > Hmm. That whole code currently depends on the stack setup being just a
> > single instruction (the move to esp). And that simplifies things, I'd
> > like to keep it that way.
> >
> > I'd *much* rather just keep the 8-byte padding. What was so
> > problematic with that? It worked. It's been around forever. Removing
> > it is the bug.
> 
> Let's at least fix it, then.  processor.h has:
> 
> #define INIT_TSS  {                              \
>     .x86_tss = {                              \
>         .sp0        = sizeof(init_stack) + (long)&init_stack, \
> 
> (moved in -tip)
> 
> That's bogus, and this bogosity is why I broke 32-bit -next in the
> first place: I assumed it was correct.
> 
> I'll get it if no one beats me to it.

Please do and please post patches ASAP so that I can move tip:x86/asm 
back into a correct state again.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2015-03-10  6:25 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-03-09 14:05 [PATCH] x86: entry_32.S: change ESPFIX test to not touch PT_OLDSS(%esp) Denys Vlasenko
2015-03-09 14:18 ` Andy Lutomirski
2015-03-09 15:00   ` Denys Vlasenko
2015-03-09 15:09     ` Andy Lutomirski
2015-03-09 19:31       ` Denys Vlasenko
2015-03-09 15:13     ` Ingo Molnar
2015-03-09 15:18       ` Andy Lutomirski
2015-03-09 15:47       ` Steven Rostedt
2015-03-09 15:54         ` Ingo Molnar
2015-03-09 16:08 ` Linus Torvalds
2015-03-09 16:28   ` Denys Vlasenko
2015-03-09 16:44     ` Linus Torvalds
2015-03-09 17:44       ` H. Peter Anvin
2015-03-09 19:13         ` Andy Lutomirski
2015-03-09 19:26           ` H. Peter Anvin
2015-03-09 19:51             ` Andy Lutomirski
2015-03-09 17:42   ` H. Peter Anvin
2015-03-09 17:45     ` Andy Lutomirski
2015-03-09 17:59       ` Linus Torvalds
2015-03-09 18:04         ` Andy Lutomirski
2015-03-09 18:16           ` Linus Torvalds
2015-03-09 18:32             ` Denys Vlasenko
2015-03-09 18:36             ` Andy Lutomirski
2015-03-10  6:25               ` Ingo Molnar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).