linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* selftests/x86/fsgsbase_64 test problem
@ 2018-01-26 15:36 Dan Rue
  2018-01-26 16:22 ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Dan Rue @ 2018-01-26 15:36 UTC (permalink / raw)
  To: Shuah Khan, Ingo Molnar, Andy Lutomirski, Dmitry Safonov,
	Borislav Petkov, linux-kselftest, linux-kernel

We've noticed that fsgsbase_64 can fail intermittently with the
following error:

        [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
                Before schedule, set selector to 0x1
                other thread: ARCH_SET_GS(0x1) -- sel is 0x0
        [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0

This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.

    for i in $(seq 1 10000); do ./fsgsbase_64 || break; done

This problem isn't new - I've reproduced it on latest mainline and every
release going back to v4.12 (I did not try earlier). This was tested on
a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
i3-5010U.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 15:36 selftests/x86/fsgsbase_64 test problem Dan Rue
@ 2018-01-26 16:22 ` Andy Lutomirski
  2018-01-26 18:59   ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-26 16:22 UTC (permalink / raw)
  To: Dan Rue
  Cc: Shuah Khan, Ingo Molnar, Andy Lutomirski, Dmitry Safonov,
	Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>
> We've noticed that fsgsbase_64 can fail intermittently with the
> following error:
>
>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>                 Before schedule, set selector to 0x1
>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>
> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>
>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>
> This problem isn't new - I've reproduced it on latest mainline and every
> release going back to v4.12 (I did not try earlier). This was tested on
> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
> i3-5010U.
>

Hmm, I can reproduce it, too.  I'll look in a bit.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 16:22 ` Andy Lutomirski
@ 2018-01-26 18:59   ` Andy Lutomirski
  2018-01-26 19:46     ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-26 18:59 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov,
	Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>
>> We've noticed that fsgsbase_64 can fail intermittently with the
>> following error:
>>
>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>                 Before schedule, set selector to 0x1
>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>
>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>
>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>
>> This problem isn't new - I've reproduced it on latest mainline and every
>> release going back to v4.12 (I did not try earlier). This was tested on
>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>> i3-5010U.
>>
>
> Hmm, I can reproduce it, too.  I'll look in a bit.

I'm triggering a different error, and I think what's going on is that
the kernel doesn't currently re-save GSBASE when a task switches out
and that task has save gsbase != 0 and in-register GS == 0.  This is
arguably a bug, but it's not an infoleak, and fixing it could be a wee
bit expensive.  I'm not sure what, if anything, to do about this.  I
suppose I could add some gross perf hackery to the test to detect this
case and suppress the error.

I can also trigger the problem you're seeing, and I don't know what's
up.  It may be related to and old problem I've seen that causes signal
delivery to sometimes corrupt %gs.  It's deterministic, but it depends
in some odd way on register state.  I can currently reproduce that
issue 100% of the time, and I'm trying to see if I can figure out
what's happening.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 18:59   ` Andy Lutomirski
@ 2018-01-26 19:46     ` Andy Lutomirski
  2018-01-26 22:38       ` Andy Lutomirski
  2018-01-26 22:51       ` H. Peter Anvin
  0 siblings, 2 replies; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-26 19:46 UTC (permalink / raw)
  To: Andy Lutomirski, H. Peter Anvin
  Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov,
	Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>>
>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>> following error:
>>>
>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>                 Before schedule, set selector to 0x1
>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>
>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>
>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>
>>> This problem isn't new - I've reproduced it on latest mainline and every
>>> release going back to v4.12 (I did not try earlier). This was tested on
>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>> i3-5010U.
>>>
>>
>> Hmm, I can reproduce it, too.  I'll look in a bit.
>
> I'm triggering a different error, and I think what's going on is that
> the kernel doesn't currently re-save GSBASE when a task switches out
> and that task has save gsbase != 0 and in-register GS == 0.  This is
> arguably a bug, but it's not an infoleak, and fixing it could be a wee
> bit expensive.  I'm not sure what, if anything, to do about this.  I
> suppose I could add some gross perf hackery to the test to detect this
> case and suppress the error.
>
> I can also trigger the problem you're seeing, and I don't know what's
> up.  It may be related to and old problem I've seen that causes signal
> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
> in some odd way on register state.  I can currently reproduce that
> issue 100% of the time, and I'm trying to see if I can figure out
> what's happening.

I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
following, plausibly related issue:

Write a program that writes %gs = 1.
Run that program under gdb
break in which %gs == 1
display/x $gs
si

Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
changes to 0.

On KVM or native, I do not observe do_debug getting called with %gs ==
1.  On TCG, I do.  I don't think that's precisely the problem that's
causing the test to fail, since the test doesn't use TF or ptrace, but
I wouldn't be shocked if it's related.

hpa, any insight?

(NB: if you want to play with this as I've described it, you may need
to make invalid_selector() in ptrace.c always return false.  The
current implementation is too strict and causes problems.)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 19:46     ` Andy Lutomirski
@ 2018-01-26 22:38       ` Andy Lutomirski
  2018-01-26 22:42         ` Andy Lutomirski
  2018-01-26 22:56         ` Borislav Petkov
  2018-01-26 22:51       ` H. Peter Anvin
  1 sibling, 2 replies; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-26 22:38 UTC (permalink / raw)
  To: Andy Lutomirski, Borislav Petkov
  Cc: H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML

[-- Attachment #1: Type: text/plain, Size: 5170 bytes --]

On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>>>
>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>> following error:
>>>>
>>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>                 Before schedule, set selector to 0x1
>>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>
>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>
>>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>
>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>> i3-5010U.
>>>>
>>>
>>> Hmm, I can reproduce it, too.  I'll look in a bit.
>>
>> I'm triggering a different error, and I think what's going on is that
>> the kernel doesn't currently re-save GSBASE when a task switches out
>> and that task has save gsbase != 0 and in-register GS == 0.  This is
>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>> bit expensive.  I'm not sure what, if anything, to do about this.  I
>> suppose I could add some gross perf hackery to the test to detect this
>> case and suppress the error.
>>
>> I can also trigger the problem you're seeing, and I don't know what's
>> up.  It may be related to and old problem I've seen that causes signal
>> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
>> in some odd way on register state.  I can currently reproduce that
>> issue 100% of the time, and I'm trying to see if I can figure out
>> what's happening.
>
> I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
> following, plausibly related issue:
>
> Write a program that writes %gs = 1.
> Run that program under gdb
> break in which %gs == 1
> display/x $gs
> si
>
> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
> changes to 0.
>
> On KVM or native, I do not observe do_debug getting called with %gs ==
> 1.  On TCG, I do.  I don't think that's precisely the problem that's
> causing the test to fail, since the test doesn't use TF or ptrace, but
> I wouldn't be shocked if it's related.
>
> hpa, any insight?
>
> (NB: if you want to play with this as I've described it, you may need
> to make invalid_selector() in ptrace.c always return false.  The
> current implementation is too strict and causes problems.)

Much simpler test.  Run the attached program (gs1).  It more or less
just sets %gs to 1 and spins until it stops being 1.  Do it on a
kernel with the attached patch applied.  I see stuff like this:

# ./gs1
PID = 129
[   15.703015] pid 129 saved gs = 1
[   15.703517] pid 129 loaded gs = 1
[   15.703973] pid 129 prepare_exit_to_usermode: gs = 1
ax = 0, cx = 0, dx = 0

So we're interrupting the program, switching out, switching back in,
setting %gs to 1, observing that %gs is *still* 1 in
prepare_exit_to_usermode(), returning to usermode, and observing %gs
== 0.

Presumably what's happening is that the IRET microcode matches the
SDM's pseudocode, which says:

RETURN-TO-OUTER-PRIVILEGE-LEVEL:
...
FOR each SegReg in (ES, FS, GS, and DS)
  DO
    tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
    IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
    THEN (* Segment register invalid *)
      SegReg ← NULL;
    FI;
  OD;

But this is very odd.  The actual permission checks (in the docs for MOV) are:

IF DS, ES, FS, or GS is loaded with non-NULL selector
THEN
  IF segment selector index is outside descriptor table limits
  or segment is not a data or readable code segment
  or ((segment is a data or nonconforming code segment)
  or ((RPL > DPL) and (CPL > DPL))
    THEN #GP(selector); FI;

^^^^
This makes no sense.  This says that the data segments cannot be
loaded with MOV.  Empirically, it seems like MOV works if CPL <= DPL
and RPL <= DPL, but I haven't checked that hard.

  IF segment not marked present
    THEN #NP(selector);
  ELSE
    SegmentRegister ← segment selector;
    SegmentRegister ← segment descriptor; FI;
  FI;

  IF DS, ES, FS, or GS is loaded with NULL selector
  THEN
    SegmentRegister ← segment selector;
    SegmentRegister ← segment descriptor;
    ^^^^
    wtf?  There is no "segment descriptor".  Presumably what actually
gets written to segment.DPL is nonsense.
  FI;

Anyway, I think it's nonsense that user code can load a selector using
MOV that is, in turn, rejected by IRET.  I don't suppose Intel would
consider fixing this going forward.

Borislav, any chance you could run the attached program on an AMD
machine to see what it does?

[-- Attachment #2: gs1.c --]
[-- Type: text/x-csrc, Size: 588 bytes --]

#include <stdio.h>
#include <sys/types.h>
#include <unistd.h>

int main()
{
	unsigned short ax, cx, dx;
	printf("PID = %d\n", (int)getpid());
	asm volatile ("mov %[one], %%gs\n\t"
		      "1:\n\t"
		      "mov %%gs, %%eax\n\t"
		      "mov %%gs, %%ecx\n\t"
		      "mov %%gs, %%edx\n\t"
		      "cmpw $1, %%ax\n\tjne 2f\n\t"
		      "cmpw $1, %%cx\n\tjne 2f\n\t"
		      "cmpw $1, %%dx\n\tjne 2f\n\t"
		      "jmp 1b\n\t"
		      "2:"
		      : "=a" (ax), "=c" (cx), "=d" (dx)
		      : [one] "rm" ((unsigned short)1));
	printf("ax = %hx, cx = %hx, dx = %hx\n", ax, cx, dx);
	return 0;
}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 22:38       ` Andy Lutomirski
@ 2018-01-26 22:42         ` Andy Lutomirski
  2018-01-28 19:21           ` Andy Lutomirski
  2018-01-26 22:56         ` Borislav Petkov
  1 sibling, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-26 22:42 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>>>>
>>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>>> following error:
>>>>>
>>>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>>                 Before schedule, set selector to 0x1
>>>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>>
>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>>
>>>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>>
>>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>>> i3-5010U.
>>>>>
>>>>
>>>> Hmm, I can reproduce it, too.  I'll look in a bit.
>>>
>>> I'm triggering a different error, and I think what's going on is that
>>> the kernel doesn't currently re-save GSBASE when a task switches out
>>> and that task has save gsbase != 0 and in-register GS == 0.  This is
>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>>> bit expensive.  I'm not sure what, if anything, to do about this.  I
>>> suppose I could add some gross perf hackery to the test to detect this
>>> case and suppress the error.
>>>
>>> I can also trigger the problem you're seeing, and I don't know what's
>>> up.  It may be related to and old problem I've seen that causes signal
>>> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
>>> in some odd way on register state.  I can currently reproduce that
>>> issue 100% of the time, and I'm trying to see if I can figure out
>>> what's happening.
>>
>> I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
>> following, plausibly related issue:
>>
>> Write a program that writes %gs = 1.
>> Run that program under gdb
>> break in which %gs == 1
>> display/x $gs
>> si
>>
>> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
>> changes to 0.
>>
>> On KVM or native, I do not observe do_debug getting called with %gs ==
>> 1.  On TCG, I do.  I don't think that's precisely the problem that's
>> causing the test to fail, since the test doesn't use TF or ptrace, but
>> I wouldn't be shocked if it's related.
>>
>> hpa, any insight?
>>
>> (NB: if you want to play with this as I've described it, you may need
>> to make invalid_selector() in ptrace.c always return false.  The
>> current implementation is too strict and causes problems.)
>
> Much simpler test.  Run the attached program (gs1).  It more or less
> just sets %gs to 1 and spins until it stops being 1.  Do it on a
> kernel with the attached patch applied.  I see stuff like this:
>
> # ./gs1
> PID = 129
> [   15.703015] pid 129 saved gs = 1
> [   15.703517] pid 129 loaded gs = 1
> [   15.703973] pid 129 prepare_exit_to_usermode: gs = 1
> ax = 0, cx = 0, dx = 0
>
> So we're interrupting the program, switching out, switching back in,
> setting %gs to 1, observing that %gs is *still* 1 in
> prepare_exit_to_usermode(), returning to usermode, and observing %gs
> == 0.
>
> Presumably what's happening is that the IRET microcode matches the
> SDM's pseudocode, which says:
>
> RETURN-TO-OUTER-PRIVILEGE-LEVEL:
> ...
> FOR each SegReg in (ES, FS, GS, and DS)
>   DO
>     tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
>     IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
>     THEN (* Segment register invalid *)
>       SegReg ← NULL;
>     FI;
>   OD;
>
> But this is very odd.  The actual permission checks (in the docs for MOV) are:
>
> IF DS, ES, FS, or GS is loaded with non-NULL selector
> THEN
>   IF segment selector index is outside descriptor table limits
>   or segment is not a data or readable code segment
>   or ((segment is a data or nonconforming code segment)
>   or ((RPL > DPL) and (CPL > DPL))
>     THEN #GP(selector); FI;
>
> ^^^^
> This makes no sense.  This says that the data segments cannot be
> loaded with MOV.  Empirically, it seems like MOV works if CPL <= DPL
> and RPL <= DPL, but I haven't checked that hard.

Surely Intel meant:

... or ((segment is a data segment or nonconforming code segment) and
((RPL > DPL) or (CPL > DPL))

This would be consistent with the AMD APM #GP condition of "The DS,
ES, FS, or GS register was loaded and the segment pointed to was a
data or non-conforming code segment, but the RPL or CPL was greater
than the DPL."

>
>   IF segment not marked present
>     THEN #NP(selector);
>   ELSE
>     SegmentRegister ← segment selector;
>     SegmentRegister ← segment descriptor; FI;
>   FI;
>
>   IF DS, ES, FS, or GS is loaded with NULL selector
>   THEN
>     SegmentRegister ← segment selector;
>     SegmentRegister ← segment descriptor;
>     ^^^^
>     wtf?  There is no "segment descriptor".  Presumably what actually
> gets written to segment.DPL is nonsense.
>   FI;

I think the bug is here.  I think that, when writing a NULL selector
to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
they should set DPL to 3.

>
> Anyway, I think it's nonsense that user code can load a selector using
> MOV that is, in turn, rejected by IRET.  I don't suppose Intel would
> consider fixing this going forward.
>
> Borislav, any chance you could run the attached program on an AMD
> machine to see what it does?

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 19:46     ` Andy Lutomirski
  2018-01-26 22:38       ` Andy Lutomirski
@ 2018-01-26 22:51       ` H. Peter Anvin
  1 sibling, 0 replies; 16+ messages in thread
From: H. Peter Anvin @ 2018-01-26 22:51 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov,
	Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On 01/26/18 11:46, Andy Lutomirski wrote:
> 
> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
> changes to 0.
> 
> On KVM or native, I do not observe do_debug getting called with %gs ==
> 1.  On TCG, I do.  I don't think that's precisely the problem that's
> causing the test to fail, since the test doesn't use TF or ptrace, but
> I wouldn't be shocked if it's related.
> 
> hpa, any insight?
> 
> (NB: if you want to play with this as I've described it, you may need
> to make invalid_selector() in ptrace.c always return false.  The
> current implementation is too strict and causes problems.)
> 

Looking at it.  I want to grok this in the general context of fsgsbase
as well, of course.

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 22:38       ` Andy Lutomirski
  2018-01-26 22:42         ` Andy Lutomirski
@ 2018-01-26 22:56         ` Borislav Petkov
  2018-01-28 19:21           ` Andy Lutomirski
  1 sibling, 1 reply; 16+ messages in thread
From: Borislav Petkov @ 2018-01-26 22:56 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov,
	open list:KERNEL SELFTEST FRAMEWORK, LKML

On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:
> Borislav, any chance you could run the attached program on an AMD
> machine to see what it does?

[boris@pd: /tmp> ./gs1
PID = 3420
ax = 0, cx = 0, dx = 0

This is on 4.15-rc7-ish + tip/master from 2 wks ago.

-- 
Regards/Gruss,
    Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 22:42         ` Andy Lutomirski
@ 2018-01-28 19:21           ` Andy Lutomirski
  2018-01-29  9:13             ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-28 19:21 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, H. Peter Anvin, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski <luto@kernel.org> wrote:
> On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <luto@kernel.org> wrote:
>> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote:
>>>>>>
>>>>>> We've noticed that fsgsbase_64 can fail intermittently with the
>>>>>> following error:
>>>>>>
>>>>>>         [RUN]   ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1
>>>>>>                 Before schedule, set selector to 0x1
>>>>>>                 other thread: ARCH_SET_GS(0x1) -- sel is 0x0
>>>>>>         [FAIL]  GS/BASE changed from 0x1/0x0 to 0x0/0x0
>>>>>>
>>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e.
>>>>>>
>>>>>>     for i in $(seq 1 10000); do ./fsgsbase_64 || break; done
>>>>>>
>>>>>> This problem isn't new - I've reproduced it on latest mainline and every
>>>>>> release going back to v4.12 (I did not try earlier). This was tested on
>>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an
>>>>>> i3-5010U.
>>>>>>
>>>>>
>>>>> Hmm, I can reproduce it, too.  I'll look in a bit.
>>>>
>>>> I'm triggering a different error, and I think what's going on is that
>>>> the kernel doesn't currently re-save GSBASE when a task switches out
>>>> and that task has save gsbase != 0 and in-register GS == 0.  This is
>>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee
>>>> bit expensive.  I'm not sure what, if anything, to do about this.  I
>>>> suppose I could add some gross perf hackery to the test to detect this
>>>> case and suppress the error.
>>>>
>>>> I can also trigger the problem you're seeing, and I don't know what's
>>>> up.  It may be related to and old problem I've seen that causes signal
>>>> delivery to sometimes corrupt %gs.  It's deterministic, but it depends
>>>> in some odd way on register state.  I can currently reproduce that
>>>> issue 100% of the time, and I'm trying to see if I can figure out
>>>> what's happening.
>>>
>>> I think it's a CPU bug, and I'm a bit mystified.  I can trigger the
>>> following, plausibly related issue:
>>>
>>> Write a program that writes %gs = 1.
>>> Run that program under gdb
>>> break in which %gs == 1
>>> display/x $gs
>>> si
>>>
>>> Under QEMU TCG, gs stays equal to 1.  On native or KVM, on Skylake, it
>>> changes to 0.
>>>
>>> On KVM or native, I do not observe do_debug getting called with %gs ==
>>> 1.  On TCG, I do.  I don't think that's precisely the problem that's
>>> causing the test to fail, since the test doesn't use TF or ptrace, but
>>> I wouldn't be shocked if it's related.
>>>
>>> hpa, any insight?
>>>
>>> (NB: if you want to play with this as I've described it, you may need
>>> to make invalid_selector() in ptrace.c always return false.  The
>>> current implementation is too strict and causes problems.)
>>
>> Much simpler test.  Run the attached program (gs1).  It more or less
>> just sets %gs to 1 and spins until it stops being 1.  Do it on a
>> kernel with the attached patch applied.  I see stuff like this:
>>
>> # ./gs1
>> PID = 129
>> [   15.703015] pid 129 saved gs = 1
>> [   15.703517] pid 129 loaded gs = 1
>> [   15.703973] pid 129 prepare_exit_to_usermode: gs = 1
>> ax = 0, cx = 0, dx = 0
>>
>> So we're interrupting the program, switching out, switching back in,
>> setting %gs to 1, observing that %gs is *still* 1 in
>> prepare_exit_to_usermode(), returning to usermode, and observing %gs
>> == 0.
>>
>> Presumably what's happening is that the IRET microcode matches the
>> SDM's pseudocode, which says:
>>
>> RETURN-TO-OUTER-PRIVILEGE-LEVEL:
>> ...
>> FOR each SegReg in (ES, FS, GS, and DS)
>>   DO
>>     tempDesc ← descriptor cache for SegReg (* hidden part of segment register *)
>>     IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code
>>     THEN (* Segment register invalid *)
>>       SegReg ← NULL;
>>     FI;
>>   OD;
>>
>> But this is very odd.  The actual permission checks (in the docs for MOV) are:
>>
>> IF DS, ES, FS, or GS is loaded with non-NULL selector
>> THEN
>>   IF segment selector index is outside descriptor table limits
>>   or segment is not a data or readable code segment
>>   or ((segment is a data or nonconforming code segment)
>>   or ((RPL > DPL) and (CPL > DPL))
>>     THEN #GP(selector); FI;
>>
>> ^^^^
>> This makes no sense.  This says that the data segments cannot be
>> loaded with MOV.  Empirically, it seems like MOV works if CPL <= DPL
>> and RPL <= DPL, but I haven't checked that hard.
>
> Surely Intel meant:
>
> ... or ((segment is a data segment or nonconforming code segment) and
> ((RPL > DPL) or (CPL > DPL))
>
> This would be consistent with the AMD APM #GP condition of "The DS,
> ES, FS, or GS register was loaded and the segment pointed to was a
> data or non-conforming code segment, but the RPL or CPL was greater
> than the DPL."
>
>>
>>   IF segment not marked present
>>     THEN #NP(selector);
>>   ELSE
>>     SegmentRegister ← segment selector;
>>     SegmentRegister ← segment descriptor; FI;
>>   FI;
>>
>>   IF DS, ES, FS, or GS is loaded with NULL selector
>>   THEN
>>     SegmentRegister ← segment selector;
>>     SegmentRegister ← segment descriptor;
>>     ^^^^
>>     wtf?  There is no "segment descriptor".  Presumably what actually
>> gets written to segment.DPL is nonsense.
>>   FI;
>
> I think the bug is here.  I think that, when writing a NULL selector
> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
> they should set DPL to 3.

As an experiment, I did this:

 DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
+       [0] = { .dpl = 3, },
+

This had no apparent effect.  I was hoping that maybe loading NULL
into a selector would copy DPL from from gdt[0], but it seems like it
doesn't.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-26 22:56         ` Borislav Petkov
@ 2018-01-28 19:21           ` Andy Lutomirski
  0 siblings, 0 replies; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-28 19:21 UTC (permalink / raw)
  To: Borislav Petkov
  Cc: Andy Lutomirski, H. Peter Anvin, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On Fri, Jan 26, 2018 at 2:56 PM, Borislav Petkov <bp@alien8.de> wrote:
> On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote:
>> Borislav, any chance you could run the attached program on an AMD
>> machine to see what it does?
>
> [boris@pd: /tmp> ./gs1
> PID = 3420
> ax = 0, cx = 0, dx = 0
>
> This is on 4.15-rc7-ish + tip/master from 2 wks ago.

Phooey :(

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-28 19:21           ` Andy Lutomirski
@ 2018-01-29  9:13             ` H. Peter Anvin
  2018-01-29 16:37               ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2018-01-29  9:13 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar,
	Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On 01/28/18 11:21, Andy Lutomirski wrote:
>>
>> I think the bug is here.  I think that, when writing a NULL selector
>> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
>> they should set DPL to 3.
> 
> As an experiment, I did this:
> 
>  DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
> +       [0] = { .dpl = 3, },
> +
> 
> This had no apparent effect.  I was hoping that maybe loading NULL
> into a selector would copy DPL from from gdt[0], but it seems like it
> doesn't.
> 

GDT[0] doesn't actually exist.  It is pretty much scratch space (I have
suggested using it for the gsbase once all those issues get sorted out,
because it lets the paranoid code do something like:

	rdgsbase %rax
	push %rax	/* Save old gsbase */
	push %rax	/* Reserve space on stack */
	sgdt -2(%rsp)	/* We don't care about the limit */
	pop %rax	/* %rax <- gdtbase */
	mov (%rax),%rax	/* GDT[0] holds the gsbase for this cpu */
	wrgsbase %rax

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-29  9:13             ` H. Peter Anvin
@ 2018-01-29 16:37               ` Andy Lutomirski
  2018-01-29 18:12                 ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-29 16:37 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On Mon, Jan 29, 2018 at 1:13 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> On 01/28/18 11:21, Andy Lutomirski wrote:
>>>
>>> I think the bug is here.  I think that, when writing a NULL selector
>>> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas
>>> they should set DPL to 3.
>>
>> As an experiment, I did this:
>>
>>  DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = {
>> +       [0] = { .dpl = 3, },
>> +
>>
>> This had no apparent effect.  I was hoping that maybe loading NULL
>> into a selector would copy DPL from from gdt[0], but it seems like it
>> doesn't.
>>
>
> GDT[0] doesn't actually exist.

That's what I thought, too, and the SDM does say that, but the SDM
says all kinds of not-quite-correct things about segmentation.

> It is pretty much scratch space (I have
> suggested using it for the gsbase once all those issues get sorted out,
> because it lets the paranoid code do something like:
>
>         rdgsbase %rax
>         push %rax       /* Save old gsbase */
>         push %rax       /* Reserve space on stack */
>         sgdt -2(%rsp)   /* We don't care about the limit */
>         pop %rax        /* %rax <- gdtbase */
>         mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
>         wrgsbase %rax

That will utterly suck on non-UMIP machines that have
hypervisor-provided UMIP emulation.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-29 16:37               ` Andy Lutomirski
@ 2018-01-29 18:12                 ` H. Peter Anvin
  2018-01-29 18:26                   ` Andy Lutomirski
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2018-01-29 18:12 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar,
	Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML

On 01/29/18 08:37, Andy Lutomirski wrote:
> 
> That's what I thought, too, and the SDM does say that, but the SDM
> says all kinds of not-quite-correct things about segmentation.
> 
>> It is pretty much scratch space (I have
>> suggested using it for the gsbase once all those issues get sorted out,
>> because it lets the paranoid code do something like:
>>
>>         rdgsbase %rax
>>         push %rax       /* Save old gsbase */
>>         push %rax       /* Reserve space on stack */
>>         sgdt -2(%rsp)   /* We don't care about the limit */
>>         pop %rax        /* %rax <- gdtbase */
>>         mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
>>         wrgsbase %rax
> 
> That will utterly suck on non-UMIP machines that have
> hypervisor-provided UMIP emulation.
> 

Is that a valid thing to optimize for, especially given that paranoid
entries aren't the most common anyway?

	-hpa

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-29 18:12                 ` H. Peter Anvin
@ 2018-01-29 18:26                   ` Andy Lutomirski
  2018-01-29 18:30                     ` H. Peter Anvin
  0 siblings, 1 reply; 16+ messages in thread
From: Andy Lutomirski @ 2018-01-29 18:26 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML



> On Jan 29, 2018, at 10:12 AM, H. Peter Anvin <hpa@zytor.com> wrote:
> 
>> On 01/29/18 08:37, Andy Lutomirski wrote:
>> 
>> That's what I thought, too, and the SDM does say that, but the SDM
>> says all kinds of not-quite-correct things about segmentation.
>> 
>>> It is pretty much scratch space (I have
>>> suggested using it for the gsbase once all those issues get sorted out,
>>> because it lets the paranoid code do something like:
>>> 
>>>        rdgsbase %rax
>>>        push %rax       /* Save old gsbase */
>>>        push %rax       /* Reserve space on stack */
>>>        sgdt -2(%rsp)   /* We don't care about the limit */
>>>        pop %rax        /* %rax <- gdtbase */
>>>        mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */
>>>        wrgsbase %rax
>> 
>> That will utterly suck on non-UMIP machines that have
>> hypervisor-provided UMIP emulation.
>> 
> 
> Is that a valid thing to optimize for, especially given that paranoid
> entries aren't the most common anyway?
> 

A bunch of people seem to care about NMI performance for perf.  And the current patch set works without this trick.

FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-29 18:26                   ` Andy Lutomirski
@ 2018-01-29 18:30                     ` H. Peter Anvin
  2018-02-27 22:59                       ` Dan Rue
  0 siblings, 1 reply; 16+ messages in thread
From: H. Peter Anvin @ 2018-01-29 18:30 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On 01/29/18 10:26, Andy Lutomirski wrote:
>>>
>>> That will utterly suck on non-UMIP machines that have
>>> hypervisor-provided UMIP emulation.
>>
>> Is that a valid thing to optimize for, especially given that paranoid
>> entries aren't the most common anyway?
> 
> A bunch of people seem to care about NMI performance for perf.
>

That wasn't really the question...

> And the current patch set works without this trick.

But I believe the tricks it uses are fragile.

> FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.

That might be a better option.

	-hpa


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: selftests/x86/fsgsbase_64 test problem
  2018-01-29 18:30                     ` H. Peter Anvin
@ 2018-02-27 22:59                       ` Dan Rue
  0 siblings, 0 replies; 16+ messages in thread
From: Dan Rue @ 2018-02-27 22:59 UTC (permalink / raw)
  To: H. Peter Anvin
  Cc: Andy Lutomirski, Andy Lutomirski, Borislav Petkov, Shuah Khan,
	Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK,
	LKML

On Mon, Jan 29, 2018 at 10:30:05AM -0800, H. Peter Anvin wrote:
> On 01/29/18 10:26, Andy Lutomirski wrote:
> >>>
> >>> That will utterly suck on non-UMIP machines that have
> >>> hypervisor-provided UMIP emulation.
> >>
> >> Is that a valid thing to optimize for, especially given that paranoid
> >> entries aren't the most common anyway?
> > 
> > A bunch of people seem to care about NMI performance for perf.
> >
> 
> That wasn't really the question...
> 
> > And the current patch set works without this trick.
> 
> But I believe the tricks it uses are fragile.
> 
> > FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free.
> 
> That might be a better option.

Has there been any conclusion to this thread? I can still reproduce the
issue on mainline and next.

Thanks,
Dan

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2018-02-27 23:00 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-01-26 15:36 selftests/x86/fsgsbase_64 test problem Dan Rue
2018-01-26 16:22 ` Andy Lutomirski
2018-01-26 18:59   ` Andy Lutomirski
2018-01-26 19:46     ` Andy Lutomirski
2018-01-26 22:38       ` Andy Lutomirski
2018-01-26 22:42         ` Andy Lutomirski
2018-01-28 19:21           ` Andy Lutomirski
2018-01-29  9:13             ` H. Peter Anvin
2018-01-29 16:37               ` Andy Lutomirski
2018-01-29 18:12                 ` H. Peter Anvin
2018-01-29 18:26                   ` Andy Lutomirski
2018-01-29 18:30                     ` H. Peter Anvin
2018-02-27 22:59                       ` Dan Rue
2018-01-26 22:56         ` Borislav Petkov
2018-01-28 19:21           ` Andy Lutomirski
2018-01-26 22:51       ` H. Peter Anvin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).