* selftests/x86/fsgsbase_64 test problem @ 2018-01-26 15:36 Dan Rue 2018-01-26 16:22 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Dan Rue @ 2018-01-26 15:36 UTC (permalink / raw) To: Shuah Khan, Ingo Molnar, Andy Lutomirski, Dmitry Safonov, Borislav Petkov, linux-kselftest, linux-kernel We've noticed that fsgsbase_64 can fail intermittently with the following error: [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 Before schedule, set selector to 0x1 other thread: ARCH_SET_GS(0x1) -- sel is 0x0 [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. for i in $(seq 1 10000); do ./fsgsbase_64 || break; done This problem isn't new - I've reproduced it on latest mainline and every release going back to v4.12 (I did not try earlier). This was tested on a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an i3-5010U. Thanks, Dan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 15:36 selftests/x86/fsgsbase_64 test problem Dan Rue @ 2018-01-26 16:22 ` Andy Lutomirski 2018-01-26 18:59 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-26 16:22 UTC (permalink / raw) To: Dan Rue Cc: Shuah Khan, Ingo Molnar, Andy Lutomirski, Dmitry Safonov, Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: > > We've noticed that fsgsbase_64 can fail intermittently with the > following error: > > [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 > Before schedule, set selector to 0x1 > other thread: ARCH_SET_GS(0x1) -- sel is 0x0 > [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 > > This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. > > for i in $(seq 1 10000); do ./fsgsbase_64 || break; done > > This problem isn't new - I've reproduced it on latest mainline and every > release going back to v4.12 (I did not try earlier). This was tested on > a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an > i3-5010U. > Hmm, I can reproduce it, too. I'll look in a bit. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 16:22 ` Andy Lutomirski @ 2018-01-26 18:59 ` Andy Lutomirski 2018-01-26 19:46 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-26 18:59 UTC (permalink / raw) To: Andy Lutomirski Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: >> >> We've noticed that fsgsbase_64 can fail intermittently with the >> following error: >> >> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >> Before schedule, set selector to 0x1 >> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >> >> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >> >> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >> >> This problem isn't new - I've reproduced it on latest mainline and every >> release going back to v4.12 (I did not try earlier). This was tested on >> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >> i3-5010U. >> > > Hmm, I can reproduce it, too. I'll look in a bit. I'm triggering a different error, and I think what's going on is that the kernel doesn't currently re-save GSBASE when a task switches out and that task has save gsbase != 0 and in-register GS == 0. This is arguably a bug, but it's not an infoleak, and fixing it could be a wee bit expensive. I'm not sure what, if anything, to do about this. I suppose I could add some gross perf hackery to the test to detect this case and suppress the error. I can also trigger the problem you're seeing, and I don't know what's up. It may be related to and old problem I've seen that causes signal delivery to sometimes corrupt %gs. It's deterministic, but it depends in some odd way on register state. I can currently reproduce that issue 100% of the time, and I'm trying to see if I can figure out what's happening. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 18:59 ` Andy Lutomirski @ 2018-01-26 19:46 ` Andy Lutomirski 2018-01-26 22:38 ` Andy Lutomirski 2018-01-26 22:51 ` H. Peter Anvin 0 siblings, 2 replies; 16+ messages in thread From: Andy Lutomirski @ 2018-01-26 19:46 UTC (permalink / raw) To: Andy Lutomirski, H. Peter Anvin Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote: >> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: >>> >>> We've noticed that fsgsbase_64 can fail intermittently with the >>> following error: >>> >>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>> Before schedule, set selector to 0x1 >>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>> >>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>> >>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>> >>> This problem isn't new - I've reproduced it on latest mainline and every >>> release going back to v4.12 (I did not try earlier). This was tested on >>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>> i3-5010U. >>> >> >> Hmm, I can reproduce it, too. I'll look in a bit. > > I'm triggering a different error, and I think what's going on is that > the kernel doesn't currently re-save GSBASE when a task switches out > and that task has save gsbase != 0 and in-register GS == 0. This is > arguably a bug, but it's not an infoleak, and fixing it could be a wee > bit expensive. I'm not sure what, if anything, to do about this. I > suppose I could add some gross perf hackery to the test to detect this > case and suppress the error. > > I can also trigger the problem you're seeing, and I don't know what's > up. It may be related to and old problem I've seen that causes signal > delivery to sometimes corrupt %gs. It's deterministic, but it depends > in some odd way on register state. I can currently reproduce that > issue 100% of the time, and I'm trying to see if I can figure out > what's happening. I think it's a CPU bug, and I'm a bit mystified. I can trigger the following, plausibly related issue: Write a program that writes %gs = 1. Run that program under gdb break in which %gs == 1 display/x $gs si Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it changes to 0. On KVM or native, I do not observe do_debug getting called with %gs == 1. On TCG, I do. I don't think that's precisely the problem that's causing the test to fail, since the test doesn't use TF or ptrace, but I wouldn't be shocked if it's related. hpa, any insight? (NB: if you want to play with this as I've described it, you may need to make invalid_selector() in ptrace.c always return false. The current implementation is too strict and causes problems.) ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 19:46 ` Andy Lutomirski @ 2018-01-26 22:38 ` Andy Lutomirski 2018-01-26 22:42 ` Andy Lutomirski 2018-01-26 22:56 ` Borislav Petkov 2018-01-26 22:51 ` H. Peter Anvin 1 sibling, 2 replies; 16+ messages in thread From: Andy Lutomirski @ 2018-01-26 22:38 UTC (permalink / raw) To: Andy Lutomirski, Borislav Petkov Cc: H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML [-- Attachment #1: Type: text/plain, Size: 5170 bytes --] On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote: >> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: >>>> >>>> We've noticed that fsgsbase_64 can fail intermittently with the >>>> following error: >>>> >>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>>> Before schedule, set selector to 0x1 >>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>>> >>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>>> >>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>>> >>>> This problem isn't new - I've reproduced it on latest mainline and every >>>> release going back to v4.12 (I did not try earlier). This was tested on >>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>>> i3-5010U. >>>> >>> >>> Hmm, I can reproduce it, too. I'll look in a bit. >> >> I'm triggering a different error, and I think what's going on is that >> the kernel doesn't currently re-save GSBASE when a task switches out >> and that task has save gsbase != 0 and in-register GS == 0. This is >> arguably a bug, but it's not an infoleak, and fixing it could be a wee >> bit expensive. I'm not sure what, if anything, to do about this. I >> suppose I could add some gross perf hackery to the test to detect this >> case and suppress the error. >> >> I can also trigger the problem you're seeing, and I don't know what's >> up. It may be related to and old problem I've seen that causes signal >> delivery to sometimes corrupt %gs. It's deterministic, but it depends >> in some odd way on register state. I can currently reproduce that >> issue 100% of the time, and I'm trying to see if I can figure out >> what's happening. > > I think it's a CPU bug, and I'm a bit mystified. I can trigger the > following, plausibly related issue: > > Write a program that writes %gs = 1. > Run that program under gdb > break in which %gs == 1 > display/x $gs > si > > Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it > changes to 0. > > On KVM or native, I do not observe do_debug getting called with %gs == > 1. On TCG, I do. I don't think that's precisely the problem that's > causing the test to fail, since the test doesn't use TF or ptrace, but > I wouldn't be shocked if it's related. > > hpa, any insight? > > (NB: if you want to play with this as I've described it, you may need > to make invalid_selector() in ptrace.c always return false. The > current implementation is too strict and causes problems.) Much simpler test. Run the attached program (gs1). It more or less just sets %gs to 1 and spins until it stops being 1. Do it on a kernel with the attached patch applied. I see stuff like this: # ./gs1 PID = 129 [ 15.703015] pid 129 saved gs = 1 [ 15.703517] pid 129 loaded gs = 1 [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 ax = 0, cx = 0, dx = 0 So we're interrupting the program, switching out, switching back in, setting %gs to 1, observing that %gs is *still* 1 in prepare_exit_to_usermode(), returning to usermode, and observing %gs == 0. Presumably what's happening is that the IRET microcode matches the SDM's pseudocode, which says: RETURN-TO-OUTER-PRIVILEGE-LEVEL: ... FOR each SegReg in (ES, FS, GS, and DS) DO tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code THEN (* Segment register invalid *) SegReg ← NULL; FI; OD; But this is very odd. The actual permission checks (in the docs for MOV) are: IF DS, ES, FS, or GS is loaded with non-NULL selector THEN IF segment selector index is outside descriptor table limits or segment is not a data or readable code segment or ((segment is a data or nonconforming code segment) or ((RPL > DPL) and (CPL > DPL)) THEN #GP(selector); FI; ^^^^ This makes no sense. This says that the data segments cannot be loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL and RPL <= DPL, but I haven't checked that hard. IF segment not marked present THEN #NP(selector); ELSE SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; FI; FI; IF DS, ES, FS, or GS is loaded with NULL selector THEN SegmentRegister ← segment selector; SegmentRegister ← segment descriptor; ^^^^ wtf? There is no "segment descriptor". Presumably what actually gets written to segment.DPL is nonsense. FI; Anyway, I think it's nonsense that user code can load a selector using MOV that is, in turn, rejected by IRET. I don't suppose Intel would consider fixing this going forward. Borislav, any chance you could run the attached program on an AMD machine to see what it does? [-- Attachment #2: gs1.c --] [-- Type: text/x-csrc, Size: 588 bytes --] #include <stdio.h> #include <sys/types.h> #include <unistd.h> int main() { unsigned short ax, cx, dx; printf("PID = %d\n", (int)getpid()); asm volatile ("mov %[one], %%gs\n\t" "1:\n\t" "mov %%gs, %%eax\n\t" "mov %%gs, %%ecx\n\t" "mov %%gs, %%edx\n\t" "cmpw $1, %%ax\n\tjne 2f\n\t" "cmpw $1, %%cx\n\tjne 2f\n\t" "cmpw $1, %%dx\n\tjne 2f\n\t" "jmp 1b\n\t" "2:" : "=a" (ax), "=c" (cx), "=d" (dx) : [one] "rm" ((unsigned short)1)); printf("ax = %hx, cx = %hx, dx = %hx\n", ax, cx, dx); return 0; } ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 22:38 ` Andy Lutomirski @ 2018-01-26 22:42 ` Andy Lutomirski 2018-01-28 19:21 ` Andy Lutomirski 2018-01-26 22:56 ` Borislav Petkov 1 sibling, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-26 22:42 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote: >> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: >>>>> >>>>> We've noticed that fsgsbase_64 can fail intermittently with the >>>>> following error: >>>>> >>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>>>> Before schedule, set selector to 0x1 >>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>>>> >>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>>>> >>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>>>> >>>>> This problem isn't new - I've reproduced it on latest mainline and every >>>>> release going back to v4.12 (I did not try earlier). This was tested on >>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>>>> i3-5010U. >>>>> >>>> >>>> Hmm, I can reproduce it, too. I'll look in a bit. >>> >>> I'm triggering a different error, and I think what's going on is that >>> the kernel doesn't currently re-save GSBASE when a task switches out >>> and that task has save gsbase != 0 and in-register GS == 0. This is >>> arguably a bug, but it's not an infoleak, and fixing it could be a wee >>> bit expensive. I'm not sure what, if anything, to do about this. I >>> suppose I could add some gross perf hackery to the test to detect this >>> case and suppress the error. >>> >>> I can also trigger the problem you're seeing, and I don't know what's >>> up. It may be related to and old problem I've seen that causes signal >>> delivery to sometimes corrupt %gs. It's deterministic, but it depends >>> in some odd way on register state. I can currently reproduce that >>> issue 100% of the time, and I'm trying to see if I can figure out >>> what's happening. >> >> I think it's a CPU bug, and I'm a bit mystified. I can trigger the >> following, plausibly related issue: >> >> Write a program that writes %gs = 1. >> Run that program under gdb >> break in which %gs == 1 >> display/x $gs >> si >> >> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it >> changes to 0. >> >> On KVM or native, I do not observe do_debug getting called with %gs == >> 1. On TCG, I do. I don't think that's precisely the problem that's >> causing the test to fail, since the test doesn't use TF or ptrace, but >> I wouldn't be shocked if it's related. >> >> hpa, any insight? >> >> (NB: if you want to play with this as I've described it, you may need >> to make invalid_selector() in ptrace.c always return false. The >> current implementation is too strict and causes problems.) > > Much simpler test. Run the attached program (gs1). It more or less > just sets %gs to 1 and spins until it stops being 1. Do it on a > kernel with the attached patch applied. I see stuff like this: > > # ./gs1 > PID = 129 > [ 15.703015] pid 129 saved gs = 1 > [ 15.703517] pid 129 loaded gs = 1 > [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 > ax = 0, cx = 0, dx = 0 > > So we're interrupting the program, switching out, switching back in, > setting %gs to 1, observing that %gs is *still* 1 in > prepare_exit_to_usermode(), returning to usermode, and observing %gs > == 0. > > Presumably what's happening is that the IRET microcode matches the > SDM's pseudocode, which says: > > RETURN-TO-OUTER-PRIVILEGE-LEVEL: > ... > FOR each SegReg in (ES, FS, GS, and DS) > DO > tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) > IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code > THEN (* Segment register invalid *) > SegReg ← NULL; > FI; > OD; > > But this is very odd. The actual permission checks (in the docs for MOV) are: > > IF DS, ES, FS, or GS is loaded with non-NULL selector > THEN > IF segment selector index is outside descriptor table limits > or segment is not a data or readable code segment > or ((segment is a data or nonconforming code segment) > or ((RPL > DPL) and (CPL > DPL)) > THEN #GP(selector); FI; > > ^^^^ > This makes no sense. This says that the data segments cannot be > loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL > and RPL <= DPL, but I haven't checked that hard. Surely Intel meant: ... or ((segment is a data segment or nonconforming code segment) and ((RPL > DPL) or (CPL > DPL)) This would be consistent with the AMD APM #GP condition of "The DS, ES, FS, or GS register was loaded and the segment pointed to was a data or non-conforming code segment, but the RPL or CPL was greater than the DPL." > > IF segment not marked present > THEN #NP(selector); > ELSE > SegmentRegister ← segment selector; > SegmentRegister ← segment descriptor; FI; > FI; > > IF DS, ES, FS, or GS is loaded with NULL selector > THEN > SegmentRegister ← segment selector; > SegmentRegister ← segment descriptor; > ^^^^ > wtf? There is no "segment descriptor". Presumably what actually > gets written to segment.DPL is nonsense. > FI; I think the bug is here. I think that, when writing a NULL selector to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas they should set DPL to 3. > > Anyway, I think it's nonsense that user code can load a selector using > MOV that is, in turn, rejected by IRET. I don't suppose Intel would > consider fixing this going forward. > > Borislav, any chance you could run the attached program on an AMD > machine to see what it does? ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 22:42 ` Andy Lutomirski @ 2018-01-28 19:21 ` Andy Lutomirski 2018-01-29 9:13 ` H. Peter Anvin 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-28 19:21 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski <luto@kernel.org> wrote: > On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski <luto@kernel.org> wrote: >> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski <luto@kernel.org> wrote: >>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski <luto@kernel.org> wrote: >>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue <dan.rue@linaro.org> wrote: >>>>>> >>>>>> We've noticed that fsgsbase_64 can fail intermittently with the >>>>>> following error: >>>>>> >>>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>>>>> Before schedule, set selector to 0x1 >>>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>>>>> >>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>>>>> >>>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>>>>> >>>>>> This problem isn't new - I've reproduced it on latest mainline and every >>>>>> release going back to v4.12 (I did not try earlier). This was tested on >>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>>>>> i3-5010U. >>>>>> >>>>> >>>>> Hmm, I can reproduce it, too. I'll look in a bit. >>>> >>>> I'm triggering a different error, and I think what's going on is that >>>> the kernel doesn't currently re-save GSBASE when a task switches out >>>> and that task has save gsbase != 0 and in-register GS == 0. This is >>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee >>>> bit expensive. I'm not sure what, if anything, to do about this. I >>>> suppose I could add some gross perf hackery to the test to detect this >>>> case and suppress the error. >>>> >>>> I can also trigger the problem you're seeing, and I don't know what's >>>> up. It may be related to and old problem I've seen that causes signal >>>> delivery to sometimes corrupt %gs. It's deterministic, but it depends >>>> in some odd way on register state. I can currently reproduce that >>>> issue 100% of the time, and I'm trying to see if I can figure out >>>> what's happening. >>> >>> I think it's a CPU bug, and I'm a bit mystified. I can trigger the >>> following, plausibly related issue: >>> >>> Write a program that writes %gs = 1. >>> Run that program under gdb >>> break in which %gs == 1 >>> display/x $gs >>> si >>> >>> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it >>> changes to 0. >>> >>> On KVM or native, I do not observe do_debug getting called with %gs == >>> 1. On TCG, I do. I don't think that's precisely the problem that's >>> causing the test to fail, since the test doesn't use TF or ptrace, but >>> I wouldn't be shocked if it's related. >>> >>> hpa, any insight? >>> >>> (NB: if you want to play with this as I've described it, you may need >>> to make invalid_selector() in ptrace.c always return false. The >>> current implementation is too strict and causes problems.) >> >> Much simpler test. Run the attached program (gs1). It more or less >> just sets %gs to 1 and spins until it stops being 1. Do it on a >> kernel with the attached patch applied. I see stuff like this: >> >> # ./gs1 >> PID = 129 >> [ 15.703015] pid 129 saved gs = 1 >> [ 15.703517] pid 129 loaded gs = 1 >> [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 >> ax = 0, cx = 0, dx = 0 >> >> So we're interrupting the program, switching out, switching back in, >> setting %gs to 1, observing that %gs is *still* 1 in >> prepare_exit_to_usermode(), returning to usermode, and observing %gs >> == 0. >> >> Presumably what's happening is that the IRET microcode matches the >> SDM's pseudocode, which says: >> >> RETURN-TO-OUTER-PRIVILEGE-LEVEL: >> ... >> FOR each SegReg in (ES, FS, GS, and DS) >> DO >> tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) >> IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code >> THEN (* Segment register invalid *) >> SegReg ← NULL; >> FI; >> OD; >> >> But this is very odd. The actual permission checks (in the docs for MOV) are: >> >> IF DS, ES, FS, or GS is loaded with non-NULL selector >> THEN >> IF segment selector index is outside descriptor table limits >> or segment is not a data or readable code segment >> or ((segment is a data or nonconforming code segment) >> or ((RPL > DPL) and (CPL > DPL)) >> THEN #GP(selector); FI; >> >> ^^^^ >> This makes no sense. This says that the data segments cannot be >> loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL >> and RPL <= DPL, but I haven't checked that hard. > > Surely Intel meant: > > ... or ((segment is a data segment or nonconforming code segment) and > ((RPL > DPL) or (CPL > DPL)) > > This would be consistent with the AMD APM #GP condition of "The DS, > ES, FS, or GS register was loaded and the segment pointed to was a > data or non-conforming code segment, but the RPL or CPL was greater > than the DPL." > >> >> IF segment not marked present >> THEN #NP(selector); >> ELSE >> SegmentRegister ← segment selector; >> SegmentRegister ← segment descriptor; FI; >> FI; >> >> IF DS, ES, FS, or GS is loaded with NULL selector >> THEN >> SegmentRegister ← segment selector; >> SegmentRegister ← segment descriptor; >> ^^^^ >> wtf? There is no "segment descriptor". Presumably what actually >> gets written to segment.DPL is nonsense. >> FI; > > I think the bug is here. I think that, when writing a NULL selector > to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas > they should set DPL to 3. As an experiment, I did this: DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { + [0] = { .dpl = 3, }, + This had no apparent effect. I was hoping that maybe loading NULL into a selector would copy DPL from from gdt[0], but it seems like it doesn't. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-28 19:21 ` Andy Lutomirski @ 2018-01-29 9:13 ` H. Peter Anvin 2018-01-29 16:37 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: H. Peter Anvin @ 2018-01-29 9:13 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On 01/28/18 11:21, Andy Lutomirski wrote: >> >> I think the bug is here. I think that, when writing a NULL selector >> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas >> they should set DPL to 3. > > As an experiment, I did this: > > DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { > + [0] = { .dpl = 3, }, > + > > This had no apparent effect. I was hoping that maybe loading NULL > into a selector would copy DPL from from gdt[0], but it seems like it > doesn't. > GDT[0] doesn't actually exist. It is pretty much scratch space (I have suggested using it for the gsbase once all those issues get sorted out, because it lets the paranoid code do something like: rdgsbase %rax push %rax /* Save old gsbase */ push %rax /* Reserve space on stack */ sgdt -2(%rsp) /* We don't care about the limit */ pop %rax /* %rax <- gdtbase */ mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */ wrgsbase %rax ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-29 9:13 ` H. Peter Anvin @ 2018-01-29 16:37 ` Andy Lutomirski 2018-01-29 18:12 ` H. Peter Anvin 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-29 16:37 UTC (permalink / raw) To: H. Peter Anvin Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Mon, Jan 29, 2018 at 1:13 AM, H. Peter Anvin <hpa@zytor.com> wrote: > On 01/28/18 11:21, Andy Lutomirski wrote: >>> >>> I think the bug is here. I think that, when writing a NULL selector >>> to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas >>> they should set DPL to 3. >> >> As an experiment, I did this: >> >> DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { >> + [0] = { .dpl = 3, }, >> + >> >> This had no apparent effect. I was hoping that maybe loading NULL >> into a selector would copy DPL from from gdt[0], but it seems like it >> doesn't. >> > > GDT[0] doesn't actually exist. That's what I thought, too, and the SDM does say that, but the SDM says all kinds of not-quite-correct things about segmentation. > It is pretty much scratch space (I have > suggested using it for the gsbase once all those issues get sorted out, > because it lets the paranoid code do something like: > > rdgsbase %rax > push %rax /* Save old gsbase */ > push %rax /* Reserve space on stack */ > sgdt -2(%rsp) /* We don't care about the limit */ > pop %rax /* %rax <- gdtbase */ > mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */ > wrgsbase %rax That will utterly suck on non-UMIP machines that have hypervisor-provided UMIP emulation. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-29 16:37 ` Andy Lutomirski @ 2018-01-29 18:12 ` H. Peter Anvin 2018-01-29 18:26 ` Andy Lutomirski 0 siblings, 1 reply; 16+ messages in thread From: H. Peter Anvin @ 2018-01-29 18:12 UTC (permalink / raw) To: Andy Lutomirski Cc: Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On 01/29/18 08:37, Andy Lutomirski wrote: > > That's what I thought, too, and the SDM does say that, but the SDM > says all kinds of not-quite-correct things about segmentation. > >> It is pretty much scratch space (I have >> suggested using it for the gsbase once all those issues get sorted out, >> because it lets the paranoid code do something like: >> >> rdgsbase %rax >> push %rax /* Save old gsbase */ >> push %rax /* Reserve space on stack */ >> sgdt -2(%rsp) /* We don't care about the limit */ >> pop %rax /* %rax <- gdtbase */ >> mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */ >> wrgsbase %rax > > That will utterly suck on non-UMIP machines that have > hypervisor-provided UMIP emulation. > Is that a valid thing to optimize for, especially given that paranoid entries aren't the most common anyway? -hpa ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-29 18:12 ` H. Peter Anvin @ 2018-01-29 18:26 ` Andy Lutomirski 2018-01-29 18:30 ` H. Peter Anvin 0 siblings, 1 reply; 16+ messages in thread From: Andy Lutomirski @ 2018-01-29 18:26 UTC (permalink / raw) To: H. Peter Anvin Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML > On Jan 29, 2018, at 10:12 AM, H. Peter Anvin <hpa@zytor.com> wrote: > >> On 01/29/18 08:37, Andy Lutomirski wrote: >> >> That's what I thought, too, and the SDM does say that, but the SDM >> says all kinds of not-quite-correct things about segmentation. >> >>> It is pretty much scratch space (I have >>> suggested using it for the gsbase once all those issues get sorted out, >>> because it lets the paranoid code do something like: >>> >>> rdgsbase %rax >>> push %rax /* Save old gsbase */ >>> push %rax /* Reserve space on stack */ >>> sgdt -2(%rsp) /* We don't care about the limit */ >>> pop %rax /* %rax <- gdtbase */ >>> mov (%rax),%rax /* GDT[0] holds the gsbase for this cpu */ >>> wrgsbase %rax >> >> That will utterly suck on non-UMIP machines that have >> hypervisor-provided UMIP emulation. >> > > Is that a valid thing to optimize for, especially given that paranoid > entries aren't the most common anyway? > A bunch of people seem to care about NMI performance for perf. And the current patch set works without this trick. FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-29 18:26 ` Andy Lutomirski @ 2018-01-29 18:30 ` H. Peter Anvin 2018-02-27 22:59 ` Dan Rue 0 siblings, 1 reply; 16+ messages in thread From: H. Peter Anvin @ 2018-01-29 18:30 UTC (permalink / raw) To: Andy Lutomirski Cc: Andy Lutomirski, Borislav Petkov, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On 01/29/18 10:26, Andy Lutomirski wrote: >>> >>> That will utterly suck on non-UMIP machines that have >>> hypervisor-provided UMIP emulation. >> >> Is that a valid thing to optimize for, especially given that paranoid >> entries aren't the most common anyway? > > A bunch of people seem to care about NMI performance for perf. > That wasn't really the question... > And the current patch set works without this trick. But I believe the tricks it uses are fragile. > FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free. That might be a better option. -hpa ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-29 18:30 ` H. Peter Anvin @ 2018-02-27 22:59 ` Dan Rue 0 siblings, 0 replies; 16+ messages in thread From: Dan Rue @ 2018-02-27 22:59 UTC (permalink / raw) To: H. Peter Anvin Cc: Andy Lutomirski, Andy Lutomirski, Borislav Petkov, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Mon, Jan 29, 2018 at 10:30:05AM -0800, H. Peter Anvin wrote: > On 01/29/18 10:26, Andy Lutomirski wrote: > >>> > >>> That will utterly suck on non-UMIP machines that have > >>> hypervisor-provided UMIP emulation. > >> > >> Is that a valid thing to optimize for, especially given that paranoid > >> entries aren't the most common anyway? > > > > A bunch of people seem to care about NMI performance for perf. > > > > That wasn't really the question... > > > And the current patch set works without this trick. > > But I believe the tricks it uses are fragile. > > > FWIW, if we switch all entries to the entry text trampoline, we get direct percpu access for free. > > That might be a better option. Has there been any conclusion to this thread? I can still reproduce the issue on mainline and next. Thanks, Dan ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 22:38 ` Andy Lutomirski 2018-01-26 22:42 ` Andy Lutomirski @ 2018-01-26 22:56 ` Borislav Petkov 2018-01-28 19:21 ` Andy Lutomirski 1 sibling, 1 reply; 16+ messages in thread From: Borislav Petkov @ 2018-01-26 22:56 UTC (permalink / raw) To: Andy Lutomirski Cc: H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote: > Borislav, any chance you could run the attached program on an AMD > machine to see what it does? [boris@pd: /tmp> ./gs1 PID = 3420 ax = 0, cx = 0, dx = 0 This is on 4.15-rc7-ish + tip/master from 2 wks ago. -- Regards/Gruss, Boris. Good mailing practices for 400: avoid top-posting and trim the reply. ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 22:56 ` Borislav Petkov @ 2018-01-28 19:21 ` Andy Lutomirski 0 siblings, 0 replies; 16+ messages in thread From: Andy Lutomirski @ 2018-01-28 19:21 UTC (permalink / raw) To: Borislav Petkov Cc: Andy Lutomirski, H. Peter Anvin, Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, open list:KERNEL SELFTEST FRAMEWORK, LKML On Fri, Jan 26, 2018 at 2:56 PM, Borislav Petkov <bp@alien8.de> wrote: > On Fri, Jan 26, 2018 at 02:38:28PM -0800, Andy Lutomirski wrote: >> Borislav, any chance you could run the attached program on an AMD >> machine to see what it does? > > [boris@pd: /tmp> ./gs1 > PID = 3420 > ax = 0, cx = 0, dx = 0 > > This is on 4.15-rc7-ish + tip/master from 2 wks ago. Phooey :( ^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: selftests/x86/fsgsbase_64 test problem 2018-01-26 19:46 ` Andy Lutomirski 2018-01-26 22:38 ` Andy Lutomirski @ 2018-01-26 22:51 ` H. Peter Anvin 1 sibling, 0 replies; 16+ messages in thread From: H. Peter Anvin @ 2018-01-26 22:51 UTC (permalink / raw) To: Andy Lutomirski Cc: Dan Rue, Shuah Khan, Ingo Molnar, Dmitry Safonov, Borislav Petkov, open list:KERNEL SELFTEST FRAMEWORK, LKML On 01/26/18 11:46, Andy Lutomirski wrote: > > Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it > changes to 0. > > On KVM or native, I do not observe do_debug getting called with %gs == > 1. On TCG, I do. I don't think that's precisely the problem that's > causing the test to fail, since the test doesn't use TF or ptrace, but > I wouldn't be shocked if it's related. > > hpa, any insight? > > (NB: if you want to play with this as I've described it, you may need > to make invalid_selector() in ptrace.c always return false. The > current implementation is too strict and causes problems.) > Looking at it. I want to grok this in the general context of fsgsbase as well, of course. -hpa ^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2018-02-27 23:00 UTC | newest] Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2018-01-26 15:36 selftests/x86/fsgsbase_64 test problem Dan Rue 2018-01-26 16:22 ` Andy Lutomirski 2018-01-26 18:59 ` Andy Lutomirski 2018-01-26 19:46 ` Andy Lutomirski 2018-01-26 22:38 ` Andy Lutomirski 2018-01-26 22:42 ` Andy Lutomirski 2018-01-28 19:21 ` Andy Lutomirski 2018-01-29 9:13 ` H. Peter Anvin 2018-01-29 16:37 ` Andy Lutomirski 2018-01-29 18:12 ` H. Peter Anvin 2018-01-29 18:26 ` Andy Lutomirski 2018-01-29 18:30 ` H. Peter Anvin 2018-02-27 22:59 ` Dan Rue 2018-01-26 22:56 ` Borislav Petkov 2018-01-28 19:21 ` Andy Lutomirski 2018-01-26 22:51 ` H. Peter Anvin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).