* FSGSBASE ABI considerations @ 2017-07-31 3:05 Andy Lutomirski 2017-07-31 4:38 ` Linus Torvalds ` (3 more replies) 0 siblings, 4 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-07-31 3:05 UTC (permalink / raw) To: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev Hi all- Chang wants to get the FSGSBASE patches in. Here's a bit on a brain dump on what I think the relevant considerations are and why I haven't sent out my patches. ----- Background ----- Setting CR4.FSGSBASE has two major advantages and one major disadvantage. The major advantages are: - We can avoid some WRMSR instructions in the context switch path, which makes a rather large difference. - User code can use the new RD/WR FS/GS BASE instructions. Apparently some users really want this for, umm, userspace threading. Think Java. The major disadvantage is that user code can use the new instructions. Now userspace is going to do totally stupid shite like writing some nonzero value to GS and then doing WRGSBASE or like linking some idiotic library that uses WRGSBASE into a perfectly innocent program like dosemu2 and resulting in utterly nonsensical descriptor state. In Windows, supposedly the scheduler reserves the right to do arbitrarily awful things to you if you use WRFSBASE or WRGSBASE inappropriately. Andi took a similar approach in his original FSGSBASE patches. I think this is wrong and we need to have sensible, documented, and tested behavior for what happens when you use the new instructions. For simplicity, the text below talks about WRGSBASE and ignores WRFSBASE. The ABI considerations are identical, even if the kernel implementation details are different. ----- Requirements ----- In my book, there's only one sensible choice for what happens when you are scheduled out and back in on a Linux system with FSGSBASE enabled: all of your descriptors end up *exactly* the way they were when you scheduled out. ptrace users need to keep working. It would be nice if existing gdb versions continue to work right when user code uses WRGSBASE, but it might be okay if a new ptrace interface is needed. The existing regset ABI is exactly backwards from what it needs to be to make this easy. ----- interaction with modify_ldt() ----- The first sticking point we'll hit is modify_ldt() and, in particular, what happens if you call modify_ldt() to change the base of a segment that is ioaded into gs by another thread in the same mm. Our current behavior here is nonsensical: on 32-bit kernels, FS would be fully refreshed on other threads and GS might be depending on compiler options. On 64-bit kernels, neither FS nor GS is immediately refreshed. Historically, we didn't refresh anything reliably. On the bright side, this means that existing modify_ldt() users are (AFAIK) tolerant of somewhat crazy behavior. On an FSGSBASE-enabled system, I think we need to provide deterministic, documented, tested behavior. I can think of three plausible choices: 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads that reference the modified selector. 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads that reference the LDT. 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads. (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to implement when FSGSBASE is on. The tricky bit is that 32-bit kernels can't do (2), so, if we want modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're stuck with (1). (I think we can implement (2) with acceptable performance on 64-bit non-FSGSBASE kernels if we wanted to.) Thoughts? ----- Interaction with ptrace ----- struct user_regs_struct looks like this: ... unsigned long fs_base; unsigned long gs_base; unsigned long ds; unsigned long es; unsigned long fs; unsigned long gs; ... This means that, when gdb saves away a regset and reloads it using PTRACE_SETREGS or similar, the effect is to load gs_base and then load gs. If gs != 0, this will blow away gs_base. Without FSGSBASE, this doesn't matter so much. With FSGSBASE, it means that using gdb to do, say, 'print func()' may corrupt gsbase. What, if anything, should we do about this? One option would be to make gs_base be accurate all the time (it currently isn't) and teach PTRACE_SETREGS to restore in the opposite order despite the struct layout. Thoughts? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 3:05 FSGSBASE ABI considerations Andy Lutomirski @ 2017-07-31 4:38 ` Linus Torvalds 2017-07-31 14:14 ` Andy Lutomirski 2017-07-31 10:55 ` Kirill A. Shutemov ` (2 subsequent siblings) 3 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2017-07-31 4:38 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, Stas Sergeev On Sun, Jul 30, 2017 at 8:05 PM, Andy Lutomirski <luto@kernel.org> wrote: > > This means that, when gdb saves away a regset and reloads it using > PTRACE_SETREGS or similar, the effect is to load gs_base and then load > gs. If gs != 0, this will blow away gs_base. Without FSGSBASE, this > doesn't matter so much. With FSGSBASE, it means that using gdb to do, > say, 'print func()' may corrupt gsbase. > > What, if anything, should we do about this? One option would be to > make gs_base be accurate all the time (it currently isn't) and teach > PTRACE_SETREGS to restore in the opposite order despite the struct > layout. I do not think that ordering should ever matter. If it does, it means that you've designed something. We already screwed that up with the msr interface, can we try to not do it again? Could we perhaps do something like: - every process starts out with CR4.FSGSBASE cleared - if we get an #UD due to the process using the {rd|wr}{gs|fs}base instructions, we enable FSGSBASE and mark the process as using those instructions. - once a process is marked as FSGSBASE, the kernel prioritizes FSGSBASE. We'll still save/restore the selector too, but every time we restore the selector, we will first do a rd*base, and then do a wr*base afterwards IOW, the "selector" ends up being meaningless after people have used fsgsbase. It is saved and restored as a _value_, but it has no effect what-so-ever on the actual base pointer. Yes, it's modal, but at least you don't end up in some situation where it matters whether you write the selector first or not. Hmm? Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 4:38 ` Linus Torvalds @ 2017-07-31 14:14 ` Andy Lutomirski 0 siblings, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-07-31 14:14 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, Stas Sergeev On Sun, Jul 30, 2017 at 9:38 PM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Sun, Jul 30, 2017 at 8:05 PM, Andy Lutomirski <luto@kernel.org> wrote: >> >> This means that, when gdb saves away a regset and reloads it using >> PTRACE_SETREGS or similar, the effect is to load gs_base and then load >> gs. If gs != 0, this will blow away gs_base. Without FSGSBASE, this >> doesn't matter so much. With FSGSBASE, it means that using gdb to do, >> say, 'print func()' may corrupt gsbase. >> >> What, if anything, should we do about this? One option would be to >> make gs_base be accurate all the time (it currently isn't) and teach >> PTRACE_SETREGS to restore in the opposite order despite the struct >> layout. > > I do not think that ordering should ever matter. If it does, it means > that you've designed something. We already screwed that up with the > msr interface, can we try to not do it again? > > Could we perhaps do something like: > > - every process starts out with CR4.FSGSBASE cleared > > - if we get an #UD due to the process using the {rd|wr}{gs|fs}base > instructions, we enable FSGSBASE and mark the process as using those > instructions. > > - once a process is marked as FSGSBASE, the kernel prioritizes > FSGSBASE. We'll still save/restore the selector too, but every time we > restore the selector, we will first do a rd*base, and then do a > wr*base afterwards > > IOW, the "selector" ends up being meaningless after people have used > fsgsbase. It is saved and restored as a _value_, but it has no effect > what-so-ever on the actual base pointer. > > Yes, it's modal, but at least you don't end up in some situation where > it matters whether you write the selector first or not. > > Hmm? I hadn't thought of that approach. I have three very different objections. - The only reason I think that FSGSBASE is worth supporting at all is that it provides a fairly dramatic speedup to context switches by getting rid of the awful serializing WRMSR. I tend to consider the actual exposure of the instructions to userspace to be more trouble than it's worth. But, with your approach, we may only get the speedup when running SPECJava Environmentally Friendly Threads, and we'll lose it again due to all the CR4 writes, and that would make me want to just drop the whole thing. - The modal approach makes the modify_ldt() consistency issue go away, but it doesn't help with ptrace, I think, because, with ptrace, we care about the debugger, not the debuggee. - glibc will probably be daft and start using WRGSBASE instead of arch_prctl and this whole idea may become irrelevant. All that being said, we might be able to get away with treating the selector and the base totally separately no matter what. I've searched a bit, and I haven't come up with anything that needs modify_ldt() to behave synchronously, presumably because its behavior used to be so utterly erratic that user code always had to follow modify_ldt() by an explicit segment write. The only thing that cares about ptrace that I've spotted and that do anything more complicated than reading the state and writing it back out the same way it found it is stuff like gdb's 'print $gs = 43', and I find it hard to believe that there are gdb scripts that do that and need to be supported for compatibility. --Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 3:05 FSGSBASE ABI considerations Andy Lutomirski 2017-07-31 4:38 ` Linus Torvalds @ 2017-07-31 10:55 ` Kirill A. Shutemov 2017-07-31 14:16 ` Andy Lutomirski 2017-07-31 21:23 ` Bae, Chang Seok 2017-08-07 8:06 ` Stas Sergeev 3 siblings, 1 reply; 13+ messages in thread From: Kirill A. Shutemov @ 2017-07-31 10:55 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev On Sun, Jul 30, 2017 at 08:05:43PM -0700, Andy Lutomirski wrote: > Hi all- > > Chang wants to get the FSGSBASE patches in. Here's a bit on a brain > dump on what I think the relevant considerations are and why I haven't > sent out my patches. I'm not sure if it would be relevant input for the descussion, but there's another issue with FSGSBASE vs LA57: http://lkml.kernel.org/r/9ddf602b-6c8b-8c1e-ab46-07ed12366593@redhat.com -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 10:55 ` Kirill A. Shutemov @ 2017-07-31 14:16 ` Andy Lutomirski 0 siblings, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-07-31 14:16 UTC (permalink / raw) To: Kirill A. Shutemov Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev On Mon, Jul 31, 2017 at 3:55 AM, Kirill A. Shutemov <kirill@shutemov.name> wrote: > On Sun, Jul 30, 2017 at 08:05:43PM -0700, Andy Lutomirski wrote: >> Hi all- >> >> Chang wants to get the FSGSBASE patches in. Here's a bit on a brain >> dump on what I think the relevant considerations are and why I haven't >> sent out my patches. > > I'm not sure if it would be relevant input for the descussion, but there's > another issue with FSGSBASE vs LA57: > > http://lkml.kernel.org/r/9ddf602b-6c8b-8c1e-ab46-07ed12366593@redhat.com Clearly we should support FSGSBASE on Linux, have Paolo blacklist it on LA57 machines, and get Java to start using it just to pressure Intel to clean up this mess. Sigh, the degree to which CPU engineers don't understand software is sometimes mind-boggling. ^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: FSGSBASE ABI considerations 2017-07-31 3:05 FSGSBASE ABI considerations Andy Lutomirski 2017-07-31 4:38 ` Linus Torvalds 2017-07-31 10:55 ` Kirill A. Shutemov @ 2017-07-31 21:23 ` Bae, Chang Seok 2017-08-07 16:14 ` Andy Lutomirski 2017-08-07 8:06 ` Stas Sergeev 3 siblings, 1 reply; 13+ messages in thread From: Bae, Chang Seok @ 2017-07-31 21:23 UTC (permalink / raw) To: Andy Lutomirski Cc: X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev > On an FSGSBASE-enabled system, I think we need to provide deterministic, documented, tested behavior. I can think of three plausible choices: > 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads that reference the modified selector. > 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads that reference the LDT. > 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads. > (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to implement when FSGSBASE is on. > The tricky bit is that 32-bit kernels can't do (2), so, if we want modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're stuck with (1). While implementing (1) is still unclear for context switch, here is one idea for (1b): - thread struct has new entry for ldt pointer that last seen - modify_ldt happens - ldtr upated for active threads via IPI - for inactive threads being scheduled in, ldtr updated before __switch_to - in __switch_to, read ldtr by sldt and compare the new ldt pointer sldt is ucode that likely takes only a couple cycles - mostly matched given modify_ldt is rare - unmatched, don't write gsbase if gs indicating LDT > (I think we can implement (2) with acceptable performance on 64-bit non-FSGSBASE kernels if we wanted to.) Nonetheless, with Andy's argument for (1), (2) might be straightforward assuming that user code already followed the legacy around modify_ldt in 64bit. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 21:23 ` Bae, Chang Seok @ 2017-08-07 16:14 ` Andy Lutomirski 0 siblings, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-08-07 16:14 UTC (permalink / raw) To: Bae, Chang Seok Cc: Andy Lutomirski, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev On Jul 31, 2017, at 5:23 PM, Bae, Chang Seok <chang.seok.bae@intel.com> wrote: >> On an FSGSBASE-enabled system, I think we need to provide deterministic, documented, tested behavior. I can think of three plausible choices: >> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads that reference the modified selector. >> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads that reference the LDT. >> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads. >> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to implement when FSGSBASE is on. > >> The tricky bit is that 32-bit kernels can't do (2), so, if we want modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're stuck with (1). > > While implementing (1) is still unclear for context switch, here is one idea for (1b): > - thread struct has new entry for ldt pointer that last seen > - modify_ldt happens > - ldtr upated for active threads via IPI > - for inactive threads being scheduled in, ldtr updated before __switch_to > - in __switch_to, read ldtr by sldt and compare the new ldt pointer > sldt is ucode that likely takes only a couple cycles > - mostly matched given modify_ldt is rare > - unmatched, don't write gsbase if gs indicating LDT That won't be reliable -- LDTR could change more than once and be reused between context switches. If we went this route, I think we'd put a u64 version in ldt_struct. We'd also need to audit and fix up every access to thread.fs/gsbase. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-07-31 3:05 FSGSBASE ABI considerations Andy Lutomirski ` (2 preceding siblings ...) 2017-07-31 21:23 ` Bae, Chang Seok @ 2017-08-07 8:06 ` Stas Sergeev 2017-08-07 16:20 ` Andy Lutomirski 3 siblings, 1 reply; 13+ messages in thread From: Stas Sergeev @ 2017-08-07 8:06 UTC (permalink / raw) To: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman Hello. 31.07.2017 06:05, Andy Lutomirski пишет: > - User code can use the new RD/WR FS/GS BASE instructions. > Apparently some users really want this for, umm, userspace threading. > Think Java. I wonder how java avoids the lack of the user-space continuations support while getting the userspace threading. (swapcontext() calls to kernel for sigprocmask()) > The major disadvantage is that user code can use the new instructions. > Now userspace is going to do totally stupid shite like writing some > nonzero value to GS and then doing WRGSBASE or like linking some > idiotic library that uses WRGSBASE into a perfectly innocent program > like dosemu2 and resulting in utterly nonsensical descriptor state. I don't think this can represent the problem, at least not for dosemu1/2. dosemu2 does the full context switch via a sighandler, dosemu1 uses iret with manually changing all registers before jumping to compatibility mode. I don't think any state changes done in long mode, can affect the state after jump to compatibility mode. > ----- interaction with modify_ldt() ----- > > The first sticking point we'll hit is modify_ldt() and, in particular, > what happens if you call modify_ldt() to change the base of a segment > that is ioaded into gs by another thread in the same mm. > > Our current behavior here is nonsensical: on 32-bit kernels, FS would > be fully refreshed on other threads and GS might be depending on > compiler options. On 64-bit kernels, neither FS nor GS is immediately > refreshed. Historically, we didn't refresh anything reliably. On the > bright side, this means that existing modify_ldt() users are (AFAIK) > tolerant of somewhat crazy behavior. > > On an FSGSBASE-enabled system, I think we need to provide > deterministic, documented, tested behavior. I can think of three > plausible choices: > > 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads > that reference the modified selector. > > 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads > that reference the LDT. Does 1b mean that any call to modify_ldt(), even the read call, will reset all bases to the ones of LDT? I think this is the half-step. It clearly shows that you don't want such state to ever exist, but why not to go a step further and just make the bases to be reset not only by any unrelated modify_ldt() call, but always on schedule? You can state that using wrgsbase on non-zero selector is invalid, reset it to LDT state and maybe send a signal to the program so that it knows it did something wrong. This may sound too rough, but I really don't see how it differs from resetting all LDT bases on some unrelated modify_ldt() that was done for read, not write. Or you may want to reset selector to 0 rather than base to LDT. > 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads. > > (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to > implement when FSGSBASE is on. > > The tricky bit is that 32-bit kernels can't do (2), so, if we want But do we have fsgsbase on 32bit kernels at all? I think it works only in long mode, no? I really tried to google some extensive description on this feature, but failed. > modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're > stuck with (1). If you mean 1a, then to me it looks like a lot of efforts for something no one ever needs. > Thoughts? I am far from the kernel development so my thoughts may be naive, but IMHO you should just disallow this by some means (like by doing a fixup on schedule() and sending a signal). No one will suffer, people will just write 0 to segreg first. Note that such a problem can be provoked by the fact that the sighandler does not reset the segregs to their default values, and someone may simply forget to reset it to 0. You need to remind him to do so rather than to invent the tricky code to do something theoretically correct. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-08-07 8:06 ` Stas Sergeev @ 2017-08-07 16:20 ` Andy Lutomirski 2017-08-07 16:49 ` Christopher Lameter ` (2 more replies) 0 siblings, 3 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-08-07 16:20 UTC (permalink / raw) To: Stas Sergeev Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman On Mon, Aug 7, 2017 at 1:06 AM, Stas Sergeev <stsp@list.ru> wrote: > Hello. > > 31.07.2017 06:05, Andy Lutomirski пишет: >> >> - User code can use the new RD/WR FS/GS BASE instructions. >> Apparently some users really want this for, umm, userspace threading. >> Think Java. > > I wonder how java avoids the lack of the user-space > continuations support while getting the userspace threading. > (swapcontext() calls to kernel for sigprocmask()) > >> The major disadvantage is that user code can use the new instructions. >> Now userspace is going to do totally stupid shite like writing some >> nonzero value to GS and then doing WRGSBASE or like linking some >> idiotic library that uses WRGSBASE into a perfectly innocent program >> like dosemu2 and resulting in utterly nonsensical descriptor state. > > I don't think this can represent the problem, at least not > for dosemu1/2. dosemu2 does the full context switch via > a sighandler, dosemu1 uses iret with manually changing > all registers before jumping to compatibility mode. I don't > think any state changes done in long mode, can affect the > state after jump to compatibility mode. Hmm, right. DOSEMU could get tripped up on the way back to long mode, and we've discussed this a little bit before, but this is certainly manageable. > >> ----- interaction with modify_ldt() ----- >> >> The first sticking point we'll hit is modify_ldt() and, in particular, >> what happens if you call modify_ldt() to change the base of a segment >> that is ioaded into gs by another thread in the same mm. >> >> Our current behavior here is nonsensical: on 32-bit kernels, FS would >> be fully refreshed on other threads and GS might be depending on >> compiler options. On 64-bit kernels, neither FS nor GS is immediately >> refreshed. Historically, we didn't refresh anything reliably. On the >> bright side, this means that existing modify_ldt() users are (AFAIK) >> tolerant of somewhat crazy behavior. >> >> On an FSGSBASE-enabled system, I think we need to provide >> deterministic, documented, tested behavior. I can think of three >> plausible choices: >> >> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads >> that reference the modified selector. >> >> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads >> that reference the LDT. > > Does 1b mean that any call to modify_ldt(), even the > read call, will reset all bases to the ones of LDT? Nah, just writes. Doing it this way makes the tracking easier, since we don't need to keep track of which selectors have been changed. Note that 1a and 1b are indistinguishable to any user program that doesn't use WRFSBASE or WRGSBASE, though. > I think > this is the half-step. It clearly shows that you don't want > such state to ever exist, but why not to go a step further > and just make the bases to be reset not only by any > unrelated modify_ldt() call, but always on schedule? > You can state that using wrgsbase on non-zero selector > is invalid, reset it to LDT state and maybe send a signal > to the program so that it knows it did something wrong. > This may sound too rough, but I really don't see how it > differs from resetting all LDT bases on some unrelated > modify_ldt() that was done for read, not write. > Or you may want to reset selector to 0 rather than > base to LDT. Windows does something sort of like this (I think), but I don't like this solution. I fully expect that someone will write a program that does: old = rdgsbase(); wrgsbase(new); call_very_fast_function(); wrgsbase(old); This will work if GS == 0, which is fine. The problem is that it will *also* work if GS != 0 with very high probability, especially if this code sequence is right after some operation that sleeps. And then we'll get random crashes with very low probability, depending on where the scheduler hits. > >> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads. >> >> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to >> implement when FSGSBASE is on. >> >> The tricky bit is that 32-bit kernels can't do (2), so, if we want > > But do we have fsgsbase on 32bit kernels at all? No, and we don't have MSR_FS_BASE, etc either, so the scheduler basically can't preserve the base across a context switch. > I think it works only in long mode, no? > I really tried to google some extensive description > on this feature, but failed. > >> modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're >> stuck with (1). > > If you mean 1a, then to me it looks like a lot of efforts > for something no one ever needs. > >> Thoughts? > > I am far from the kernel development so my thoughts > may be naive, but IMHO you should just disallow this > by some means (like by doing a fixup on schedule() and > sending a signal). No one will suffer, people will just > write 0 to segreg first. Note that such a problem can > be provoked by the fact that the sighandler does not > reset the segregs to their default values, and someone > may simply forget to reset it to 0. You need to remind > him to do so rather than to invent the tricky code to > do something theoretically correct. I would *love* to disallow it. The problem is that I don't believe it to be possible in a way that doesn't cause more problems than it solves. --Andy ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-08-07 16:20 ` Andy Lutomirski @ 2017-08-07 16:49 ` Christopher Lameter 2017-08-07 17:35 ` Linus Torvalds 2017-08-07 21:32 ` Stas Sergeev 2 siblings, 0 replies; 13+ messages in thread From: Christopher Lameter @ 2017-08-07 16:49 UTC (permalink / raw) To: Andy Lutomirski Cc: Stas Sergeev, Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman I hope this will finally enable thread local support to work in a sane way in gcc so that we can actually use it in kernel space and get rid of all the this_cpu_xxx() macros? And thread local RMVs primitives may actually be provided by gcc and be usable in user space so that we can write user space code with effective cpu local variable access? ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-08-07 16:20 ` Andy Lutomirski 2017-08-07 16:49 ` Christopher Lameter @ 2017-08-07 17:35 ` Linus Torvalds 2017-08-07 19:07 ` Andy Lutomirski 2017-08-07 21:32 ` Stas Sergeev 2 siblings, 1 reply; 13+ messages in thread From: Linus Torvalds @ 2017-08-07 17:35 UTC (permalink / raw) To: Andy Lutomirski Cc: Stas Sergeev, Bae, Chang Seok, X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, Bart Oldeman On Mon, Aug 7, 2017 at 9:20 AM, Andy Lutomirski <luto@kernel.org> wrote: > > Windows does something sort of like this (I think), but I don't like > this solution. I fully expect that someone will write a program that > does: > > old = rdgsbase(); > wrgsbase(new); > call_very_fast_function(); > wrgsbase(old); > > This will work if GS == 0, which is fine. The problem is that it will > *also* work if GS != 0 with very high probability, especially if this > code sequence is right after some operation that sleeps. And then > we'll get random crashes with very low probability, depending on where > the scheduler hits. It will work reliably if you just make the scheduler save/restore the base rather than the selector. I really think you need to walk away from the "selector is meaningful" model. Yes, yes, it's the legacy model, but it's the *insane* model. So screw the selector. It doesn't matter. We'll need to save/restore the value, but that's it. What we *really* save and restore is just the base pointer. Why do you care so much about the selector? If people *don't* use the fsgsbase, then the selector and the base of the segment will always match anyway (modulo the system calls that actually change the gdt/ldt, and we can just sat that *then* selectors matter). And if people *do* use fsgsbase, then the selector is by definition not important. So just make the scheduler save the base first, and restore it last. End of problem. Your user-space code above just works. There is no race, i doesn't matter one whit whether GS is 0 ir not, there simply is no problem. So just what is the problem you're trying to solve? Linus ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-08-07 17:35 ` Linus Torvalds @ 2017-08-07 19:07 ` Andy Lutomirski 0 siblings, 0 replies; 13+ messages in thread From: Andy Lutomirski @ 2017-08-07 19:07 UTC (permalink / raw) To: Linus Torvalds Cc: Andy Lutomirski, Stas Sergeev, Bae, Chang Seok, X86 ML, linux-kernel, Borislav Petkov, Brian Gerst, Bart Oldeman On Mon, Aug 7, 2017 at 10:35 AM, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Mon, Aug 7, 2017 at 9:20 AM, Andy Lutomirski <luto@kernel.org> wrote: >> >> Windows does something sort of like this (I think), but I don't like >> this solution. I fully expect that someone will write a program that >> does: >> >> old = rdgsbase(); >> wrgsbase(new); >> call_very_fast_function(); >> wrgsbase(old); >> >> This will work if GS == 0, which is fine. The problem is that it will >> *also* work if GS != 0 with very high probability, especially if this >> code sequence is right after some operation that sleeps. And then >> we'll get random crashes with very low probability, depending on where >> the scheduler hits. > > It will work reliably if you just make the scheduler save/restore the > base rather than the selector. > > I really think you need to walk away from the "selector is meaningful" > model. Yes, yes, it's the legacy model, but it's the *insane* model. > > So screw the selector. It doesn't matter. We'll need to save/restore > the value, but that's it. What we *really* save and restore is just > the base pointer. > > Why do you care so much about the selector? If people *don't* use the > fsgsbase, then the selector and the base of the segment will always > match anyway (modulo the system calls that actually change the > gdt/ldt, and we can just sat that *then* selectors matter). > > And if people *do* use fsgsbase, then the selector is by definition > not important. > > So just make the scheduler save the base first, and restore it last. > End of problem. Your user-space code above just works. There is no > race, i doesn't matter one whit whether GS is 0 ir not, there simply > is no problem. I agree completely. The scheduler should do exactly this and, with my patches applied, it does. > > So just what is the problem you're trying to solve? > I'm trying to avoid a situation where we implement that policy and the interaction with modify_ldt() becomes very strange. Linux has a long history of having ill-defined semantics x86_64, and I don't want to make it worse. If we *just* change the way the scheduler works, then we end up with modify_ldt() behaving determinstically on IVB+ and behaving deterministically on 32-bit kernels, but having that deterministic behavior be *different*. This makes me rather unhappy about the whole situation. Also, I don't want to break gdb, and even telling whether a change breaks gdb is an incredible PITA. Whern GDB saves and restores a context, it currently restores the base first and the selector second, and I have no idea whether gdb expects restoring the selector to update the base. ^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: FSGSBASE ABI considerations 2017-08-07 16:20 ` Andy Lutomirski 2017-08-07 16:49 ` Christopher Lameter 2017-08-07 17:35 ` Linus Torvalds @ 2017-08-07 21:32 ` Stas Sergeev 2 siblings, 0 replies; 13+ messages in thread From: Stas Sergeev @ 2017-08-07 21:32 UTC (permalink / raw) To: Andy Lutomirski Cc: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman 07.08.2017 19:20, Andy Lutomirski пишет: >> I think >> this is the half-step. It clearly shows that you don't want >> such state to ever exist, but why not to go a step further >> and just make the bases to be reset not only by any >> unrelated modify_ldt() call, but always on schedule? >> You can state that using wrgsbase on non-zero selector >> is invalid, reset it to LDT state and maybe send a signal >> to the program so that it knows it did something wrong. >> This may sound too rough, but I really don't see how it >> differs from resetting all LDT bases on some unrelated >> modify_ldt() that was done for read, not write. >> Or you may want to reset selector to 0 rather than >> base to LDT. > Windows does something sort of like this (I think), but I don't like > this solution. I fully expect that someone will write a program that > does: > > old = rdgsbase(); > wrgsbase(new); > call_very_fast_function(); > wrgsbase(old); > > This will work if GS == 0, which is fine. The problem is that it will > *also* work if GS != 0 with very high probability, especially if this > code sequence is right after some operation that sleeps. And then > we'll get random crashes with very low probability, depending on where > the scheduler hits. So, as Linus already pointed, if the fixup is to zero out the selector, then this will still work fine. >> I am far from the kernel development so my thoughts >> may be naive, but IMHO you should just disallow this >> by some means (like by doing a fixup on schedule() and >> sending a signal). No one will suffer, people will just >> write 0 to segreg first. Note that such a problem can >> be provoked by the fact that the sighandler does not >> reset the segregs to their default values, and someone >> may simply forget to reset it to 0. You need to remind >> him to do so rather than to invent the tricky code to >> do something theoretically correct. > I would *love* to disallow it. The problem is that I don't believe it > to be possible in a way that doesn't cause more problems than it > solves. I wonder if sending a signal (after doing a fixup) is too much of a punishment? > I'm trying to avoid a situation where we implement that policy and the > interaction with modify_ldt() becomes very strange. IMHO if you do the fixup on schedule (like setting the selector to zero), then the interaction with modify_ldt() is completely avoided, i.e. modify_ldt() should then never special-case the threads that did wrgsbase. So if something inconsistent comes out, then it was likely there already without wrgsbase. ^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2017-08-07 22:14 UTC | newest] Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2017-07-31 3:05 FSGSBASE ABI considerations Andy Lutomirski 2017-07-31 4:38 ` Linus Torvalds 2017-07-31 14:14 ` Andy Lutomirski 2017-07-31 10:55 ` Kirill A. Shutemov 2017-07-31 14:16 ` Andy Lutomirski 2017-07-31 21:23 ` Bae, Chang Seok 2017-08-07 16:14 ` Andy Lutomirski 2017-08-07 8:06 ` Stas Sergeev 2017-08-07 16:20 ` Andy Lutomirski 2017-08-07 16:49 ` Christopher Lameter 2017-08-07 17:35 ` Linus Torvalds 2017-08-07 19:07 ` Andy Lutomirski 2017-08-07 21:32 ` Stas Sergeev
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).