linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* FSGSBASE ABI considerations
@ 2017-07-31  3:05 Andy Lutomirski
  2017-07-31  4:38 ` Linus Torvalds
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-07-31  3:05 UTC (permalink / raw)
  To: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds,
	Borislav Petkov, Brian Gerst, Stas Sergeev

Hi all-

Chang wants to get the FSGSBASE patches in.  Here's a bit on a brain
dump on what I think the relevant considerations are and why I haven't
sent out my patches.

----- Background -----

Setting CR4.FSGSBASE has two major advantages and one major
disadvantage.  The major advantages are:

 - We can avoid some WRMSR instructions in the context switch path,
which makes a rather large difference.

 - User code can use the new RD/WR FS/GS BASE instructions.
Apparently some users really want this for, umm, userspace threading.
Think Java.

The major disadvantage is that user code can use the new instructions.
Now userspace is going to do totally stupid shite like writing some
nonzero value to GS and then doing WRGSBASE or like linking some
idiotic library that uses WRGSBASE into a perfectly innocent program
like dosemu2 and resulting in utterly nonsensical descriptor state.

In Windows, supposedly the scheduler reserves the right to do
arbitrarily awful things to you if you use WRFSBASE or WRGSBASE
inappropriately.  Andi took a similar approach in his original
FSGSBASE patches.  I think this is wrong and we need to have sensible,
documented, and tested behavior for what happens when you use the new
instructions.

For simplicity, the text below talks about WRGSBASE and ignores
WRFSBASE.  The ABI considerations are identical, even if the kernel
implementation details are different.

----- Requirements -----

In my book, there's only one sensible choice for what happens when you
are scheduled out and back in on a Linux system with FSGSBASE enabled:
all of your descriptors end up *exactly* the way they were when you
scheduled out.

ptrace users need to keep working.  It would be nice if existing gdb
versions continue to work right when user code uses WRGSBASE, but it
might be okay if a new ptrace interface is needed.  The existing
regset ABI is exactly backwards from what it needs to be to make this
easy.

----- interaction with modify_ldt() -----

The first sticking point we'll hit is modify_ldt() and, in particular,
what happens if you call modify_ldt() to change the base of a segment
that is ioaded into gs by another thread in the same mm.

Our current behavior here is nonsensical: on 32-bit kernels, FS would
be fully refreshed on other threads and GS might be depending on
compiler options.  On 64-bit kernels, neither FS nor GS is immediately
refreshed.  Historically, we didn't refresh anything reliably.  On the
bright side, this means that existing modify_ldt() users are (AFAIK)
tolerant of somewhat crazy behavior.

On an FSGSBASE-enabled system, I think we need to provide
deterministic, documented, tested behavior.  I can think of three
plausible choices:

1a. modify_ldt() immediately updates FSBASE and GSBASE all threads
that reference the modified selector.

1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads
that reference the LDT.

2. modify_ldt() leaves FSBASE and GSBASE alone on all threads.

(2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to
implement when FSGSBASE is on.

The tricky bit is that 32-bit kernels can't do (2), so, if we want
modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're
stuck with (1).  (I think we can implement (2) with acceptable
performance on 64-bit non-FSGSBASE kernels if we wanted to.)

Thoughts?

----- Interaction with ptrace -----

struct user_regs_struct looks like this:

        ...
        unsigned long   fs_base;
        unsigned long   gs_base;
        unsigned long   ds;
        unsigned long   es;
        unsigned long   fs;
        unsigned long   gs;
        ...

This means that, when gdb saves away a regset and reloads it using
PTRACE_SETREGS or similar, the effect is to load gs_base and then load
gs.  If gs != 0, this will blow away gs_base.  Without FSGSBASE, this
doesn't matter so much.  With FSGSBASE, it means that using gdb to do,
say, 'print func()' may corrupt gsbase.

What, if anything, should we do about this?  One option would be to
make gs_base be accurate all the time (it currently isn't) and teach
PTRACE_SETREGS to restore in the opposite order despite the struct
layout.

Thoughts?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31  3:05 FSGSBASE ABI considerations Andy Lutomirski
@ 2017-07-31  4:38 ` Linus Torvalds
  2017-07-31 14:14   ` Andy Lutomirski
  2017-07-31 10:55 ` Kirill A. Shutemov
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2017-07-31  4:38 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, X86 ML, linux-kernel, Borislav Petkov,
	Brian Gerst, Stas Sergeev

On Sun, Jul 30, 2017 at 8:05 PM, Andy Lutomirski <luto@kernel.org> wrote:
>
> This means that, when gdb saves away a regset and reloads it using
> PTRACE_SETREGS or similar, the effect is to load gs_base and then load
> gs.  If gs != 0, this will blow away gs_base.  Without FSGSBASE, this
> doesn't matter so much.  With FSGSBASE, it means that using gdb to do,
> say, 'print func()' may corrupt gsbase.
>
> What, if anything, should we do about this?  One option would be to
> make gs_base be accurate all the time (it currently isn't) and teach
> PTRACE_SETREGS to restore in the opposite order despite the struct
> layout.

I do not think that ordering should ever matter. If it does, it means
that you've designed something. We already screwed that up with the
msr interface, can we try to not do it again?

Could we perhaps do something like:

 - every process starts out with CR4.FSGSBASE cleared

 - if we get an #UD due to the process using the {rd|wr}{gs|fs}base
instructions,  we enable FSGSBASE and mark the process as using those
instructions.

 - once a process is marked as FSGSBASE, the kernel prioritizes
FSGSBASE. We'll still save/restore the selector too, but every time we
restore the selector, we will first do a rd*base, and then do a
wr*base afterwards

IOW, the "selector" ends up being meaningless after people have used
fsgsbase. It is saved and restored as a _value_, but it has no effect
what-so-ever on the actual base pointer.

Yes, it's modal, but at least you don't end up in some situation where
it matters whether you write the selector first or not.

Hmm?

                 Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31  3:05 FSGSBASE ABI considerations Andy Lutomirski
  2017-07-31  4:38 ` Linus Torvalds
@ 2017-07-31 10:55 ` Kirill A. Shutemov
  2017-07-31 14:16   ` Andy Lutomirski
  2017-07-31 21:23 ` Bae, Chang Seok
  2017-08-07  8:06 ` Stas Sergeev
  3 siblings, 1 reply; 13+ messages in thread
From: Kirill A. Shutemov @ 2017-07-31 10:55 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds,
	Borislav Petkov, Brian Gerst, Stas Sergeev

On Sun, Jul 30, 2017 at 08:05:43PM -0700, Andy Lutomirski wrote:
> Hi all-
> 
> Chang wants to get the FSGSBASE patches in.  Here's a bit on a brain
> dump on what I think the relevant considerations are and why I haven't
> sent out my patches.

I'm not sure if it would be relevant input for the descussion, but there's
another issue with FSGSBASE vs LA57:

http://lkml.kernel.org/r/9ddf602b-6c8b-8c1e-ab46-07ed12366593@redhat.com

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31  4:38 ` Linus Torvalds
@ 2017-07-31 14:14   ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-07-31 14:14 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel,
	Borislav Petkov, Brian Gerst, Stas Sergeev

On Sun, Jul 30, 2017 at 9:38 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Sun, Jul 30, 2017 at 8:05 PM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> This means that, when gdb saves away a regset and reloads it using
>> PTRACE_SETREGS or similar, the effect is to load gs_base and then load
>> gs.  If gs != 0, this will blow away gs_base.  Without FSGSBASE, this
>> doesn't matter so much.  With FSGSBASE, it means that using gdb to do,
>> say, 'print func()' may corrupt gsbase.
>>
>> What, if anything, should we do about this?  One option would be to
>> make gs_base be accurate all the time (it currently isn't) and teach
>> PTRACE_SETREGS to restore in the opposite order despite the struct
>> layout.
>
> I do not think that ordering should ever matter. If it does, it means
> that you've designed something. We already screwed that up with the
> msr interface, can we try to not do it again?
>
> Could we perhaps do something like:
>
>  - every process starts out with CR4.FSGSBASE cleared
>
>  - if we get an #UD due to the process using the {rd|wr}{gs|fs}base
> instructions,  we enable FSGSBASE and mark the process as using those
> instructions.
>
>  - once a process is marked as FSGSBASE, the kernel prioritizes
> FSGSBASE. We'll still save/restore the selector too, but every time we
> restore the selector, we will first do a rd*base, and then do a
> wr*base afterwards
>
> IOW, the "selector" ends up being meaningless after people have used
> fsgsbase. It is saved and restored as a _value_, but it has no effect
> what-so-ever on the actual base pointer.
>
> Yes, it's modal, but at least you don't end up in some situation where
> it matters whether you write the selector first or not.
>
> Hmm?

I hadn't thought of that approach.  I have three very different objections.

 - The only reason I think that FSGSBASE is worth supporting at all is
that it provides a fairly dramatic speedup to context switches by
getting rid of the awful serializing WRMSR.  I tend to consider the
actual exposure of the instructions to userspace to be more trouble
than it's worth.  But, with your approach, we may only get the speedup
when running SPECJava Environmentally Friendly Threads, and we'll lose
it again due to all the CR4 writes, and that would make me want to
just drop the whole thing.

 - The modal approach makes the modify_ldt() consistency issue go
away, but it doesn't help with ptrace, I think, because, with ptrace,
we care about the debugger, not the debuggee.

 - glibc will probably be daft and start using WRGSBASE instead of
arch_prctl and this whole idea may become irrelevant.

All that being said, we might be able to get away with treating the
selector and the base totally separately no matter what.  I've
searched a bit, and I haven't come up with anything that needs
modify_ldt() to behave synchronously, presumably because its behavior
used to be so utterly erratic that user code always had to follow
modify_ldt() by an explicit segment write.  The only thing that cares
about ptrace that I've spotted and that do anything more complicated
than reading the state and writing it back out the same way it found
it is stuff like gdb's 'print $gs = 43', and I find it hard to believe
that there are gdb scripts that do that and need to be supported for
compatibility.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31 10:55 ` Kirill A. Shutemov
@ 2017-07-31 14:16   ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-07-31 14:16 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel,
	Linus Torvalds, Borislav Petkov, Brian Gerst, Stas Sergeev

On Mon, Jul 31, 2017 at 3:55 AM, Kirill A. Shutemov
<kirill@shutemov.name> wrote:
> On Sun, Jul 30, 2017 at 08:05:43PM -0700, Andy Lutomirski wrote:
>> Hi all-
>>
>> Chang wants to get the FSGSBASE patches in.  Here's a bit on a brain
>> dump on what I think the relevant considerations are and why I haven't
>> sent out my patches.
>
> I'm not sure if it would be relevant input for the descussion, but there's
> another issue with FSGSBASE vs LA57:
>
> http://lkml.kernel.org/r/9ddf602b-6c8b-8c1e-ab46-07ed12366593@redhat.com

Clearly we should support FSGSBASE on Linux, have Paolo blacklist it
on LA57 machines, and get Java to start using it just to pressure
Intel to clean up this mess.

Sigh, the degree to which CPU engineers don't understand software is
sometimes mind-boggling.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* RE: FSGSBASE ABI considerations
  2017-07-31  3:05 FSGSBASE ABI considerations Andy Lutomirski
  2017-07-31  4:38 ` Linus Torvalds
  2017-07-31 10:55 ` Kirill A. Shutemov
@ 2017-07-31 21:23 ` Bae, Chang Seok
  2017-08-07 16:14   ` Andy Lutomirski
  2017-08-07  8:06 ` Stas Sergeev
  3 siblings, 1 reply; 13+ messages in thread
From: Bae, Chang Seok @ 2017-07-31 21:23 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: X86 ML, linux-kernel, Linus Torvalds, Borislav Petkov,
	Brian Gerst, Stas Sergeev

> On an FSGSBASE-enabled system, I think we need to provide deterministic, documented, tested behavior.  I can think of three plausible choices:
> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads that reference the modified selector.
> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads that reference the LDT.
> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads.
> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to implement when FSGSBASE is on.

> The tricky bit is that 32-bit kernels can't do (2), so, if we want modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're stuck with (1).  

While implementing (1) is still unclear for context switch, here is one idea for (1b):
- thread struct has new entry for ldt pointer that last seen
- modify_ldt happens
- ldtr upated for active threads via IPI
- for inactive threads being scheduled in, ldtr updated before __switch_to
- in __switch_to, read ldtr by sldt and compare the new ldt pointer
	sldt is ucode that likely takes only a couple cycles
- mostly matched given modify_ldt is rare
- unmatched, don't write gsbase if gs indicating LDT

> (I think we can implement (2) with acceptable performance on 64-bit non-FSGSBASE kernels if we wanted to.)
Nonetheless, with Andy's argument for (1), (2) might be straightforward assuming that user code already followed the legacy around modify_ldt in 64bit.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31  3:05 FSGSBASE ABI considerations Andy Lutomirski
                   ` (2 preceding siblings ...)
  2017-07-31 21:23 ` Bae, Chang Seok
@ 2017-08-07  8:06 ` Stas Sergeev
  2017-08-07 16:20   ` Andy Lutomirski
  3 siblings, 1 reply; 13+ messages in thread
From: Stas Sergeev @ 2017-08-07  8:06 UTC (permalink / raw)
  To: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel,
	Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman

Hello.

31.07.2017 06:05, Andy Lutomirski пишет:
>   - User code can use the new RD/WR FS/GS BASE instructions.
> Apparently some users really want this for, umm, userspace threading.
> Think Java.
I wonder how java avoids the lack of the user-space
continuations support while getting the userspace threading.
(swapcontext() calls to kernel for sigprocmask())

> The major disadvantage is that user code can use the new instructions.
> Now userspace is going to do totally stupid shite like writing some
> nonzero value to GS and then doing WRGSBASE or like linking some
> idiotic library that uses WRGSBASE into a perfectly innocent program
> like dosemu2 and resulting in utterly nonsensical descriptor state.
I don't think this can represent the problem, at least not
for dosemu1/2. dosemu2 does the full context switch via
a sighandler, dosemu1 uses iret with manually changing
all registers before jumping to compatibility mode. I don't
think any state changes done in long mode, can affect the
state after jump to compatibility mode.

> ----- interaction with modify_ldt() -----
>
> The first sticking point we'll hit is modify_ldt() and, in particular,
> what happens if you call modify_ldt() to change the base of a segment
> that is ioaded into gs by another thread in the same mm.
>
> Our current behavior here is nonsensical: on 32-bit kernels, FS would
> be fully refreshed on other threads and GS might be depending on
> compiler options.  On 64-bit kernels, neither FS nor GS is immediately
> refreshed.  Historically, we didn't refresh anything reliably.  On the
> bright side, this means that existing modify_ldt() users are (AFAIK)
> tolerant of somewhat crazy behavior.
>
> On an FSGSBASE-enabled system, I think we need to provide
> deterministic, documented, tested behavior.  I can think of three
> plausible choices:
>
> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads
> that reference the modified selector.
>
> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads
> that reference the LDT.
Does 1b mean that any call to modify_ldt(), even the
read call, will reset all bases to the ones of LDT? I think
this is the half-step. It clearly shows that you don't want
such state to ever exist, but why not to go a step further
and just make the bases to be reset not only by any
unrelated modify_ldt() call, but always on schedule?
You can state that using wrgsbase on non-zero selector
is invalid, reset it to LDT state and maybe send a signal
to the program so that it knows it did something wrong.
This may sound too rough, but I really don't see how it
differs from resetting all LDT bases on some unrelated
modify_ldt() that was done for read, not write.
Or you may want to reset selector to 0 rather than
base to LDT.

> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads.
>
> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to
> implement when FSGSBASE is on.
>
> The tricky bit is that 32-bit kernels can't do (2), so, if we want
But do we have fsgsbase on 32bit kernels at all?
I think it works only in long mode, no?
I really tried to google some extensive description
on this feature, but failed.

> modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're
> stuck with (1).
If you mean 1a, then to me it looks like a lot of efforts
for something no one ever needs.

> Thoughts?
I am far from the kernel development so my thoughts
may be naive, but IMHO you should just disallow this
by some means (like by doing a fixup on schedule() and
sending a signal). No one will suffer, people will just
write 0 to segreg first. Note that such a problem can
be provoked by the fact that the sighandler does not
reset the segregs to their default values, and someone
may simply forget to reset it to 0. You need to remind
him to do so rather than to invent the tricky code to
do something theoretically correct.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-07-31 21:23 ` Bae, Chang Seok
@ 2017-08-07 16:14   ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-08-07 16:14 UTC (permalink / raw)
  To: Bae, Chang Seok
  Cc: Andy Lutomirski, X86 ML, linux-kernel, Linus Torvalds,
	Borislav Petkov, Brian Gerst, Stas Sergeev

On Jul 31, 2017, at 5:23 PM, Bae, Chang Seok <chang.seok.bae@intel.com> wrote:

>> On an FSGSBASE-enabled system, I think we need to provide deterministic, documented, tested behavior.  I can think of three plausible choices:
>> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads that reference the modified selector.
>> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads that reference the LDT.
>> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads.
>> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to implement when FSGSBASE is on.
>
>> The tricky bit is that 32-bit kernels can't do (2), so, if we want modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're stuck with (1).
>
> While implementing (1) is still unclear for context switch, here is one idea for (1b):
> - thread struct has new entry for ldt pointer that last seen
> - modify_ldt happens
> - ldtr upated for active threads via IPI
> - for inactive threads being scheduled in, ldtr updated before __switch_to
> - in __switch_to, read ldtr by sldt and compare the new ldt pointer
>    sldt is ucode that likely takes only a couple cycles
> - mostly matched given modify_ldt is rare
> - unmatched, don't write gsbase if gs indicating LDT

That won't be reliable -- LDTR could change more than once and be
reused between context switches.  If we went this route, I think we'd
put a u64 version in ldt_struct.  We'd also need to audit and fix up
every access to thread.fs/gsbase.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-08-07  8:06 ` Stas Sergeev
@ 2017-08-07 16:20   ` Andy Lutomirski
  2017-08-07 16:49     ` Christopher Lameter
                       ` (2 more replies)
  0 siblings, 3 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-08-07 16:20 UTC (permalink / raw)
  To: Stas Sergeev
  Cc: Andy Lutomirski, Bae, Chang Seok, X86 ML, linux-kernel,
	Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman

On Mon, Aug 7, 2017 at 1:06 AM, Stas Sergeev <stsp@list.ru> wrote:
> Hello.
>
> 31.07.2017 06:05, Andy Lutomirski пишет:
>>
>>   - User code can use the new RD/WR FS/GS BASE instructions.
>> Apparently some users really want this for, umm, userspace threading.
>> Think Java.
>
> I wonder how java avoids the lack of the user-space
> continuations support while getting the userspace threading.
> (swapcontext() calls to kernel for sigprocmask())
>
>> The major disadvantage is that user code can use the new instructions.
>> Now userspace is going to do totally stupid shite like writing some
>> nonzero value to GS and then doing WRGSBASE or like linking some
>> idiotic library that uses WRGSBASE into a perfectly innocent program
>> like dosemu2 and resulting in utterly nonsensical descriptor state.
>
> I don't think this can represent the problem, at least not
> for dosemu1/2. dosemu2 does the full context switch via
> a sighandler, dosemu1 uses iret with manually changing
> all registers before jumping to compatibility mode. I don't
> think any state changes done in long mode, can affect the
> state after jump to compatibility mode.

Hmm, right.  DOSEMU could get tripped up on the way back to long mode,
and we've discussed this a little bit before, but this is certainly
manageable.

>
>> ----- interaction with modify_ldt() -----
>>
>> The first sticking point we'll hit is modify_ldt() and, in particular,
>> what happens if you call modify_ldt() to change the base of a segment
>> that is ioaded into gs by another thread in the same mm.
>>
>> Our current behavior here is nonsensical: on 32-bit kernels, FS would
>> be fully refreshed on other threads and GS might be depending on
>> compiler options.  On 64-bit kernels, neither FS nor GS is immediately
>> refreshed.  Historically, we didn't refresh anything reliably.  On the
>> bright side, this means that existing modify_ldt() users are (AFAIK)
>> tolerant of somewhat crazy behavior.
>>
>> On an FSGSBASE-enabled system, I think we need to provide
>> deterministic, documented, tested behavior.  I can think of three
>> plausible choices:
>>
>> 1a. modify_ldt() immediately updates FSBASE and GSBASE all threads
>> that reference the modified selector.
>>
>> 1b. modify_ldt() immediatley updates FSBASE and GSBASE on all threads
>> that reference the LDT.
>
> Does 1b mean that any call to modify_ldt(), even the
> read call, will reset all bases to the ones of LDT?

Nah, just writes.  Doing it this way makes the tracking easier, since
we don't need to keep track of which selectors have been changed.

Note that 1a and 1b are indistinguishable to any user program that
doesn't use WRFSBASE or WRGSBASE, though.

> I think
> this is the half-step. It clearly shows that you don't want
> such state to ever exist, but why not to go a step further
> and just make the bases to be reset not only by any
> unrelated modify_ldt() call, but always on schedule?
> You can state that using wrgsbase on non-zero selector
> is invalid, reset it to LDT state and maybe send a signal
> to the program so that it knows it did something wrong.
> This may sound too rough, but I really don't see how it
> differs from resetting all LDT bases on some unrelated
> modify_ldt() that was done for read, not write.
> Or you may want to reset selector to 0 rather than
> base to LDT.

Windows does something sort of like this (I think), but I don't like
this solution.  I fully expect that someone will write a program that
does:

old = rdgsbase();
wrgsbase(new);
call_very_fast_function();
wrgsbase(old);

This will work if GS == 0, which is fine.  The problem is that it will
*also* work if GS != 0 with very high probability, especially if this
code sequence is right after some operation that sleeps.  And then
we'll get random crashes with very low probability, depending on where
the scheduler hits.

>
>> 2. modify_ldt() leaves FSBASE and GSBASE alone on all threads.
>>
>> (2) is trivial to implement, whereas (1a) and (1b) are a bit nasty to
>> implement when FSGSBASE is on.
>>
>> The tricky bit is that 32-bit kernels can't do (2), so, if we want
>
> But do we have fsgsbase on 32bit kernels at all?

No, and we don't have MSR_FS_BASE, etc either, so the scheduler
basically can't preserve the base across a context switch.

> I think it works only in long mode, no?
> I really tried to google some extensive description
> on this feature, but failed.
>
>> modify_ldt() to behave the same on 32-bit and 64-bit kernels, we're
>> stuck with (1).
>
> If you mean 1a, then to me it looks like a lot of efforts
> for something no one ever needs.
>
>> Thoughts?
>
> I am far from the kernel development so my thoughts
> may be naive, but IMHO you should just disallow this
> by some means (like by doing a fixup on schedule() and
> sending a signal). No one will suffer, people will just
> write 0 to segreg first. Note that such a problem can
> be provoked by the fact that the sighandler does not
> reset the segregs to their default values, and someone
> may simply forget to reset it to 0. You need to remind
> him to do so rather than to invent the tricky code to
> do something theoretically correct.

I would *love* to disallow it.  The problem is that I don't believe it
to be possible in a way that doesn't cause more problems than it
solves.

--Andy

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-08-07 16:20   ` Andy Lutomirski
@ 2017-08-07 16:49     ` Christopher Lameter
  2017-08-07 17:35     ` Linus Torvalds
  2017-08-07 21:32     ` Stas Sergeev
  2 siblings, 0 replies; 13+ messages in thread
From: Christopher Lameter @ 2017-08-07 16:49 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stas Sergeev, Bae, Chang Seok, X86 ML, linux-kernel,
	Linus Torvalds, Borislav Petkov, Brian Gerst, Bart Oldeman

I hope this will finally enable thread local support to work in a sane way
in gcc so that we can actually use it in kernel space and get rid of all
the this_cpu_xxx() macros?

And thread local RMVs primitives may actually be provided by gcc and
be usable in user space so that we can write user space code with
effective cpu local variable access?

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-08-07 16:20   ` Andy Lutomirski
  2017-08-07 16:49     ` Christopher Lameter
@ 2017-08-07 17:35     ` Linus Torvalds
  2017-08-07 19:07       ` Andy Lutomirski
  2017-08-07 21:32     ` Stas Sergeev
  2 siblings, 1 reply; 13+ messages in thread
From: Linus Torvalds @ 2017-08-07 17:35 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Stas Sergeev, Bae, Chang Seok, X86 ML, linux-kernel,
	Borislav Petkov, Brian Gerst, Bart Oldeman

On Mon, Aug 7, 2017 at 9:20 AM, Andy Lutomirski <luto@kernel.org> wrote:
>
> Windows does something sort of like this (I think), but I don't like
> this solution.  I fully expect that someone will write a program that
> does:
>
> old = rdgsbase();
> wrgsbase(new);
> call_very_fast_function();
> wrgsbase(old);
>
> This will work if GS == 0, which is fine.  The problem is that it will
> *also* work if GS != 0 with very high probability, especially if this
> code sequence is right after some operation that sleeps.  And then
> we'll get random crashes with very low probability, depending on where
> the scheduler hits.

It will work reliably if you just make the scheduler save/restore the
base rather than the selector.

I really think you need to walk away from the "selector is meaningful"
model. Yes, yes, it's the legacy model, but it's the *insane* model.

So screw the selector. It doesn't matter. We'll need to save/restore
the value, but that's it. What we *really* save and restore is just
the base pointer.

Why do you care so much about the selector? If people *don't* use the
fsgsbase, then the selector and the base of the segment will always
match anyway (modulo the system calls that actually change the
gdt/ldt, and we can just sat that *then* selectors matter).

And if people *do* use fsgsbase, then the selector is by definition
not important.

So just make the scheduler save the base first, and restore it last.
End of problem. Your user-space code above just works. There is no
race, i doesn't matter one whit whether GS is 0 ir not, there simply
is no problem.

So just what is the problem you're trying to solve?

               Linus

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-08-07 17:35     ` Linus Torvalds
@ 2017-08-07 19:07       ` Andy Lutomirski
  0 siblings, 0 replies; 13+ messages in thread
From: Andy Lutomirski @ 2017-08-07 19:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andy Lutomirski, Stas Sergeev, Bae, Chang Seok, X86 ML,
	linux-kernel, Borislav Petkov, Brian Gerst, Bart Oldeman

On Mon, Aug 7, 2017 at 10:35 AM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, Aug 7, 2017 at 9:20 AM, Andy Lutomirski <luto@kernel.org> wrote:
>>
>> Windows does something sort of like this (I think), but I don't like
>> this solution.  I fully expect that someone will write a program that
>> does:
>>
>> old = rdgsbase();
>> wrgsbase(new);
>> call_very_fast_function();
>> wrgsbase(old);
>>
>> This will work if GS == 0, which is fine.  The problem is that it will
>> *also* work if GS != 0 with very high probability, especially if this
>> code sequence is right after some operation that sleeps.  And then
>> we'll get random crashes with very low probability, depending on where
>> the scheduler hits.
>
> It will work reliably if you just make the scheduler save/restore the
> base rather than the selector.
>
> I really think you need to walk away from the "selector is meaningful"
> model. Yes, yes, it's the legacy model, but it's the *insane* model.
>
> So screw the selector. It doesn't matter. We'll need to save/restore
> the value, but that's it. What we *really* save and restore is just
> the base pointer.
>
> Why do you care so much about the selector? If people *don't* use the
> fsgsbase, then the selector and the base of the segment will always
> match anyway (modulo the system calls that actually change the
> gdt/ldt, and we can just sat that *then* selectors matter).
>
> And if people *do* use fsgsbase, then the selector is by definition
> not important.
>
> So just make the scheduler save the base first, and restore it last.
> End of problem. Your user-space code above just works. There is no
> race, i doesn't matter one whit whether GS is 0 ir not, there simply
> is no problem.

I agree completely.  The scheduler should do exactly this and, with my
patches applied, it does.

>
> So just what is the problem you're trying to solve?
>

I'm trying to avoid a situation where we implement that policy and the
interaction with modify_ldt() becomes very strange.  Linux has a long
history of having ill-defined semantics x86_64, and I don't want to
make it worse.

If we *just* change the way the scheduler works, then we end up with
modify_ldt() behaving determinstically on IVB+ and behaving
deterministically on 32-bit kernels, but having that deterministic
behavior be *different*.  This makes me rather unhappy about the whole
situation.

Also, I don't want to break gdb, and even telling whether a change
breaks gdb is an incredible PITA.  Whern GDB saves and restores a
context, it currently restores the base first and the selector second,
and I have no idea whether gdb expects restoring the selector to
update the base.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: FSGSBASE ABI considerations
  2017-08-07 16:20   ` Andy Lutomirski
  2017-08-07 16:49     ` Christopher Lameter
  2017-08-07 17:35     ` Linus Torvalds
@ 2017-08-07 21:32     ` Stas Sergeev
  2 siblings, 0 replies; 13+ messages in thread
From: Stas Sergeev @ 2017-08-07 21:32 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Bae, Chang Seok, X86 ML, linux-kernel, Linus Torvalds,
	Borislav Petkov, Brian Gerst, Bart Oldeman

07.08.2017 19:20, Andy Lutomirski пишет:
>> I think
>> this is the half-step. It clearly shows that you don't want
>> such state to ever exist, but why not to go a step further
>> and just make the bases to be reset not only by any
>> unrelated modify_ldt() call, but always on schedule?
>> You can state that using wrgsbase on non-zero selector
>> is invalid, reset it to LDT state and maybe send a signal
>> to the program so that it knows it did something wrong.
>> This may sound too rough, but I really don't see how it
>> differs from resetting all LDT bases on some unrelated
>> modify_ldt() that was done for read, not write.
>> Or you may want to reset selector to 0 rather than
>> base to LDT.
> Windows does something sort of like this (I think), but I don't like
> this solution.  I fully expect that someone will write a program that
> does:
>
> old = rdgsbase();
> wrgsbase(new);
> call_very_fast_function();
> wrgsbase(old);
>
> This will work if GS == 0, which is fine.  The problem is that it will
> *also* work if GS != 0 with very high probability, especially if this
> code sequence is right after some operation that sleeps.  And then
> we'll get random crashes with very low probability, depending on where
> the scheduler hits.
So, as Linus already pointed, if the fixup is to
zero out the selector, then this will still work fine.


>> I am far from the kernel development so my thoughts
>> may be naive, but IMHO you should just disallow this
>> by some means (like by doing a fixup on schedule() and
>> sending a signal). No one will suffer, people will just
>> write 0 to segreg first. Note that such a problem can
>> be provoked by the fact that the sighandler does not
>> reset the segregs to their default values, and someone
>> may simply forget to reset it to 0. You need to remind
>> him to do so rather than to invent the tricky code to
>> do something theoretically correct.
> I would *love* to disallow it.  The problem is that I don't believe it
> to be possible in a way that doesn't cause more problems than it
> solves.
I wonder if sending a signal (after doing a fixup)
is too much of a punishment?

> I'm trying to avoid a situation where we implement that policy and the
> interaction with modify_ldt() becomes very strange.
IMHO if you do the fixup on schedule (like setting
the selector to zero), then the interaction with
modify_ldt() is completely avoided, i.e. modify_ldt()
should then never special-case the threads that
did wrgsbase. So if something inconsistent comes
out, then it was likely there already without wrgsbase.

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2017-08-07 22:14 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-07-31  3:05 FSGSBASE ABI considerations Andy Lutomirski
2017-07-31  4:38 ` Linus Torvalds
2017-07-31 14:14   ` Andy Lutomirski
2017-07-31 10:55 ` Kirill A. Shutemov
2017-07-31 14:16   ` Andy Lutomirski
2017-07-31 21:23 ` Bae, Chang Seok
2017-08-07 16:14   ` Andy Lutomirski
2017-08-07  8:06 ` Stas Sergeev
2017-08-07 16:20   ` Andy Lutomirski
2017-08-07 16:49     ` Christopher Lameter
2017-08-07 17:35     ` Linus Torvalds
2017-08-07 19:07       ` Andy Lutomirski
2017-08-07 21:32     ` Stas Sergeev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).