linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
@ 2017-11-02  9:41 Andy Lutomirski
  2017-11-02 11:48 ` Thomas Gleixner
  2017-11-02 16:36 ` Dave Hansen
  0 siblings, 2 replies; 7+ messages in thread
From: Andy Lutomirski @ 2017-11-02  9:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Andrew Lutomirski, Linus Torvalds, Kees Cook,
	Hugh Dickins, X86 ML, Borislav Petkov, Josh Poimboeuf

On Tue, Oct 31, 2017 at 3:31 PM, Dave Hansen
<dave.hansen@linux.intel.com> wrote:
>
> These patches are based on work from a team at Graz University of
> Technology posted here: https://github.com/IAIK/KAISER
>

I think we're far enough along here that it may be time to nail down
the memory layout for real.  I propose the following:

The user tables will contain the following:

 - The GDT array.
 - The IDT.
 - The vsyscall page.  We can make this be _PAGE_USER.
 - The TSS.
 - The per-cpu entry stack.  Let's make it one page with guard pages
on either side.  This can replace rsp_scratch.
 - cpu_current_top_of_stack.  This could be in the same page as the TSS.
 - The entry text.
 - The percpu IST (aka "EXCEPTION") stacks.

That's it.

We can either try to move all of the above into the fixmap or we can
have the user tables be sparse a la Dave's current approach.  If we do
it the latter way, I think we'll want to add a mechanism to have holes
in the percpu space to give the entry stack a guard page.

I would *much* prefer moving everything into the fixmap, but that's a
wee bit awkward because we can't address per-cpu data in the fixmap
using %gs, which makes the SYSCALL code awkward.  But we could alias
the SYSCALL entry text itself per-cpu into the fixmap, which lets us
use %rip-relative addressing, which is quite nice.

So I guess my preference is to actually try the fixmap approach.  We
give the TSS the same aliasing treatment we gave the GDT, and I can
try to make the entry trampoline work through the fixmap and thus not
need %gs-based addressing until CR3 gets updated.  (This actually
saves several cycles of latency.)

What do you all think?

I'll deal with the LDT separately.  It will either live in the
fixmap-like region or it will live at the top of the user address
space.

[1] https://git.kernel.org/pub/scm/linux/kernel/git/luto/linux.git/log/?h=x86/entry_consolidation

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02  9:41 KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas) Andy Lutomirski
@ 2017-11-02 11:48 ` Thomas Gleixner
  2017-11-02 12:00   ` Andy Lutomirski
  2017-11-02 16:36 ` Dave Hansen
  1 sibling, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2017-11-02 11:48 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Dave Hansen, linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML,
	Borislav Petkov, Josh Poimboeuf

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> I think we're far enough along here that it may be time to nail down
> the memory layout for real.  I propose the following:
> 
> The user tables will contain the following:
> 
>  - The GDT array.
>  - The IDT.
>  - The vsyscall page.  We can make this be _PAGE_USER.

I rather remove it for the kaiser case.

>  - The TSS.
>  - The per-cpu entry stack.  Let's make it one page with guard pages
> on either side.  This can replace rsp_scratch.
>  - cpu_current_top_of_stack.  This could be in the same page as the TSS.
>  - The entry text.
>  - The percpu IST (aka "EXCEPTION") stacks.

Do you really want to put the full exception stacks into that user mapping?
I think we should not do that. There are two options:

  1) Always use the per-cpu entry stack and switch to the proper IST after
     the CR3 fixup

  2) Have separate per-cpu entry stacks for the ISTs and switch to the real
     ones after the CR3 fixup.

> We can either try to move all of the above into the fixmap or we can
> have the user tables be sparse a la Dave's current approach.  If we do
> it the latter way, I think we'll want to add a mechanism to have holes
> in the percpu space to give the entry stack a guard page.
> 
> I would *much* prefer moving everything into the fixmap, but that's a
> wee bit awkward because we can't address per-cpu data in the fixmap
> using %gs, which makes the SYSCALL code awkward.  But we could alias
> the SYSCALL entry text itself per-cpu into the fixmap, which lets us
> use %rip-relative addressing, which is quite nice.
>
> So I guess my preference is to actually try the fixmap approach.  We
> give the TSS the same aliasing treatment we gave the GDT, and I can
> try to make the entry trampoline work through the fixmap and thus not
> need %gs-based addressing until CR3 gets updated.  (This actually
> saves several cycles of latency.)

Makes a lot of sense.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02 11:48 ` Thomas Gleixner
@ 2017-11-02 12:00   ` Andy Lutomirski
  2017-11-02 12:45     ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2017-11-02 12:00 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML, Borislav Petkov, Josh Poimboeuf



> On Nov 2, 2017, at 12:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
>> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>> I think we're far enough along here that it may be time to nail down
>> the memory layout for real.  I propose the following:
>> 
>> The user tables will contain the following:
>> 
>> - The GDT array.
>> - The IDT.
>> - The vsyscall page.  We can make this be _PAGE_USER.
> 
> I rather remove it for the kaiser case.
> 
>> - The TSS.
>> - The per-cpu entry stack.  Let's make it one page with guard pages
>> on either side.  This can replace rsp_scratch.
>> - cpu_current_top_of_stack.  This could be in the same page as the TSS.
>> - The entry text.
>> - The percpu IST (aka "EXCEPTION") stacks.
> 
> Do you really want to put the full exception stacks into that user mapping?
> I think we should not do that. There are two options:
> 
>  1) Always use the per-cpu entry stack and switch to the proper IST after
>     the CR3 fixup

Can't -- it's microcode, not software, that does that switch.

> 
>  2) Have separate per-cpu entry stacks for the ISTs and switch to the real
>     ones after the CR3 fixup.

How is that simpler?

> 
>> We can either try to move all of the above into the fixmap or we can
>> have the user tables be sparse a la Dave's current approach.  If we do
>> it the latter way, I think we'll want to add a mechanism to have holes
>> in the percpu space to give the entry stack a guard page.
>> 
>> I would *much* prefer moving everything into the fixmap, but that's a
>> wee bit awkward because we can't address per-cpu data in the fixmap
>> using %gs, which makes the SYSCALL code awkward.  But we could alias
>> the SYSCALL entry text itself per-cpu into the fixmap, which lets us
>> use %rip-relative addressing, which is quite nice.
>> 
>> So I guess my preference is to actually try the fixmap approach.  We
>> give the TSS the same aliasing treatment we gave the GDT, and I can
>> try to make the entry trampoline work through the fixmap and thus not
>> need %gs-based addressing until CR3 gets updated.  (This actually
>> saves several cycles of latency.)
> 
> Makes a lot of sense.
> 
> Thanks,
> 
>    tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02 12:00   ` Andy Lutomirski
@ 2017-11-02 12:45     ` Thomas Gleixner
  2017-11-02 15:36       ` Andy Lutomirski
  0 siblings, 1 reply; 7+ messages in thread
From: Thomas Gleixner @ 2017-11-02 12:45 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML, Borislav Petkov, Josh Poimboeuf

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> > On Nov 2, 2017, at 12:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > 
> >> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> >> I think we're far enough along here that it may be time to nail down
> >> the memory layout for real.  I propose the following:
> >> 
> >> The user tables will contain the following:
> >> 
> >> - The GDT array.
> >> - The IDT.
> >> - The vsyscall page.  We can make this be _PAGE_USER.
> > 
> > I rather remove it for the kaiser case.
> > 
> >> - The TSS.
> >> - The per-cpu entry stack.  Let's make it one page with guard pages
> >> on either side.  This can replace rsp_scratch.
> >> - cpu_current_top_of_stack.  This could be in the same page as the TSS.
> >> - The entry text.
> >> - The percpu IST (aka "EXCEPTION") stacks.
> > 
> > Do you really want to put the full exception stacks into that user mapping?
> > I think we should not do that. There are two options:
> > 
> >  1) Always use the per-cpu entry stack and switch to the proper IST after
> >     the CR3 fixup
> 
> Can't -- it's microcode, not software, that does that switch.

Well, yes. The micro code does the stack switch to ISTs but software tells
it to do so. We write the IDT IIRC.

> >  2) Have separate per-cpu entry stacks for the ISTs and switch to the real
> >     ones after the CR3 fixup.
> 
> How is that simpler?

Simpler is not the question. I want to avoid mapping the whole IST stacks.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02 12:45     ` Thomas Gleixner
@ 2017-11-02 15:36       ` Andy Lutomirski
  2017-11-02 16:03         ` Thomas Gleixner
  0 siblings, 1 reply; 7+ messages in thread
From: Andy Lutomirski @ 2017-11-02 15:36 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML, Borislav Petkov, Josh Poimboeuf



> On Nov 2, 2017, at 1:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>>> On Nov 2, 2017, at 12:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>>> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>>>> I think we're far enough along here that it may be time to nail down
>>>> the memory layout for real.  I propose the following:
>>>> 
>>>> The user tables will contain the following:
>>>> 
>>>> - The GDT array.
>>>> - The IDT.
>>>> - The vsyscall page.  We can make this be _PAGE_USER.
>>> 
>>> I rather remove it for the kaiser case.
>>> 
>>>> - The TSS.
>>>> - The per-cpu entry stack.  Let's make it one page with guard pages
>>>> on either side.  This can replace rsp_scratch.
>>>> - cpu_current_top_of_stack.  This could be in the same page as the TSS.
>>>> - The entry text.
>>>> - The percpu IST (aka "EXCEPTION") stacks.
>>> 
>>> Do you really want to put the full exception stacks into that user mapping?
>>> I think we should not do that. There are two options:
>>> 
>>> 1) Always use the per-cpu entry stack and switch to the proper IST after
>>>    the CR3 fixup
>> 
>> Can't -- it's microcode, not software, that does that switch.
> 
> Well, yes. The micro code does the stack switch to ISTs but software tells
> it to do so. We write the IDT IIRC.
> 
>>> 2) Have separate per-cpu entry stacks for the ISTs and switch to the real
>>>    ones after the CR3 fixup.
>> 
>> How is that simpler?
> 
> Simpler is not the question. I want to avoid mapping the whole IST stacks.
> 

OK, let's see.  We can have the IDT be different in the user tables and the kernel tables.  The user IDT could have IST-less entry stubs that do their own CR3 switch and then bounce to the IST stack.  I don't see why this wouldn't work aside from requiring a substantially larger entry stack, but I'm also not convinced it's worth the added complexity.  The NMI code would certainly need some careful thought to convince ourselves that it would still be correct.  #DF would be, um, interesting because of the silly ESPFIX64 thing.

My inclination would be to deal with this later.  For the first upstream version, we map the IST stacks.  Later on, we have a separate user IDT that does whatever it needs to do.

The argument to the contrary would be that Dave's CR3 code *and* my entry stack crap gets simpler if all the CR3 switches happen in special stubs.

The argument against *that* is that this erase_kstack crap might also benefit from the magic stack switch.  OTOH that's the *exit* stack, which is totally independent.

FWIW, I want to get rid of the #DB and #BP stacks entirely, but that does not deserve to block this series, I think.

> Thanks,
> 
>    tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02 15:36       ` Andy Lutomirski
@ 2017-11-02 16:03         ` Thomas Gleixner
  0 siblings, 0 replies; 7+ messages in thread
From: Thomas Gleixner @ 2017-11-02 16:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Andy Lutomirski, Dave Hansen, linux-kernel, linux-mm,
	moritz.lipp, Daniel Gruss, michael.schwarz, Linus Torvalds,
	Kees Cook, Hugh Dickins, X86 ML, Borislav Petkov, Josh Poimboeuf

On Thu, 2 Nov 2017, Andy Lutomirski wrote:
> > On Nov 2, 2017, at 1:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> > Simpler is not the question. I want to avoid mapping the whole IST stacks.
> > 
> 
> OK, let's see.  We can have the IDT be different in the user tables and
> the kernel tables.  The user IDT could have IST-less entry stubs that do
> their own CR3 switch and then bounce to the IST stack.  I don't see why
> this wouldn't work aside from requiring a substantially larger entry
> stack, but I'm also not convinced it's worth the added complexity.  The
> NMI code would certainly need some careful thought to convince ourselves
> that it would still be correct.  #DF would be, um, interesting because of
> the silly ESPFIX64 thing.

> My inclination would be to deal with this later.  For the first upstream
> version, we map the IST stacks.  Later on, we have a separate user IDT
> that does whatever it needs to do.
>
> The argument to the contrary would be that Dave's CR3 code *and* my entry
> stack crap gets simpler if all the CR3 switches happen in special stubs.
>
> The argument against *that* is that this erase_kstack crap might also
> benefit from the magic stack switch.  OTOH that's the *exit* stack, which
> is totally independent.

My initial thought was: Use always IST stub stacks for entry and exit.

So the entry/exit stubs deal with the CR3 stuff and also with the extra
magic for espfix and nested NMIs, etc. Once that is done, you just flip
over to the relevant kernel internal stack and switch back to the user
visible one on return. Haven't thought that through completely, but in my
naive view it made stuff simpler.

> FWIW, I want to get rid of the #DB and #BP stacks entirely, but that does
> not deserve to block this series, I think.

Agreed.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
  2017-11-02  9:41 KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas) Andy Lutomirski
  2017-11-02 11:48 ` Thomas Gleixner
@ 2017-11-02 16:36 ` Dave Hansen
  1 sibling, 0 replies; 7+ messages in thread
From: Dave Hansen @ 2017-11-02 16:36 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-kernel, linux-mm, moritz.lipp, Daniel Gruss,
	michael.schwarz, Linus Torvalds, Kees Cook, Hugh Dickins, X86 ML,
	Borislav Petkov, Josh Poimboeuf

On 11/02/2017 02:41 AM, Andy Lutomirski wrote:
> 
>  - The GDT array.
>  - The IDT.
>  - The vsyscall page.  We can make this be _PAGE_USER.
>  - The TSS.
>  - The per-cpu entry stack.  Let's make it one page with guard pages
> on either side.  This can replace rsp_scratch.
>  - cpu_current_top_of_stack.  This could be in the same page as the TSS.
>  - The entry text.
>  - The percpu IST (aka "EXCEPTION") stacks.
> 
> That's it.

The PEBS/BTS buffers need it too, I think:

https://git.kernel.org/pub/scm/linux/kernel/git/daveh/x86-kaiser.git/commit/?h=kaiser-414rc6-20171031&id=97a334906d7853a8109b295ef94f3991418d0c07

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2017-11-02 16:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-11-02  9:41 KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas) Andy Lutomirski
2017-11-02 11:48 ` Thomas Gleixner
2017-11-02 12:00   ` Andy Lutomirski
2017-11-02 12:45     ` Thomas Gleixner
2017-11-02 15:36       ` Andy Lutomirski
2017-11-02 16:03         ` Thomas Gleixner
2017-11-02 16:36 ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).