Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)

From: Andy Lutomirski <luto@amacapital.net>
To: Thomas Gleixner <tglx@linutronix.de>
Cc: Andy Lutomirski <luto@kernel.org>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	moritz.lipp@iaik.tugraz.at,
	Daniel Gruss <daniel.gruss@iaik.tugraz.at>,
	michael.schwarz@iaik.tugraz.at,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Kees Cook <keescook@google.com>, Hugh Dickins <hughd@google.com>,
	X86 ML <x86@kernel.org>, Borislav Petkov <bp@alien8.de>,
	Josh Poimboeuf <jpoimboe@redhat.com>
Subject: Re: KAISER memory layout (Re: [PATCH 06/23] x86, kaiser: introduce user-mapped percpu areas)
Date: Thu, 2 Nov 2017 16:36:29 +0100	[thread overview]
Message-ID: <65E6D547-2871-4D93-9E10-24C31DB10269@amacapital.net> (raw)
In-Reply-To: <alpine.DEB.2.20.1711021343380.2090@nanos>

> On Nov 2, 2017, at 1:45 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>>> On Nov 2, 2017, at 12:48 PM, Thomas Gleixner <tglx@linutronix.de> wrote:
>>> 
>>>> On Thu, 2 Nov 2017, Andy Lutomirski wrote:
>>>> I think we're far enough along here that it may be time to nail down
>>>> the memory layout for real.  I propose the following:
>>>> 
>>>> The user tables will contain the following:
>>>> 
>>>> - The GDT array.
>>>> - The IDT.
>>>> - The vsyscall page.  We can make this be _PAGE_USER.
>>> 
>>> I rather remove it for the kaiser case.
>>> 
>>>> - The TSS.
>>>> - The per-cpu entry stack.  Let's make it one page with guard pages
>>>> on either side.  This can replace rsp_scratch.
>>>> - cpu_current_top_of_stack.  This could be in the same page as the TSS.
>>>> - The entry text.
>>>> - The percpu IST (aka "EXCEPTION") stacks.
>>> 
>>> Do you really want to put the full exception stacks into that user mapping?
>>> I think we should not do that. There are two options:
>>> 
>>> 1) Always use the per-cpu entry stack and switch to the proper IST after
>>>    the CR3 fixup
>> 
>> Can't -- it's microcode, not software, that does that switch.
> 
> Well, yes. The micro code does the stack switch to ISTs but software tells
> it to do so. We write the IDT IIRC.
> 
>>> 2) Have separate per-cpu entry stacks for the ISTs and switch to the real
>>>    ones after the CR3 fixup.
>> 
>> How is that simpler?
> 
> Simpler is not the question. I want to avoid mapping the whole IST stacks.
> 

OK, let's see.  We can have the IDT be different in the user tables and the kernel tables.  The user IDT could have IST-less entry stubs that do their own CR3 switch and then bounce to the IST stack.  I don't see why this wouldn't work aside from requiring a substantially larger entry stack, but I'm also not convinced it's worth the added complexity.  The NMI code would certainly need some careful thought to convince ourselves that it would still be correct.  #DF would be, um, interesting because of the silly ESPFIX64 thing.

My inclination would be to deal with this later.  For the first upstream version, we map the IST stacks.  Later on, we have a separate user IDT that does whatever it needs to do.

The argument to the contrary would be that Dave's CR3 code *and* my entry stack crap gets simpler if all the CR3 switches happen in special stubs.

The argument against *that* is that this erase_kstack crap might also benefit from the magic stack switch.  OTOH that's the *exit* stack, which is totally independent.

FWIW, I want to get rid of the #DB and #BP stacks entirely, but that does not deserve to block this series, I think.

> Thanks,
> 
>    tglx