Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch

From: Rik van Riel <riel@surriel.com>
To: Andy Lutomirski <luto@kernel.org>,
	"Jason A. Donenfeld" <Jason@zx2c4.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Cc: LKML <linux-kernel@vger.kernel.org>, X86 ML <x86@kernel.org>
Subject: Re: Lazy FPU restoration / moving kernel_fpu_end() to context switch
Date: Fri, 15 Jun 2018 14:40:02 -0400	[thread overview]
Message-ID: <1529088002.7898.156.camel@surriel.com> (raw)
In-Reply-To: <CALCETrU+2mBPDfkBz1i_GT1EOJau+mzj4yOK8N0UnT2pGjiUWQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 3746 bytes --]

On Fri, 2018-06-15 at 11:31 -0700, Andy Lutomirski wrote:
> On Fri, Jun 15, 2018 at 6:11 AM Jason A. Donenfeld <Jason@zx2c4.com>
> wrote:
> > 
> > Hi Andy & folks,
> > 
> > Lots of crypto routines look like this:
> > 
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > 
> > If you call such a routine twice, you get:
> > 
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > kernel_fpu_begin();
> > encrypt();
> > kernel_fpu_end();
> > 
> > In a loop this looks like:
> > 
> > for (thing) {
> >   kernel_fpu_begin();
> >   encrypt(thing);
> >   kernel_fpu_end();
> > }
> > 
> > This is obviously very bad, because begin() and end() are slow, so
> > WireGuard does the obvious:
> > 
> > kernel_fpu_begin();
> > for (thing)
> >   encrypt(thing);
> > kernel_fpu_end();
> > 
> > This is fine and well, and the crypto API I'm working on will
> > enable
> > this to be done in a clear way, but I do wonder if maybe this is
> > not
> > something that should be happening at the level of the caller, but
> > rather in the fpu functions themselves. Namely, what are your
> > thoughts
> > on modifying kernel_fpu_end() so that it doesn't actually restore
> > the
> > state, but just reenables preemption and marks that on the next
> > context switch, the state should be restored? Then, essentially,
> > kernel_fpu_begin() and end() become free after the first usage of
> > kernel_fpu_begin().
> > 
> > Is this something feasible? I know that performance-wise, I'm
> > really
> > gaining a lot from hoisting those functions out of the loops, and
> > API
> > wise, it'd be slightly simpler to implement if I didn't have to all
> > for that hoisting.
> 
> Hi Jason-
> 
> Funny you asked.  This has been discussed a few times, although not
> quite in the form you imagined.  The idea that we've tossed around is
> to restore FPU state on return to user mode.  Roughly, we'd introduce
> a new thread flag TIF_FPU_UNLOADED (name TBD).
> prepare_exit_to_usermode() would notice this flag, copy the fpstate
> to
> fpregs, and clear the flag.  (Or maybe exit_to_usermode_loop() -- No
> one has quite thought it through, but I think it should be outside
> the
> loop.)  We'd update all the FPU accessors to understand the flag.
> We'd probably have a helper like:
> 
> void unload_user_fpstate(void)
> 
> that would do nothing if TIF_FPU_UNLOADED is set.  If
> TIF_FPU_UNLOADED
> is clear, it would move the state to memory and set TIF_FPU_UNLOADED.
> We'd call unload_user_fpstate() during context switch in the prev
> task
> context and in kernel_fpu_begin().
> 
> The one major complication I know of is that PKRU is different from
> all other FPU state in that it needs to be loaded when *kernel* code
> does any user memory access.  So we'd need to make sure that context
> switches load the new task's PKRU eagerly.  Using WRPKRU is easy,
> but,
> unless we do something very clever, actually finding PKRU in the
> in-memory fpstate image may be slightly nontrivial.

IIRC the in-kernel FPU state always has every FPU field
at a constant location.

> I suppose we could eagerly load the new FPU state on context
> switches,
> but I'm not sure I see any point.  We'd still have to load it when we
> switch back to user mode, so it would be slower but not necessarily
> any simpler.
> 
> I'd love to review patches to make this change, but I don't have the
> bandwidth to write them right now.

I can take a look at this, I am already looking at
some context switch code right now, anyway.

I also should still have the TIF_FPU_UNLOADED patches
(or whatever the flag was called) code around somewhere.

-- 
All Rights Reversed.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 488 bytes --]