From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752355AbcJCVga (ORCPT <rfc822;w@1wt.eu>);
        Mon, 3 Oct 2016 17:36:30 -0400
Received: from mail-vk0-f43.google.com ([209.85.213.43]:34000 "EHLO
        mail-vk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751469AbcJCVgY (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 3 Oct 2016 17:36:24 -0400
MIME-Version: 1.0
In-Reply-To: <1475529679.4622.27.camel@redhat.com>
References: <1475353895-22175-1-git-send-email-riel@redhat.com>
 <1475353895-22175-3-git-send-email-riel@redhat.com> <CALCETrWnhA4Gd5AxDEJz7d-VQNDN7O8ayyna7fz3=HNtNnARxQ@mail.gmail.com>
 <1475366900.21644.8.camel@redhat.com> <CALCETrWzYa1F7ucq=gMUhEbY+=uqSAaOgw1qZA3VkKrCo_7XOg@mail.gmail.com>
 <1475529679.4622.27.camel@redhat.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Mon, 3 Oct 2016 14:36:02 -0700
Message-ID: <CALCETrX_A71XYJxSxprd_ujHFUK_M7UfmCXeNnP1iGx17CSu8A@mail.gmail.com>
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch
 to userspace
To: Rik van Riel <riel@redhat.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Andrew Lutomirski <luto@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
        Borislav Petkov <bp@suse.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel <riel@redhat.com> wrote:
> On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <riel@redhat.com> wrote:
>> >
>> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> > >
>> > > On Sat, Oct 1, 2016 at 1:31 PM,  <riel@redhat.com> wrote:
>> > > >
>> > > >
>> > > >  #define CREATE_TRACE_POINTS
>> > > >  #include <trace/events/syscalls.h>
>> > > > @@ -197,6 +198,9 @@ __visible inline void
>> > > > prepare_exit_to_usermode(struct pt_regs *regs)
>> > > >         if (unlikely(cached_flags &
>> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
>> > > >                 exit_to_usermode_loop(regs, cached_flags);
>> > > >
>> > > > +       if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > > > +               switch_fpu_return();
>> > > > +
>> > > How about:
>> > >
>> > > if (unlikely(...)) {
>> > >   exit_to_usermode_loop(regs, cached_flags);
>> > >   cached_flags = READ_ONCE(ti->flags);

Like this?

>> > > }
>> > >
>> > > if (ti->flags & _TIF_LOAD_FPU) {
>> > >   clear_thread_flag(TIF_LOAD_FPU);
>> > >   switch_fpu_return();
>> > > }
>> > It looks like test_thread_flag should be fine for avoiding
>> > atomics, too?
>> >
>> >   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>> >       clear_thread_flag(TIF_LOAD_FPU);
>> >       switch_fpu_return();
>> >   }
>> True, but we've got that cached_flags variable already...
>
> The cached flag may no longer be valid after
> exit_to_usermode_loop() returns, because that
> code is preemptible.
>
> We have to reload a fresh ti->flags after
> preemption is disabled and irqs are off.

See above...

>
>> One of my objections to the current code structure is that I find it
>> quite hard to keep track of all of the possible states of the fpu
>> that
>> we have.  Off the top of my head, we have:
>>
>> 1. In-cpu state is valid and in-memory state is invalid.  (This is
>> the
>> case when we're running user code, for example.)
>> 2. In-memory state is valid.
>> 2a. In-cpu state is *also* valid for some particular task.  In this
>> state, no one may write to either FPU state copy for that task lest
>> they get out of sync.
>> 2b. In-cpu state is not valid.
>
> It gets more fun with ptrace. Look at xfpregs_get
> and xfpregs_set, which call fpu__activate_fpstate_read
> and fpu__activate_fpstate_write respectively.
>
> It looks like those functions already deal with the
> cases you outline above.

Hmm, maybe they could be used.  The names are IMO atrocious.  I had to
read the docs and the code to figure out WTF "activating" the
"fpstate" even meant.

>
>> Rik, your patches further complicate this by making it possible to
>> have the in-cpu state be valid for a non-current task even when
>> non-idle.
>
> However, we never need to examine the in-cpu state for
> a task that is not currently running, or being scheduled
> to in the context switch code.
>
> We always save the FPU register state when a task is
> context switched out, so for a non-current task, we
> still only need to consider whether the in-memory state
> is valid or not.
>
> The in-cpu state does not matter.
>
> fpu__activate_fpstate_write will disable lazy restore,
> and ensure a full restore from memory.

Something has to fix up owner_ctx so that the previous thread doesn't
think its CPU state is still valid.  I don't think
activate_fpstate_write does it, because nothing should (I think) be
calling it as a mattor of course on context switch.

>
>>  This is why I'm thinking that we should maybe revisit the
>> API a bit to make this clearer and to avoid scattering around all the
>> state manipulation quite so much.  I propose, as a straw-man:
>>
>> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
>> current.
>> After calling this, we're in state 1 or 2a.
>>
>> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
>> writable
>> for current.  Invalidates any in-memory copy.  After calling this,
>> we're in state 1.
>
> How can we distinguish between these two?
>
> If the task is running, it can change its FPU
> registers at any time, without any obvious thing
> to mark the transition from state 2a to state 1.

That's true now, but I don't think it's really true any more with your
patches.  If a task is running in *kernel* mode, then I think that
pretty much all of the combinations are possible.  When the task
actually enters user mode, we have to make sure we're in state 1 so
the CPU regs belong to the task and are writable.

>
> It sounds like the first function you name would
> only be useful if we prevented a switch to user
> space, which could immediately do whatever it wants
> with the registers.
>
> Is there a use case for user_fpregs_copy_to_cpu()?

With PKRU, any user memory access would need
user_fpregs_copy_to_cpu().  I think we should get rid of this and just
switch to using RDPKRU and WRPKRU to eagerly update PKRU in every
context switch to avoid this.  Other than that, I can't think of any
use cases off the top of my head unless the signal code wanted to play
games with this stuff.

>
>> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
>> for
>> current.  After calling this, we're in state 2a or 2b.
>>
>> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
>> and
>> writable for current.  After calling this, we're in state 2b.
>
> We can only ensure that the in memory copy stays
> valid, by stopping the task from writing to the
> in-CPU copy.
>
> This sounds a lot like fpu__activate_fpstate_read
>
> What we need after copying the contents to memory is
> a way to invalidate the in-CPU copy, if changes were
> made to the in-memory copy.

My proposal is that we invalidate the CPU state *before* writing the
in-memory state, not after.

>
> This is already done by fpu__activate_fpstate_write
>
>> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
>> and PKRU memory state readers.  MPX and PKRU will do
>> user_fpregs_move_to_memory() before writing to memory.
>> kernel_fpu_begin() does user_fpregs_move_to_memory().
>
> This is done by switch_fpu_prepare.
>
> Can you think of any other place in the kernel where we
> need a 1 -> 2a transition, besides context switching and
> ptrace/xfpregs_get?

Anything else that tries to read task xstate from memory, i.e. MPX and
PKRU.  (Although if we switch to eager-switched PKRU, then PKRU stops
mattering for this purpose.)

Actually, I don't see any way your patches can be compatible with PKRU
without switching to eager-switched PKRU.

>
>> There are a couple ways we could deal with TIF_LOAD_FPU in this
>> scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
>> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
>> and make the handler do user_fpregs_move_to_cpu().
>
> I can see where you are coming from, but given that
> there is no way to run a task besides context switching
> to it, I am not sure why we would need to do anything
> besides disabling the lazy restore (indicating that the
> in-CPU state is stale) anywhere else?
>
> If user_fpregs_move_to_cpu determines that the in-register
> state is still valid, it can skip the restore.
>
> This is what switch_fpu_finish does after my first patch,
> and switch_fpu_prepare later in my series.

I think my main objection is that there are too many functions like
this outside the fpu/ directory that all interact in complicated ways.
I think that activate_fpststate_read/write are along the right lines,
but we should start using them and their friends everywhere.