From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754208AbcJCUy3 (ORCPT <rfc822;w@1wt.eu>);
        Mon, 3 Oct 2016 16:54:29 -0400
Received: from mail-vk0-f43.google.com ([209.85.213.43]:36188 "EHLO
        mail-vk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1753874AbcJCUy0 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 3 Oct 2016 16:54:26 -0400
MIME-Version: 1.0
In-Reply-To: <1475366900.21644.8.camel@redhat.com>
References: <1475353895-22175-1-git-send-email-riel@redhat.com>
 <1475353895-22175-3-git-send-email-riel@redhat.com> <CALCETrWnhA4Gd5AxDEJz7d-VQNDN7O8ayyna7fz3=HNtNnARxQ@mail.gmail.com>
 <1475366900.21644.8.camel@redhat.com>
From: Andy Lutomirski <luto@amacapital.net>
Date: Mon, 3 Oct 2016 13:54:02 -0700
Message-ID: <CALCETrWzYa1F7ucq=gMUhEbY+=uqSAaOgw1qZA3VkKrCo_7XOg@mail.gmail.com>
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch
 to userspace
To: Rik van Riel <riel@redhat.com>,
        Dave Hansen <dave.hansen@linux.intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Andrew Lutomirski <luto@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
        Borislav Petkov <bp@suse.de>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <riel@redhat.com> wrote:
> On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
>> On Sat, Oct 1, 2016 at 1:31 PM,  <riel@redhat.com> wrote:
>> >
>> >  #define CREATE_TRACE_POINTS
>> >  #include <trace/events/syscalls.h>
>> > @@ -197,6 +198,9 @@ __visible inline void
>> > prepare_exit_to_usermode(struct pt_regs *regs)
>> >         if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS))
>> >                 exit_to_usermode_loop(regs, cached_flags);
>> >
>> > +       if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU)))
>> > +               switch_fpu_return();
>> > +
>>
>> How about:
>>
>> if (unlikely(...)) {
>>   exit_to_usermode_loop(regs, cached_flags);
>>   cached_flags = READ_ONCE(ti->flags);
>> }
>>
>> if (ti->flags & _TIF_LOAD_FPU) {
>>   clear_thread_flag(TIF_LOAD_FPU);
>>   switch_fpu_return();
>> }
>
> It looks like test_thread_flag should be fine for avoiding
> atomics, too?
>
>   if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
>       clear_thread_flag(TIF_LOAD_FPU);
>       switch_fpu_return();
>   }

True, but we've got that cached_flags variable already...

>
>
>> > +static inline void switch_fpu_finish(void)
>> >  {
>> > +       set_thread_flag(TIF_LOAD_FPU);
>> >  }
>>
>> I can imagine this causing problems with kernel code that accesses
>> current's FPU state, e.g. get_xsave_field_ptr().  I wonder if it
>> would
>> make sense to make your changes deeper into the FPU core innards so
>> that, for example, we'd have explicit functions that cause the
>
> Whereabouts do you have in mind?
>
> I like your general idea, but am not sure quite where
> to put it.

Somewhere in arch/x86/kernel/fpu/ perhaps?

One of my objections to the current code structure is that I find it
quite hard to keep track of all of the possible states of the fpu that
we have.  Off the top of my head, we have:

1. In-cpu state is valid and in-memory state is invalid.  (This is the
case when we're running user code, for example.)
2. In-memory state is valid.
2a. In-cpu state is *also* valid for some particular task.  In this
state, no one may write to either FPU state copy for that task lest
they get out of sync.
2b. In-cpu state is not valid.

Rik, your patches further complicate this by making it possible to
have the in-cpu state be valid for a non-current task even when
non-idle.  This is why I'm thinking that we should maybe revisit the
API a bit to make this clearer and to avoid scattering around all the
state manipulation quite so much.  I propose, as a straw-man:

user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for current.
After calling this, we're in state 1 or 2a.

user_fpregs_move_to_cpu() -- Makes the CPU state be valid and writable
for current.  Invalidates any in-memory copy.  After calling this,
we're in state 1.

user_fpregs_copy_to_memory() -- Makes the in-memory state be valid for
current.  After calling this, we're in state 2a or 2b.

user_fpregs_move_to_memory() -- Makes the in-memory state be valid and
writable for current.  After calling this, we're in state 2b.

The xxx_to_cpu() functions will make sure that, if the cpu state was
valid for a different task, it stops being valid for that different
task.

Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
and PKRU memory state readers.  MPX and PKRU will do
user_fpregs_move_to_memory() before writing to memory.
kernel_fpu_begin() does user_fpregs_move_to_memory().

There are a couple ways we could deal with TIF_LOAD_FPU in this
scheme.  My instinct would be to rename it TIF_REFRESH_FPU, set it
unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
and make the handler do user_fpregs_move_to_cpu().  This avoids
needing to deal with complications in which we're in state 2a and we
land in actual user code and then forget that the in-memory copy is
out of date.   It also means we get TIF_REFRESH_FPU for free for
newly-forked threads.

>> in-memory state for current to be up-to-date and readable, to cause
>> the in-memory state to be up-to-date and writable (which is the same
>> thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the
>> in-CPU state to be up-to-date (possibly readable and writable).
>> TIF_LOAD_FPU would trigger the latter.
>>
>> I've often found it confusing that fpu__save by itself has somewhat
>> ill-defined effects.
>
> Fully agreed on both. I guess that means we want a few
> different functions:
>
> 1) initialize FPU state, both in memory and in registers
>
> 2) cause FPU state in registers to be updated from memory,
>    if necessary
>
> 3) cause FPU state in memory to be updated from registers,
>    if necessary
>
> The latter functions could use fpu->fpregs_active to see
> whether any action is required, and be called from various
> places in the code.
>
> The signal code in particular is doing some strange things
> that should probably be hidden away in FPU specific functions.
>
>>
>> >
>> > +/*
>> > + * Set up the userspace FPU context before returning to userspace.
>> > + */
>> > +void switch_fpu_return(void)
>> > +{
>> > +       struct fpu *fpu = &current->thread.fpu;
>> > +       bool preload;
>> > +       /*
>> > +        * If the task has used the math, pre-load the FPU on xsave
>> > processors
>> > +        * or if the past 5 consecutive context-switches used math.
>> > +        */
>> > +       preload = static_cpu_has(X86_FEATURE_FPU) &&
>> > +                 fpu->fpstate_active &&
>> > +                 (use_eager_fpu() || fpu->counter > 5);
>> > +
>> > +       if (preload) {
>> > +               prefetch(&fpu->state);
>> > +               fpu->counter++;
>> > +               __fpregs_activate(fpu);
>> > +               trace_x86_fpu_regs_activated(fpu);
>> > +
>> > +               /* Don't change CR0.TS if we just switch! */
>> > +               if (!__this_cpu_read(fpu_active)) {
>> > +                       __fpregs_activate_hw();
>> > +                       __this_cpu_write(fpu_active, true);
>> > +               }
>>
>> We should just finish getting rid of all TS uses.
>
> Agreed. I will rebase on top of your FPU changes, that
> will make things easier for everybody.
>
> --
> All Rights Reversed.


-- 
Andy Lutomirski
AMA Capital Management, LLC