From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1753497AbcJCVWk (ORCPT <rfc822;w@1wt.eu>);
        Mon, 3 Oct 2016 17:22:40 -0400
Received: from mx1.redhat.com ([209.132.183.28]:36500 "EHLO mx1.redhat.com"
        rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
        id S1753581AbcJCVVX (ORCPT <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 3 Oct 2016 17:21:23 -0400
Message-ID: <1475529679.4622.27.camel@redhat.com>
Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until
 switch to userspace
From: Rik van Riel <riel@redhat.com>
To: Andy Lutomirski <luto@amacapital.net>,
        Dave Hansen <dave.hansen@linux.intel.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        X86 ML <x86@kernel.org>, Thomas Gleixner <tglx@linutronix.de>,
        Paolo Bonzini <pbonzini@redhat.com>, Ingo Molnar <mingo@redhat.com>,
        Andrew Lutomirski <luto@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>,
        Borislav Petkov <bp@suse.de>
Date: Mon, 03 Oct 2016 17:21:19 -0400
In-Reply-To: <CALCETrWzYa1F7ucq=gMUhEbY+=uqSAaOgw1qZA3VkKrCo_7XOg@mail.gmail.com>
References: <1475353895-22175-1-git-send-email-riel@redhat.com>
         <1475353895-22175-3-git-send-email-riel@redhat.com>
         <CALCETrWnhA4Gd5AxDEJz7d-VQNDN7O8ayyna7fz3=HNtNnARxQ@mail.gmail.com>
         <1475366900.21644.8.camel@redhat.com>
         <CALCETrWzYa1F7ucq=gMUhEbY+=uqSAaOgw1qZA3VkKrCo_7XOg@mail.gmail.com>
Content-Type: multipart/signed; micalg="pgp-sha256";
        protocol="application/pgp-signature"; boundary="=-KROcjT65aSPGm1YeMiZt"
Mime-Version: 1.0
X-Greylist: Sender IP whitelisted, not delayed by milter-greylist-4.5.16 (mx1.redhat.com [10.5.110.31]); Mon, 03 Oct 2016 21:21:22 +0000 (UTC)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org


--=-KROcjT65aSPGm1YeMiZt
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote:
> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel <riel@redhat.com> wrote:
> >=20
> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote:
> > >=20
> > > On Sat, Oct 1, 2016 at 1:31 PM,=C2=A0=C2=A0<riel@redhat.com> wrote:
> > > >=20
> > > >=20
> > > > =C2=A0#define CREATE_TRACE_POINTS
> > > > =C2=A0#include <trace/events/syscalls.h>
> > > > @@ -197,6 +198,9 @@ __visible inline void
> > > > prepare_exit_to_usermode(struct pt_regs *regs)
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (unlikely(cached=
_flags &
> > > > EXIT_TO_USERMODE_LOOP_FLAGS))
> > > > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0exit_to_usermode_loop(regs, cached_flags);
> > > >=20
> > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0if (unlikely(test_and_cl=
ear_thread_flag(TIF_LOAD_FPU)))
> > > > +=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=
=C2=A0=C2=A0=C2=A0=C2=A0switch_fpu_return();
> > > > +
> > > How about:
> > >=20
> > > if (unlikely(...)) {
> > > =C2=A0 exit_to_usermode_loop(regs, cached_flags);
> > > =C2=A0 cached_flags =3D READ_ONCE(ti->flags);
> > > }
> > >=20
> > > if (ti->flags & _TIF_LOAD_FPU) {
> > > =C2=A0 clear_thread_flag(TIF_LOAD_FPU);
> > > =C2=A0 switch_fpu_return();
> > > }
> > It looks like test_thread_flag should be fine for avoiding
> > atomics, too?
> >=20
> > =C2=A0 if (unlikely(test_thread_flag(TIF_LOAD_FPU))) {
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0clear_thread_flag(TIF_LOAD_FPU);
> > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0switch_fpu_return();
> > =C2=A0 }
> True, but we've got that cached_flags variable already...

The cached flag may no longer be valid after
exit_to_usermode_loop() returns, because that
code is preemptible.

We have to reload a fresh ti->flags after=C2=A0
preemption is disabled and irqs are off.

> One of my objections to the current code structure is that I find it
> quite hard to keep track of all of the possible states of the fpu
> that
> we have.=C2=A0=C2=A0Off the top of my head, we have:
>=20
> 1. In-cpu state is valid and in-memory state is invalid.=C2=A0=C2=A0(This=
 is
> the
> case when we're running user code, for example.)
> 2. In-memory state is valid.
> 2a. In-cpu state is *also* valid for some particular task.=C2=A0=C2=A0In =
this
> state, no one may write to either FPU state copy for that task lest
> they get out of sync.
> 2b. In-cpu state is not valid.

It gets more fun with ptrace. Look at xfpregs_get
and xfpregs_set, which call=C2=A0fpu__activate_fpstate_read
and=C2=A0fpu__activate_fpstate_write respectively.

It looks like those functions already deal with the
cases you outline above.

> Rik, your patches further complicate this by making it possible to
> have the in-cpu state be valid for a non-current task even when
> non-idle.=C2=A0

However, we never need to examine the in-cpu state for
a task that is not currently running, or being scheduled
to in the context switch code.

We always save the FPU register state when a task is
context switched out, so for a non-current task, we
still only need to consider whether the in-memory state
is valid or not.

The in-cpu state does not matter.

fpu__activate_fpstate_write will disable lazy restore,
and ensure a full restore from memory.

> =C2=A0This is why I'm thinking that we should maybe revisit the
> API a bit to make this clearer and to avoid scattering around all the
> state manipulation quite so much.=C2=A0=C2=A0I propose, as a straw-man:
>=20
> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for
> current.
> After calling this, we're in state 1 or 2a.
>=20
> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and
> writable
> for current.=C2=A0=C2=A0Invalidates any in-memory copy.=C2=A0=C2=A0After =
calling this,
> we're in state 1.

How can we distinguish between these two?

If the task is running, it can change its FPU
registers at any time, without any obvious thing
to mark the transition from state 2a to state 1.

It sounds like the first function you name would
only be useful if we prevented a switch to user
space, which could immediately do whatever it wants
with the registers.

Is there a use case for user_fpregs_copy_to_cpu()?

> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid
> for
> current.=C2=A0=C2=A0After calling this, we're in state 2a or 2b.
>=20
> user_fpregs_move_to_memory() -- Makes the in-memory state be valid
> and
> writable for current.=C2=A0=C2=A0After calling this, we're in state 2b.

We can only ensure that the in memory copy stays
valid, by stopping the task from writing to the
in-CPU copy.

This sounds a lot like=C2=A0fpu__activate_fpstate_read

What we need after copying the contents to memory is
a way to invalidate the in-CPU copy, if changes were
made to the in-memory copy.

This is already done by=C2=A0fpu__activate_fpstate_write

> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX
> and PKRU memory state readers.=C2=A0=C2=A0MPX and PKRU will do
> user_fpregs_move_to_memory() before writing to memory.
> kernel_fpu_begin() does user_fpregs_move_to_memory().

This is done by switch_fpu_prepare.

Can you think of any other place in the kernel where we
need a 1 -> 2a transition, besides context switching and
ptrace/xfpregs_get?

> There are a couple ways we could deal with TIF_LOAD_FPU in this
> scheme.=C2=A0=C2=A0My instinct would be to rename it TIF_REFRESH_FPU, set=
 it
> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(),
> and make the handler do user_fpregs_move_to_cpu().

I can see where you are coming from, but given that
there is no way to run a task besides context switching
to it, I am not sure why we would need to do anything
besides disabling the lazy restore (indicating that the
in-CPU state is stale) anywhere else?

If user_fpregs_move_to_cpu determines that the in-register
state is still valid, it can skip the restore.

This is what switch_fpu_finish does after my first patch,
and switch_fpu_prepare later in my series.

--=20
All rights reversed

--=-KROcjT65aSPGm1YeMiZt
Content-Type: application/pgp-signature; name="signature.asc"
Content-Description: This is a digitally signed message part
Content-Transfer-Encoding: 7bit

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAABCAAGBQJX8svPAAoJEM553pKExN6DTMsH/Rur607NLiFs4eRswX+QNPeP
9QlFusD02KlF8zWPmpmPFWRAIHaYJ+YB//Srk43kv0XsCV4i0+7LkTORezX33YqS
TmfrQbONhemmhc6enjPQtKGegYoy0GFWagHGlkGodEBzOOVu/o5deOaoA6biGXr1
RDDwrxY+EWQSXsRu4sqmvcNI2g90dhRC8HJES7euKvFrdmROZ3pp0xZNNZ7lQAT0
Q3jbBkj51f9///TIDBC6ZYm+KdwOcXjJlN3EsKy/TZjg3PTNZUCgRnCOtqrSMMpN
/0M4Ta87sWhkMoAjbzMu1f14XWLf7sv0wS3xeuz43Z26gxQcckQPlM22nSQRMsk=
=dsFl
-----END PGP SIGNATURE-----

--=-KROcjT65aSPGm1YeMiZt--