From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752355AbcJCVga (ORCPT ); Mon, 3 Oct 2016 17:36:30 -0400 Received: from mail-vk0-f43.google.com ([209.85.213.43]:34000 "EHLO mail-vk0-f43.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751469AbcJCVgY (ORCPT ); Mon, 3 Oct 2016 17:36:24 -0400 MIME-Version: 1.0 In-Reply-To: <1475529679.4622.27.camel@redhat.com> References: <1475353895-22175-1-git-send-email-riel@redhat.com> <1475353895-22175-3-git-send-email-riel@redhat.com> <1475366900.21644.8.camel@redhat.com> <1475529679.4622.27.camel@redhat.com> From: Andy Lutomirski Date: Mon, 3 Oct 2016 14:36:02 -0700 Message-ID: Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace To: Rik van Riel Cc: Dave Hansen , "linux-kernel@vger.kernel.org" , X86 ML , Thomas Gleixner , Paolo Bonzini , Ingo Molnar , Andrew Lutomirski , "H. Peter Anvin" , Borislav Petkov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Oct 3, 2016 at 2:21 PM, Rik van Riel wrote: > On Mon, 2016-10-03 at 13:54 -0700, Andy Lutomirski wrote: >> On Sat, Oct 1, 2016 at 5:08 PM, Rik van Riel wrote: >> > >> > On Sat, 2016-10-01 at 16:44 -0700, Andy Lutomirski wrote: >> > > >> > > On Sat, Oct 1, 2016 at 1:31 PM, wrote: >> > > > >> > > > >> > > > #define CREATE_TRACE_POINTS >> > > > #include >> > > > @@ -197,6 +198,9 @@ __visible inline void >> > > > prepare_exit_to_usermode(struct pt_regs *regs) >> > > > if (unlikely(cached_flags & >> > > > EXIT_TO_USERMODE_LOOP_FLAGS)) >> > > > exit_to_usermode_loop(regs, cached_flags); >> > > > >> > > > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU))) >> > > > + switch_fpu_return(); >> > > > + >> > > How about: >> > > >> > > if (unlikely(...)) { >> > > exit_to_usermode_loop(regs, cached_flags); >> > > cached_flags = READ_ONCE(ti->flags); Like this? >> > > } >> > > >> > > if (ti->flags & _TIF_LOAD_FPU) { >> > > clear_thread_flag(TIF_LOAD_FPU); >> > > switch_fpu_return(); >> > > } >> > It looks like test_thread_flag should be fine for avoiding >> > atomics, too? >> > >> > if (unlikely(test_thread_flag(TIF_LOAD_FPU))) { >> > clear_thread_flag(TIF_LOAD_FPU); >> > switch_fpu_return(); >> > } >> True, but we've got that cached_flags variable already... > > The cached flag may no longer be valid after > exit_to_usermode_loop() returns, because that > code is preemptible. > > We have to reload a fresh ti->flags after > preemption is disabled and irqs are off. See above... > >> One of my objections to the current code structure is that I find it >> quite hard to keep track of all of the possible states of the fpu >> that >> we have. Off the top of my head, we have: >> >> 1. In-cpu state is valid and in-memory state is invalid. (This is >> the >> case when we're running user code, for example.) >> 2. In-memory state is valid. >> 2a. In-cpu state is *also* valid for some particular task. In this >> state, no one may write to either FPU state copy for that task lest >> they get out of sync. >> 2b. In-cpu state is not valid. > > It gets more fun with ptrace. Look at xfpregs_get > and xfpregs_set, which call fpu__activate_fpstate_read > and fpu__activate_fpstate_write respectively. > > It looks like those functions already deal with the > cases you outline above. Hmm, maybe they could be used. The names are IMO atrocious. I had to read the docs and the code to figure out WTF "activating" the "fpstate" even meant. > >> Rik, your patches further complicate this by making it possible to >> have the in-cpu state be valid for a non-current task even when >> non-idle. > > However, we never need to examine the in-cpu state for > a task that is not currently running, or being scheduled > to in the context switch code. > > We always save the FPU register state when a task is > context switched out, so for a non-current task, we > still only need to consider whether the in-memory state > is valid or not. > > The in-cpu state does not matter. > > fpu__activate_fpstate_write will disable lazy restore, > and ensure a full restore from memory. Something has to fix up owner_ctx so that the previous thread doesn't think its CPU state is still valid. I don't think activate_fpstate_write does it, because nothing should (I think) be calling it as a mattor of course on context switch. > >> This is why I'm thinking that we should maybe revisit the >> API a bit to make this clearer and to avoid scattering around all the >> state manipulation quite so much. I propose, as a straw-man: >> >> user_fpregs_copy_to_cpu() -- Makes the CPU state be valid for >> current. >> After calling this, we're in state 1 or 2a. >> >> user_fpregs_move_to_cpu() -- Makes the CPU state be valid and >> writable >> for current. Invalidates any in-memory copy. After calling this, >> we're in state 1. > > How can we distinguish between these two? > > If the task is running, it can change its FPU > registers at any time, without any obvious thing > to mark the transition from state 2a to state 1. That's true now, but I don't think it's really true any more with your patches. If a task is running in *kernel* mode, then I think that pretty much all of the combinations are possible. When the task actually enters user mode, we have to make sure we're in state 1 so the CPU regs belong to the task and are writable. > > It sounds like the first function you name would > only be useful if we prevented a switch to user > space, which could immediately do whatever it wants > with the registers. > > Is there a use case for user_fpregs_copy_to_cpu()? With PKRU, any user memory access would need user_fpregs_copy_to_cpu(). I think we should get rid of this and just switch to using RDPKRU and WRPKRU to eagerly update PKRU in every context switch to avoid this. Other than that, I can't think of any use cases off the top of my head unless the signal code wanted to play games with this stuff. > >> user_fpregs_copy_to_memory() -- Makes the in-memory state be valid >> for >> current. After calling this, we're in state 2a or 2b. >> >> user_fpregs_move_to_memory() -- Makes the in-memory state be valid >> and >> writable for current. After calling this, we're in state 2b. > > We can only ensure that the in memory copy stays > valid, by stopping the task from writing to the > in-CPU copy. > > This sounds a lot like fpu__activate_fpstate_read > > What we need after copying the contents to memory is > a way to invalidate the in-CPU copy, if changes were > made to the in-memory copy. My proposal is that we invalidate the CPU state *before* writing the in-memory state, not after. > > This is already done by fpu__activate_fpstate_write > >> Scheduling out calls user_fpregs_copy_to_memory(), as do all the MPX >> and PKRU memory state readers. MPX and PKRU will do >> user_fpregs_move_to_memory() before writing to memory. >> kernel_fpu_begin() does user_fpregs_move_to_memory(). > > This is done by switch_fpu_prepare. > > Can you think of any other place in the kernel where we > need a 1 -> 2a transition, besides context switching and > ptrace/xfpregs_get? Anything else that tries to read task xstate from memory, i.e. MPX and PKRU. (Although if we switch to eager-switched PKRU, then PKRU stops mattering for this purpose.) Actually, I don't see any way your patches can be compatible with PKRU without switching to eager-switched PKRU. > >> There are a couple ways we could deal with TIF_LOAD_FPU in this >> scheme. My instinct would be to rename it TIF_REFRESH_FPU, set it >> unconditionally on schedule-in *and* on user_fpregs_xxx_to_memory(), >> and make the handler do user_fpregs_move_to_cpu(). > > I can see where you are coming from, but given that > there is no way to run a task besides context switching > to it, I am not sure why we would need to do anything > besides disabling the lazy restore (indicating that the > in-CPU state is stale) anywhere else? > > If user_fpregs_move_to_cpu determines that the in-register > state is still valid, it can skip the restore. > > This is what switch_fpu_finish does after my first patch, > and switch_fpu_prepare later in my series. I think my main objection is that there are too many functions like this outside the fpu/ directory that all interact in complicated ways. I think that activate_fpststate_read/write are along the right lines, but we should start using them and their friends everywhere.