From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752160AbcJAXos (ORCPT ); Sat, 1 Oct 2016 19:44:48 -0400 Received: from mail-vk0-f45.google.com ([209.85.213.45]:33625 "EHLO mail-vk0-f45.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751006AbcJAXol (ORCPT ); Sat, 1 Oct 2016 19:44:41 -0400 MIME-Version: 1.0 In-Reply-To: <1475353895-22175-3-git-send-email-riel@redhat.com> References: <1475353895-22175-1-git-send-email-riel@redhat.com> <1475353895-22175-3-git-send-email-riel@redhat.com> From: Andy Lutomirski Date: Sat, 1 Oct 2016 16:44:19 -0700 Message-ID: Subject: Re: [PATCH RFC 2/5] x86,fpu: delay FPU register loading until switch to userspace To: Rik van Riel Cc: "linux-kernel@vger.kernel.org" , X86 ML , Thomas Gleixner , Paolo Bonzini , Ingo Molnar , Andrew Lutomirski , "H. Peter Anvin" , Dave Hansen , Borislav Petkov Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Oct 1, 2016 at 1:31 PM, wrote: > From: Rik van Riel > > Delay the loading of FPU registers until a process switches back to > userspace. This allows us to skip FPU saving & restoring for kernel > threads, the idle task, and tasks that are spinning in kernel space. > > It also allows us to not repeatedly save & restore the userspace FPU > context on repeated invocations of kernel_fpu_start & kernel_fpu_end. > > Not overwriting the FPU state of a task unless we need to also allows > us to be be lazier about restoring it, in a future patch. > > Signed-off-by: Rik van Riel > --- > arch/x86/entry/common.c | 4 ++++ > arch/x86/include/asm/fpu/api.h | 5 +++++ > arch/x86/include/asm/fpu/internal.h | 44 +++++++++---------------------------- > arch/x86/include/asm/thread_info.h | 4 +++- > arch/x86/kernel/fpu/core.c | 17 ++++++++------ > arch/x86/kernel/process.c | 35 +++++++++++++++++++++++++++++ > arch/x86/kernel/process_32.c | 5 ++--- > arch/x86/kernel/process_64.c | 5 ++--- > 8 files changed, 71 insertions(+), 48 deletions(-) > > diff --git a/arch/x86/entry/common.c b/arch/x86/entry/common.c > index 1433f6b4607d..a69bbefa3408 100644 > --- a/arch/x86/entry/common.c > +++ b/arch/x86/entry/common.c > @@ -27,6 +27,7 @@ > #include > #include > #include > +#include > > #define CREATE_TRACE_POINTS > #include > @@ -197,6 +198,9 @@ __visible inline void prepare_exit_to_usermode(struct pt_regs *regs) > if (unlikely(cached_flags & EXIT_TO_USERMODE_LOOP_FLAGS)) > exit_to_usermode_loop(regs, cached_flags); > > + if (unlikely(test_and_clear_thread_flag(TIF_LOAD_FPU))) > + switch_fpu_return(); > + How about: if (unlikely(...)) { exit_to_usermode_loop(regs, cached_flags); cached_flags = READ_ONCE(ti->flags); } if (ti->flags & _TIF_LOAD_FPU) { clear_thread_flag(TIF_LOAD_FPU); switch_fpu_return(); } or something along those lines. The issue is that test_and_clear_thread_flag is unconditionally atomic, which means that it will slow down the common fast case by quite a bit. Alternatively, you could try to jam it into thread_struct::status, which would let you avoid an atomic operation when you clear it, but you'd still need an atomic op to set it, so this might not be a win. > +static inline void switch_fpu_finish(void) > { > - bool preload; > - /* > - * If the task has used the math, pre-load the FPU on xsave processors > - * or if the past 5 consecutive context-switches used math. > - */ > - preload = static_cpu_has(X86_FEATURE_FPU) && > - new_fpu->fpstate_active && > - (use_eager_fpu() || new_fpu->counter > 5); > - > - if (preload) { > - prefetch(&new_fpu->state); > - new_fpu->counter++; > - __fpregs_activate(new_fpu); > - trace_x86_fpu_regs_activated(new_fpu); > - > - /* Don't change CR0.TS if we just switch! */ > - if (!__this_cpu_read(fpu_active)) { > - __fpregs_activate_hw(); > - __this_cpu_write(fpu_active, true); > - } > - > - copy_kernel_to_fpregs(&new_fpu->state); > - } else if (__this_cpu_read(fpu_active)) { > - __this_cpu_write(fpu_active, false); > - __fpregs_deactivate_hw(); > - } > + set_thread_flag(TIF_LOAD_FPU); > } I can imagine this causing problems with kernel code that accesses current's FPU state, e.g. get_xsave_field_ptr(). I wonder if it would make sense to make your changes deeper into the FPU core innards so that, for example, we'd have explicit functions that cause the in-memory state for current to be up-to-date and readable, to cause the in-memory state to be up-to-date and writable (which is the same thing + TIF_FPU_LOAD + whatever other bookkeeping), and causing the in-CPU state to be up-to-date (possibly readable and writable). TIF_LOAD_FPU would trigger the latter. I've often found it confusing that fpu__save by itself has somewhat ill-defined effects. > +/* > + * Set up the userspace FPU context before returning to userspace. > + */ > +void switch_fpu_return(void) > +{ > + struct fpu *fpu = ¤t->thread.fpu; > + bool preload; > + /* > + * If the task has used the math, pre-load the FPU on xsave processors > + * or if the past 5 consecutive context-switches used math. > + */ > + preload = static_cpu_has(X86_FEATURE_FPU) && > + fpu->fpstate_active && > + (use_eager_fpu() || fpu->counter > 5); > + > + if (preload) { > + prefetch(&fpu->state); > + fpu->counter++; > + __fpregs_activate(fpu); > + trace_x86_fpu_regs_activated(fpu); > + > + /* Don't change CR0.TS if we just switch! */ > + if (!__this_cpu_read(fpu_active)) { > + __fpregs_activate_hw(); > + __this_cpu_write(fpu_active, true); > + } We should just finish getting rid of all TS uses.