From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1751950AbbBVCS0 (ORCPT <rfc822;w@1wt.eu>);
	Sat, 21 Feb 2015 21:18:26 -0500
Received: from mail-la0-f49.google.com ([209.85.215.49]:40673 "EHLO
	mail-la0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751707AbbBVCSZ (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Sat, 21 Feb 2015 21:18:25 -0500
MIME-Version: 1.0
In-Reply-To: <alpine.LFD.2.11.1502212328210.11588@eddie.linux-mips.org>
References: <b0ba174ea882ed36cf7011e872baf427c23b7e09.1424458621.git.luto@amacapital.net>
 <20150221093150.GA27841@gmail.com> <20150221163840.GA32073@pd.tnic>
 <20150221172914.GB32073@pd.tnic> <alpine.LFD.2.11.1502212328210.11588@eddie.linux-mips.org>
From: Andy Lutomirski <luto@amacapital.net>
Date: Sat, 21 Feb 2015 18:18:01 -0800
Message-ID: <CALCETrU=9Kvq82fBRfw9RLxzyj=LhnLzGV+vWtH+etpqypLatg@mail.gmail.com>
Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
To: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@kernel.org>,
        Oleg Nesterov <oleg@redhat.com>, Rik van Riel <riel@redhat.com>,
        X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, Feb 21, 2015 at 4:34 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Sat, 21 Feb 2015, Borislav Petkov wrote:
>
>> Provided I've not made a mistake, this leads me to think that this
>> simple workload and pretty much everything else uses the FPU through
>> glibc which does the SSE memcpy and so on. Which basically kills the
>> whole idea behind lazy FPU as practically you don't really encounter
>> workloads nowadays which don't use the FPU thanks to glibc and the lazy
>> strategy doesn't really bring anything.
>>
>> Which would then mean, we don't really need the lazy handling as
>> userspace is making it eager, so to speak, for us.
>
>  Please correct me if I'm wrong, but it looks to me like you're confusing
> lazy FPU context allocation and lazy FPU context switching.  These build
> on the same hardware principles, but they are different concepts.
>
>  Your "userspace is making it eager" statement in the context of glibc
> using SSE for `memcpy' is certainly true for lazy FPU context allocation,
> however I wouldn't be so sure about lazy FPU context switching, and a
> kernel compilation (or in fact any compilation) does not appear to be a
> representative benchmark to me.  I am sure lots of software won't be
> calling `memcpy' all the time, there should be context switches between
> which the FPU is not referred to at all.

That's true.  The question is whether there are enough of them, and
whether twiddling TS is fast enough, that it's worth it.

>
>  Additionally I believe long-executing FPU instructions (i.e.
> transcendentals) can take advantage of continuing to execute in parallel
> where the context has already been switched rather than stalling an eager
> FPU context switch until the FPU instruction has completed.

It seems highly unlikely to me that a slow FPU instruction can retire
*after* a subsequent fxsave, which would need to happen for this to
work.

>
>  And last but not least, why does the handling of CR0.TS traps have to be
> complicated?  It does not look like rocket science to me, it should be a
> mere handful of instructions, the time required to move the two FP
> contexts out from and in to the FPU respectively should dominate
> processing time.  Where quoted the optimisation manual states 250 cycles
> for FXSAVE and FXRSTOR combined.

The TS traps aren't complicated -- they're just really slow.  I think
that each of setting *and* clearing TS serializes and takes well over
a hundred cycles.  A #NM interrupt (the thing that happens if you try
to use the FPU with TS set) serializes and does all kinds of slow
things, so it takes many hundreds of cycles.  The handler needs to
clear TS (another hundred cycles or more), load the FPU state
(actually rather fast on modern CPUs), and then IRET back to userspace
(hundreds of cycles).  This adds up to a lot of cycles.  A round trip
through an exception handler seems to be thousands of cycles.

>
>  And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> bootstrap depending on processor features, you don't have to do all the
> run-time check on every trap.  You can even optimise the FSAVE handler
> away at the build time if you know it won't ever be used based on the
> minimal supported processor family selected.

None of this matters.  A couple of branches in an #NM handler are
completely lost in the noise.

>
>  Do you happen to know or can determine how much time (in clock cycles) a
> CR0.TS trap itself takes, including any time required to preserve the
> execution state in the handler such as pushing/popping GPRs to/from the
> stack (as opposed to processing time spent on moving the FP contexts back
> and forth)?  Is there no room for improvement left there?  How many task
> scheduling slots say per million must be there poking at the FPU for eager
> FPU context switching to take advantage over lazy one?

Thousands of cycles.  Considerably worse in an SVM guest.  x86
exception handling sucks.

--Andy