From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1753021AbbBWVRs (ORCPT <rfc822;w@1wt.eu>);
	Mon, 23 Feb 2015 16:17:48 -0500
Received: from eddie.linux-mips.org ([148.251.95.138]:48880 "EHLO
	cvs.linux-mips.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752884AbbBWVRp (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Mon, 23 Feb 2015 16:17:45 -0500
Date: Mon, 23 Feb 2015 21:17:42 +0000 (GMT)
From: "Maciej W. Rozycki" <macro@linux-mips.org>
To: Andy Lutomirski <luto@amacapital.net>
cc: Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@kernel.org>,
        Oleg Nesterov <oleg@redhat.com>, Rik van Riel <riel@redhat.com>,
        X86 ML <x86@kernel.org>,
        "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
In-Reply-To: <CALCETrU=9Kvq82fBRfw9RLxzyj=LhnLzGV+vWtH+etpqypLatg@mail.gmail.com>
Message-ID: <alpine.LFD.2.11.1502232014300.17311@eddie.linux-mips.org>
References: <b0ba174ea882ed36cf7011e872baf427c23b7e09.1424458621.git.luto@amacapital.net> <20150221093150.GA27841@gmail.com> <20150221163840.GA32073@pd.tnic> <20150221172914.GB32073@pd.tnic> <alpine.LFD.2.11.1502212328210.11588@eddie.linux-mips.org>
 <CALCETrU=9Kvq82fBRfw9RLxzyj=LhnLzGV+vWtH+etpqypLatg@mail.gmail.com>
User-Agent: Alpine 2.11 (LFD 23 2013-08-11)
MIME-Version: 1.0
Content-Type: TEXT/PLAIN; charset=US-ASCII
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Sat, 21 Feb 2015, Andy Lutomirski wrote:

> >  Additionally I believe long-executing FPU instructions (i.e.
> > transcendentals) can take advantage of continuing to execute in parallel
> > where the context has already been switched rather than stalling an eager
> > FPU context switch until the FPU instruction has completed.
> 
> It seems highly unlikely to me that a slow FPU instruction can retire
> *after* a subsequent fxsave, which would need to happen for this to
> work.

 I meant something else -- a slow FPU instruction can retire after a task 
has been switched where the FP context has been left intact, i.e. in the 
lazy FP context switching case, where only the MMU context and GPRs have 
been replaced.  Whereas in the eager FP context switching case you can get 
through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g. 
started just as preemption was about to happen).

 Obviously that FXSAVE will have to stall until the FPU instruction has 
completed (IIRC the i486 aborted transcendental instructions on any 
exceptions/interrupts instead, leading to the risk of process starvation 
in heavily interrupt loaded systems, but I also believe it has been fixed 
as from the Pentium).  Though if, as you say, the lone taking of a 
trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU 
instruction will execute *that* long on modern silicon.

> >  And last but not least, why does the handling of CR0.TS traps have to be
> > complicated?  It does not look like rocket science to me, it should be a
> > mere handful of instructions, the time required to move the two FP
> > contexts out from and in to the FPU respectively should dominate
> > processing time.  Where quoted the optimisation manual states 250 cycles
> > for FXSAVE and FXRSTOR combined.
> 
> The TS traps aren't complicated -- they're just really slow.  I think
> that each of setting *and* clearing TS serializes and takes well over
> a hundred cycles.  A #NM interrupt (the thing that happens if you try
> to use the FPU with TS set) serializes and does all kinds of slow
> things, so it takes many hundreds of cycles.  The handler needs to
> clear TS (another hundred cycles or more), load the FPU state
> (actually rather fast on modern CPUs), and then IRET back to userspace
> (hundreds of cycles).  This adds up to a lot of cycles.  A round trip
> through an exception handler seems to be thousands of cycles.

 That sucks wet goat farts! :(

 I have to admit I got moved a bit away from the x86 world and didn't 
realise things have become so bad.  Some 10 years ago or so taking a trap 
or interrupt gate would need some 30 cycles (of course task gates are 
unusable for anything that does not absolutely require them such as a #DF; 
their usability for anything real ended with the 80286 or suchlike).  
Similarly an IRET to reverse the actions taken.  That was already rather 
bad, but understandable, after all the CPU had to read the gate 
descriptor, access the TSS, switch both CS and SS descriptors, etc.

 What I don't understand is why CLTS, a dedicated instruction that avoids 
the need to access whole CR0 (that again can understandably be costly, 
because of the grouping of all the important bits there), has to be so 
slow.  It flips a single bit down there and does not to serialise 
anything, as any instruction down the pipeline it could affect would 
trigger a #NM anyway!  And there's an IRET somewhere on the way too, 
before the instruction that originally triggered the fault will be 
reexecuted.

 And why the heck over all these years a mechanism similar to SYSENTER and 
its bunch of complementing MSRs hasn't been invented for the common 
exceptions, to avoid all this gate descriptor dance!

> >  And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> > bootstrap depending on processor features, you don't have to do all the
> > run-time check on every trap.  You can even optimise the FSAVE handler
> > away at the build time if you know it won't ever be used based on the
> > minimal supported processor family selected.
> 
> None of this matters.  A couple of branches in an #NM handler are
> completely lost in the noise.

 Agreed, given what you state, completely understood.

> >  Do you happen to know or can determine how much time (in clock cycles) a
> > CR0.TS trap itself takes, including any time required to preserve the
> > execution state in the handler such as pushing/popping GPRs to/from the
> > stack (as opposed to processing time spent on moving the FP contexts back
> > and forth)?  Is there no room for improvement left there?  How many task
> > scheduling slots say per million must be there poking at the FPU for eager
> > FPU context switching to take advantage over lazy one?
> 
> Thousands of cycles.  Considerably worse in an SVM guest.  x86
> exception handling sucks.

 I must have been spoilt with the MIPS exception handling.  Taking an 
exception on a MIPS processor is virtually instantaneous, just like 
retiring another instruction.  Of course there's the cost equivalent to 
branch misprediction, you need to invalidate all the pipeline.  So 
depending on how many stages you have there, you can expect a latency of 
say 3-7 clocks.

 Granted, on a MIPS processor taking an exception does not change much -- 
it switches into the kernel mode (1 bit set in a control register, a 
special kernel-mode-override bit dedicated to exception handling), saves 
the old PC (another control register updated; called Exception PC or EPC) 
and loads the PC with the exception vector.  All the rest is left to the 
kernel.  Which is good!

 The same stands for ERET, the exception return instruction -- it merely 
loads the PC back from EPC and clears the kernel-mode-override bit in the 
other control register.  More recently it also serves the purpose of an 
instruction hazard barrier, which you'd call synchronisation, the 
strongest kind provided in the MIPS architecture (in older architecture 
revisions you had to take care of any hazards caused by preceding 
instructions that could affect user-mode execution, by inserting the right 
number of NOPs before ERET, possibly taking other instructions already 
executed since the origin of the hazard into account).  So rather than 3-7 
clocks that could be 20 or so, though usually much fewer.

 A while ago I cooperated with the hardware team in adding an extra 
instruction to the architecture under the assumption that it will be 
emulated on legacy hardware, by taking the RI or Reserved Instruction 
exception (the equivalent to x86's #UD) and doing the rest there.  
Another assumption was a fast path would be taken for this single 
instruction and all the handling done in assembly, without even reaching 
the usual C-language RI handlers that we've accumulated over the years.  

 Mind that exceptions actually have to be decoded and dispatched to 
individual handlers on MIPS processors first, it's not that individual 
exception classes have individual vectors like with x86 -- there's only 
one!  And you need to update EPC too or you'd be trapping back.  Finally 
the instruction itself had to be decoded, so instruction memory had to be 
read and compared against the pattern expected.

 To make a long story short I was able to squeeze all the handling into 
some 30 cycles, with a slight variation across different processor 
implementations.  How much different!

 Oh well, some further benchmarking is still needed, but given the 
circumstances I suppose the old good design will have to go after all, 
sigh...  Thanks for your input!

  Maciej