From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753021AbbBWVRs (ORCPT ); Mon, 23 Feb 2015 16:17:48 -0500 Received: from eddie.linux-mips.org ([148.251.95.138]:48880 "EHLO cvs.linux-mips.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752884AbbBWVRp (ORCPT ); Mon, 23 Feb 2015 16:17:45 -0500 Date: Mon, 23 Feb 2015 21:17:42 +0000 (GMT) From: "Maciej W. Rozycki" To: Andy Lutomirski cc: Borislav Petkov , Ingo Molnar , Oleg Nesterov , Rik van Riel , X86 ML , "linux-kernel@vger.kernel.org" , Linus Torvalds Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs In-Reply-To: Message-ID: References: <20150221093150.GA27841@gmail.com> <20150221163840.GA32073@pd.tnic> <20150221172914.GB32073@pd.tnic> User-Agent: Alpine 2.11 (LFD 23 2013-08-11) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, 21 Feb 2015, Andy Lutomirski wrote: > > Additionally I believe long-executing FPU instructions (i.e. > > transcendentals) can take advantage of continuing to execute in parallel > > where the context has already been switched rather than stalling an eager > > FPU context switch until the FPU instruction has completed. > > It seems highly unlikely to me that a slow FPU instruction can retire > *after* a subsequent fxsave, which would need to happen for this to > work. I meant something else -- a slow FPU instruction can retire after a task has been switched where the FP context has been left intact, i.e. in the lazy FP context switching case, where only the MMU context and GPRs have been replaced. Whereas in the eager FP context switching case you can get through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g. started just as preemption was about to happen). Obviously that FXSAVE will have to stall until the FPU instruction has completed (IIRC the i486 aborted transcendental instructions on any exceptions/interrupts instead, leading to the risk of process starvation in heavily interrupt loaded systems, but I also believe it has been fixed as from the Pentium). Though if, as you say, the lone taking of a trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU instruction will execute *that* long on modern silicon. > > And last but not least, why does the handling of CR0.TS traps have to be > > complicated? It does not look like rocket science to me, it should be a > > mere handful of instructions, the time required to move the two FP > > contexts out from and in to the FPU respectively should dominate > > processing time. Where quoted the optimisation manual states 250 cycles > > for FXSAVE and FXRSTOR combined. > > The TS traps aren't complicated -- they're just really slow. I think > that each of setting *and* clearing TS serializes and takes well over > a hundred cycles. A #NM interrupt (the thing that happens if you try > to use the FPU with TS set) serializes and does all kinds of slow > things, so it takes many hundreds of cycles. The handler needs to > clear TS (another hundred cycles or more), load the FPU state > (actually rather fast on modern CPUs), and then IRET back to userspace > (hundreds of cycles). This adds up to a lot of cycles. A round trip > through an exception handler seems to be thousands of cycles. That sucks wet goat farts! :( I have to admit I got moved a bit away from the x86 world and didn't realise things have become so bad. Some 10 years ago or so taking a trap or interrupt gate would need some 30 cycles (of course task gates are unusable for anything that does not absolutely require them such as a #DF; their usability for anything real ended with the 80286 or suchlike). Similarly an IRET to reverse the actions taken. That was already rather bad, but understandable, after all the CPU had to read the gate descriptor, access the TSS, switch both CS and SS descriptors, etc. What I don't understand is why CLTS, a dedicated instruction that avoids the need to access whole CR0 (that again can understandably be costly, because of the grouping of all the important bits there), has to be so slow. It flips a single bit down there and does not to serialise anything, as any instruction down the pipeline it could affect would trigger a #NM anyway! And there's an IRET somewhere on the way too, before the instruction that originally triggered the fault will be reexecuted. And why the heck over all these years a mechanism similar to SYSENTER and its bunch of complementing MSRs hasn't been invented for the common exceptions, to avoid all this gate descriptor dance! > > And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at > > bootstrap depending on processor features, you don't have to do all the > > run-time check on every trap. You can even optimise the FSAVE handler > > away at the build time if you know it won't ever be used based on the > > minimal supported processor family selected. > > None of this matters. A couple of branches in an #NM handler are > completely lost in the noise. Agreed, given what you state, completely understood. > > Do you happen to know or can determine how much time (in clock cycles) a > > CR0.TS trap itself takes, including any time required to preserve the > > execution state in the handler such as pushing/popping GPRs to/from the > > stack (as opposed to processing time spent on moving the FP contexts back > > and forth)? Is there no room for improvement left there? How many task > > scheduling slots say per million must be there poking at the FPU for eager > > FPU context switching to take advantage over lazy one? > > Thousands of cycles. Considerably worse in an SVM guest. x86 > exception handling sucks. I must have been spoilt with the MIPS exception handling. Taking an exception on a MIPS processor is virtually instantaneous, just like retiring another instruction. Of course there's the cost equivalent to branch misprediction, you need to invalidate all the pipeline. So depending on how many stages you have there, you can expect a latency of say 3-7 clocks. Granted, on a MIPS processor taking an exception does not change much -- it switches into the kernel mode (1 bit set in a control register, a special kernel-mode-override bit dedicated to exception handling), saves the old PC (another control register updated; called Exception PC or EPC) and loads the PC with the exception vector. All the rest is left to the kernel. Which is good! The same stands for ERET, the exception return instruction -- it merely loads the PC back from EPC and clears the kernel-mode-override bit in the other control register. More recently it also serves the purpose of an instruction hazard barrier, which you'd call synchronisation, the strongest kind provided in the MIPS architecture (in older architecture revisions you had to take care of any hazards caused by preceding instructions that could affect user-mode execution, by inserting the right number of NOPs before ERET, possibly taking other instructions already executed since the origin of the hazard into account). So rather than 3-7 clocks that could be 20 or so, though usually much fewer. A while ago I cooperated with the hardware team in adding an extra instruction to the architecture under the assumption that it will be emulated on legacy hardware, by taking the RI or Reserved Instruction exception (the equivalent to x86's #UD) and doing the rest there. Another assumption was a fast path would be taken for this single instruction and all the handling done in assembly, without even reaching the usual C-language RI handlers that we've accumulated over the years. Mind that exceptions actually have to be decoded and dispatched to individual handlers on MIPS processors first, it's not that individual exception classes have individual vectors like with x86 -- there's only one! And you need to update EPC too or you'd be trapping back. Finally the instruction itself had to be decoded, so instruction memory had to be read and compared against the pattern expected. To make a long story short I was able to squeeze all the handling into some 30 cycles, with a slight variation across different processor implementations. How much different! Oh well, some further benchmarking is still needed, but given the circumstances I suppose the old good design will have to go after all, sigh... Thanks for your input! Maciej