All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Maciej W. Rozycki" <macro@linux-mips.org>
To: Andy Lutomirski <luto@amacapital.net>
Cc: Borislav Petkov <bp@alien8.de>, Ingo Molnar <mingo@kernel.org>,
	Oleg Nesterov <oleg@redhat.com>, Rik van Riel <riel@redhat.com>,
	X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
Date: Mon, 23 Feb 2015 21:17:42 +0000 (GMT)	[thread overview]
Message-ID: <alpine.LFD.2.11.1502232014300.17311@eddie.linux-mips.org> (raw)
In-Reply-To: <CALCETrU=9Kvq82fBRfw9RLxzyj=LhnLzGV+vWtH+etpqypLatg@mail.gmail.com>

On Sat, 21 Feb 2015, Andy Lutomirski wrote:

> >  Additionally I believe long-executing FPU instructions (i.e.
> > transcendentals) can take advantage of continuing to execute in parallel
> > where the context has already been switched rather than stalling an eager
> > FPU context switch until the FPU instruction has completed.
> 
> It seems highly unlikely to me that a slow FPU instruction can retire
> *after* a subsequent fxsave, which would need to happen for this to
> work.

 I meant something else -- a slow FPU instruction can retire after a task 
has been switched where the FP context has been left intact, i.e. in the 
lazy FP context switching case, where only the MMU context and GPRs have 
been replaced.  Whereas in the eager FP context switching case you can get 
through to FXSAVE while a slow FPU instruction hasn't completed yet (e.g. 
started just as preemption was about to happen).

 Obviously that FXSAVE will have to stall until the FPU instruction has 
completed (IIRC the i486 aborted transcendental instructions on any 
exceptions/interrupts instead, leading to the risk of process starvation 
in heavily interrupt loaded systems, but I also believe it has been fixed 
as from the Pentium).  Though if, as you say, the lone taking of a 
trap/interrupt gate can take hundreds of cycles, perhaps indeed no FPU 
instruction will execute *that* long on modern silicon.

> >  And last but not least, why does the handling of CR0.TS traps have to be
> > complicated?  It does not look like rocket science to me, it should be a
> > mere handful of instructions, the time required to move the two FP
> > contexts out from and in to the FPU respectively should dominate
> > processing time.  Where quoted the optimisation manual states 250 cycles
> > for FXSAVE and FXRSTOR combined.
> 
> The TS traps aren't complicated -- they're just really slow.  I think
> that each of setting *and* clearing TS serializes and takes well over
> a hundred cycles.  A #NM interrupt (the thing that happens if you try
> to use the FPU with TS set) serializes and does all kinds of slow
> things, so it takes many hundreds of cycles.  The handler needs to
> clear TS (another hundred cycles or more), load the FPU state
> (actually rather fast on modern CPUs), and then IRET back to userspace
> (hundreds of cycles).  This adds up to a lot of cycles.  A round trip
> through an exception handler seems to be thousands of cycles.

 That sucks wet goat farts! :(

 I have to admit I got moved a bit away from the x86 world and didn't 
realise things have become so bad.  Some 10 years ago or so taking a trap 
or interrupt gate would need some 30 cycles (of course task gates are 
unusable for anything that does not absolutely require them such as a #DF; 
their usability for anything real ended with the 80286 or suchlike).  
Similarly an IRET to reverse the actions taken.  That was already rather 
bad, but understandable, after all the CPU had to read the gate 
descriptor, access the TSS, switch both CS and SS descriptors, etc.

 What I don't understand is why CLTS, a dedicated instruction that avoids 
the need to access whole CR0 (that again can understandably be costly, 
because of the grouping of all the important bits there), has to be so 
slow.  It flips a single bit down there and does not to serialise 
anything, as any instruction down the pipeline it could affect would 
trigger a #NM anyway!  And there's an IRET somewhere on the way too, 
before the instruction that originally triggered the fault will be 
reexecuted.

 And why the heck over all these years a mechanism similar to SYSENTER and 
its bunch of complementing MSRs hasn't been invented for the common 
exceptions, to avoid all this gate descriptor dance!

> >  And of course you can install the right handler (i.e. FSAVE vs FXSAVE) at
> > bootstrap depending on processor features, you don't have to do all the
> > run-time check on every trap.  You can even optimise the FSAVE handler
> > away at the build time if you know it won't ever be used based on the
> > minimal supported processor family selected.
> 
> None of this matters.  A couple of branches in an #NM handler are
> completely lost in the noise.

 Agreed, given what you state, completely understood.

> >  Do you happen to know or can determine how much time (in clock cycles) a
> > CR0.TS trap itself takes, including any time required to preserve the
> > execution state in the handler such as pushing/popping GPRs to/from the
> > stack (as opposed to processing time spent on moving the FP contexts back
> > and forth)?  Is there no room for improvement left there?  How many task
> > scheduling slots say per million must be there poking at the FPU for eager
> > FPU context switching to take advantage over lazy one?
> 
> Thousands of cycles.  Considerably worse in an SVM guest.  x86
> exception handling sucks.

 I must have been spoilt with the MIPS exception handling.  Taking an 
exception on a MIPS processor is virtually instantaneous, just like 
retiring another instruction.  Of course there's the cost equivalent to 
branch misprediction, you need to invalidate all the pipeline.  So 
depending on how many stages you have there, you can expect a latency of 
say 3-7 clocks.

 Granted, on a MIPS processor taking an exception does not change much -- 
it switches into the kernel mode (1 bit set in a control register, a 
special kernel-mode-override bit dedicated to exception handling), saves 
the old PC (another control register updated; called Exception PC or EPC) 
and loads the PC with the exception vector.  All the rest is left to the 
kernel.  Which is good!

 The same stands for ERET, the exception return instruction -- it merely 
loads the PC back from EPC and clears the kernel-mode-override bit in the 
other control register.  More recently it also serves the purpose of an 
instruction hazard barrier, which you'd call synchronisation, the 
strongest kind provided in the MIPS architecture (in older architecture 
revisions you had to take care of any hazards caused by preceding 
instructions that could affect user-mode execution, by inserting the right 
number of NOPs before ERET, possibly taking other instructions already 
executed since the origin of the hazard into account).  So rather than 3-7 
clocks that could be 20 or so, though usually much fewer.

 A while ago I cooperated with the hardware team in adding an extra 
instruction to the architecture under the assumption that it will be 
emulated on legacy hardware, by taking the RI or Reserved Instruction 
exception (the equivalent to x86's #UD) and doing the rest there.  
Another assumption was a fast path would be taken for this single 
instruction and all the handling done in assembly, without even reaching 
the usual C-language RI handlers that we've accumulated over the years.  

 Mind that exceptions actually have to be decoded and dispatched to 
individual handlers on MIPS processors first, it's not that individual 
exception classes have individual vectors like with x86 -- there's only 
one!  And you need to update EPC too or you'd be trapping back.  Finally 
the instruction itself had to be decoded, so instruction memory had to be 
read and compared against the pattern expected.

 To make a long story short I was able to squeeze all the handling into 
some 30 cycles, with a slight variation across different processor 
implementations.  How much different!

 Oh well, some further benchmarking is still needed, but given the 
circumstances I suppose the old good design will have to go after all, 
sigh...  Thanks for your input!

  Maciej

  parent reply	other threads:[~2015-02-23 21:17 UTC|newest]

Thread overview: 45+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-20 18:58 [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs Andy Lutomirski
2015-02-20 19:05 ` Borislav Petkov
2015-02-21  9:31 ` Ingo Molnar
2015-02-21 16:38   ` Borislav Petkov
2015-02-21 17:29     ` Borislav Petkov
2015-02-21 18:39       ` Ingo Molnar
2015-02-21 19:15         ` Borislav Petkov
2015-02-21 19:23           ` Ingo Molnar
2015-02-21 21:36             ` Borislav Petkov
2015-02-22  8:18               ` Ingo Molnar
2015-02-22  8:22                 ` Ingo Molnar
2015-02-22 10:48                 ` Borislav Petkov
2015-02-22 12:50                 ` Borislav Petkov
2015-02-22 12:57                   ` Ingo Molnar
2015-02-22 13:21                     ` Borislav Petkov
2015-02-22  0:34       ` Maciej W. Rozycki
2015-02-22  2:18         ` Andy Lutomirski
2015-02-22 11:06           ` Borislav Petkov
2015-02-23  1:45             ` Rik van Riel
2015-02-23  5:22               ` Andy Lutomirski
2015-02-23 12:51                 ` Rik van Riel
2015-02-23 15:03                   ` Borislav Petkov
2015-02-23 15:51                     ` Rik van Riel
2015-02-23 18:06                       ` Borislav Petkov
2015-02-23 21:17           ` Maciej W. Rozycki [this message]
2015-02-23 21:21             ` Rik van Riel
2015-02-23 22:14               ` Linus Torvalds
2015-02-24  0:56                 ` Maciej W. Rozycki
2015-02-24  0:59                   ` Andy Lutomirski
2015-02-23 22:27               ` Maciej W. Rozycki
2015-02-23 23:44                 ` Andy Lutomirski
2015-02-24  2:14                   ` Maciej W. Rozycki
2015-02-24  2:31                     ` Andy Lutomirski
2015-02-24 14:43                       ` Rik van Riel
2015-02-21 18:34     ` Ingo Molnar
2015-02-23 14:59 ` Oleg Nesterov
2015-02-23 15:11   ` Borislav Petkov
2015-02-23 15:53     ` Rik van Riel
2015-02-23 18:40       ` Oleg Nesterov
2015-02-24 19:15 ` Denys Vlasenko
2015-02-25  0:07   ` Andy Lutomirski
2015-02-25 10:37     ` Borislav Petkov
2015-02-25 10:50       ` Ingo Molnar
2015-02-25 10:45     ` Ingo Molnar
2015-02-25 17:12 ` Some results (was: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs) Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.2.11.1502232014300.17311@eddie.linux-mips.org \
    --to=macro@linux-mips.org \
    --cc=bp@alien8.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@amacapital.net \
    --cc=mingo@kernel.org \
    --cc=oleg@redhat.com \
    --cc=riel@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.