Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs

From: Andy Lutomirski <luto@amacapital.net>
To: "Maciej W. Rozycki" <macro@linux-mips.org>
Cc: Rik van Riel <riel@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Ingo Molnar <mingo@kernel.org>, Oleg Nesterov <oleg@redhat.com>,
	X86 ML <x86@kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>
Subject: Re: [RFC PATCH] x86, fpu: Use eagerfpu by default on all CPUs
Date: Mon, 23 Feb 2015 18:31:57 -0800	[thread overview]
Message-ID: <CALCETrV5bD=L2OOtDE1TbHcVMc7Ahhs5YQtx6jTmArM08wqPjg@mail.gmail.com> (raw)
In-Reply-To: <alpine.LFD.2.11.1502240112070.17311@eddie.linux-mips.org>

On Mon, Feb 23, 2015 at 6:14 PM, Maciej W. Rozycki <macro@linux-mips.org> wrote:
> On Mon, 23 Feb 2015, Andy Lutomirski wrote:
>
>> >> After a context switch, the instructions from the old task are no
>> >> longer in the pipeline.
>> >
>> >  I'd say it's implementation-specific.  As I mentioned the i486 aborted
>> > any transcendental x87 instruction in progress upon taking an exception or
>> > interrupt.  That was a model like you refer to, but as I also mentioned it
>> > had its shortcomings.
>>
>> IRET is serializing, according to the the docs (I think) and according
>> to the Intel engineers I asked (I'm absolutely certain about this
>> part).  So FPU ops are entirely done at the end of a normal context
>> switch.
>
>  No question about the serialising property of IRET, it has been like this
> since the original Pentium implementation.  Do you have an architecture
> specification reference to back up your claim though as far as the FPU is
> concerned?  I'm asking because I am genuinely curious.
>
>  The x87 case is so special, there isn't anything there really that is
> externally observable or should be affected by IRET or any other
> synchronisation barriers apart from WAIT (or a waiting x87 instruction)
> that has been there for this purpose since forever.  And it would defeat
> some documented benefits of running the FP pipeline in the parallel.

It's plausible that this is special, but I doubt it.  Especially since
this optimization would be nuts post-SSE2.

>
>  And certainly such synchronisation didn't happen in the old days.
>
>> We also always save the FPU context on every context switch away from
>> a task that used the FPU, even in lazy mode.  This is because we might
>> switch the task back in on a different CPU, and we don't want to use
>> an IPI to move the FPU context.
>
>  That's an interesting case too, although not necessarily related.  If you
> say that we always save the FP context eagerly for the purpose of process
> migration, then sure, that invalidates any benefit we'd have from letting
> the x87 proceed.
>
>  However I can see different ways to address this case avoiding the need
> of eager FP context saving or an IPI:
>
> 1. We could bind any currently suspended process with an unsaved FP
>    context to the CPU it last executed on.

This would be insane.

>
> 2. We could mark such a process for migration next time and let it execute
>    on the CPU that holds its FP context once more, and then save the FP
>    context eagerly on the way out.

This would be worse than insane.  Now, in order to wake such a process
on a different CPU, we'd have to force a *context switch* on the
source CPU.  Now we're replacing a few hundred cycles at worse for a
transcendental function with at least 10k cycles (at a guess) and
possibly many orders of magnitude more if locks are held, plus
priority issues, plus total craziness.

>
> In some cases a lazily retained FP context would be preempted by another
> process before the process in question would resume anyway.  In this case
> any temporary binding to a CPU could be given up.
>
>> Given that we're only talking about old CPUs here, I sincerely doubt
>> that there's any relevant case in which an fxsave can usefully wait
>> for a long-running transcendental op to finish while we continue doing
>> useful work.  *Especially* since there will almost certainly be
>> several more mfences or atomic ops before the end of the context
>> switch, even if we're lucky enough to complete the context switching
>> using sysret.
>
>  I am not sure what you mean by FXSAVE usefully waiting for an op, please
> elaborate.  At the point you've reached FXSAVE and an earlier x87
> instruction hasn't completed, you've already lost.  The pipeline will be
> stalled until the x87 instruction has completed and it can be hundreds of
> cycles.  My point therefore has been about avoiding to execute FXSAVE for
> the old task until absolutely necessary, that with the lazy FP context
> switching would be at the next x87 (or SSE) instruction reached by the new
> task.
>
>  Likewise I don't see why MFENCE or an atomic operation should affect the
> excecution of say FSINCOS.  Whether the results of FSINCOS arrive before
> or after MFENCE, etc. are not externally observable.

FSINCOS; FXSAVE; MFENCE had better serialize all the way, no matter
what weird architectural crud is going on.

>
>  And I'm not sure if this all affects old CPUs only -- I don't know how
> much x87 software is out there, but after all these years I'd expect quite
> some.  Sure, lots of this can be recompiled to use SSE instead, but not
> all, and even where it is feasible, that's an extra burden for people,
> beyond say a routine hardware or Linux distribution or for that matter
> lone kernel upgrade.  Therefore I think we need to be careful not to
> pessimise things for a subset of people too much and ideally at all.
>
>  And to be clear, I am not against removing lazy FP context switching per
> se.  I am just emphasizing to be careful with that and be absolutely sure
> that it does not cause excessive harm.

We're talking about the unusual case in which we context switch within
~100 cycles of a legacy transcendental operation, and, even so,
there's *still* no regression, since we don't optimize this case
today.

--Andy