From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <tglx@linutronix.de>
Received: from p4fea4eb5.dip0.t-ipconnect.de ([79.234.78.181] helo=nanos)
	by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256)
	(Exim 4.80)
	(envelope-from <tglx@linutronix.de>)
	id 1fMeeQ-000329-2s
	for speck@linutronix.de; Sat, 26 May 2018 21:14:50 +0200
Date: Sat, 26 May 2018 21:14:49 +0200 (CEST)
From: Thomas Gleixner <tglx@linutronix.de>
Subject: Re: L1D-Fault KVM mitigation
In-Reply-To: <d2029ba2-bdad-5bb9-596d-f22a9bfa5b9a@linux.intel.com>
Message-ID: <alpine.DEB.2.21.1805251152400.1581@nanos.tec.linutronix.de>
References: <20180424090630.wlghmrpasn7v7wbn@suse.de>
 <20180424093537.GC4064@hirez.programming.kicks-ass.net>
 <1524563292.8691.38.camel@infradead.org>
 <20180424110445.GU4043@hirez.programming.kicks-ass.net>
 <1527068745.8186.89.camel@infradead.org>
 <20180524094526.GE12198@hirez.programming.kicks-ass.net>
 <alpine.DEB.2.21.1805241201510.1577@nanos.tec.linutronix.de>
 <alpine.DEB.2.21.1805241729520.1577@nanos.tec.linutronix.de>
 <d2029ba2-bdad-5bb9-596d-f22a9bfa5b9a@linux.intel.com>
MIME-Version: 1.0
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: 7bit
To: speck@linutronix.de
List-ID: <speck.linutronix.de>

On Thu, 24 May 2018, speck for Tim Chen wrote:

>> Now with the whole gang scheduling the numbers I heard through the
>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>> disk image. 13 minutes instead of 6 seconds...

> The performance is highly dependent on how often we VM exit.

That's pretty obvious.

> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.

These numbers are not that interesting when you do not provide comparisons
vs. single threaded. See below.

> PIO brings out the worse aspect of the synchronization overhead as we VM
> exit on every dword PIO read in, and the kernel and initrd image was
> about 50 MB for the experiment, and led to 13 min of load time.
>
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.

You cannot do that during runtime. That will destroy placement schemes and
whatever. The SMT off decision needs to be done at a quiescent moment,
i.e. before starting VMs.

The PIO case _IS_ interesting because it highlights the problem with the
synchronization overhead. And it does not matter at all whether you VMEXIT
because of a PIO access or due to any other reason. So even if you optimize
it then you still have a gazillion of vm_exits on boot. The simple boot
tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO.

Removing the PIO access makes the boot faster because you avoid 50% of the
vmexits, but the rest of the vmexits will still get a massive overhead,
unless you have a scenario where two vCPUs of a guest are runnable and
ready to enter at the same time and vmexit at the same time. Any other
scenario will lose due to the busy waiting synchronization overhead. Just
look at traces and do the math.

I did the following test:

 - Two CPUs (siblings) on the host (HSW-EX) fully isolated

 - One guest with two vCPUs affine to the isolated host CPUs. idle=poll on
   the guest command line to avoid the single vCPU case.

 - No L1 Flush

 - Running a kernel compile on the guest in the regular virtio disk backed
   filesystem. Modified the build skript to stop before the final linkage
   because that is single threaded.

Time:  88 seconds

vmexits:   vCPU0         86.218
           vCPU1         85.703
           total        171.921

That's about 2 vmexits per ms.

Running the same compile single threaded (offlining vCPU1 in the guest)
increases the time to 107 seconds.

    107 / 88  = 1.22

I.e. it's 20% slower than the one using two threads. That means that it is
the same slowdown as having two threads synchronized (your number).

So if I take the above example and assume that the overhead of
synchronization is ~20% then the average vmenter/vmexit time is close to
50us.

Next I did an experiment with synchronizing the vmenter/vmexit. It's
probably more stupid than what you have as the overhead I observe is way
higher, but then I don't know how and what you tested exactly, so it's hard
to compare.

Nevertheless it gave me very interesting insights via tracing the
synchronization mechanics. The interesting thing is that halfways
synchronous vmexits on both vCPUs are rather cheap. The slightly async ones
make the big difference and at some points in the trace the stuff starts to
ping pong in and out of guest mode without really making progress for a
while. So there is not only the overhead itself, it's timing dependend
overhead which can accumulate rather fast. And there is absolutely nothing
you can do about that.

So I can see the usefulness for scenarious which David Woodhouse described
where vCPU and host CPU have a fixed relationship and the guests exit once
in a while. But that should really be done with ucode assisantance which
avoids all the nasty synchronization hackery more or less completely.

But if anyone believes that the gang scheduling scheme with full software
synchronization can be applied to random usecases, then he's probably
working for the marketing department and authoring the L1 terminal fuckup
press release and whitepaper.

I'm surely open for a suprising clever trick which makes this all work, but
I certainly won't hold by breath.

Thanks,

        tglx