Re: L1D-Fault KVM mitigation

From: Thomas Gleixner <tglx@linutronix.de>
To: speck@linutronix.de
Subject: Re: L1D-Fault KVM mitigation
Date: Thu, 24 May 2018 17:04:54 +0200 (CEST)	[thread overview]
Message-ID: <alpine.DEB.2.21.1805241201510.1577@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20180524094526.GE12198@hirez.programming.kicks-ass.net>

On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
> > The microcode trick just makes it a lot easier because we don't
> > have to *explicitly* pause the sibling vCPUs and manage their state on
> > every vmexit/entry. And avoids potential race conditions with managing
> > that in software.
> 
> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
> instance, avoid having to modify irq_enter() / irq_exit(), which would
> otherwise be required (and possibly leak all data touched up until that
> point is reached).
> 
> But even with all that, adding L1-flush to every VMENTER will hurt lots.
> Consider for example the PIO emulation used when booting a guest from a
> disk image. That causes VMEXIT/VMENTER at stupendous rates.

Just did a test on SKL Client where I have ucode. It does not have HT so
its not suffering from any HT side effects when L1D is flushed.

Boot time from a disk image is ~1s measured from the first vcpu enter.

With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
lots of PIO operations in the early boot.

For a kernel build the L1D Flush has an overhead of < 1%.

Netperf guest to host has a slight drop of the throughput in the 2%
range. Host to guest surprisingly goes up by ~3%. Fun stuff!

Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
measure the overhead. Running cyclictest with a period of 25us in the guest
on a isolated guest CPU and monitoring the behaviour with perf on the host
for the corresponding host CPU gives

No Flush	      	       Flush

1.31 insn per cycle	       1.14 insn per cycle

2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec

In that simple test the L1D misses go up by a factor of 13.

Now with the whole gang scheduling the numbers I heard through the
grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
disk image. 13 minutes instead of 6 seconds...

That's not surprising at all, though the magnitude is way higher than I
expected. I don't see a realistic chance for vmexit heavy workloads to work
with that synchronization thing at all, whether it's ucode assisted or not.

The only workload types which will ever benefit from that co-scheduling
stuff are CPU bound workloads which more or less never vmexit. But are
those workloads really workloads which benefit from HT? Compute workloads
tend to use floating point or vector instructions which are not really HT
friendly.

Can the virt folks who know what runs on their clowdy offerings please shed
some light on this? Has anyone made a proper analysis of clowd workloads
and their behaviour on HT and their vmexit rates?

Thanks,

	tglx