From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from p4fea4eb5.dip0.t-ipconnect.de ([79.234.78.181] helo=nanos) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1fMeeQ-000329-2s for speck@linutronix.de; Sat, 26 May 2018 21:14:50 +0200 Date: Sat, 26 May 2018 21:14:49 +0200 (CEST) From: Thomas Gleixner Subject: Re: L1D-Fault KVM mitigation In-Reply-To: Message-ID: References: <20180424090630.wlghmrpasn7v7wbn@suse.de> <20180424093537.GC4064@hirez.programming.kicks-ass.net> <1524563292.8691.38.camel@infradead.org> <20180424110445.GU4043@hirez.programming.kicks-ass.net> <1527068745.8186.89.camel@infradead.org> <20180524094526.GE12198@hirez.programming.kicks-ass.net> MIME-Version: 1.0 Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit To: speck@linutronix.de List-ID: On Thu, 24 May 2018, speck for Tim Chen wrote: >> Now with the whole gang scheduling the numbers I heard through the >> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from >> disk image. 13 minutes instead of 6 seconds... > The performance is highly dependent on how often we VM exit. That's pretty obvious. > Working with Peter Z on his prototype, the performance ranges from > no regression for a network loop back, ~20% regression for kernel compile > to ~100% regression on File IO. These numbers are not that interesting when you do not provide comparisons vs. single threaded. See below. > PIO brings out the worse aspect of the synchronization overhead as we VM > exit on every dword PIO read in, and the kernel and initrd image was > about 50 MB for the experiment, and led to 13 min of load time. > > We may need to do the co-scheduling only when VM exit rate is low, and > turn off the SMT when VM exit rate becomes too high. You cannot do that during runtime. That will destroy placement schemes and whatever. The SMT off decision needs to be done at a quiescent moment, i.e. before starting VMs. The PIO case _IS_ interesting because it highlights the problem with the synchronization overhead. And it does not matter at all whether you VMEXIT because of a PIO access or due to any other reason. So even if you optimize it then you still have a gazillion of vm_exits on boot. The simple boot tests I did have ~250k vm_exits in 5 seconds and only half of them are PIO. Removing the PIO access makes the boot faster because you avoid 50% of the vmexits, but the rest of the vmexits will still get a massive overhead, unless you have a scenario where two vCPUs of a guest are runnable and ready to enter at the same time and vmexit at the same time. Any other scenario will lose due to the busy waiting synchronization overhead. Just look at traces and do the math. I did the following test: - Two CPUs (siblings) on the host (HSW-EX) fully isolated - One guest with two vCPUs affine to the isolated host CPUs. idle=poll on the guest command line to avoid the single vCPU case. - No L1 Flush - Running a kernel compile on the guest in the regular virtio disk backed filesystem. Modified the build skript to stop before the final linkage because that is single threaded. Time: 88 seconds vmexits: vCPU0 86.218 vCPU1 85.703 total 171.921 That's about 2 vmexits per ms. Running the same compile single threaded (offlining vCPU1 in the guest) increases the time to 107 seconds. 107 / 88 = 1.22 I.e. it's 20% slower than the one using two threads. That means that it is the same slowdown as having two threads synchronized (your number). So if I take the above example and assume that the overhead of synchronization is ~20% then the average vmenter/vmexit time is close to 50us. Next I did an experiment with synchronizing the vmenter/vmexit. It's probably more stupid than what you have as the overhead I observe is way higher, but then I don't know how and what you tested exactly, so it's hard to compare. Nevertheless it gave me very interesting insights via tracing the synchronization mechanics. The interesting thing is that halfways synchronous vmexits on both vCPUs are rather cheap. The slightly async ones make the big difference and at some points in the trace the stuff starts to ping pong in and out of guest mode without really making progress for a while. So there is not only the overhead itself, it's timing dependend overhead which can accumulate rather fast. And there is absolutely nothing you can do about that. So I can see the usefulness for scenarious which David Woodhouse described where vCPU and host CPU have a fixed relationship and the guests exit once in a while. But that should really be done with ucode assisantance which avoids all the nasty synchronization hackery more or less completely. But if anyone believes that the gang scheduling scheme with full software synchronization can be applied to random usecases, then he's probably working for the marketing department and authoring the L1 terminal fuckup press release and whitepaper. I'm surely open for a suprising clever trick which makes this all work, but I certainly won't hold by breath. Thanks, tglx