On 05/24/2018 04:18 PM, speck for Tim Chen wrote: > On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote: >> On Thu, 24 May 2018, speck for Thomas Gleixner wrote: >>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote: >>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote: >>>>> The microcode trick just makes it a lot easier because we don't >>>>> have to *explicitly* pause the sibling vCPUs and manage their state on >>>>> every vmexit/entry. And avoids potential race conditions with managing >>>>> that in software. >>>> >>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for >>>> instance, avoid having to modify irq_enter() / irq_exit(), which would >>>> otherwise be required (and possibly leak all data touched up until that >>>> point is reached). >>>> >>>> But even with all that, adding L1-flush to every VMENTER will hurt lots. >>>> Consider for example the PIO emulation used when booting a guest from a >>>> disk image. That causes VMEXIT/VMENTER at stupendous rates. >>> >>> Just did a test on SKL Client where I have ucode. It does not have HT so >>> its not suffering from any HT side effects when L1D is flushed. >>> >>> Boot time from a disk image is ~1s measured from the first vcpu enter. >>> >>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has >>> lots of PIO operations in the early boot. >>> >>> For a kernel build the L1D Flush has an overhead of < 1%. >>> >>> Netperf guest to host has a slight drop of the throughput in the 2% >>> range. Host to guest surprisingly goes up by ~3%. Fun stuff! >>> >>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to >>> measure the overhead. Running cyclictest with a period of 25us in the guest >>> on a isolated guest CPU and monitoring the behaviour with perf on the host >>> for the corresponding host CPU gives >>> >>> No Flush Flush >>> >>> 1.31 insn per cycle 1.14 insn per cycle >>> >>> 2e6 L1-dcache-load-misses/sec 26e6 L1-dcache-load-misses/sec >>> >>> In that simple test the L1D misses go up by a factor of 13. >>> >>> Now with the whole gang scheduling the numbers I heard through the >>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from >>> disk image. 13 minutes instead of 6 seconds... > > The performance is highly dependent on how often we VM exit. > Working with Peter Z on his prototype, the performance ranges from > no regression for a network loop back, ~20% regression for kernel compile > to ~100% regression on File IO. PIO brings out the worse aspect > of the synchronization overhead as we VM exit on every dword PIO read in, and the > kernel and initrd image was about 50 MB for the experiment, and led to > 13 min of load time. > > We may need to do the co-scheduling only when VM exit rate is low, and > turn off the SMT when VM exit rate becomes too high. > > (Note: I haven't added in the L1 flush on VM entry for my experiment, that is on > the todo). As a post note, I added in the L1 flush and the performance numbers pretty much stay the same. So the synchronization overhead is dominant and L1 flush overhead is secondary. Tim