On 05/24/2018 04:18 PM, speck for Tim Chen wrote:
> On 05/24/2018 08:33 AM, speck for Thomas Gleixner wrote:
>> On Thu, 24 May 2018, speck for Thomas Gleixner wrote:
>>> On Thu, 24 May 2018, speck for Peter Zijlstra wrote:
>>>> On Wed, May 23, 2018 at 10:45:45AM +0100, speck for David Woodhouse wrote:
>>>>> The microcode trick just makes it a lot easier because we don't
>>>>> have to *explicitly* pause the sibling vCPUs and manage their state on
>>>>> every vmexit/entry. And avoids potential race conditions with managing
>>>>> that in software.
>>>>
>>>> Yes, it would certainly help and avoid a fair bit of ugly. It would, for
>>>> instance, avoid having to modify irq_enter() / irq_exit(), which would
>>>> otherwise be required (and possibly leak all data touched up until that
>>>> point is reached).
>>>>
>>>> But even with all that, adding L1-flush to every VMENTER will hurt lots.
>>>> Consider for example the PIO emulation used when booting a guest from a
>>>> disk image. That causes VMEXIT/VMENTER at stupendous rates.
>>>
>>> Just did a test on SKL Client where I have ucode. It does not have HT so
>>> its not suffering from any HT side effects when L1D is flushed.
>>>
>>> Boot time from a disk image is ~1s measured from the first vcpu enter.
>>>
>>> With L1D Flush on vmenter the boot time is about 5-10% slower. And that has
>>> lots of PIO operations in the early boot.
>>>
>>> For a kernel build the L1D Flush has an overhead of < 1%.
>>>
>>> Netperf guest to host has a slight drop of the throughput in the 2%
>>> range. Host to guest surprisingly goes up by ~3%. Fun stuff!
>>>
>>> Now I isolated two host CPUs and pinned the two vCPUs on it to be able to
>>> measure the overhead. Running cyclictest with a period of 25us in the guest
>>> on a isolated guest CPU and monitoring the behaviour with perf on the host
>>> for the corresponding host CPU gives
>>>
>>> No Flush	      	       Flush
>>>
>>> 1.31 insn per cycle	       1.14 insn per cycle
>>>
>>> 2e6 L1-dcache-load-misses/sec  26e6 L1-dcache-load-misses/sec
>>>
>>> In that simple test the L1D misses go up by a factor of 13.
>>>
>>> Now with the whole gang scheduling the numbers I heard through the
>>> grapevine are in the range of factor 130, i.e. 13k% for a simple boot from
>>> disk image. 13 minutes instead of 6 seconds...
> 
> The performance is highly dependent on how often we VM exit.
> Working with Peter Z on his prototype, the performance ranges from
> no regression for a network loop back, ~20% regression for kernel compile
> to ~100% regression on File IO.  PIO brings out the worse aspect
> of the synchronization overhead as we VM exit on every dword PIO read in, and the
> kernel and initrd image was about 50 MB for the experiment, and led to
> 13 min of load time.
> 
> We may need to do the co-scheduling only when VM exit rate is low, and
> turn off the SMT when VM exit rate becomes too high.
> 
> (Note: I haven't added in the L1 flush on VM entry for my experiment, that is on
> the todo).

As a post note, I added in the L1 flush and the performance numbers
pretty much stay the same.  So the synchronization overhead is
dominant and L1 flush overhead is secondary.

Tim