Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers

From: Paolo Bonzini <pbonzini@redhat.com>
To: Andrea Arcangeli <aarcange@redhat.com>
Cc: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Sean Christopherson <sean.j.christopherson@intel.com>
Subject: Re: [PATCH 12/14] KVM: retpolines: x86: eliminate retpoline from vmx.c exit handlers
Date: Wed, 16 Oct 2019 00:22:31 +0200	[thread overview]
Message-ID: <f375049a-6a45-c0df-a377-66418c8eb7e8@redhat.com> (raw)
In-Reply-To: <20191015203516.GF331@redhat.com>

On 15/10/19 22:35, Andrea Arcangeli wrote:
> On Tue, Oct 15, 2019 at 09:46:58PM +0200, Paolo Bonzini wrote:
>> On 15/10/19 18:49, Andrea Arcangeli wrote:
>>> On Tue, Oct 15, 2019 at 10:28:39AM +0200, Paolo Bonzini wrote:
>>>> If you're including EXIT_REASON_EPT_MISCONFIG (MMIO access) then you
>>>> should include EXIT_REASON_IO_INSTRUCTION too.  Depending on the devices
>>>> that are in the guest, the doorbell register might be MMIO or PIO.
>>>
>>> The fact outb/inb devices exists isn't the question here. The question
>>> you should clarify is: which of the PIO devices is performance
>>> critical as much as MMIO with virtio/vhost?
>>
>> virtio 0.9 uses PIO.
> 
> 0.9 is a 12 years old protocol replaced several years ago.

Oh come on.  0.9 is not 12-years old.  virtio 1.0 is 3.5 years old
(March 2016).  Anything older than 2017 is going to use 0.9.

> Your idea that HLT is a certainly is a slow path is only correct if
> you assume the host is IDLE, but the host is never idle if you use
> virt for consolidation.
>
> I've several workloads including eBPF tracing, not related to
> interrupts (that in turn cannot be mitigated by NAPI) that schedule
> frequently and hit 100k+ of HLT vmexits per second and the host is all
> but idle. There's no need of hardware interrupt to wake up tasks and
> schedule in the guest, scheduler IPIs and timers are more than enough.
>
> All it matters is how many vmexits per second there are, everything
> else including "why" they happen and what those vmexists means for the
> guest, is irrelevant, or it would be relevant only if the host was
> guaranteed to be idle but there's no such guarantee.

Your tables give:

	Samples	  Samples%  Time%     Min Time  Max time       Avg time
HLT     101128    75.33%    99.66%    0.43us    901000.66us    310.88us
HLT     118474    19.11%    95.88%    0.33us    707693.05us    43.56us

If "avg time" means the average time to serve an HLT vmexit, I don't
understand how you can have an average time of 0.3ms (1/3000th of a
second) and 100000 samples per second.  Can you explain that to me?

Anyway, if the average time is indeed 310us and 43us, it is orders of
magnitude more than the time spent executing a retpoline.  That time
will be spent in an indirect branch miss (retpoline) instead of doing
while(!kvm_vcpu_check_block()), but it doesn't change anything.

>>> I'm pretty sure HLT/EXTERNAL_INTERRUPT/PENDING_INTERRUPT should be
>>> included.
>>> I also wonder if VMCALL should be added, certain loads hit on fairly
>>> frequent VMCALL, but none of the one I benchmarked.
>>
>> I agree for external interrupt and pending interrupt, and VMCALL is fine
>> too.  In addition I'd add I/O instructions which are useful for some
>> guests and also for benchmarking (e.g. vmexit.flat has both IN and OUT
>> tests).
> 
> Isn't it faster to use cpuid for benchmarking? I mean we don't want to
> pay for more than one branch for benchmarking (even cpuid is
> questionable in the long term, but for now it's handy to have),

outl is more or less the same as cpuid and vmcall.  You can measure it
with vmexit.flat.  inl is slower.

> and unlike inb/outb, cpuid runs occasionally in all real life workloads
> (including in guest userland) so between inb/outb, I'd rather prefer
> to use cpuid as the benchmark vector because at least it has a chance
> to help real workloads a bit too.

Again: what is the real workload that does thousands of CPUIDs per second?

Paolo