Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus

From: Shivam Kumar <shivam.kumar1@nutanix.com>
To: Sean Christopherson <seanjc@google.com>, Marc Zyngier <maz@kernel.org>
Cc: pbonzini@redhat.com, james.morse@arm.com,
	borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org,
	Shaju Abraham <shaju.abraham@nutanix.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Anurag Madnawat <anurag.madnawat@nutanix.com>
Subject: Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
Date: Tue, 5 Jul 2022 12:51:01 +0530	[thread overview]
Message-ID: <2e5198b3-54ea-010e-c418-f98054befe1b@nutanix.com> (raw)
In-Reply-To: <b75013cb-0d40-569a-8a31-8ebb7cf6c541@nutanix.com>

On 30/05/22 4:35 pm, Shivam Kumar wrote:
> 
>>> It just shows that the proposed API is pretty bad, because instead of
>>> working as a credit, it works as a ceiling, based on a value that is
>>> dependent on the vpcu previous state (forcing userspace to recompute
>>> the next quota on each exit),
>> My understanding is that intended use case is dependent on previous 
>> vCPU state by
>> design.  E.g. a vCPU that is aggressively dirtying memory will be 
>> given a different
>> quota than a vCPU that has historically not dirtied much memory.
>>
>> Even if the quotas are fixed, "recomputing" the new quota is simply:
>>
>>     run->dirty_quota = run->dirty_quota_exit.count + PER_VCPU_DIRTY_QUOTA
>>
>> The main motivation for the ceiling approach is that it's simpler for 
>> KVM to implement
>> since KVM never writes vcpu->run->dirty_quota.  E.g. userspace may 
>> observe a "spurious"
>> exit if KVM reads the quota right before it's changed by userspace, 
>> but KVM doesn't
>> have to guard against clobbering the quota.
>>
>> A credit based implementation would require (a) snapshotting the 
>> credit at
>> some point during of KVM_RUN, (b) disallowing changing the quota 
>> credit while the
>> vCPU is running, or (c) an expensive atomic sequence (expensive on x86 
>> at least)
>> to avoid clobbering an update from userspace.  (a) is undesirable 
>> because it delays
>> recognition of the new quota, especially if KVM doesn't re-read the 
>> quota in the
>> tight run loop.  And AIUI, changing the quota while the vCPU is 
>> running is a desired
>> use case, so that rules out (b).  The cost of (c) is not the end of 
>> the world, but
>> IMO the benefits are marginal.
>>
>> E.g. if we go with a request-based implementation where 
>> kvm_vcpu_check_dirty_quota()
>> is called in mark_page_dirty_in_slot(), then the ceiling-based 
>> approach is:
>>
>>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>>    {
>>          u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
>>
>>     if (!dirty_quota || (vcpu->stat.generic.pages_dirtied < dirty_quota))
>>         return;
>>
>>     /*
>>      * Snapshot the quota to report it to userspace.  The dirty count 
>> will
>>      * captured when the request is processed.
>>      */
>>     vcpu->dirty_quota = dirty_quota;
>>     kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>>    }
>>
>> whereas the credit-based approach would need to be something like:
>>
>>    static void kvm_vcpu_is_dirty_quota_exhausted(struct kvm_vcpu *vcpu)
>>    {
>>          u64 old_dirty_quota;
>>
>>     if (!READ_ONCE(vcpu->run->dirty_quota_enabled))
>>         return;
>>
>>     old_dirty_quota = 
>> atomic64_fetch_add_unless(vcpu->run->dirty_quota, -1, 0);
>>
>>     /* Quota is not yet exhausted, or was already exhausted. */
>>     if (old_dirty_quota != 1)
>>         return;
>>
>>     /*
>>      * The dirty count will be re-captured in dirty_count when the 
>> request
>>      * is processed so that userspace can compute the number of pages 
>> that
>>      * were dirtied after the quota was exhausted.
>>      */
>>     vcpu->dirty_count_at_exhaustion = vcpu->stat.generic.pages_dirtied;
>>     kvm_make_request(KVM_REQ_DIRTY_QUOTA_EXIT, vcpu);
>>    }
>>
>>> and with undocumented, arbitrary limits as a bonus.
>> Eh, documenting that userspace needs to be aware of a 
>> theoretically-possible wrap
>> is easy enough.  If we're truly worried about a wrap scenario, it'd be 
>> trivial to
>> add a flag/ioctl() to let userspace reset 
>> vcpu->stat.generic.pages_dirtied.
> Thank you so much Marc and Sean for the feedback. We can implement an 
> ioctl to reset 'pages_dirtied' and mention in the documentation that 
> userspace should reset it after each migration (regardless of whether 
> the migration fails/succeeds). I hope this will address the concern raised.
> 
> We moved away from the credit-based approach due to multiple atomic 
> reads. For now, we are sticking to not changing the quota while the vcpu 
> is running. But, we need to track the pages dirtied at the time of the 
> last quota update for credit-based approach. We also need to share this 
> with the userspace as pages dirtied can overflow in certain 
> circumstances and we might have to adjust the throttling accordingly. 
> With this, the check would look like:
> 
> u64 dirty_quota = READ_ONCE(vcpu->run->dirty_quota);
> u64 last_pages_dirtied = READ_ONCE(vcpu->run->last_pages_dirtied);
> u64 pages_dirtied = vcpu->stat.generic.pages_dirtied;
> 
> if (pages_dirtied - last_pages_dirtied < dirty_quota)
>      return;
> 
> // Code to set exit reason here
> 
> Please post your suggestions. Thanks.

Hi, here's a summary of what needs to be changed and what should be kept 
as it is (purely my opinion based on the discussions we have had so far):

i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm 
requests in dirty quota check. I hope that the ceiling-based approach, 
with proper documentation and an ioctl exposed for resetting 
'dirty_quota' and 'pages_dirtied', is good enough. Please post your 
suggestions if you think otherwise.
ii) The change in VMX's handle_invalid_guest_state() remains as it is.
iii) For now, we are lazily updating dirty quota, i.e. we are updating 
it only when the vcpu exits to userspace with the exit reason 
KVM_EXIT_DIRTY_QUOTA_EXHAUSTED. So, we don't require an additional 
variable to capture the dirty quota at the time of dirty quota check. 
This may change in the future though.
iv) Adding a KVM capability for dirty quota so that userspace can detect 
if this feature is available or not. Please let me know if we can do it 
differently than having a KVM cap.

I'm happy to send v5 if the above points look good to you. If you have 
further suggestions, please let me know. I'm looking forward to them. 
Thank you so much for the review so far.