Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus

From: Sean Christopherson <seanjc@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Shivam Kumar <shivam.kumar1@nutanix.com>,
	Marc Zyngier <maz@kernel.org>,
	pbonzini@redhat.com, james.morse@arm.com,
	borntraeger@linux.ibm.com, david@redhat.com, kvm@vger.kernel.org,
	Shaju Abraham <shaju.abraham@nutanix.com>,
	Manish Mishra <manish.mishra@nutanix.com>,
	Anurag Madnawat <anurag.madnawat@nutanix.com>
Subject: Re: [PATCH v4 1/4] KVM: Implement dirty quota-based throttling of vcpus
Date: Fri, 15 Jul 2022 16:23:54 +0000	[thread overview]
Message-ID: <YtGUmsavkoTBjQTU@google.com> (raw)
In-Reply-To: <YtCWW2OfbI4+r1L3@xz-m1.local>

On Thu, Jul 14, 2022, Peter Xu wrote:
> On Thu, Jul 14, 2022 at 08:48:04PM +0000, Sean Christopherson wrote:
> > On Thu, Jul 14, 2022, Peter Xu wrote:
> > > Hi, Shivam,
> > > 
> > > On Tue, Jul 05, 2022 at 12:51:01PM +0530, Shivam Kumar wrote:
> > > > Hi, here's a summary of what needs to be changed and what should be kept as
> > > > it is (purely my opinion based on the discussions we have had so far):
> > > > 
> > > > i) Moving the dirty quota check to mark_page_dirty_in_slot. Use kvm requests
> > > > in dirty quota check. I hope that the ceiling-based approach, with proper
> > > > documentation and an ioctl exposed for resetting 'dirty_quota' and
> > > > 'pages_dirtied', is good enough. Please post your suggestions if you think
> > > > otherwise.
> > > 
> > > An ioctl just for this could be an overkill to me.
> > >
> > > Currently you exposes only "quota" to kvm_run, then when vmexit you have
> > > exit fields contain both "quota" and "count".  I always think it's a bit
> > > redundant.
> > > 
> > > What I'm thinking is:
> > > 
> > >   (1) Expose both "quota" and "count" in kvm_run, then:
> > > 
> > >       "quota" should only be written by userspace and read by kernel.
> > >       "count" should only be written by kernel and read by the userspace. [*]
> > > 
> > >       [*] One special case is when the userspace found that there's risk of
> > >       quota & count overflow, then the userspace:
> > > 
> > >         - Kick the vcpu out (so the kernel won't write to "count" anymore)
> > >         - Update both "quota" and "count" to safe values
> > >         - Resume the KVM_RUN
> > > 
> > >   (2) When quota reached, we don't need to copy quota/count in vmexit
> > >       fields, since the userspace can read the realtime values in kvm_run.
> > > 
> > > Would this work?
> > 
> > Technically, yes, practically speaking, no.  If KVM doesn't provide the quota
> > that _KVM_ saw at the time of exit, then there's no sane way to audit KVM exits
> > due to KVM_EXIT_DIRTY_QUOTA_EXHAUSTED.  Providing the quota ensure userspace sees
> > sane, coherent data if there's a race between KVM checking the quota and userspace
> > updating the quota.  If KVM doesn't provide the quota, then userspace can see an
> > exit with "count < quota".
> 
> This is rare false positive which should be acceptable in this case (the
> same as vmexit with count==quota but we just planned to boost the quota),
> IMHO it's better than always kicking the vcpu, since the overhead for such
> false is only a vmexit but nothing else.

Oh, we're in complete agreement on that front.  I'm only objecting to forcing
userspace to read the realtime quota+count.  I want KVM to provide a snapshot of
the quota+count so that if there's a KVM bug, e.g. KVM spuriously exits, then
there is zero ambiguity as the quota+count in the kvm_run exit field will hold
invalid/garbage data.  Without a snapshot, if there were a bug where KVM spuriously
exited, root causing or even detecting the bug would be difficult if userspace is
dynamically updating the quota as changing the quota would have destroyed the
evidence of KVM's bug.

It's unlikely we'll eever have such a bug, but filling the exits fields is cheap, and
because it's a union, the "redundant" fields don't consume extra space in kvm_run.

And the reasoning behind not having kvm_run.dirty_count is that it's fully
redundant if KVM provides a stat, and IMO such a stat will be quite helpful for
things beyond dirty quotas, e.g. being able to see which vCPUs are dirtying memory
from the command line for debug purposes.

> > Even if userspace is ok with such races, it will be extremely difficult to detect
> > KVM issues if we mess something up because such behavior would have to be allowed
> > by KVM's ABI.
> 
> Could you elaborate?  We have quite a few places sharing these between
> user/kernel on kvm_run, no?

I think I answered this above, let me know if I didn't.