Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot

From: Sean Christopherson <seanjc@google.com>
To: Alexander Graf <graf@amazon.com>
Cc: "Alex Williamson" <alex.williamson@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	kvm@vger.kernel.org, "Xiao Guangrong" <guangrong.xiao@gmail.com>,
	"Chandrasekaran, Siddharth" <sidcha@amazon.de>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
Date: Mon, 24 Oct 2022 15:55:55 +0000	[thread overview]
Message-ID: <Y1a1i9vbJ/pVmV9r@google.com> (raw)
In-Reply-To: <490509f6-ae1a-4fc8-42a1-b037d6bffada@amazon.com>

On Mon, Oct 24, 2022, Alexander Graf wrote:
> Hey Sean,
> 
> On 21.10.22 21:40, Sean Christopherson wrote:
> > 
> > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > On 20.10.22 22:37, Sean Christopherson wrote:
> > > > On Thu, Oct 20, 2022, Alexander Graf wrote:
> > > > > On 26.06.20 19:32, Sean Christopherson wrote:
> > > > > > /cast <thread necromancy>
> > > > > > 
> > > > > > On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
> > > > > [...]
> > > > > 
> > > > > > I don't think any of this explains the pass-through GPU issue.  But, we
> > > > > > have a few use cases where zapping the entire MMU is undesirable, so I'm
> > > > > > going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
> > > > > > set the record straight for posterity before doing so.
> > > > > Hey Sean,
> > > > > 
> > > > > Did you ever get around to upstream or rework the zap optimization? The way
> > > > > I read current upstream, a memslot change still always wipes all SPTEs, not
> > > > > only the ones that were changed.
> > > > Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
> > > > TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
> > > > much a special case.
> > > > 
> > > > Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
> > > 
> > > Yeah, we're looking at adding support for the Hyper-V VSM extensions which
> > > Windows uses to implement Credential Guard. With that, the guest gets access
> > > to hypercalls that allow it to set reduced permissions for arbitrary gfns.
> > > To ensure that user space has full visibility into those for live migration,
> > > memory slots to model access would be a great fit. But it means we'd do
> > > ~100k memslot modifications on boot.
> > Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> > memslots (in their current form) won't work if the guest can drop executable
> > permissions.
> > 
> > Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> > the "KVM flushes everything on memslot deletion", I think we should instead
> > properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> > userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:
> 
> 
> That would be a cute acceleration for the case where we have to change
> permissions for a full slot. Unfortunately, the bulk of the changes are slot
> splits.

Ah, right, the guest will be operating on per-page granularity.

> We already built a prototype implementation of an atomic memslot update
> ioctl that allows us to keep other vCPUs running while we do the
> delete/create/create/create operation.

Please weigh in with your use case on a relevant upstream discussion regarding
"atomic" memslot updates[*].  I suspect we'll end up with a different solution
for this use case (see below), but we should at least capture all potential use
cases and ideas for modifying memslots without pausing vCPUs.

[*] https://lore.kernel.org/all/20220909104506.738478-1-eesposit@redhat.com

> But even with that, we see up to 30 min boot times for larger guests that
> most of the time are stuck in zapping pages.

Out of curiosity, did you measure runtime performance?  I would expect some amount
of runtime overhead as well dut to fragmenting memslots to that degree.

> I guess we have 2 options to make this viable:
> 
>   1) Optimize memslot splits + modifications to a point where they're fast
> enough
>   2) Add a different, faster mechanism on top of memslots for page granular
> permission bits

#2 crossed my mind as well.  This is actually nearly identical to the confidential
VM use case, where KVM needs to handle guest-initiated conversions of memory between
"private" and "shared" on a per-page granularity.  The proposed solution for that
is indeed a layer on top of memslots[*], which we arrived at in no small part because
splitting memslots was going to be a bottleneck.

Extending the proposed mem_attr_array to support additional state should be quite
easy.  The framework is all there, KVM just needs a few extra flags values, e.g.

	KVM_MEM_ATTR_SHARED	BIT(0)
	KVM_MEM_ATTR_READONLY	BIT(1)
	KVM_MEM_ATTR_NOEXEC	BIT(2)

and then new ioctls to expose the functionality to userspace.  Actually, if we
want to go this route, it might even make sense to define new a generic MEM_ATTR
ioctl() right away instead of repurposing KVM_MEMORY_ENCRYPT_(UN)REG_REGION for
the private vs. shared use case.

[*] https://lore.kernel.org/all/20220915142913.2213336-6-chao.p.peng@linux.intel.com