Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot

From: Alexander Graf <graf@amazon.com>
To: Sean Christopherson <seanjc@google.com>
Cc: "Alex Williamson" <alex.williamson@redhat.com>,
	"Radim Krčmář" <rkrcmar@redhat.com>,
	kvm@vger.kernel.org, "Xiao Guangrong" <guangrong.xiao@gmail.com>,
	"Chandrasekaran, Siddharth" <sidcha@amazon.de>,
	"Paolo Bonzini" <pbonzini@redhat.com>
Subject: Re: [PATCH v2 11/27] KVM: x86/mmu: Zap only the relevant pages when removing a memslot
Date: Mon, 24 Oct 2022 08:12:22 +0200	[thread overview]
Message-ID: <490509f6-ae1a-4fc8-42a1-b037d6bffada@amazon.com> (raw)
In-Reply-To: <Y1L1t6Qw2CaLwJk3@google.com>

Hey Sean,

On 21.10.22 21:40, Sean Christopherson wrote:
>
> On Thu, Oct 20, 2022, Alexander Graf wrote:
>> On 20.10.22 22:37, Sean Christopherson wrote:
>>> On Thu, Oct 20, 2022, Alexander Graf wrote:
>>>> On 26.06.20 19:32, Sean Christopherson wrote:
>>>>> /cast <thread necromancy>
>>>>>
>>>>> On Tue, Aug 20, 2019 at 01:03:19PM -0700, Sean Christopherson wrote:
>>>> [...]
>>>>
>>>>> I don't think any of this explains the pass-through GPU issue.  But, we
>>>>> have a few use cases where zapping the entire MMU is undesirable, so I'm
>>>>> going to retry upstreaming this patch as with per-VM opt-in.  I wanted to
>>>>> set the record straight for posterity before doing so.
>>>> Hey Sean,
>>>>
>>>> Did you ever get around to upstream or rework the zap optimization? The way
>>>> I read current upstream, a memslot change still always wipes all SPTEs, not
>>>> only the ones that were changed.
>>> Nope, I've more or less given up hope on zapping only the deleted/moved memslot.
>>> TDX (and SNP?) will preserve SPTEs for guest private memory, but they're very
>>> much a special case.
>>>
>>> Do you have use case and/or issue that doesn't play nice with the "zap all" behavior?
>>
>> Yeah, we're looking at adding support for the Hyper-V VSM extensions which
>> Windows uses to implement Credential Guard. With that, the guest gets access
>> to hypercalls that allow it to set reduced permissions for arbitrary gfns.
>> To ensure that user space has full visibility into those for live migration,
>> memory slots to model access would be a great fit. But it means we'd do
>> ~100k memslot modifications on boot.
> Oof.  100k memslot updates is going to be painful irrespective of flushing.  And
> memslots (in their current form) won't work if the guest can drop executable
> permissions.
>
> Assuming KVM needs to support a KVM_MEM_NO_EXEC flag, rather than trying to solve
> the "KVM flushes everything on memslot deletion", I think we should instead
> properly support toggling KVM_MEM_READONLY (and KVM_MEM_NO_EXEC) without forcing
> userspace to delete the memslot.  Commit 75d61fbcf563 ("KVM: set_memory_region:

That would be a cute acceleration for the case where we have to change 
permissions for a full slot. Unfortunately, the bulk of the changes are 
slot splits. Let me explain with numbers from a 1 vcpu, 8GB Windows 
Server 2019 boot:

GFN permission modification requests: 46294
Unique GFNs: 21200

That means on boot, we start off with a few huge memslots for guest RAM. 
Then down the road, we need to change permissions for individual pages 
inside these larger regions. The obvious option for that is a memslot 
split - delete, create, create, create. Now we have 2 large memslots and 
1 that only spans a single page.

Later in the boot process, Windows then some times also toggles 
permissions for pages that it already split off earlier. That's the case 
we can optimize with the modify optimization you described in the 
previous email. But that's only about half the requests. The other half 
are memslot split requests.

We already built a prototype implementation of an atomic memslot update 
ioctl that allows us to keep other vCPUs running while we do the 
delete/create/create/create operation. But even with that, we see up to 
30 min boot times for larger guests that most of the time are stuck in 
zapping pages.

I guess we have 2 options to make this viable:

   1) Optimize memslot splits + modifications to a point where they're 
fast enough
   2) Add a different, faster mechanism on top of memslots for page 
granular permission bits

Also sorry for not posting the underlying credguard and atomic memslot 
patches yet. I wanted to kick off this conversation before sending them 
out - they're still too raw for upstream review atm :).

Thanks,

Alex

Amazon Development Center Germany GmbH
Krausenstr. 38
10117 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 149173 B
Sitz: Berlin
Ust-ID: DE 289 237 879