Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed

From: Michael Roth <michael.roth@amd.com>
To: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>, <kvm@vger.kernel.org>,
	<linux-kernel@vger.kernel.org>,
	Binbin Wu <binbin.wu@linux.intel.com>
Subject: Re: [PATCH 06/13] KVM: Disallow hugepages for incompatible gmem bindings, but let 'em succeed
Date: Fri, 22 Sep 2023 17:42:10 -0500	[thread overview]
Message-ID: <20230922224210.6klwbphnsk5j2wft@amd.com> (raw)
In-Reply-To: <20230921203331.3746712-7-seanjc@google.com>

On Thu, Sep 21, 2023 at 01:33:23PM -0700, Sean Christopherson wrote:
> Remove the restriction that a guest_memfd instance that supports hugepages
> can *only* be bound by memslots that are 100% compatible with hugepage
> mappings, and instead force KVM to use an order-0 mapping if the binding
> isn't compatible with hugepages.
> 
> The intent of the draconian binding restriction was purely to simplify the
> guest_memfd implementation, e.g. to avoid repeatining the existing logic in
> KVM x86ial for precisely tracking which GFNs support hugepages.  But
> checking that the binding's offset and size is compatible is just as easy
> to do when KVM wants to create a mapping.
> 
> And on the other hand, completely rejecting bindings that are incompatible
> with hugepages makes it practically impossible for userspace to use a
> single guest_memfd instance for all guest memory, e.g. on x86 it would be
> impossible to skip the legacy VGA hole while still allowing hugepage
> mappings for the rest of guest memory.
> 
> Suggested-by: Michael Roth <michael.roth@amd.com>
> Link: https://lore.kernel.org/all/20230918163647.m6bjgwusc7ww5tyu@amd.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  virt/kvm/guest_mem.c | 54 ++++++++++++++++++++++----------------------
>  1 file changed, 27 insertions(+), 27 deletions(-)
> 
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index 68528e9cddd7..4f3a313f5532 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -434,20 +434,6 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags,
>  	return err;
>  }
>  
> -static bool kvm_gmem_is_valid_size(loff_t size, u64 flags)
> -{
> -	if (size < 0 || !PAGE_ALIGNED(size))
> -		return false;
> -
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> -	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> -	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> -		return false;
> -#endif
> -
> -	return true;
> -}
> -
>  int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  {
>  	loff_t size = args->size;
> @@ -460,9 +446,15 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
>  	if (flags & ~valid_flags)
>  		return -EINVAL;
>  
> -	if (!kvm_gmem_is_valid_size(size, flags))
> +	if (size < 0 || !PAGE_ALIGNED(size))
>  		return -EINVAL;
>  
> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +	if ((flags & KVM_GUEST_MEMFD_ALLOW_HUGEPAGE) &&
> +	    !IS_ALIGNED(size, HPAGE_PMD_SIZE))
> +		return -EINVAL;
> +#endif
> +
>  	return __kvm_gmem_create(kvm, size, flags, kvm_gmem_mnt);
>  }
>  
> @@ -470,7 +462,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		  unsigned int fd, loff_t offset)
>  {
>  	loff_t size = slot->npages << PAGE_SHIFT;
> -	unsigned long start, end, flags;
> +	unsigned long start, end;
>  	struct kvm_gmem *gmem;
>  	struct inode *inode;
>  	struct file *file;
> @@ -489,16 +481,9 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  		goto err;
>  
>  	inode = file_inode(file);
> -	flags = (unsigned long)inode->i_private;
>  
> -	/*
> -	 * For simplicity, require the offset into the file and the size of the
> -	 * memslot to be aligned to the largest possible page size used to back
> -	 * the file (same as the size of the file itself).
> -	 */
> -	if (!kvm_gmem_is_valid_size(offset, flags) ||
> -	    !kvm_gmem_is_valid_size(size, flags))
> -		goto err;
> +	if (offset < 0 || !PAGE_ALIGNED(offset))
> +		return -EINVAL;
>  
>  	if (offset + size > i_size_read(inode))
>  		goto err;
> @@ -599,8 +584,23 @@ int kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	page = folio_file_page(folio, index);
>  
>  	*pfn = page_to_pfn(page);
> -	if (max_order)
> -		*max_order = compound_order(compound_head(page));
> +	if (!max_order)
> +		goto success;
> +
> +	*max_order = compound_order(compound_head(page));
> +	if (!*max_order)
> +		goto success;
> +
> +	/*
> +	 * For simplicity, allow mapping a hugepage if and only if the entire
> +	 * binding is compatible, i.e. don't bother supporting mapping interior
> +	 * sub-ranges with hugepages (unless userspace comes up with a *really*
> +	 * strong use case for needing hugepages within unaligned bindings).
> +	 */
> +	if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
> +	    !IS_ALIGNED(slot->npages, 1ull << *max_order))
> +		*max_order = 0;

Thanks for working this in. Unfortunately on x86 the bulk of guest memory
ends up getting slotted directly above legacy regions at GFN 0x100, so the
associated slot still ends failing these alignment checks if it tries to
match the gmemfd offsets up with the shared RAM/memfd offsets.

I tried to work around it in userspace by padding the gmemfd offset of
each slot to the next 2M boundary, but that also requires dynamically
growing the gmemfd inode to account for the padding of each new slot and
it gets ugly enough that I'm not sure it's any better than your
suggested alternative of using a unique gmemfd for each slot.

But what if we relax the check to simply make sure that any large folio
must is fully-contained by the range of the slot is bound to? It *seems*
like that would still avoid stuff like mapping 2M pages in the NPT (or
setting up 2M RMP table entries) that aren't fully contained by a slot
while still allowing the bulk of guest memory to get mapped as 2M. Are
there other edge cases to consider?

The following seems to work for a basic 16GB SNP guest at least:

diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
index 9109bf5751ee..e73128d4ebc2 100644
--- a/virt/kvm/guest_mem.c
+++ b/virt/kvm/guest_mem.c
@@ -618,6 +618,7 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
                       gfn_t gfn, kvm_pfn_t *pfn, int *max_order, bool prep)
 {
        pgoff_t index = gfn - slot->base_gfn + slot->gmem.pgoff;
+       pgoff_t huge_index;
        struct kvm_gmem *gmem;
        struct folio *folio;
        struct page *page;
@@ -662,9 +663,12 @@ int __kvm_gmem_get_pfn(struct kvm *kvm, struct kvm_memory_slot *slot,
         * sub-ranges with hugepages (unless userspace comes up with a *really*
         * strong use case for needing hugepages within unaligned bindings).
         */
-       if (!IS_ALIGNED(slot->gmem.pgoff, 1ull << *max_order) ||
-           !IS_ALIGNED(slot->npages, 1ull << *max_order))
+       huge_index = round_down(index, 1ull << *max_order);
+       if (huge_index < ALIGN(slot->gmem.pgoff, 1ull << *max_order) ||
+           huge_index + (1ull << *max_order) > slot->gmem.pgoff + slot->npages) {
+               pr_debug("%s: GFN %llx failed alignment checks\n", __func__, gfn);
                *max_order = 0;
+       }
 success:
        r = 0;

-Mike

> +success:
>  	r = 0;
>  
>  out_unlock:
> -- 
> 2.42.0.515.g380fc7ccd1-goog
>