All of lore.kernel.org
 help / color / mirror / Atom feed
From: Peter Xu <peterx@redhat.com>
To: David Matlack <dmatlack@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Marc Zyngier <maz@kernel.org>,
	Huacai Chen <chenhuacai@kernel.org>,
	Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>,
	Anup Patel <anup@brainfault.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	Albert Ou <aou@eecs.berkeley.edu>,
	Sean Christopherson <seanjc@google.com>,
	Andrew Jones <drjones@redhat.com>,
	Ben Gardon <bgardon@google.com>,
	maciej.szmigiero@oracle.com,
	"moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 (KVM/arm64)" 
	<kvmarm@lists.cs.columbia.edu>,
	"open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)" 
	<linux-mips@vger.kernel.org>,
	"open list:KERNEL VIRTUAL MACHINE FOR MIPS (KVM/mips)" 
	<kvm@vger.kernel.org>,
	"open list:KERNEL VIRTUAL MACHINE FOR RISC-V (KVM/riscv)" 
	<kvm-riscv@lists.infradead.org>,
	Peter Feiner <pfeiner@google.com>
Subject: Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
Date: Wed, 16 Mar 2022 18:26:46 +0800	[thread overview]
Message-ID: <YjG7Zh4zwTDsO3L1@xz-m1.local> (raw)
In-Reply-To: <20220311002528.2230172-21-dmatlack@google.com>

On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> Extend KVM's eager page splitting to also split huge pages that are
> mapped by the shadow MMU. Specifically, walk through the rmap splitting
> all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.
> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.  This commit does *not*
>     handle such aliasing and opts not to split such huge pages.
> 
> (4) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
>     This commit does *not* handle such cases and instead opts to leave
>     such lower-level SPTEs non-present. In this situation TLBs must be
>     flushed before dropping the MMU lock as a portion of the huge page
>     region is being unmapped.
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 -
>  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
>  2 files changed, 307 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 05161afd7642..495f6ac53801 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2360,9 +2360,6 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> -
>  			Default is Y (on).
>  
>  	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 926ddfaa9e1a..dd56b5b9624f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> +	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> +
> +	if (WARN_ON_ONCE(!cache))
> +		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> +

I also think this is not proper to be added into this patch.  Maybe it'll
be more suitable for the rmap_add() rework patch previously, or maybe it
can be dropped directly if it should never trigger at all. Then we die hard
at below when referencing it.

>  	return kvm_mmu_memory_cache_alloc(cache);
>  }
>  
> @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>  
> +static gfn_t sptep_to_gfn(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +}
> +
> +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	return sp->role.access;
> +}
> +
> +static unsigned int sptep_to_access(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> +}
> +
>  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
>  					gfn_t gfn, u32 access)
>  {
> @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>  	return count;
>  }
>  
> +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> +					 const struct kvm_memory_slot *slot);
> +
>  static void
>  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>  			   struct pte_list_desc *desc, int i,
> @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> +						   union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	BUG_ON(!role.direct);
> +
> +	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> +
> +	/* Direct SPs are never unsync. */
> +	WARN_ON_ONCE(sp && sp->unsync);
> +
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +	return sp;
> +}
> +
>  /*
>   * Looks up an existing SP for the given gfn and role if one exists. The
>   * return SP is guaranteed to be synced.
> @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +static int prepare_to_split_huge_page(struct kvm *kvm,
> +				      const struct kvm_memory_slot *slot,
> +				      u64 *huge_sptep,
> +				      struct kvm_mmu_page **spp,
> +				      bool *flush,
> +				      bool *dropped_lock)
> +{
> +	int r = 0;
> +
> +	*dropped_lock = false;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> +		return -ENOSPC;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		goto drop_lock;
> +

Not immediately clear on whether there'll be case that *spp is set within
the current function.  Some sanity check might be nice?

> +	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
> +	if (r)
> +		goto drop_lock;
> +
> +	return 0;
> +
> +drop_lock:
> +	if (*flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +	*flush = false;
> +	*dropped_lock = true;
> +
> +	write_unlock(&kvm->mmu_lock);
> +	cond_resched();
> +	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
> +	if (!*spp)
> +		r = -ENOMEM;
> +	write_lock(&kvm->mmu_lock);
> +
> +	return r;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> +						     const struct kvm_memory_slot *slot,
> +						     u64 *huge_sptep,
> +						     struct kvm_mmu_page **spp)
> +{
> +	struct kvm_mmu_page *split_sp;
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = sptep_to_gfn(huge_sptep);
> +	access = sptep_to_access(huge_sptep);
> +
> +	/*
> +	 * Huge page splitting always uses direct shadow pages since we are
> +	 * directly mapping the huge page GFN region with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, true, access);
> +	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> +
> +	/*
> +	 * Opt not to split if the lower-level SP already exists. This requires
> +	 * more complex handling as the SP may be already partially filled in
> +	 * and may need extra pte_list_desc structs to update parent_ptes.
> +	 */
> +	if (split_sp)
> +		return NULL;

This smells tricky..

Firstly we're trying to lookup the existing SPs that has shadowed this huge
page in split way, with the access bits fetched from the shadow cache (so
without hugepage nx effect).  However could the pages be mapped with
different permissions from the currently hugely mapped page?

IIUC all these is for the fact that we can't allocate pte_list_desc and we
want to make sure we won't make the pte list to be >1.

But I also see that the pte_list check below...

> +
> +	swap(split_sp, *spp);
> +	init_shadow_page(kvm, split_sp, slot, gfn, role);
> +	trace_kvm_mmu_get_page(split_sp, true);
> +
> +	return split_sp;
> +}
> +
> +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> +				   const struct kvm_memory_slot *slot,
> +				   u64 *huge_sptep, struct kvm_mmu_page **spp,
> +				   bool *flush)
> +
> +{
> +	struct kvm_mmu_page *split_sp;
> +	u64 huge_spte, split_spte;
> +	int split_level, index;
> +	unsigned int access;
> +	u64 *split_sptep;
> +	gfn_t split_gfn;
> +
> +	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +	if (!split_sp)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Since we did not allocate pte_list_desc_structs for the split, we
> +	 * cannot add a new parent SPTE to parent_ptes. This should never happen
> +	 * in practice though since this is a fresh SP.
> +	 *
> +	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> +	 */
> +	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> +		return -EINVAL;
> +
> +	huge_spte = READ_ONCE(*huge_sptep);
> +
> +	split_level = split_sp->role.level;
> +	access = split_sp->role.access;
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		split_sptep = &split_sp->spt[index];
> +		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> +
> +		BUG_ON(is_shadow_present_pte(*split_sptep));
> +
> +		/*
> +		 * Since we did not allocate pte_list_desc structs for the
> +		 * split, we can't add a new SPTE that maps this GFN.
> +		 * Skipping this SPTE means we're only partially mapping the
> +		 * huge page, which means we'll need to flush TLBs before
> +		 * dropping the MMU lock.
> +		 *
> +		 * Note, this make it safe to pass NULL to __rmap_add() below.
> +		 */
> +		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> +			*flush = true;
> +			continue;
> +		}

... here.

IIUC this check should already be able to cover all the cases and it's
accurate on the fact that we don't want to grow any rmap to >1 len.

> +
> +		split_spte = make_huge_page_split_spte(
> +				huge_spte, split_level + 1, index, access);
> +
> +		mmu_spte_set(split_sptep, split_spte);
> +		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);

__rmap_add() with a NULL cache pointer is weird.. same as
__link_shadow_page() below.

I'll stop here for now I guess.. Have you considered having rmap allocation
ready altogether, rather than making this intermediate step but only add
that later?  Because all these look hackish to me..  It's also possible
that I missed something important, if so please shoot.

Thanks,

> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. Since we are making this change without a TLB flush vCPUs
> +	 * will see a mix of the split mappings and the original huge mapping,
> +	 * depending on what's currently in their TLB. This is fine from a
> +	 * correctness standpoint since the translation will either be identical
> +	 * or non-present. To account for non-present mappings, the TLB will be
> +	 * flushed prior to dropping the MMU lock.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, false);
> +	__link_shadow_page(NULL, huge_sptep, split_sp);
> +
> +	return 0;
> +}

-- 
Peter Xu


WARNING: multiple messages have this Message-ID (diff)
From: Peter Xu <peterx@redhat.com>
To: David Matlack <dmatlack@google.com>
Cc: Marc Zyngier <maz@kernel.org>, Albert Ou <aou@eecs.berkeley.edu>,
	"open list:KERNEL VIRTUAL MACHINE FOR MIPS \(KVM/mips\)"
	<kvm@vger.kernel.org>, Huacai Chen <chenhuacai@kernel.org>,
	"open list:KERNEL VIRTUAL MACHINE FOR MIPS \(KVM/mips\)"
	<linux-mips@vger.kernel.org>,
	Aleksandar Markovic <aleksandar.qemu.devel@gmail.com>,
	Palmer Dabbelt <palmer@dabbelt.com>,
	"open list:KERNEL VIRTUAL MACHINE FOR RISC-V \(KVM/riscv\)"
	<kvm-riscv@lists.infradead.org>,
	Paul Walmsley <paul.walmsley@sifive.com>,
	Ben Gardon <bgardon@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	maciej.szmigiero@oracle.com,
	"moderated list:KERNEL VIRTUAL MACHINE FOR ARM64 \(KVM/arm64\)"
	<kvmarm@lists.cs.columbia.edu>, Peter Feiner <pfeiner@google.com>
Subject: Re: [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU
Date: Wed, 16 Mar 2022 18:26:46 +0800	[thread overview]
Message-ID: <YjG7Zh4zwTDsO3L1@xz-m1.local> (raw)
In-Reply-To: <20220311002528.2230172-21-dmatlack@google.com>

On Fri, Mar 11, 2022 at 12:25:22AM +0000, David Matlack wrote:
> Extend KVM's eager page splitting to also split huge pages that are
> mapped by the shadow MMU. Specifically, walk through the rmap splitting
> all 1GiB pages to 2MiB pages, and splitting all 2MiB pages to 4KiB
> pages.
> 
> Splitting huge pages mapped by the shadow MMU requries dealing with some
> extra complexity beyond that of the TDP MMU:
> 
> (1) The shadow MMU has a limit on the number of shadow pages that are
>     allowed to be allocated. So, as a policy, Eager Page Splitting
>     refuses to split if there are KVM_MIN_FREE_MMU_PAGES or fewer
>     pages available.
> 
> (2) Huge pages may be mapped by indirect shadow pages which have the
>     possibility of being unsync. As a policy we opt not to split such
>     pages as their translation may no longer be valid.
> 
> (3) Splitting a huge page may end up re-using an existing lower level
>     shadow page tables. This is unlike the TDP MMU which always allocates
>     new shadow page tables when splitting.  This commit does *not*
>     handle such aliasing and opts not to split such huge pages.
> 
> (4) When installing the lower level SPTEs, they must be added to the
>     rmap which may require allocating additional pte_list_desc structs.
>     This commit does *not* handle such cases and instead opts to leave
>     such lower-level SPTEs non-present. In this situation TLBs must be
>     flushed before dropping the MMU lock as a portion of the huge page
>     region is being unmapped.
> 
> Suggested-by: Peter Feiner <pfeiner@google.com>
> [ This commit is based off of the original implementation of Eager Page
>   Splitting from Peter in Google's kernel from 2016. ]
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
>  .../admin-guide/kernel-parameters.txt         |   3 -
>  arch/x86/kvm/mmu/mmu.c                        | 307 ++++++++++++++++++
>  2 files changed, 307 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index 05161afd7642..495f6ac53801 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2360,9 +2360,6 @@
>  			the KVM_CLEAR_DIRTY ioctl, and only for the pages being
>  			cleared.
>  
> -			Eager page splitting currently only supports splitting
> -			huge pages mapped by the TDP MMU.
> -
>  			Default is Y (on).
>  
>  	kvm.enable_vmware_backdoor=[KVM] Support VMware backdoor PV interface.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 926ddfaa9e1a..dd56b5b9624f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -727,6 +727,11 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>  
>  static struct pte_list_desc *mmu_alloc_pte_list_desc(struct kvm_mmu_memory_cache *cache)
>  {
> +	static const gfp_t gfp_nocache = GFP_ATOMIC | __GFP_ACCOUNT | __GFP_ZERO;
> +
> +	if (WARN_ON_ONCE(!cache))
> +		return kmem_cache_alloc(pte_list_desc_cache, gfp_nocache);
> +

I also think this is not proper to be added into this patch.  Maybe it'll
be more suitable for the rmap_add() rework patch previously, or maybe it
can be dropped directly if it should never trigger at all. Then we die hard
at below when referencing it.

>  	return kvm_mmu_memory_cache_alloc(cache);
>  }
>  
> @@ -743,6 +748,28 @@ static gfn_t kvm_mmu_page_get_gfn(struct kvm_mmu_page *sp, int index)
>  	return sp->gfn + (index << ((sp->role.level - 1) * PT64_LEVEL_BITS));
>  }
>  
> +static gfn_t sptep_to_gfn(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_gfn(sp, sptep - sp->spt);
> +}
> +
> +static unsigned int kvm_mmu_page_get_access(struct kvm_mmu_page *sp, int index)
> +{
> +	if (!sp->role.direct)
> +		return sp->shadowed_translation[index].access;
> +
> +	return sp->role.access;
> +}
> +
> +static unsigned int sptep_to_access(u64 *sptep)
> +{
> +	struct kvm_mmu_page *sp = sptep_to_sp(sptep);
> +
> +	return kvm_mmu_page_get_access(sp, sptep - sp->spt);
> +}
> +
>  static void kvm_mmu_page_set_gfn_access(struct kvm_mmu_page *sp, int index,
>  					gfn_t gfn, u32 access)
>  {
> @@ -912,6 +939,9 @@ static int pte_list_add(struct kvm_mmu_memory_cache *cache, u64 *spte,
>  	return count;
>  }
>  
> +static struct kvm_rmap_head *gfn_to_rmap(gfn_t gfn, int level,
> +					 const struct kvm_memory_slot *slot);
> +
>  static void
>  pte_list_desc_remove_entry(struct kvm_rmap_head *rmap_head,
>  			   struct pte_list_desc *desc, int i,
> @@ -2125,6 +2155,23 @@ static struct kvm_mmu_page *__kvm_mmu_find_shadow_page(struct kvm *kvm,
>  	return sp;
>  }
>  
> +static struct kvm_mmu_page *kvm_mmu_find_direct_sp(struct kvm *kvm, gfn_t gfn,
> +						   union kvm_mmu_page_role role)
> +{
> +	struct kvm_mmu_page *sp;
> +	LIST_HEAD(invalid_list);
> +
> +	BUG_ON(!role.direct);
> +
> +	sp = __kvm_mmu_find_shadow_page(kvm, gfn, role, &invalid_list);
> +
> +	/* Direct SPs are never unsync. */
> +	WARN_ON_ONCE(sp && sp->unsync);
> +
> +	kvm_mmu_commit_zap_page(kvm, &invalid_list);
> +	return sp;
> +}
> +
>  /*
>   * Looks up an existing SP for the given gfn and role if one exists. The
>   * return SP is guaranteed to be synced.
> @@ -6063,12 +6110,266 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		kvm_arch_flush_remote_tlbs_memslot(kvm, memslot);
>  }
>  
> +static int prepare_to_split_huge_page(struct kvm *kvm,
> +				      const struct kvm_memory_slot *slot,
> +				      u64 *huge_sptep,
> +				      struct kvm_mmu_page **spp,
> +				      bool *flush,
> +				      bool *dropped_lock)
> +{
> +	int r = 0;
> +
> +	*dropped_lock = false;
> +
> +	if (kvm_mmu_available_pages(kvm) <= KVM_MIN_FREE_MMU_PAGES)
> +		return -ENOSPC;
> +
> +	if (need_resched() || rwlock_needbreak(&kvm->mmu_lock))
> +		goto drop_lock;
> +

Not immediately clear on whether there'll be case that *spp is set within
the current function.  Some sanity check might be nice?

> +	*spp = kvm_mmu_alloc_direct_sp_for_split(true);
> +	if (r)
> +		goto drop_lock;
> +
> +	return 0;
> +
> +drop_lock:
> +	if (*flush)
> +		kvm_arch_flush_remote_tlbs_memslot(kvm, slot);
> +
> +	*flush = false;
> +	*dropped_lock = true;
> +
> +	write_unlock(&kvm->mmu_lock);
> +	cond_resched();
> +	*spp = kvm_mmu_alloc_direct_sp_for_split(false);
> +	if (!*spp)
> +		r = -ENOMEM;
> +	write_lock(&kvm->mmu_lock);
> +
> +	return r;
> +}
> +
> +static struct kvm_mmu_page *kvm_mmu_get_sp_for_split(struct kvm *kvm,
> +						     const struct kvm_memory_slot *slot,
> +						     u64 *huge_sptep,
> +						     struct kvm_mmu_page **spp)
> +{
> +	struct kvm_mmu_page *split_sp;
> +	union kvm_mmu_page_role role;
> +	unsigned int access;
> +	gfn_t gfn;
> +
> +	gfn = sptep_to_gfn(huge_sptep);
> +	access = sptep_to_access(huge_sptep);
> +
> +	/*
> +	 * Huge page splitting always uses direct shadow pages since we are
> +	 * directly mapping the huge page GFN region with smaller pages.
> +	 */
> +	role = kvm_mmu_child_role(huge_sptep, true, access);
> +	split_sp = kvm_mmu_find_direct_sp(kvm, gfn, role);
> +
> +	/*
> +	 * Opt not to split if the lower-level SP already exists. This requires
> +	 * more complex handling as the SP may be already partially filled in
> +	 * and may need extra pte_list_desc structs to update parent_ptes.
> +	 */
> +	if (split_sp)
> +		return NULL;

This smells tricky..

Firstly we're trying to lookup the existing SPs that has shadowed this huge
page in split way, with the access bits fetched from the shadow cache (so
without hugepage nx effect).  However could the pages be mapped with
different permissions from the currently hugely mapped page?

IIUC all these is for the fact that we can't allocate pte_list_desc and we
want to make sure we won't make the pte list to be >1.

But I also see that the pte_list check below...

> +
> +	swap(split_sp, *spp);
> +	init_shadow_page(kvm, split_sp, slot, gfn, role);
> +	trace_kvm_mmu_get_page(split_sp, true);
> +
> +	return split_sp;
> +}
> +
> +static int kvm_mmu_split_huge_page(struct kvm *kvm,
> +				   const struct kvm_memory_slot *slot,
> +				   u64 *huge_sptep, struct kvm_mmu_page **spp,
> +				   bool *flush)
> +
> +{
> +	struct kvm_mmu_page *split_sp;
> +	u64 huge_spte, split_spte;
> +	int split_level, index;
> +	unsigned int access;
> +	u64 *split_sptep;
> +	gfn_t split_gfn;
> +
> +	split_sp = kvm_mmu_get_sp_for_split(kvm, slot, huge_sptep, spp);
> +	if (!split_sp)
> +		return -EOPNOTSUPP;
> +
> +	/*
> +	 * Since we did not allocate pte_list_desc_structs for the split, we
> +	 * cannot add a new parent SPTE to parent_ptes. This should never happen
> +	 * in practice though since this is a fresh SP.
> +	 *
> +	 * Note, this makes it safe to pass NULL to __link_shadow_page() below.
> +	 */
> +	if (WARN_ON_ONCE(split_sp->parent_ptes.val))
> +		return -EINVAL;
> +
> +	huge_spte = READ_ONCE(*huge_sptep);
> +
> +	split_level = split_sp->role.level;
> +	access = split_sp->role.access;
> +
> +	for (index = 0; index < PT64_ENT_PER_PAGE; index++) {
> +		split_sptep = &split_sp->spt[index];
> +		split_gfn = kvm_mmu_page_get_gfn(split_sp, index);
> +
> +		BUG_ON(is_shadow_present_pte(*split_sptep));
> +
> +		/*
> +		 * Since we did not allocate pte_list_desc structs for the
> +		 * split, we can't add a new SPTE that maps this GFN.
> +		 * Skipping this SPTE means we're only partially mapping the
> +		 * huge page, which means we'll need to flush TLBs before
> +		 * dropping the MMU lock.
> +		 *
> +		 * Note, this make it safe to pass NULL to __rmap_add() below.
> +		 */
> +		if (gfn_to_rmap(split_gfn, split_level, slot)->val) {
> +			*flush = true;
> +			continue;
> +		}

... here.

IIUC this check should already be able to cover all the cases and it's
accurate on the fact that we don't want to grow any rmap to >1 len.

> +
> +		split_spte = make_huge_page_split_spte(
> +				huge_spte, split_level + 1, index, access);
> +
> +		mmu_spte_set(split_sptep, split_spte);
> +		__rmap_add(kvm, NULL, slot, split_sptep, split_gfn, access);

__rmap_add() with a NULL cache pointer is weird.. same as
__link_shadow_page() below.

I'll stop here for now I guess.. Have you considered having rmap allocation
ready altogether, rather than making this intermediate step but only add
that later?  Because all these look hackish to me..  It's also possible
that I missed something important, if so please shoot.

Thanks,

> +	}
> +
> +	/*
> +	 * Replace the huge spte with a pointer to the populated lower level
> +	 * page table. Since we are making this change without a TLB flush vCPUs
> +	 * will see a mix of the split mappings and the original huge mapping,
> +	 * depending on what's currently in their TLB. This is fine from a
> +	 * correctness standpoint since the translation will either be identical
> +	 * or non-present. To account for non-present mappings, the TLB will be
> +	 * flushed prior to dropping the MMU lock.
> +	 */
> +	__drop_large_spte(kvm, huge_sptep, false);
> +	__link_shadow_page(NULL, huge_sptep, split_sp);
> +
> +	return 0;
> +}

-- 
Peter Xu

_______________________________________________
kvmarm mailing list
kvmarm@lists.cs.columbia.edu
https://lists.cs.columbia.edu/mailman/listinfo/kvmarm

  reply	other threads:[~2022-03-16 10:27 UTC|newest]

Thread overview: 134+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-03-11  0:25 [PATCH v2 00/26] Extend Eager Page Splitting to the shadow MMU David Matlack
2022-03-11  0:25 ` David Matlack
2022-03-11  0:25 ` [PATCH v2 01/26] KVM: x86/mmu: Optimize MMU page cache lookup for all direct SPs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  7:40   ` Peter Xu
2022-03-15  7:40     ` Peter Xu
2022-03-22 18:16     ` David Matlack
2022-03-22 18:16       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 02/26] KVM: x86/mmu: Use a bool for direct David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  7:46   ` Peter Xu
2022-03-15  7:46     ` Peter Xu
2022-03-22 18:21     ` David Matlack
2022-03-22 18:21       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 03/26] KVM: x86/mmu: Derive shadow MMU page role from parent David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:15   ` Peter Xu
2022-03-15  8:15     ` Peter Xu
2022-03-22 18:30     ` David Matlack
2022-03-22 18:30       ` David Matlack
2022-03-30 14:25       ` Peter Xu
2022-03-30 14:25         ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 04/26] KVM: x86/mmu: Decompose kvm_mmu_get_page() into separate functions David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:50   ` Peter Xu
2022-03-15  8:50     ` Peter Xu
2022-03-22 22:09     ` David Matlack
2022-03-22 22:09       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 05/26] KVM: x86/mmu: Rename shadow MMU functions that deal with shadow pages David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  8:52   ` Peter Xu
2022-03-15  8:52     ` Peter Xu
2022-03-22 21:35     ` David Matlack
2022-03-22 21:35       ` David Matlack
2022-03-30 14:28       ` Peter Xu
2022-03-30 14:28         ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 06/26] KVM: x86/mmu: Pass memslot to kvm_mmu_new_shadow_page() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  9:03   ` Peter Xu
2022-03-15  9:03     ` Peter Xu
2022-03-22 22:05     ` David Matlack
2022-03-22 22:05       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 07/26] KVM: x86/mmu: Separate shadow MMU sp allocation from initialization David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15  9:54   ` Peter Xu
2022-03-15  9:54     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 08/26] KVM: x86/mmu: Link spt to sp during allocation David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:04   ` Peter Xu
2022-03-15 10:04     ` Peter Xu
2022-03-22 22:30     ` David Matlack
2022-03-22 22:30       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 09/26] KVM: x86/mmu: Move huge page split sp allocation code to mmu.c David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:17   ` Peter Xu
2022-03-15 10:17     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 10/26] KVM: x86/mmu: Use common code to free kvm_mmu_page structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:22   ` Peter Xu
2022-03-15 10:22     ` Peter Xu
2022-03-22 22:33     ` David Matlack
2022-03-22 22:33       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 11/26] KVM: x86/mmu: Use common code to allocate kvm_mmu_page structs from vCPU caches David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:27   ` Peter Xu
2022-03-15 10:27     ` Peter Xu
2022-03-22 22:35     ` David Matlack
2022-03-22 22:35       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 12/26] KVM: x86/mmu: Pass const memslot to rmap_add() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 13/26] KVM: x86/mmu: Pass const memslot to init_shadow_page() and descendants David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 14/26] KVM: x86/mmu: Decouple rmap_add() and link_shadow_page() from kvm_vcpu David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:37   ` Peter Xu
2022-03-15 10:37     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 15/26] KVM: x86/mmu: Update page stats in __rmap_add() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-15 10:39   ` Peter Xu
2022-03-15 10:39     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 16/26] KVM: x86/mmu: Cache the access bits of shadowed translations David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:32   ` Peter Xu
2022-03-16  8:32     ` Peter Xu
2022-03-22 22:51     ` David Matlack
2022-03-22 22:51       ` David Matlack
2022-03-30 18:30       ` Peter Xu
2022-03-30 18:30         ` Peter Xu
2022-03-31 21:40         ` David Matlack
2022-03-31 21:40           ` David Matlack
2022-03-11  0:25 ` [PATCH v2 17/26] KVM: x86/mmu: Pass access information to make_huge_page_split_spte() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:44   ` Peter Xu
2022-03-16  8:44     ` Peter Xu
2022-03-22 23:08     ` David Matlack
2022-03-22 23:08       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 18/26] KVM: x86/mmu: Zap collapsible SPTEs at all levels in the shadow MMU David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:49   ` Peter Xu
2022-03-16  8:49     ` Peter Xu
2022-03-22 23:11     ` David Matlack
2022-03-22 23:11       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 19/26] KVM: x86/mmu: Refactor drop_large_spte() David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16  8:53   ` Peter Xu
2022-03-16  8:53     ` Peter Xu
2022-03-11  0:25 ` [PATCH v2 20/26] KVM: x86/mmu: Extend Eager Page Splitting to the shadow MMU David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-16 10:26   ` Peter Xu [this message]
2022-03-16 10:26     ` Peter Xu
2022-03-22  0:07     ` David Matlack
2022-03-22  0:07       ` David Matlack
2022-03-22 23:58     ` David Matlack
2022-03-22 23:58       ` David Matlack
2022-03-30 18:34       ` Peter Xu
2022-03-30 18:34         ` Peter Xu
2022-03-31 19:57         ` David Matlack
2022-03-31 19:57           ` David Matlack
2022-03-11  0:25 ` [PATCH v2 21/26] KVM: Allow for different capacities in kvm_mmu_memory_cache structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-19  5:27   ` Anup Patel
2022-03-19  5:27     ` Anup Patel
2022-03-22 23:13     ` David Matlack
2022-03-22 23:13       ` David Matlack
2022-03-11  0:25 ` [PATCH v2 22/26] KVM: Allow GFP flags to be passed when topping up MMU caches David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 23/26] KVM: x86/mmu: Fully split huge pages that require extra pte_list_desc structs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 24/26] KVM: x86/mmu: Split huge pages aliased by multiple SPTEs David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 25/26] KVM: x86/mmu: Drop NULL pte_list_desc_cache fallback David Matlack
2022-03-11  0:25   ` David Matlack
2022-03-11  0:25 ` [PATCH v2 26/26] KVM: selftests: Map x86_64 guest virtual memory with huge pages David Matlack
2022-03-11  0:25   ` David Matlack

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YjG7Zh4zwTDsO3L1@xz-m1.local \
    --to=peterx@redhat.com \
    --cc=aleksandar.qemu.devel@gmail.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=bgardon@google.com \
    --cc=chenhuacai@kernel.org \
    --cc=dmatlack@google.com \
    --cc=drjones@redhat.com \
    --cc=kvm-riscv@lists.infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.cs.columbia.edu \
    --cc=linux-mips@vger.kernel.org \
    --cc=maciej.szmigiero@oracle.com \
    --cc=maz@kernel.org \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=pfeiner@google.com \
    --cc=seanjc@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.