All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely
@ 2022-08-05 23:05 Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page Sean Christopherson
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Precisely track (via kvm_mmu_page) if a non-huge page is being forced
and use that info to avoid unnecessarily forcing smaller page sizes in
disallowed_hugepage_adjust().

KVM incorrectly assumes that the NX huge page mitigation is the only
scenario where KVM will create a non-leaf page instead of a huge page.
As a result, if the original source of huge page incompatibility goes
away, the NX mitigation is enabled, and KVM encounters an present shadow
page when attempting to install a huge page, KVM will force a smaller page
regardless of whether or not a smaller page is actually necessary to
satisfy the NX huge page mitigation.

Unnecessarily forcing small pages can result in degraded guest performance,
especially on larger VMs.  The bug was originally discovered when testing
dirty log performance, as KVM would leave small pages lying around when
zapping collapsible SPTEs.  That case was indadvertantly fixed by commit
5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when disabling dirty
logging"), but other scenarios are still affected, e.g. KVM will not
rebuild a huge page if the mmu_notifier zaps a range of PTEs because the
primary MMU is creating a huge page.

v3:
 - Bug the VM if KVM attempts to double account a shadow page that
   disallows a NX huge page. [David]
 - Split the rename to separate patch. [Paolo]
 - Rename more NX huge page variables/functions. [David]
 - Combine and tweak the comments about enforcing the NX huge page
   mitigation for non-paging MMUs. [Paolo, David]
 - Call out that the shadow MMU holds mmu_lock for write and doesn't need
   to manual handle memory ordering when accounting NX huge pages. [David]
 - Add a smp_rmb() when unlinking shadow pages in the TDP MMU.
 - Rename spte_to_sp() to spte_to_child_sp(). [David]
 - Collect reviews. [David]
 - Tweak the changelog for the final patch to call out that precise
   accounting addresses real world performance bugs. [Paolo]
 - Reword the changelog for the patch to (almost) always tag disallowed
   NX huge pages, and call out that it doesn't fix the TDP MMU. [David]

v2: Rebase, tweak a changelog accordingly.

v1: https://lore.kernel.org/all/20220409003847.819686-1-seanjc@google.com

Mingwei Zhang (1):
  KVM: x86/mmu: explicitly check nx_hugepage in
    disallowed_hugepage_adjust()

Sean Christopherson (7):
  KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge
    page
  KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
  KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  KVM: x86/mmu: Properly account NX huge page workaround for nonpaging
    MMUs
  KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting
    SPTE
  KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
    pages
  KVM: x86/mmu: Add helper to convert SPTE value to its shadow page

 arch/x86/include/asm/kvm_host.h |  19 ++---
 arch/x86/kvm/mmu/mmu.c          | 123 +++++++++++++++++++++-----------
 arch/x86/kvm/mmu/mmu_internal.h |  33 ++++-----
 arch/x86/kvm/mmu/paging_tmpl.h  |   6 +-
 arch/x86/kvm/mmu/spte.c         |  12 ++++
 arch/x86/kvm/mmu/spte.h         |  17 +++++
 arch/x86/kvm/mmu/tdp_mmu.c      |  59 ++++++++++-----
 arch/x86/kvm/mmu/tdp_mmu.h      |   2 +
 8 files changed, 178 insertions(+), 93 deletions(-)


base-commit: 93472b79715378a2386598d6632c654a2223267b
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-14  0:53   ` Mingwei Zhang
  2022-08-05 23:05 ` [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Sean Christopherson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

WARN and kill the VM if KVM attempts to double count an NX huge page,
i.e. attempts to re-tag a shadow page with "NX huge page disallowed".
KVM does NX huge page accounting only when linking a new shadow page, and
it should be impossible for a new shadow page to be already accounted.
E.g. even in the TDP MMU case, where vCPUs can race to install a new
shadow page, only the "winner" will account the installed page.

Kill the VM instead of continuing on as either KVM has an egregious bug,
e.g. didn't zero-initialize the data, or there's host data corruption, in
which carrying on is dangerous, e.g. could cause silent data corruption
in the guest.

Reported-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 3e1317325e1f..36b898dbde91 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -804,7 +804,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 
 void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	if (sp->lpage_disallowed)
+	if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
 		return;
 
 	++kvm->stat.nx_lpage_splits;
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-14  0:53   ` Mingwei Zhang
  2022-08-05 23:05 ` [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Sean Christopherson
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Tag shadow pages that cannot be replaced with an NX huge page regardless
of whether or not zapping the page would allow KVM to immediately create
a huge page, e.g. because something else prevents creating a huge page.

I.e. track pages that are disallowed from being NX huge pages regardless
of whether or not the page could have been huge at the time of fault.
KVM currently tracks pages that were disallowed from being huge due to
the NX workaround if and only if the page could otherwise be huge.  But
that fails to handled the scenario where whatever restriction prevented
KVM from installing a huge page goes away, e.g. if dirty logging is
disabled, the host mapping level changes, etc...

Failure to tag shadow pages appropriately could theoretically lead to
false negatives, e.g. if a fetch fault requests a small page and thus
isn't tracked, and a read/write fault later requests a huge page, KVM
will not reject the huge page as it should.

To avoid yet another flag, initialize the list_head and use list_empty()
to determine whether or not a page is on the list of NX huge pages that
should be recovered.

Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
more involved due to mmu_lock being held for read.  This will be
addressed in a future commit.

Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++++++++++--------
 arch/x86/kvm/mmu/mmu_internal.h | 10 +++++++++-
 arch/x86/kvm/mmu/paging_tmpl.h  |  6 +++---
 arch/x86/kvm/mmu/tdp_mmu.c      |  4 +++-
 4 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 36b898dbde91..55dac44f3397 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,15 +802,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			  bool nx_huge_page_possible)
 {
-	if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
+	if (KVM_BUG_ON(!list_empty(&sp->lpage_disallowed_link), kvm))
+		return;
+
+	sp->lpage_disallowed = true;
+
+	if (!nx_huge_page_possible)
 		return;
 
 	++kvm->stat.nx_lpage_splits;
 	list_add_tail(&sp->lpage_disallowed_link,
 		      &kvm->arch.lpage_disallowed_mmu_pages);
-	sp->lpage_disallowed = true;
 }
 
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -832,9 +837,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	--kvm->stat.nx_lpage_splits;
 	sp->lpage_disallowed = false;
-	list_del(&sp->lpage_disallowed_link);
+
+	if (list_empty(&sp->lpage_disallowed_link))
+		return;
+
+	--kvm->stat.nx_lpage_splits;
+	list_del_init(&sp->lpage_disallowed_link);
 }
 
 static struct kvm_memory_slot *
@@ -2115,6 +2124,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
+	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
+
 	/*
 	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
 	 * depends on valid pages being added to the head of the list.  See
@@ -3112,9 +3123,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			continue;
 
 		link_shadow_page(vcpu, it.sptep, sp);
-		if (fault->is_tdp && fault->huge_page_disallowed &&
-		    fault->req_level >= it.level)
-			account_huge_nx_page(vcpu->kvm, sp);
+		if (fault->is_tdp && fault->huge_page_disallowed)
+			account_huge_nx_page(vcpu->kvm, sp,
+					     fault->req_level >= it.level);
 	}
 
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 582def531d4d..cca1ad75d096 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -100,6 +100,13 @@ struct kvm_mmu_page {
 		};
 	};
 
+	/*
+	 * Tracks shadow pages that, if zapped, would allow KVM to create an NX
+	 * huge page.  A shadow page will have lpage_disallowed set but not be
+	 * on the list if a huge page is disallowed for other reasons, e.g.
+	 * because KVM is shadowing a PTE at the same gfn, the memslot isn't
+	 * properly aligned, etc...
+	 */
 	struct list_head lpage_disallowed_link;
 #ifdef CONFIG_X86_32
 	/*
@@ -315,7 +322,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+			  bool nx_huge_page_possible);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index f5958071220c..e450f49f2225 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -713,9 +713,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 			continue;
 
 		link_shadow_page(vcpu, it.sptep, sp);
-		if (fault->huge_page_disallowed &&
-		    fault->req_level >= it.level)
-			account_huge_nx_page(vcpu->kvm, sp);
+		if (fault->huge_page_disallowed)
+			account_huge_nx_page(vcpu->kvm, sp,
+					     fault->req_level >= it.level);
 	}
 
 	if (WARN_ON_ONCE(it.level != fault->goal_level))
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf2ccf9debca..903d0d3497b6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -284,6 +284,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
+	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
+
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
 	sp->role = role;
@@ -1130,7 +1132,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
 	if (account_nx)
-		account_huge_nx_page(kvm, sp);
+		account_huge_nx_page(kvm, sp, true);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 	return 0;
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-14  1:12   ` Mingwei Zhang
  2022-08-05 23:05 ` [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Sean Christopherson
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Rename most of the variables/functions involved in the NX huge page
mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
applies only to the NX huge page mitigation, not to any condition that
prevents creating a huge page.

Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
stone.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  8 ++--
 arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
 arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
 arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
 5 files changed, 59 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e8281d64a431..5634347e5d05 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1143,7 +1143,7 @@ struct kvm_arch {
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
 	struct list_head zapped_obsolete_pages;
-	struct list_head lpage_disallowed_mmu_pages;
+	struct list_head possible_nx_huge_pages;
 	struct kvm_page_track_notifier_node mmu_sp_tracker;
 	struct kvm_page_track_notifier_head track_notifier_head;
 	/*
@@ -1259,7 +1259,7 @@ struct kvm_arch {
 	bool sgx_provisioning_allowed;
 
 	struct kvm_pmu_event_filter __rcu *pmu_event_filter;
-	struct task_struct *nx_lpage_recovery_thread;
+	struct task_struct *nx_huge_page_recovery_thread;
 
 #ifdef CONFIG_X86_64
 	/*
@@ -1304,8 +1304,8 @@ struct kvm_arch {
 	 *  - tdp_mmu_roots (above)
 	 *  - tdp_mmu_pages (above)
 	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
-	 *  - lpage_disallowed_mmu_pages
-	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
+	 *  - possible_nx_huge_pages;
+	 *  - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
 	 *    by the TDP MMU
 	 * It is acceptable, but not necessary, to acquire this lock when
 	 * the thread holds the MMU lock in write mode.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 55dac44f3397..53d0dafa68ff 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,20 +802,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 			  bool nx_huge_page_possible)
 {
-	if (KVM_BUG_ON(!list_empty(&sp->lpage_disallowed_link), kvm))
+	if (KVM_BUG_ON(!list_empty(&sp->possible_nx_huge_page_link), kvm))
 		return;
 
-	sp->lpage_disallowed = true;
+	sp->nx_huge_page_disallowed = true;
 
 	if (!nx_huge_page_possible)
 		return;
 
 	++kvm->stat.nx_lpage_splits;
-	list_add_tail(&sp->lpage_disallowed_link,
-		      &kvm->arch.lpage_disallowed_mmu_pages);
+	list_add_tail(&sp->possible_nx_huge_page_link,
+		      &kvm->arch.possible_nx_huge_pages);
 }
 
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
@@ -835,15 +835,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
 
-void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	sp->lpage_disallowed = false;
+	sp->nx_huge_page_disallowed = false;
 
-	if (list_empty(&sp->lpage_disallowed_link))
+	if (list_empty(&sp->possible_nx_huge_page_link))
 		return;
 
 	--kvm->stat.nx_lpage_splits;
-	list_del_init(&sp->lpage_disallowed_link);
+	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
 static struct kvm_memory_slot *
@@ -2124,7 +2124,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
+	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
 
 	/*
 	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
@@ -2483,8 +2483,8 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
 		zapped_root = !is_obsolete_sp(kvm, sp);
 	}
 
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	if (sp->nx_huge_page_disallowed)
+		unaccount_nx_huge_page(kvm, sp);
 
 	sp->role.invalid = 1;
 
@@ -3124,7 +3124,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->is_tdp && fault->huge_page_disallowed)
-			account_huge_nx_page(vcpu->kvm, sp,
+			account_nx_huge_page(vcpu->kvm, sp,
 					     fault->req_level >= it.level);
 	}
 
@@ -5981,7 +5981,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
-	INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
+	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
 	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
 
 	r = kvm_mmu_init_tdp_mmu(kvm);
@@ -6699,7 +6699,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
 			kvm_mmu_zap_all_fast(kvm);
 			mutex_unlock(&kvm->slots_lock);
 
-			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
+			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
 		}
 		mutex_unlock(&kvm_lock);
 	}
@@ -6825,7 +6825,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 		mutex_lock(&kvm_lock);
 
 		list_for_each_entry(kvm, &vm_list, vm_list)
-			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
+			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
 
 		mutex_unlock(&kvm_lock);
 	}
@@ -6833,7 +6833,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
 	return err;
 }
 
-static void kvm_recover_nx_lpages(struct kvm *kvm)
+static void kvm_recover_nx_huge_pages(struct kvm *kvm)
 {
 	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
 	int rcu_idx;
@@ -6856,23 +6856,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
 	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
 	for ( ; to_zap; --to_zap) {
-		if (list_empty(&kvm->arch.lpage_disallowed_mmu_pages))
+		if (list_empty(&kvm->arch.possible_nx_huge_pages))
 			break;
 
 		/*
 		 * We use a separate list instead of just using active_mmu_pages
-		 * because the number of lpage_disallowed pages is expected to
-		 * be relatively small compared to the total.
+		 * because the number of shadow pages that be replaced with an
+		 * NX huge page is expected to be relatively small compared to
+		 * the total number of shadow pages.  And because the TDP MMU
+		 * doesn't use active_mmu_pages.
 		 */
-		sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
+		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
 				      struct kvm_mmu_page,
-				      lpage_disallowed_link);
-		WARN_ON_ONCE(!sp->lpage_disallowed);
+				      possible_nx_huge_page_link);
+		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
 		if (is_tdp_mmu_page(sp)) {
 			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
 		} else {
 			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
-			WARN_ON_ONCE(sp->lpage_disallowed);
+			WARN_ON_ONCE(sp->nx_huge_page_disallowed);
 		}
 
 		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
@@ -6893,7 +6895,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
 	srcu_read_unlock(&kvm->srcu, rcu_idx);
 }
 
-static long get_nx_lpage_recovery_timeout(u64 start_time)
+static long get_nx_huge_page_recovery_timeout(u64 start_time)
 {
 	bool enabled;
 	uint period;
@@ -6904,19 +6906,19 @@ static long get_nx_lpage_recovery_timeout(u64 start_time)
 		       : MAX_SCHEDULE_TIMEOUT;
 }
 
-static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
+static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
 {
 	u64 start_time;
 	long remaining_time;
 
 	while (true) {
 		start_time = get_jiffies_64();
-		remaining_time = get_nx_lpage_recovery_timeout(start_time);
+		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
 
 		set_current_state(TASK_INTERRUPTIBLE);
 		while (!kthread_should_stop() && remaining_time > 0) {
 			schedule_timeout(remaining_time);
-			remaining_time = get_nx_lpage_recovery_timeout(start_time);
+			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
 			set_current_state(TASK_INTERRUPTIBLE);
 		}
 
@@ -6925,7 +6927,7 @@ static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
 		if (kthread_should_stop())
 			return 0;
 
-		kvm_recover_nx_lpages(kvm);
+		kvm_recover_nx_huge_pages(kvm);
 	}
 }
 
@@ -6933,17 +6935,17 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
 {
 	int err;
 
-	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 0,
+	err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
 					  "kvm-nx-lpage-recovery",
-					  &kvm->arch.nx_lpage_recovery_thread);
+					  &kvm->arch.nx_huge_page_recovery_thread);
 	if (!err)
-		kthread_unpark(kvm->arch.nx_lpage_recovery_thread);
+		kthread_unpark(kvm->arch.nx_huge_page_recovery_thread);
 
 	return err;
 }
 
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
 {
-	if (kvm->arch.nx_lpage_recovery_thread)
-		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
+	if (kvm->arch.nx_huge_page_recovery_thread)
+		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
 }
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index cca1ad75d096..67879459a25c 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -57,7 +57,13 @@ struct kvm_mmu_page {
 	bool tdp_mmu_page;
 	bool unsync;
 	u8 mmu_valid_gen;
-	bool lpage_disallowed; /* Can't be replaced by an equiv large page */
+
+	 /*
+	  * The shadow page can't be replaced by an equivalent huge page
+	  * because it is being used to map an executable page in the guest
+	  * and the NX huge page mitigation is enabled.
+	  */
+	bool nx_huge_page_disallowed;
 
 	/*
 	 * The following two entries are used to key the shadow page in the
@@ -102,12 +108,12 @@ struct kvm_mmu_page {
 
 	/*
 	 * Tracks shadow pages that, if zapped, would allow KVM to create an NX
-	 * huge page.  A shadow page will have lpage_disallowed set but not be
-	 * on the list if a huge page is disallowed for other reasons, e.g.
-	 * because KVM is shadowing a PTE at the same gfn, the memslot isn't
-	 * properly aligned, etc...
+	 * huge page.  A shadow page will have nx_huge_page_disallowed set but
+	 * not be on the list if a huge page is disallowed for other reasons,
+	 * e.g. because KVM is shadowing a PTE at the same gfn, the memslot
+	 * isn't properly aligned, etc...
 	 */
-	struct list_head lpage_disallowed_link;
+	struct list_head possible_nx_huge_page_link;
 #ifdef CONFIG_X86_32
 	/*
 	 * Used out of the mmu-lock to avoid reading spte values while an
@@ -322,8 +328,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
 			  bool nx_huge_page_possible);
-void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index e450f49f2225..259c0f019f09 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -714,7 +714,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 
 		link_shadow_page(vcpu, it.sptep, sp);
 		if (fault->huge_page_disallowed)
-			account_huge_nx_page(vcpu->kvm, sp,
+			account_nx_huge_page(vcpu->kvm, sp,
 					     fault->req_level >= it.level);
 	}
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 903d0d3497b6..0e94182c87be 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -284,7 +284,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 			    gfn_t gfn, union kvm_mmu_page_role role)
 {
-	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
+	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
@@ -392,8 +392,8 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 		lockdep_assert_held_write(&kvm->mmu_lock);
 
 	list_del(&sp->link);
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	if (sp->nx_huge_page_disallowed)
+		unaccount_nx_huge_page(kvm, sp);
 
 	if (shared)
 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1132,7 +1132,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
 	if (account_nx)
-		account_huge_nx_page(kvm, sp, true);
+		account_nx_huge_page(kvm, sp, true);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 	return 0;
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-08-05 23:05 ` [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-16 21:25   ` Mingwei Zhang
  2022-08-05 23:05 ` [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Sean Christopherson
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Account and track NX huge pages for nonpaging MMUs so that a future
enhancement to precisely check if a shadow page can't be replaced by a NX
huge page doesn't get false positives.  Without correct tracking, KVM can
get stuck in a loop if an instruction is fetching and writing data on the
same huge page, e.g. KVM installs a small executable page on the fetch
fault, replaces it with an NX huge page on the write fault, and faults
again on the fetch.

Alternatively, and perhaps ideally, KVM would simply not enforce the
workaround for nonpaging MMUs.  The guest has no page tables to abuse
and KVM is guaranteed to switch to a different MMU on CR0.PG being
toggled so there's no security or performance concerns.  However, getting
make_spte() to play nice now and in the future is unnecessarily complex.

In the current code base, make_spte() can enforce the mitigation if TDP
is enabled or the MMU is indirect, but make_spte() may not always have a
vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
promotion when disabling dirty logging.

Without a vCPU/MMU, KVM could either pass in the correct information
and/or derive it from the shadow page, but the former is ugly and the
latter subtly non-trivial due to the possibility of direct shadow pages
in indirect MMUs.  Given that using shadow paging with an unpaged guest
is far from top priority _and_ has been subjected to the workaround since
its inception, keep it simple and just fix the accounting glitch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c  |  2 +-
 arch/x86/kvm/mmu/spte.c | 12 ++++++++++++
 2 files changed, 13 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 53d0dafa68ff..345b6b22ab68 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3123,7 +3123,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			continue;
 
 		link_shadow_page(vcpu, it.sptep, sp);
-		if (fault->is_tdp && fault->huge_page_disallowed)
+		if (fault->huge_page_disallowed)
 			account_nx_huge_page(vcpu->kvm, sp,
 					     fault->req_level >= it.level);
 	}
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 7314d27d57a4..52186b795bce 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -147,6 +147,18 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	if (!prefetch)
 		spte |= spte_shadow_accessed_mask(spte);
 
+	/*
+	 * For simplicity, enforce the NX huge page mitigation even if not
+	 * strictly necessary.  KVM could ignore the mitigation if paging is
+	 * disabled in the guest, as the guest doesn't have an page tables to
+	 * abuse.  But to safely ignore the mitigation, KVM would have to
+	 * ensure a new MMU is loaded (or all shadow pages zapped) when CR0.PG
+	 * is toggled on, and that's a net negative for performance when TDP is
+	 * enabled.  When TDP is disabled, KVM will always switch to a new MMU
+	 * when CR0.PG is toggled, but leveraging that to ignore the mitigation
+	 * would tie make_spte() further to vCPU/MMU state, and add complexity
+	 * just to optimize a mode that is anything but performance critical.
+	 */
 	if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
 	    is_nx_huge_page_enabled(vcpu->kvm)) {
 		pte_access &= ~ACC_EXEC_MASK;
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
                   ` (3 preceding siblings ...)
  2022-08-05 23:05 ` [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-09  3:26   ` Yan Zhao
  2022-08-05 23:05 ` [PATCH v3 6/8] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Sean Christopherson
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
visible to other readers, i.e. before setting its SPTE.  This will allow
KVM to query the flag when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
for another CPU to see a shadow page without an up-to-date
nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
dance.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 28 +++++++++++++-------
 arch/x86/kvm/mmu/mmu_internal.h |  5 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      | 46 +++++++++++++++++++++++----------
 3 files changed, 53 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 345b6b22ab68..f81ddedbe2f7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -802,22 +802,25 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
 }
 
-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			  bool nx_huge_page_possible)
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	if (KVM_BUG_ON(!list_empty(&sp->possible_nx_huge_page_link), kvm))
 		return;
 
-	sp->nx_huge_page_disallowed = true;
-
-	if (!nx_huge_page_possible)
-		return;
-
 	++kvm->stat.nx_lpage_splits;
 	list_add_tail(&sp->possible_nx_huge_page_link,
 		      &kvm->arch.possible_nx_huge_pages);
 }
 
+static void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
+				 bool nx_huge_page_possible)
+{
+	sp->nx_huge_page_disallowed = true;
+
+	if (nx_huge_page_possible)
+		track_possible_nx_huge_page(kvm, sp);
+}
+
 static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
 	struct kvm_memslots *slots;
@@ -835,10 +838,8 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_allow_lpage(slot, gfn);
 }
 
-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	sp->nx_huge_page_disallowed = false;
-
 	if (list_empty(&sp->possible_nx_huge_page_link))
 		return;
 
@@ -846,6 +847,13 @@ void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 	list_del_init(&sp->possible_nx_huge_page_link);
 }
 
+static void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	sp->nx_huge_page_disallowed = false;
+
+	untrack_possible_nx_huge_page(kvm, sp);
+}
+
 static struct kvm_memory_slot *
 gfn_to_memslot_dirty_bitmap(struct kvm_vcpu *vcpu, gfn_t gfn,
 			    bool no_dirty_log)
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 67879459a25c..22152241bd29 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -328,8 +328,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
-void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
-			  bool nx_huge_page_possible);
-void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void track_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void untrack_possible_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 0e94182c87be..34994ca3d45b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -392,8 +392,19 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 		lockdep_assert_held_write(&kvm->mmu_lock);
 
 	list_del(&sp->link);
-	if (sp->nx_huge_page_disallowed)
-		unaccount_nx_huge_page(kvm, sp);
+
+	/*
+	 * Ensure nx_huge_page_disallowed is read after observing the present
+	 * shadow page.  A different vCPU may have _just_ finished installing
+	 * the shadow page if mmu_lock is held for read.  Pairs with the
+	 * smp_wmb() in kvm_tdp_mmu_map().
+	 */
+	smp_rmb();
+
+	if (sp->nx_huge_page_disallowed) {
+		sp->nx_huge_page_disallowed = false;
+		untrack_possible_nx_huge_page(kvm, sp);
+	}
 
 	if (shared)
 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1107,16 +1118,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @sp: The new TDP page table to install.
- * @account_nx: True if this page table is being installed to split a
- *              non-executable huge page.
  * @shared: This operation is running under the MMU lock in read mode.
  *
  * Returns: 0 if the new page table was installed. Non-0 if the page table
  *          could not be installed (e.g. the atomic compare-exchange failed).
  */
 static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
-			   struct kvm_mmu_page *sp, bool account_nx,
-			   bool shared)
+			   struct kvm_mmu_page *sp, bool shared)
 {
 	u64 spte = make_nonleaf_spte(sp->spt, !kvm_ad_enabled());
 	int ret = 0;
@@ -1131,8 +1139,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	if (account_nx)
-		account_nx_huge_page(kvm, sp, true);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 	return 0;
@@ -1145,6 +1151,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret;
@@ -1181,9 +1188,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
-			bool account_nx = fault->huge_page_disallowed &&
-					  fault->req_level >= iter.level;
-
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1195,10 +1199,26 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			sp = tdp_mmu_alloc_sp(vcpu);
 			tdp_mmu_init_child_sp(sp, &iter);
 
-			if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
+			sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
+
+			/*
+			 * Ensure nx_huge_page_disallowed is visible before the
+			 * SP is marked present, as mmu_lock is held for read.
+			 * Pairs with the smp_rmb() in tdp_mmu_unlink_sp().
+			 */
+			smp_wmb();
+
+			if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
 				tdp_mmu_free_sp(sp);
 				break;
 			}
+
+			if (fault->huge_page_disallowed &&
+			    fault->req_level >= iter.level) {
+				spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+				track_possible_nx_huge_page(kvm, sp);
+				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+			}
 		}
 	}
 
@@ -1486,7 +1506,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * correctness standpoint since the translation will be the same either
 	 * way.
 	 */
-	ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+	ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
 	if (ret)
 		goto out;
 
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 6/8] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
                   ` (4 preceding siblings ...)
  2022-08-05 23:05 ` [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 7/8] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Sean Christopherson
  7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yan Zhao <yan.y.zhao@intel.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 11 +++--------
 arch/x86/kvm/mmu/tdp_mmu.c      | 28 +++++++++++++---------------
 2 files changed, 16 insertions(+), 23 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5634347e5d05..7ac0c5612319 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1271,6 +1271,9 @@ struct kvm_arch {
 	 */
 	bool tdp_mmu_enabled;
 
+	/* The number of TDP MMU pages across all roots. */
+	atomic64_t tdp_mmu_pages;
+
 	/*
 	 * List of struct kvm_mmu_pages being used as roots.
 	 * All struct kvm_mmu_pages in the list should have
@@ -1291,18 +1294,10 @@ struct kvm_arch {
 	 */
 	struct list_head tdp_mmu_roots;
 
-	/*
-	 * List of struct kvmp_mmu_pages not being used as roots.
-	 * All struct kvm_mmu_pages in the list should have
-	 * tdp_mmu_page set and a tdp_mmu_root_count of 0.
-	 */
-	struct list_head tdp_mmu_pages;
-
 	/*
 	 * Protects accesses to the following fields when the MMU lock
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
-	 *  - tdp_mmu_pages (above)
 	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
 	 *  - possible_nx_huge_pages;
 	 *  - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 34994ca3d45b..526d38704e5c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,7 +29,6 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	kvm->arch.tdp_mmu_enabled = true;
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
-	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 	kvm->arch.tdp_mmu_zap_wq = wq;
 	return 1;
 }
@@ -54,7 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	/* Also waits for any queued work items.  */
 	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 
-	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
+	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
 	/*
@@ -386,12 +385,7 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 			      bool shared)
 {
-	if (shared)
-		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	else
-		lockdep_assert_held_write(&kvm->mmu_lock);
-
-	list_del(&sp->link);
+	atomic64_dec(&kvm->arch.tdp_mmu_pages);
 
 	/*
 	 * Ensure nx_huge_page_disallowed is read after observing the present
@@ -401,10 +395,16 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 	 */
 	smp_rmb();
 
-	if (sp->nx_huge_page_disallowed) {
-		sp->nx_huge_page_disallowed = false;
-		untrack_possible_nx_huge_page(kvm, sp);
-	}
+	if (!sp->nx_huge_page_disallowed)
+		return;
+
+	if (shared)
+		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+	else
+		lockdep_assert_held_write(&kvm->mmu_lock);
+
+	sp->nx_huge_page_disallowed = false;
+	untrack_possible_nx_huge_page(kvm, sp);
 
 	if (shared)
 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1137,9 +1137,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 		tdp_mmu_set_spte(kvm, iter, spte);
 	}
 
-	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+	atomic64_inc(&kvm->arch.tdp_mmu_pages);
 
 	return 0;
 }
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 7/8] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
                   ` (5 preceding siblings ...)
  2022-08-05 23:05 ` [PATCH v3 6/8] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-05 23:05 ` [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Sean Christopherson
  7 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

Add a helper to convert a SPTE to its shadow page to deduplicate a
variety of flows and hopefully avoid future bugs, e.g. if KVM attempts to
get the shadow page for a SPTE without dropping high bits.

Opportunistically add a comment in mmu_free_root_page() documenting why
it treats the root HPA as a SPTE.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 17 ++++++++++-------
 arch/x86/kvm/mmu/mmu_internal.h | 12 ------------
 arch/x86/kvm/mmu/spte.h         | 17 +++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h      |  2 ++
 4 files changed, 29 insertions(+), 19 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f81ddedbe2f7..1442129c85e0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1796,7 +1796,7 @@ static int __mmu_unsync_walk(struct kvm_mmu_page *sp,
 			continue;
 		}
 
-		child = to_shadow_page(ent & SPTE_BASE_ADDR_MASK);
+		child = spte_to_child_sp(ent);
 
 		if (child->unsync_children) {
 			if (mmu_pages_add(pvec, child, i))
@@ -2355,7 +2355,7 @@ static void validate_direct_spte(struct kvm_vcpu *vcpu, u64 *sptep,
 		 * so we should update the spte at this point to get
 		 * a new sp with the correct access.
 		 */
-		child = to_shadow_page(*sptep & SPTE_BASE_ADDR_MASK);
+		child = spte_to_child_sp(*sptep);
 		if (child->role.access == direct_access)
 			return;
 
@@ -2376,7 +2376,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 		if (is_last_spte(pte, sp->role.level)) {
 			drop_spte(kvm, spte);
 		} else {
-			child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
+			child = spte_to_child_sp(pte);
 			drop_parent_pte(child, spte);
 
 			/*
@@ -2815,7 +2815,7 @@ static int mmu_set_spte(struct kvm_vcpu *vcpu, struct kvm_memory_slot *slot,
 			struct kvm_mmu_page *child;
 			u64 pte = *sptep;
 
-			child = to_shadow_page(pte & SPTE_BASE_ADDR_MASK);
+			child = spte_to_child_sp(pte);
 			drop_parent_pte(child, sptep);
 			flush = true;
 		} else if (pfn != spte_to_pfn(*sptep)) {
@@ -3427,7 +3427,11 @@ static void mmu_free_root_page(struct kvm *kvm, hpa_t *root_hpa,
 	if (!VALID_PAGE(*root_hpa))
 		return;
 
-	sp = to_shadow_page(*root_hpa & SPTE_BASE_ADDR_MASK);
+	/*
+	 * The "root" may be a special root, e.g. a PAE entry, treat it as a
+	 * SPTE to ensure any non-PA bits are dropped.
+	 */
+	sp = spte_to_child_sp(*root_hpa);
 	if (WARN_ON(!sp))
 		return;
 
@@ -3912,8 +3916,7 @@ void kvm_mmu_sync_roots(struct kvm_vcpu *vcpu)
 		hpa_t root = vcpu->arch.mmu->pae_root[i];
 
 		if (IS_VALID_PAE_ROOT(root)) {
-			root &= SPTE_BASE_ADDR_MASK;
-			sp = to_shadow_page(root);
+			sp = spte_to_child_sp(root);
 			mmu_sync_children(vcpu, sp, true);
 		}
 	}
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 22152241bd29..dbaf6755c5a7 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -133,18 +133,6 @@ struct kvm_mmu_page {
 
 extern struct kmem_cache *mmu_page_header_cache;
 
-static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
-{
-	struct page *page = pfn_to_page(shadow_page >> PAGE_SHIFT);
-
-	return (struct kvm_mmu_page *)page_private(page);
-}
-
-static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
-{
-	return to_shadow_page(__pa(sptep));
-}
-
 static inline int kvm_mmu_role_as_id(union kvm_mmu_page_role role)
 {
 	return role.smm ? 1 : 0;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index cabe3fbb4f39..37aa4a9c3d75 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -207,6 +207,23 @@ static inline int spte_index(u64 *sptep)
  */
 extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 
+static inline struct kvm_mmu_page *to_shadow_page(hpa_t shadow_page)
+{
+	struct page *page = pfn_to_page((shadow_page) >> PAGE_SHIFT);
+
+	return (struct kvm_mmu_page *)page_private(page);
+}
+
+static inline struct kvm_mmu_page *spte_to_child_sp(u64 spte)
+{
+	return to_shadow_page(spte & SPTE_BASE_ADDR_MASK);
+}
+
+static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
+{
+	return to_shadow_page(__pa(sptep));
+}
+
 static inline bool is_mmio_spte(u64 spte)
 {
 	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index c163f7cc23ca..d3714200b932 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -5,6 +5,8 @@
 
 #include <linux/kvm_host.h>
 
+#include "spte.h"
+
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
  2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
                   ` (6 preceding siblings ...)
  2022-08-05 23:05 ` [PATCH v3 7/8] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page Sean Christopherson
@ 2022-08-05 23:05 ` Sean Christopherson
  2022-08-09 12:57   ` Paolo Bonzini
  7 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-05 23:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

From: Mingwei Zhang <mizhang@google.com>

Explicitly check if a NX huge page is disallowed when determining if a
page fault needs to be forced to use a smaller sized page.  KVM currently
assumes that the NX huge page mitigation is the only scenario where KVM
will force a shadow page instead of a huge page, and so unnecessarily
keeps an existing shadow page instead of replacing it with a huge page.

Any scenario that causes KVM to zap leaf SPTEs may result in having a SP
that can be made huge without violating the NX huge page mitigation.
E.g. prior to commit 5ba7c4c6d1c7 ("KVM: x86/MMU: Zap non-leaf SPTEs when
disabling dirty logging"), KVM would keep shadow pages after disabling
dirty logging due to a live migration being canceled, resulting in
degraded performance due to running with 4kb pages instead of huge pages.

Although the dirty logging case is "fixed", that fix is coincidental,
i.e. is an implementation detail, and there are other scenarios where KVM
will zap leaf SPTEs.  E.g. zapping leaf SPTEs in response to a host page
migration (mmu_notifier invalidation) to create a huge page would yield a
similar result; KVM would see the shadow-present non-leaf SPTE and assume
a huge page is disallowed.

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
Reviewed-by: Ben Gardon <bgardon@google.com>
Reviewed-by: David Matlack <dmatlack@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
[sean: add barrier comments, use spte_to_child_sp(), massage changelog]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 17 +++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c |  3 ++-
 2 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1442129c85e0..3ddfc82868fd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3090,6 +3090,19 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte)) {
+		u64 page_mask;
+
+		/*
+		 * Ensure nx_huge_page_disallowed is read after checking for a
+		 * present shadow page.  A different vCPU may be concurrently
+		 * installing the shadow page if mmu_lock is held for read.
+		 * Pairs with the smp_wmb() in kvm_tdp_mmu_map().
+		 */
+		smp_rmb();
+
+		if (!spte_to_child_sp(spte)->nx_huge_page_disallowed)
+			return;
+
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch)
 		 * and __direct_map would like to create a large PTE
@@ -3097,8 +3110,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 		 * patching back for them into pfn the next 9 bits of
 		 * the address.
 		 */
-		u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
-				KVM_PAGES_PER_HPAGE(cur_level - 1);
+		page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+			    KVM_PAGES_PER_HPAGE(cur_level - 1);
 		fault->pfn |= fault->gfn & page_mask;
 		fault->goal_level--;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 526d38704e5c..c5314ca95e08 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1202,7 +1202,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			/*
 			 * Ensure nx_huge_page_disallowed is visible before the
 			 * SP is marked present, as mmu_lock is held for read.
-			 * Pairs with the smp_rmb() in tdp_mmu_unlink_sp().
+			 * Pairs with the smp_rmb() in tdp_mmu_unlink_sp() and
+			 * in disallowed_hugepage_adjust().
 			 */
 			smp_wmb();
 
-- 
2.37.1.559.g78731f0fdb-goog


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-05 23:05 ` [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Sean Christopherson
@ 2022-08-09  3:26   ` Yan Zhao
  2022-08-09 12:49     ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Yan Zhao @ 2022-08-09  3:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Mingwei Zhang,
	Ben Gardon

On Fri, Aug 05, 2022 at 11:05:10PM +0000, Sean Christopherson wrote:
> Set nx_huge_page_disallowed in TDP MMU shadow pages before making the SP
> visible to other readers, i.e. before setting its SPTE.  This will allow
> KVM to query the flag when determining if a shadow page can be replaced
> by a NX huge page without violating the rules of the mitigation.
> 
> Note, the shadow/legacy MMU holds mmu_lock for write, so it's impossible
> for another CPU to see a shadow page without an up-to-date
> nx_huge_page_disallowed, i.e. only the TDP MMU needs the complicated
> dance.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: David Matlack <dmatlack@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 28 +++++++++++++-------
>  arch/x86/kvm/mmu/mmu_internal.h |  5 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 46 +++++++++++++++++++++++----------
>  3 files changed, 53 insertions(+), 26 deletions(-)
>
<snip>

> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 0e94182c87be..34994ca3d45b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -392,8 +392,19 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
>  		lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	list_del(&sp->link);
> -	if (sp->nx_huge_page_disallowed)
> -		unaccount_nx_huge_page(kvm, sp);
> +
> +	/*
> +	 * Ensure nx_huge_page_disallowed is read after observing the present
> +	 * shadow page.  A different vCPU may have _just_ finished installing
> +	 * the shadow page if mmu_lock is held for read.  Pairs with the
> +	 * smp_wmb() in kvm_tdp_mmu_map().
> +	 */
> +	smp_rmb();
hi Sean,

I understand this smp_rmb() is intended to prevent the reading of
p->nx_huge_page_disallowed from happening before it's set to true in
kvm_tdp_mmu_map(). Is this understanding right?

If it's true, then do we also need the smp_rmb() for read of sp->gfn in
handle_removed_pt()? (or maybe for other fields in sp in other places?)

Thanks
Yan

> +
> +	if (sp->nx_huge_page_disallowed) {
> +		sp->nx_huge_page_disallowed = false;
> +		untrack_possible_nx_huge_page(kvm, sp);
> +	}
>  
>  	if (shared)
>  		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);





^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-09  3:26   ` Yan Zhao
@ 2022-08-09 12:49     ` Paolo Bonzini
  2022-08-09 14:44       ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-08-09 12:49 UTC (permalink / raw)
  To: Yan Zhao, Sean Christopherson
  Cc: kvm, linux-kernel, David Matlack, Mingwei Zhang, Ben Gardon

On 8/9/22 05:26, Yan Zhao wrote:
> hi Sean,
> 
> I understand this smp_rmb() is intended to prevent the reading of
> p->nx_huge_page_disallowed from happening before it's set to true in
> kvm_tdp_mmu_map(). Is this understanding right?
> 
> If it's true, then do we also need the smp_rmb() for read of sp->gfn in
> handle_removed_pt()? (or maybe for other fields in sp in other places?)

No, in that case the barrier is provided by rcu_dereference().  In fact, 
I am not sure the barriers are needed in this patch either (but the 
comments are :)):

- the write barrier is certainly not needed because it is implicit in 
tdp_mmu_set_spte_atomic's cmpxchg64

- the read barrier _should_ also be provided by rcu_dereference(pt), but 
I'm not 100% sure about that. The reasoning is that you have

(1)	iter->old spte = READ_ONCE(*rcu_dereference(iter->sptep));
	...
(2)	tdp_ptep_t pt = spte_to_child_pt(old_spte, level);
(3)	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
	...
(4)	if (sp->nx_huge_page_disallowed) {

and (4) is definitely ordered after (1) thanks to the READ_ONCE hidden 
within (3) and the data dependency from old_spte to sp.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
  2022-08-05 23:05 ` [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Sean Christopherson
@ 2022-08-09 12:57   ` Paolo Bonzini
  2022-08-09 14:49     ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-08-09 12:57 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

On 8/6/22 01:05, Sean Christopherson wrote:
>   	    !is_large_pte(spte)) {
> +		u64 page_mask;
> +
> +		/*
> +		 * Ensure nx_huge_page_disallowed is read after checking for a
> +		 * present shadow page.  A different vCPU may be concurrently
> +		 * installing the shadow page if mmu_lock is held for read.
> +		 * Pairs with the smp_wmb() in kvm_tdp_mmu_map().
> +		 */
> +		smp_rmb();
> +
> +		if (!spte_to_child_sp(spte)->nx_huge_page_disallowed)
> +			return;
> +

I wonder if the barrier shouldn't be simply in to_shadow_page(), i.e. 
always assume in the TDP MMU code that sp->xyz is read after the SPTE 
that points to that struct kvm_mmu_page.

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-09 12:49     ` Paolo Bonzini
@ 2022-08-09 14:44       ` Sean Christopherson
  2022-08-09 14:48         ` Paolo Bonzini
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-09 14:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yan Zhao, kvm, linux-kernel, David Matlack, Mingwei Zhang, Ben Gardon

On Tue, Aug 09, 2022, Paolo Bonzini wrote:
> On 8/9/22 05:26, Yan Zhao wrote:
> > hi Sean,
> > 
> > I understand this smp_rmb() is intended to prevent the reading of
> > p->nx_huge_page_disallowed from happening before it's set to true in
> > kvm_tdp_mmu_map(). Is this understanding right?
> > 
> > If it's true, then do we also need the smp_rmb() for read of sp->gfn in
> > handle_removed_pt()? (or maybe for other fields in sp in other places?)
> 
> No, in that case the barrier is provided by rcu_dereference().  In fact, I
> am not sure the barriers are needed in this patch either (but the comments
> are :)):

Yeah, I'm 99% certain the barriers aren't strictly required, but I didn't love the
idea of depending on other implementation details for the barriers.  Of course I
completely overlooked the fact that all other sp fields would need the same
barriers...

> - the write barrier is certainly not needed because it is implicit in
> tdp_mmu_set_spte_atomic's cmpxchg64
> 
> - the read barrier _should_ also be provided by rcu_dereference(pt), but I'm
> not 100% sure about that. The reasoning is that you have
> 
> (1)	iter->old spte = READ_ONCE(*rcu_dereference(iter->sptep));
> 	...
> (2)	tdp_ptep_t pt = spte_to_child_pt(old_spte, level);
> (3)	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
> 	...
> (4)	if (sp->nx_huge_page_disallowed) {
> 
> and (4) is definitely ordered after (1) thanks to the READ_ONCE hidden
> within (3) and the data dependency from old_spte to sp.

Yes, I think that's correct.  Callers must verify the SPTE is present before getting
the associated child shadow page.  KVM does have instances where a shadow page is
retrieved from the SPTE _pointer_, but that's the parent shadow page, i.e. isn't
guarded by the SPTE being present.

	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));

Something like this is as a separate patch?

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..9d982ccf4567 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -13,6 +13,12 @@
  * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
  * batched without having to collect the list of zapped SPs.  Flows that can
  * remove SPs must service pending TLB flushes prior to dropping RCU protection.
+ *
+ * The READ_ONCE() ensures that, if the SPTE points at a child shadow page, all
+ * fields in struct kvm_mmu_page will be read after the caller observes the
+ * present SPTE (KVM must check that the SPTE is present before following the
+ * SPTE's pfn to its associated shadow page).  Pairs with the implicit memory
+ * barrier in tdp_mmu_set_spte_atomic().
  */
 static inline u64 kvm_tdp_mmu_read_spte(tdp_ptep_t sptep)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bf2ccf9debca..ca50296e3696 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -645,6 +645,11 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
        lockdep_assert_held_read(&kvm->mmu_lock);

        /*
+        * The atomic CMPXCHG64 provides an implicit memory barrier and ensures
+        * that, if the SPTE points at a shadow page, all struct kvm_mmu_page
+        * fields are visible to readers before the SPTE is marked present.
+        * Pairs with ordering guarantees provided by kvm_tdp_mmu_read_spte().
+        *
         * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
         * does not hold the mmu_lock.
         */

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-09 14:44       ` Sean Christopherson
@ 2022-08-09 14:48         ` Paolo Bonzini
  2022-08-09 15:05           ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Paolo Bonzini @ 2022-08-09 14:48 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yan Zhao, kvm, linux-kernel, David Matlack, Mingwei Zhang, Ben Gardon

On 8/9/22 16:44, Sean Christopherson wrote:
> On Tue, Aug 09, 2022, Paolo Bonzini wrote:
>> On 8/9/22 05:26, Yan Zhao wrote:
>>> hi Sean,
>>>
>>> I understand this smp_rmb() is intended to prevent the reading of
>>> p->nx_huge_page_disallowed from happening before it's set to true in
>>> kvm_tdp_mmu_map(). Is this understanding right?
>>>
>>> If it's true, then do we also need the smp_rmb() for read of sp->gfn in
>>> handle_removed_pt()? (or maybe for other fields in sp in other places?)
>>
>> No, in that case the barrier is provided by rcu_dereference().  In fact, I
>> am not sure the barriers are needed in this patch either (but the comments
>> are :)):
> 
> Yeah, I'm 99% certain the barriers aren't strictly required, but I didn't love the
> idea of depending on other implementation details for the barriers.  Of course I
> completely overlooked the fact that all other sp fields would need the same
> barriers...
> 
>> - the write barrier is certainly not needed because it is implicit in
>> tdp_mmu_set_spte_atomic's cmpxchg64
>>
>> - the read barrier _should_ also be provided by rcu_dereference(pt), but I'm
>> not 100% sure about that. The reasoning is that you have
>>
>> (1)	iter->old spte = READ_ONCE(*rcu_dereference(iter->sptep));
>> 	...
>> (2)	tdp_ptep_t pt = spte_to_child_pt(old_spte, level);
>> (3)	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
>> 	...
>> (4)	if (sp->nx_huge_page_disallowed) {
>>
>> and (4) is definitely ordered after (1) thanks to the READ_ONCE hidden
>> within (3) and the data dependency from old_spte to sp.
> 
> Yes, I think that's correct.  Callers must verify the SPTE is present before getting
> the associated child shadow page.  KVM does have instances where a shadow page is
> retrieved from the SPTE _pointer_, but that's the parent shadow page, i.e. isn't
> guarded by the SPTE being present.
> 
> 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
> 
> Something like this is as a separate patch?

Would you resubmit without the memory barriers then?

> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index f0af385c56e0..9d982ccf4567 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -13,6 +13,12 @@
>    * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
>    * batched without having to collect the list of zapped SPs.  Flows that can
>    * remove SPs must service pending TLB flushes prior to dropping RCU protection.
> + *
> + * The READ_ONCE() ensures that, if the SPTE points at a child shadow page, all
> + * fields in struct kvm_mmu_page will be read after the caller observes the
> + * present SPTE (KVM must check that the SPTE is present before following the
> + * SPTE's pfn to its associated shadow page).  Pairs with the implicit memory

I guess you mean both the shadow page table itself and the struct 
kvm_mmu_page?  Or do you think to_shadow_page() should have a smp_rmb()?

Paolo


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
  2022-08-09 12:57   ` Paolo Bonzini
@ 2022-08-09 14:49     ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-08-09 14:49 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: kvm, linux-kernel, David Matlack, Yan Zhao, Mingwei Zhang, Ben Gardon

On Tue, Aug 09, 2022, Paolo Bonzini wrote:
> On 8/6/22 01:05, Sean Christopherson wrote:
> >   	    !is_large_pte(spte)) {
> > +		u64 page_mask;
> > +
> > +		/*
> > +		 * Ensure nx_huge_page_disallowed is read after checking for a
> > +		 * present shadow page.  A different vCPU may be concurrently
> > +		 * installing the shadow page if mmu_lock is held for read.
> > +		 * Pairs with the smp_wmb() in kvm_tdp_mmu_map().
> > +		 */
> > +		smp_rmb();
> > +
> > +		if (!spte_to_child_sp(spte)->nx_huge_page_disallowed)
> > +			return;
> > +
> 
> I wonder if the barrier shouldn't be simply in to_shadow_page(), i.e. always
> assume in the TDP MMU code that sp->xyz is read after the SPTE that points
> to that struct kvm_mmu_page.

If we can get away with it, I'd prefer to rely on the READ_ONCE() in
kvm_tdp_mmu_read_spte() and required ordering of:

	READ_ONCE() => PRESENT => spte_to_child_sp()

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE
  2022-08-09 14:48         ` Paolo Bonzini
@ 2022-08-09 15:05           ` Sean Christopherson
  0 siblings, 0 replies; 27+ messages in thread
From: Sean Christopherson @ 2022-08-09 15:05 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Yan Zhao, kvm, linux-kernel, David Matlack, Mingwei Zhang, Ben Gardon

On Tue, Aug 09, 2022, Paolo Bonzini wrote:
> On 8/9/22 16:44, Sean Christopherson wrote:
> > On Tue, Aug 09, 2022, Paolo Bonzini wrote:
> > > and (4) is definitely ordered after (1) thanks to the READ_ONCE hidden
> > > within (3) and the data dependency from old_spte to sp.
> > 
> > Yes, I think that's correct.  Callers must verify the SPTE is present before getting
> > the associated child shadow page.  KVM does have instances where a shadow page is
> > retrieved from the SPTE _pointer_, but that's the parent shadow page, i.e. isn't
> > guarded by the SPTE being present.
> > 
> > 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(iter->sptep));
> > 
> > Something like this is as a separate patch?
> 
> Would you resubmit without the memory barriers then?

Ya.

> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index f0af385c56e0..9d982ccf4567 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -13,6 +13,12 @@
> >    * to be zapped while holding mmu_lock for read, and to allow TLB flushes to be
> >    * batched without having to collect the list of zapped SPs.  Flows that can
> >    * remove SPs must service pending TLB flushes prior to dropping RCU protection.
> > + *
> > + * The READ_ONCE() ensures that, if the SPTE points at a child shadow page, all
> > + * fields in struct kvm_mmu_page will be read after the caller observes the
> > + * present SPTE (KVM must check that the SPTE is present before following the
> > + * SPTE's pfn to its associated shadow page).  Pairs with the implicit memory
> 
> I guess you mean both the shadow page table itself and the struct
> kvm_mmu_page?

Yes.  It's a bug if KVM ever consumes a SPTE's pfn (read the next level of SPTEs
or pfn_to_page() to get kvm_mmu_page) without first checking that the SPTE is
MMU-present.  

> Or do you think to_shadow_page() should have a smp_rmb()?

I believe we're ok.  If any SP fields are toggled after the SP is marked present,
then we'd need separate barriers, e.g. similar to is_unsync_root() +
mmu_try_to_unsync_pages() (ignoring that we've since added a spinlock in the "try
to unsync" path).

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page
  2022-08-05 23:05 ` [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page Sean Christopherson
@ 2022-08-14  0:53   ` Mingwei Zhang
  0 siblings, 0 replies; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-14  0:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Fri, Aug 05, 2022, Sean Christopherson wrote:
> WARN and kill the VM if KVM attempts to double count an NX huge page,
> i.e. attempts to re-tag a shadow page with "NX huge page disallowed".
> KVM does NX huge page accounting only when linking a new shadow page, and
> it should be impossible for a new shadow page to be already accounted.
> E.g. even in the TDP MMU case, where vCPUs can race to install a new
> shadow page, only the "winner" will account the installed page.
> 
> Kill the VM instead of continuing on as either KVM has an egregious bug,
> e.g. didn't zero-initialize the data, or there's host data corruption, in
> which carrying on is dangerous, e.g. could cause silent data corruption
> in the guest.
> 
> Reported-by: David Matlack <dmatlack@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 3e1317325e1f..36b898dbde91 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -804,7 +804,7 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  
>  void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -	if (sp->lpage_disallowed)
> +	if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
>  		return;
>  
>  	++kvm->stat.nx_lpage_splits;
> -- 
> 2.37.1.559.g78731f0fdb-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked
  2022-08-05 23:05 ` [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Sean Christopherson
@ 2022-08-14  0:53   ` Mingwei Zhang
  0 siblings, 0 replies; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-14  0:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Fri, Aug 05, 2022, Sean Christopherson wrote:
> Tag shadow pages that cannot be replaced with an NX huge page regardless
> of whether or not zapping the page would allow KVM to immediately create
> a huge page, e.g. because something else prevents creating a huge page.
> 
> I.e. track pages that are disallowed from being NX huge pages regardless
> of whether or not the page could have been huge at the time of fault.
> KVM currently tracks pages that were disallowed from being huge due to
> the NX workaround if and only if the page could otherwise be huge.  But
> that fails to handled the scenario where whatever restriction prevented
> KVM from installing a huge page goes away, e.g. if dirty logging is
> disabled, the host mapping level changes, etc...
> 
> Failure to tag shadow pages appropriately could theoretically lead to
> false negatives, e.g. if a fetch fault requests a small page and thus
> isn't tracked, and a read/write fault later requests a huge page, KVM
> will not reject the huge page as it should.
> 
> To avoid yet another flag, initialize the list_head and use list_empty()
> to determine whether or not a page is on the list of NX huge pages that
> should be recovered.
> 
> Note, the TDP MMU accounting is still flawed as fixing the TDP MMU is
> more involved due to mmu_lock being held for read.  This will be
> addressed in a future commit.
> 
> Fixes: 5bcaf3e1715f ("KVM: x86/mmu: Account NX huge page disallowed iff huge page was requested")
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 27 +++++++++++++++++++--------
>  arch/x86/kvm/mmu/mmu_internal.h | 10 +++++++++-
>  arch/x86/kvm/mmu/paging_tmpl.h  |  6 +++---
>  arch/x86/kvm/mmu/tdp_mmu.c      |  4 +++-
>  4 files changed, 34 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 36b898dbde91..55dac44f3397 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -802,15 +802,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>  }
>  
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +			  bool nx_huge_page_possible)
>  {
> -	if (KVM_BUG_ON(sp->lpage_disallowed, kvm))
> +	if (KVM_BUG_ON(!list_empty(&sp->lpage_disallowed_link), kvm))
> +		return;
> +
> +	sp->lpage_disallowed = true;
> +
> +	if (!nx_huge_page_possible)
>  		return;
>  
>  	++kvm->stat.nx_lpage_splits;
>  	list_add_tail(&sp->lpage_disallowed_link,
>  		      &kvm->arch.lpage_disallowed_mmu_pages);
> -	sp->lpage_disallowed = true;
>  }
>  
>  static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> @@ -832,9 +837,13 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -	--kvm->stat.nx_lpage_splits;
>  	sp->lpage_disallowed = false;
> -	list_del(&sp->lpage_disallowed_link);
> +
> +	if (list_empty(&sp->lpage_disallowed_link))
> +		return;
> +
> +	--kvm->stat.nx_lpage_splits;
> +	list_del_init(&sp->lpage_disallowed_link);
>  }
>  
>  static struct kvm_memory_slot *
> @@ -2115,6 +2124,8 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> +	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
> +
>  	/*
>  	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
>  	 * depends on valid pages being added to the head of the list.  See
> @@ -3112,9 +3123,9 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			continue;
>  
>  		link_shadow_page(vcpu, it.sptep, sp);
> -		if (fault->is_tdp && fault->huge_page_disallowed &&
> -		    fault->req_level >= it.level)
> -			account_huge_nx_page(vcpu->kvm, sp);
> +		if (fault->is_tdp && fault->huge_page_disallowed)
> +			account_huge_nx_page(vcpu->kvm, sp,
> +					     fault->req_level >= it.level);
>  	}
>  
>  	if (WARN_ON_ONCE(it.level != fault->goal_level))
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 582def531d4d..cca1ad75d096 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -100,6 +100,13 @@ struct kvm_mmu_page {
>  		};
>  	};
>  
> +	/*
> +	 * Tracks shadow pages that, if zapped, would allow KVM to create an NX
> +	 * huge page.  A shadow page will have lpage_disallowed set but not be
> +	 * on the list if a huge page is disallowed for other reasons, e.g.
> +	 * because KVM is shadowing a PTE at the same gfn, the memslot isn't
> +	 * properly aligned, etc...
> +	 */
>  	struct list_head lpage_disallowed_link;
>  #ifdef CONFIG_X86_32
>  	/*
> @@ -315,7 +322,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  
>  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +			  bool nx_huge_page_possible);
>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index f5958071220c..e450f49f2225 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -713,9 +713,9 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  			continue;
>  
>  		link_shadow_page(vcpu, it.sptep, sp);
> -		if (fault->huge_page_disallowed &&
> -		    fault->req_level >= it.level)
> -			account_huge_nx_page(vcpu->kvm, sp);
> +		if (fault->huge_page_disallowed)
> +			account_huge_nx_page(vcpu->kvm, sp,
> +					     fault->req_level >= it.level);
>  	}
>  
>  	if (WARN_ON_ONCE(it.level != fault->goal_level))
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bf2ccf9debca..903d0d3497b6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -284,6 +284,8 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>  static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
>  			    gfn_t gfn, union kvm_mmu_page_role role)
>  {
> +	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
> +
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
>  	sp->role = role;
> @@ -1130,7 +1132,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>  	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
>  	if (account_nx)
> -		account_huge_nx_page(kvm, sp);
> +		account_huge_nx_page(kvm, sp, true);
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  
>  	return 0;
> -- 
> 2.37.1.559.g78731f0fdb-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-05 23:05 ` [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Sean Christopherson
@ 2022-08-14  1:12   ` Mingwei Zhang
  2022-08-15 21:54     ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-14  1:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Fri, Aug 05, 2022, Sean Christopherson wrote:
> Rename most of the variables/functions involved in the NX huge page
> mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
> vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
> applies only to the NX huge page mitigation, not to any condition that
> prevents creating a huge page.
> 
> Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
> stone.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  8 ++--
>  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
>  arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
>  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
>  5 files changed, 59 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e8281d64a431..5634347e5d05 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1143,7 +1143,7 @@ struct kvm_arch {
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	struct list_head active_mmu_pages;
>  	struct list_head zapped_obsolete_pages;
> -	struct list_head lpage_disallowed_mmu_pages;
> +	struct list_head possible_nx_huge_pages;

Honestly, I am struggling to understand this one. 'possible_*' indicates
that there are other possibilities. But what are those possibilities? I
feel this name is more confusing than the original one. Maybe just keep
the original name?

>  	struct kvm_page_track_notifier_node mmu_sp_tracker;
>  	struct kvm_page_track_notifier_head track_notifier_head;
>  	/*
> @@ -1259,7 +1259,7 @@ struct kvm_arch {
>  	bool sgx_provisioning_allowed;
>  
>  	struct kvm_pmu_event_filter __rcu *pmu_event_filter;
> -	struct task_struct *nx_lpage_recovery_thread;
> +	struct task_struct *nx_huge_page_recovery_thread;
>  
>  #ifdef CONFIG_X86_64
>  	/*
> @@ -1304,8 +1304,8 @@ struct kvm_arch {
>  	 *  - tdp_mmu_roots (above)
>  	 *  - tdp_mmu_pages (above)
>  	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
> -	 *  - lpage_disallowed_mmu_pages
> -	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
> +	 *  - possible_nx_huge_pages;
> +	 *  - the possible_nx_huge_page_link field of struct kvm_mmu_pages used
>  	 *    by the TDP MMU
>  	 * It is acceptable, but not necessary, to acquire this lock when
>  	 * the thread holds the MMU lock in write mode.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 55dac44f3397..53d0dafa68ff 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -802,20 +802,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  		kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
>  }
>  
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  			  bool nx_huge_page_possible)
>  {
> -	if (KVM_BUG_ON(!list_empty(&sp->lpage_disallowed_link), kvm))
> +	if (KVM_BUG_ON(!list_empty(&sp->possible_nx_huge_page_link), kvm))
>  		return;
>  
> -	sp->lpage_disallowed = true;
> +	sp->nx_huge_page_disallowed = true;
>  
>  	if (!nx_huge_page_possible)
>  		return;
>  
>  	++kvm->stat.nx_lpage_splits;
> -	list_add_tail(&sp->lpage_disallowed_link,
> -		      &kvm->arch.lpage_disallowed_mmu_pages);
> +	list_add_tail(&sp->possible_nx_huge_page_link,
> +		      &kvm->arch.possible_nx_huge_pages);
>  }
>  
>  static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
> @@ -835,15 +835,15 @@ static void unaccount_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>  	kvm_mmu_gfn_allow_lpage(slot, gfn);
>  }
>  
> -void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -	sp->lpage_disallowed = false;
> +	sp->nx_huge_page_disallowed = false;
>  
> -	if (list_empty(&sp->lpage_disallowed_link))
> +	if (list_empty(&sp->possible_nx_huge_page_link))
>  		return;
>  
>  	--kvm->stat.nx_lpage_splits;
> -	list_del_init(&sp->lpage_disallowed_link);
> +	list_del_init(&sp->possible_nx_huge_page_link);
>  }
>  
>  static struct kvm_memory_slot *
> @@ -2124,7 +2124,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_shadow_page(struct kvm *kvm,
>  
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> -	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
> +	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
>  
>  	/*
>  	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> @@ -2483,8 +2483,8 @@ static bool __kvm_mmu_prepare_zap_page(struct kvm *kvm,
>  		zapped_root = !is_obsolete_sp(kvm, sp);
>  	}
>  
> -	if (sp->lpage_disallowed)
> -		unaccount_huge_nx_page(kvm, sp);
> +	if (sp->nx_huge_page_disallowed)
> +		unaccount_nx_huge_page(kvm, sp);
>  
>  	sp->role.invalid = 1;
>  
> @@ -3124,7 +3124,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  		link_shadow_page(vcpu, it.sptep, sp);
>  		if (fault->is_tdp && fault->huge_page_disallowed)
> -			account_huge_nx_page(vcpu->kvm, sp,
> +			account_nx_huge_page(vcpu->kvm, sp,
>  					     fault->req_level >= it.level);
>  	}
>  
> @@ -5981,7 +5981,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  
>  	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>  	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
> -	INIT_LIST_HEAD(&kvm->arch.lpage_disallowed_mmu_pages);
> +	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
>  	spin_lock_init(&kvm->arch.mmu_unsync_pages_lock);
>  
>  	r = kvm_mmu_init_tdp_mmu(kvm);
> @@ -6699,7 +6699,7 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
>  			kvm_mmu_zap_all_fast(kvm);
>  			mutex_unlock(&kvm->slots_lock);
>  
> -			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
> +			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
>  		}
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -6825,7 +6825,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  		mutex_lock(&kvm_lock);
>  
>  		list_for_each_entry(kvm, &vm_list, vm_list)
> -			wake_up_process(kvm->arch.nx_lpage_recovery_thread);
> +			wake_up_process(kvm->arch.nx_huge_page_recovery_thread);
>  
>  		mutex_unlock(&kvm_lock);
>  	}
> @@ -6833,7 +6833,7 @@ static int set_nx_huge_pages_recovery_param(const char *val, const struct kernel
>  	return err;
>  }
>  
> -static void kvm_recover_nx_lpages(struct kvm *kvm)
> +static void kvm_recover_nx_huge_pages(struct kvm *kvm)
>  {
>  	unsigned long nx_lpage_splits = kvm->stat.nx_lpage_splits;
>  	int rcu_idx;
> @@ -6856,23 +6856,25 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>  	ratio = READ_ONCE(nx_huge_pages_recovery_ratio);
>  	to_zap = ratio ? DIV_ROUND_UP(nx_lpage_splits, ratio) : 0;
>  	for ( ; to_zap; --to_zap) {
> -		if (list_empty(&kvm->arch.lpage_disallowed_mmu_pages))
> +		if (list_empty(&kvm->arch.possible_nx_huge_pages))
>  			break;
>  
>  		/*
>  		 * We use a separate list instead of just using active_mmu_pages
> -		 * because the number of lpage_disallowed pages is expected to
> -		 * be relatively small compared to the total.
> +		 * because the number of shadow pages that be replaced with an
> +		 * NX huge page is expected to be relatively small compared to
> +		 * the total number of shadow pages.  And because the TDP MMU
> +		 * doesn't use active_mmu_pages.
>  		 */
> -		sp = list_first_entry(&kvm->arch.lpage_disallowed_mmu_pages,
> +		sp = list_first_entry(&kvm->arch.possible_nx_huge_pages,
>  				      struct kvm_mmu_page,
> -				      lpage_disallowed_link);
> -		WARN_ON_ONCE(!sp->lpage_disallowed);
> +				      possible_nx_huge_page_link);
> +		WARN_ON_ONCE(!sp->nx_huge_page_disallowed);
>  		if (is_tdp_mmu_page(sp)) {
>  			flush |= kvm_tdp_mmu_zap_sp(kvm, sp);
>  		} else {
>  			kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
> -			WARN_ON_ONCE(sp->lpage_disallowed);
> +			WARN_ON_ONCE(sp->nx_huge_page_disallowed);
>  		}
>  
>  		if (need_resched() || rwlock_needbreak(&kvm->mmu_lock)) {
> @@ -6893,7 +6895,7 @@ static void kvm_recover_nx_lpages(struct kvm *kvm)
>  	srcu_read_unlock(&kvm->srcu, rcu_idx);
>  }
>  
> -static long get_nx_lpage_recovery_timeout(u64 start_time)
> +static long get_nx_huge_page_recovery_timeout(u64 start_time)
>  {
>  	bool enabled;
>  	uint period;
> @@ -6904,19 +6906,19 @@ static long get_nx_lpage_recovery_timeout(u64 start_time)
>  		       : MAX_SCHEDULE_TIMEOUT;
>  }
>  
> -static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
> +static int kvm_nx_huge_page_recovery_worker(struct kvm *kvm, uintptr_t data)
>  {
>  	u64 start_time;
>  	long remaining_time;
>  
>  	while (true) {
>  		start_time = get_jiffies_64();
> -		remaining_time = get_nx_lpage_recovery_timeout(start_time);
> +		remaining_time = get_nx_huge_page_recovery_timeout(start_time);
>  
>  		set_current_state(TASK_INTERRUPTIBLE);
>  		while (!kthread_should_stop() && remaining_time > 0) {
>  			schedule_timeout(remaining_time);
> -			remaining_time = get_nx_lpage_recovery_timeout(start_time);
> +			remaining_time = get_nx_huge_page_recovery_timeout(start_time);
>  			set_current_state(TASK_INTERRUPTIBLE);
>  		}
>  
> @@ -6925,7 +6927,7 @@ static int kvm_nx_lpage_recovery_worker(struct kvm *kvm, uintptr_t data)
>  		if (kthread_should_stop())
>  			return 0;
>  
> -		kvm_recover_nx_lpages(kvm);
> +		kvm_recover_nx_huge_pages(kvm);
>  	}
>  }
>  
> @@ -6933,17 +6935,17 @@ int kvm_mmu_post_init_vm(struct kvm *kvm)
>  {
>  	int err;
>  
> -	err = kvm_vm_create_worker_thread(kvm, kvm_nx_lpage_recovery_worker, 0,
> +	err = kvm_vm_create_worker_thread(kvm, kvm_nx_huge_page_recovery_worker, 0,
>  					  "kvm-nx-lpage-recovery",
> -					  &kvm->arch.nx_lpage_recovery_thread);
> +					  &kvm->arch.nx_huge_page_recovery_thread);
>  	if (!err)
> -		kthread_unpark(kvm->arch.nx_lpage_recovery_thread);
> +		kthread_unpark(kvm->arch.nx_huge_page_recovery_thread);
>  
>  	return err;
>  }
>  
>  void kvm_mmu_pre_destroy_vm(struct kvm *kvm)
>  {
> -	if (kvm->arch.nx_lpage_recovery_thread)
> -		kthread_stop(kvm->arch.nx_lpage_recovery_thread);
> +	if (kvm->arch.nx_huge_page_recovery_thread)
> +		kthread_stop(kvm->arch.nx_huge_page_recovery_thread);
>  }
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index cca1ad75d096..67879459a25c 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -57,7 +57,13 @@ struct kvm_mmu_page {
>  	bool tdp_mmu_page;
>  	bool unsync;
>  	u8 mmu_valid_gen;
> -	bool lpage_disallowed; /* Can't be replaced by an equiv large page */
> +
> +	 /*
> +	  * The shadow page can't be replaced by an equivalent huge page
> +	  * because it is being used to map an executable page in the guest
> +	  * and the NX huge page mitigation is enabled.
> +	  */
> +	bool nx_huge_page_disallowed;
>  
>  	/*
>  	 * The following two entries are used to key the shadow page in the
> @@ -102,12 +108,12 @@ struct kvm_mmu_page {
>  
>  	/*
>  	 * Tracks shadow pages that, if zapped, would allow KVM to create an NX
> -	 * huge page.  A shadow page will have lpage_disallowed set but not be
> -	 * on the list if a huge page is disallowed for other reasons, e.g.
> -	 * because KVM is shadowing a PTE at the same gfn, the memslot isn't
> -	 * properly aligned, etc...
> +	 * huge page.  A shadow page will have nx_huge_page_disallowed set but
> +	 * not be on the list if a huge page is disallowed for other reasons,
> +	 * e.g. because KVM is shadowing a PTE at the same gfn, the memslot
> +	 * isn't properly aligned, etc...
>  	 */
> -	struct list_head lpage_disallowed_link;
> +	struct list_head possible_nx_huge_page_link;
>  #ifdef CONFIG_X86_32
>  	/*
>  	 * Used out of the mmu-lock to avoid reading spte values while an
> @@ -322,8 +328,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  
>  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>  
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp,
> +void account_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp,
>  			  bool nx_huge_page_possible);
> -void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void unaccount_nx_huge_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>  
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index e450f49f2225..259c0f019f09 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -714,7 +714,7 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  
>  		link_shadow_page(vcpu, it.sptep, sp);
>  		if (fault->huge_page_disallowed)
> -			account_huge_nx_page(vcpu->kvm, sp,
> +			account_nx_huge_page(vcpu->kvm, sp,
>  					     fault->req_level >= it.level);
>  	}
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 903d0d3497b6..0e94182c87be 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -284,7 +284,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
>  static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
>  			    gfn_t gfn, union kvm_mmu_page_role role)
>  {
> -	INIT_LIST_HEAD(&sp->lpage_disallowed_link);
> +	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
>  
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
>  
> @@ -392,8 +392,8 @@ static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
>  		lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	list_del(&sp->link);
> -	if (sp->lpage_disallowed)
> -		unaccount_huge_nx_page(kvm, sp);
> +	if (sp->nx_huge_page_disallowed)
> +		unaccount_nx_huge_page(kvm, sp);
>  
>  	if (shared)
>  		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -1132,7 +1132,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>  	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
>  	if (account_nx)
> -		account_huge_nx_page(kvm, sp, true);
> +		account_nx_huge_page(kvm, sp, true);
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  
>  	return 0;
> -- 
> 2.37.1.559.g78731f0fdb-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-14  1:12   ` Mingwei Zhang
@ 2022-08-15 21:54     ` Sean Christopherson
  2022-08-16 21:09       ` Mingwei Zhang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-15 21:54 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Sun, Aug 14, 2022, Mingwei Zhang wrote:
> On Fri, Aug 05, 2022, Sean Christopherson wrote:
> > Rename most of the variables/functions involved in the NX huge page
> > mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
> > vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
> > applies only to the NX huge page mitigation, not to any condition that
> > prevents creating a huge page.
> > 
> > Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
> > stone.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  8 ++--
> >  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
> >  arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
> >  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
> >  5 files changed, 59 insertions(+), 51 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index e8281d64a431..5634347e5d05 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1143,7 +1143,7 @@ struct kvm_arch {
> >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> >  	struct list_head active_mmu_pages;
> >  	struct list_head zapped_obsolete_pages;
> > -	struct list_head lpage_disallowed_mmu_pages;
> > +	struct list_head possible_nx_huge_pages;
> 
> Honestly, I am struggling to understand this one. 'possible_*' indicates
> that there are other possibilities. But what are those possibilities?

No, possible is being used as an adjective in this case.  possible_nx_huge_pages
is the list of shadow pages for which it's possible to replace the shadow page
with an NX huge page.

The noun version would yield a name like nx_huge_page_possiblities.

> I feel this name is more confusing than the original one. Maybe just keep

Ignoring lpage => huge_page, the current name is terribly inaccurate.  The list
doesn't contain all disallowed huge pages, nor does it even contain all disallowed
NX huge pages, it specifically tracks shadow pages that might be able to be
replaced with an NX huge page.

I'm open to other names, but whatever we choose should be paired with
account_nx_huge_page()'s param that is currently named "nx_huge_page_possible".

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-15 21:54     ` Sean Christopherson
@ 2022-08-16 21:09       ` Mingwei Zhang
  2022-08-17 16:13         ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-16 21:09 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Mon, Aug 15, 2022, Sean Christopherson wrote:
> On Sun, Aug 14, 2022, Mingwei Zhang wrote:
> > On Fri, Aug 05, 2022, Sean Christopherson wrote:
> > > Rename most of the variables/functions involved in the NX huge page
> > > mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
> > > vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
> > > applies only to the NX huge page mitigation, not to any condition that
> > > prevents creating a huge page.
> > > 
> > > Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
> > > stone.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  8 ++--
> > >  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
> > >  arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
> > >  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
> > >  arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
> > >  5 files changed, 59 insertions(+), 51 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index e8281d64a431..5634347e5d05 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1143,7 +1143,7 @@ struct kvm_arch {
> > >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > >  	struct list_head active_mmu_pages;
> > >  	struct list_head zapped_obsolete_pages;
> > > -	struct list_head lpage_disallowed_mmu_pages;
> > > +	struct list_head possible_nx_huge_pages;
> > 
> > Honestly, I am struggling to understand this one. 'possible_*' indicates
> > that there are other possibilities. But what are those possibilities?
> 
> No, possible is being used as an adjective in this case.  possible_nx_huge_pages
> is the list of shadow pages for which it's possible to replace the shadow page
> with an NX huge page.
> 
> The noun version would yield a name like nx_huge_page_possiblities.

Right, but these shadow pages are not NX huge pages, right? IIUC, they
are pages to be zapped due to NX huge pages, aren't they?

`nx_huge_page_disallowed` is easy to understand because it literally say
'nx_huge_page is not allowed', which is correct.

But this one, it says 'possible nx_huge_pages', but they are not
nx huge pages at all. Instead, they are 'shadow pages that are replaced
with nx_huge_pages'. So that's why updates to this list is done together
with stats nx_plage_splits.

Please correct me if I am wrong. I am still struggling to understand the
meaning of these variables.

>
> > I feel this name is more confusing than the original one. Maybe just keep
> 
> Ignoring lpage => huge_page, the current name is terribly inaccurate.  The list
> doesn't contain all disallowed huge pages, nor does it even contain all disallowed
> NX huge pages, it specifically tracks shadow pages that might be able to be
> replaced with an NX huge page.
> 
> I'm open to other names, but whatever we choose should be paired with
> account_nx_huge_page()'s param that is currently named "nx_huge_page_possible".

How about mmu_pages_replaced_by_nx_huge,
mmu_pages_replaced_by_possible_nx_huge or something starting with
possible_pages_, pages_ instead of possible_nx_huge_pages?


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs
  2022-08-05 23:05 ` [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Sean Christopherson
@ 2022-08-16 21:25   ` Mingwei Zhang
  0 siblings, 0 replies; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-16 21:25 UTC (permalink / raw)
  To: Sean Christopherson, y
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Fri, Aug 05, 2022, Sean Christopherson wrote:
> Account and track NX huge pages for nonpaging MMUs so that a future
> enhancement to precisely check if a shadow page can't be replaced by a NX
> huge page doesn't get false positives.  Without correct tracking, KVM can
> get stuck in a loop if an instruction is fetching and writing data on the
> same huge page, e.g. KVM installs a small executable page on the fetch
> fault, replaces it with an NX huge page on the write fault, and faults
> again on the fetch.
> 
> Alternatively, and perhaps ideally, KVM would simply not enforce the
> workaround for nonpaging MMUs.  The guest has no page tables to abuse
> and KVM is guaranteed to switch to a different MMU on CR0.PG being
> toggled so there's no security or performance concerns.  However, getting
> make_spte() to play nice now and in the future is unnecessarily complex.
> 
> In the current code base, make_spte() can enforce the mitigation if TDP
> is enabled or the MMU is indirect, but make_spte() may not always have a
> vCPU/MMU to work with, e.g. if KVM were to support in-line huge page
> promotion when disabling dirty logging.
> 
> Without a vCPU/MMU, KVM could either pass in the correct information
> and/or derive it from the shadow page, but the former is ugly and the
> latter subtly non-trivial due to the possibility of direct shadow pages
> in indirect MMUs.  Given that using shadow paging with an unpaged guest
> is far from top priority _and_ has been subjected to the workaround since
> its inception, keep it simple and just fix the accounting glitch.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: David Matlack <dmatlack@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c  |  2 +-
>  arch/x86/kvm/mmu/spte.c | 12 ++++++++++++
>  2 files changed, 13 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 53d0dafa68ff..345b6b22ab68 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3123,7 +3123,7 @@ static int __direct_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			continue;
>  
>  		link_shadow_page(vcpu, it.sptep, sp);
> -		if (fault->is_tdp && fault->huge_page_disallowed)
> +		if (fault->huge_page_disallowed)
>  			account_nx_huge_page(vcpu->kvm, sp,
>  					     fault->req_level >= it.level);
>  	}
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 7314d27d57a4..52186b795bce 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -147,6 +147,18 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	if (!prefetch)
>  		spte |= spte_shadow_accessed_mask(spte);
>  
> +	/*
> +	 * For simplicity, enforce the NX huge page mitigation even if not
> +	 * strictly necessary.  KVM could ignore the mitigation if paging is
> +	 * disabled in the guest, as the guest doesn't have an page tables to
> +	 * abuse.  But to safely ignore the mitigation, KVM would have to
> +	 * ensure a new MMU is loaded (or all shadow pages zapped) when CR0.PG
> +	 * is toggled on, and that's a net negative for performance when TDP is
> +	 * enabled.  When TDP is disabled, KVM will always switch to a new MMU
> +	 * when CR0.PG is toggled, but leveraging that to ignore the mitigation
> +	 * would tie make_spte() further to vCPU/MMU state, and add complexity
> +	 * just to optimize a mode that is anything but performance critical.
> +	 */
>  	if (level > PG_LEVEL_4K && (pte_access & ACC_EXEC_MASK) &&
>  	    is_nx_huge_page_enabled(vcpu->kvm)) {
>  		pte_access &= ~ACC_EXEC_MASK;
> -- 
> 2.37.1.559.g78731f0fdb-goog
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-16 21:09       ` Mingwei Zhang
@ 2022-08-17 16:13         ` Sean Christopherson
  2022-08-18 22:13           ` Mingwei Zhang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-17 16:13 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Tue, Aug 16, 2022, Mingwei Zhang wrote:
> On Mon, Aug 15, 2022, Sean Christopherson wrote:
> > On Sun, Aug 14, 2022, Mingwei Zhang wrote:
> > > On Fri, Aug 05, 2022, Sean Christopherson wrote:
> > > > Rename most of the variables/functions involved in the NX huge page
> > > > mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
> > > > vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
> > > > applies only to the NX huge page mitigation, not to any condition that
> > > > prevents creating a huge page.
> > > > 
> > > > Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
> > > > stone.
> > > > 
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  arch/x86/include/asm/kvm_host.h |  8 ++--
> > > >  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
> > > >  arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
> > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
> > > >  arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
> > > >  5 files changed, 59 insertions(+), 51 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > index e8281d64a431..5634347e5d05 100644
> > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > @@ -1143,7 +1143,7 @@ struct kvm_arch {
> > > >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > >  	struct list_head active_mmu_pages;
> > > >  	struct list_head zapped_obsolete_pages;
> > > > -	struct list_head lpage_disallowed_mmu_pages;
> > > > +	struct list_head possible_nx_huge_pages;
> > > 
> > > Honestly, I am struggling to understand this one. 'possible_*' indicates
> > > that there are other possibilities. But what are those possibilities?
> > 
> > No, possible is being used as an adjective in this case.  possible_nx_huge_pages
> > is the list of shadow pages for which it's possible to replace the shadow page
> > with an NX huge page.
> > 
> > The noun version would yield a name like nx_huge_page_possiblities.
> 
> Right, but these shadow pages are not NX huge pages, right? IIUC, they
> are pages to be zapped due to NX huge pages, aren't they?

Yes, they are shadow pages that the NX recovery thread should zap, but the reason
they should be zapped is because (a) the shadow page has at least one execute child
SPTE, (b) zapping the shadow page will also zap its child SPTEs, and (c) eliminating
all executable child SPTEs means KVM _might_ be able to instantiate an NX huge page.

> `nx_huge_page_disallowed` is easy to understand because it literally say
> 'nx_huge_page is not allowed', which is correct.

No, it's not correct.  The list isn't simply the set of shadow pages that disallow
NX huge pages, it's the set of shadow pages that disallow NX huge pages _and_ that
can possibly be replaced by an NX huge page if the shadow page and all its
(executable) children go away.

> But this one, it says 'possible nx_huge_pages', but they are not
> nx huge pages at all.

Yes, but they _can be_ NX huge pages, hence the "possible".  A super verbose name
would be something like mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages.

> Instead, they are 'shadow pages that are replaced with nx_huge_pages'. So
> that's why updates to this list is done together with stats nx_plage_splits.
> 
> Please correct me if I am wrong. I am still struggling to understand the
> meaning of these variables.
> 
> >
> > > I feel this name is more confusing than the original one. Maybe just keep
> > 
> > Ignoring lpage => huge_page, the current name is terribly inaccurate.  The list
> > doesn't contain all disallowed huge pages, nor does it even contain all disallowed
> > NX huge pages, it specifically tracks shadow pages that might be able to be
> > replaced with an NX huge page.
> > 
> > I'm open to other names, but whatever we choose should be paired with
> > account_nx_huge_page()'s param that is currently named "nx_huge_page_possible".
> 
> How about mmu_pages_replaced_by_nx_huge,

"replaced" is past tense in this usage, and these are definitely not shadow pages
that have already been replaced by NX huge pages.

> mmu_pages_replaced_by_possible_nx_huge or something starting with

Same issue with "replaced".

> possible_pages_, pages_ instead of possible_nx_huge_pages?

Reprhasing, I think you're asking that the variable have "mmu_pages" somewhere in
the name to clarify that it's a list of kvm_mmu_page structs.  I agree that ideally
it would have "mmu_pages" somewhere in there, but I also think that having both
"nx_huge_pages" and "mmu_pages" in the name makes it unreadable, i.e. we have to
choose one use of "pages".

The only alternative I can come up with that isn't completely gross is
possible_nx_huge_mmu_pages.  I can't tell if that's less confusing or more
confusing.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-17 16:13         ` Sean Christopherson
@ 2022-08-18 22:13           ` Mingwei Zhang
  2022-08-18 23:45             ` Sean Christopherson
  0 siblings, 1 reply; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-18 22:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Wed, Aug 17, 2022, Sean Christopherson wrote:
> On Tue, Aug 16, 2022, Mingwei Zhang wrote:
> > On Mon, Aug 15, 2022, Sean Christopherson wrote:
> > > On Sun, Aug 14, 2022, Mingwei Zhang wrote:
> > > > On Fri, Aug 05, 2022, Sean Christopherson wrote:
> > > > > Rename most of the variables/functions involved in the NX huge page
> > > > > mitigation to provide consistency, e.g. lpage vs huge page, and NX huge
> > > > > vs huge NX, and also to provide clarity, e.g. to make it obvious the flag
> > > > > applies only to the NX huge page mitigation, not to any condition that
> > > > > prevents creating a huge page.
> > > > > 
> > > > > Leave the nx_lpage_splits stat alone as the name is ABI and thus set in
> > > > > stone.
> > > > > 
> > > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > > ---
> > > > >  arch/x86/include/asm/kvm_host.h |  8 ++--
> > > > >  arch/x86/kvm/mmu/mmu.c          | 70 +++++++++++++++++----------------
> > > > >  arch/x86/kvm/mmu/mmu_internal.h | 22 +++++++----
> > > > >  arch/x86/kvm/mmu/paging_tmpl.h  |  2 +-
> > > > >  arch/x86/kvm/mmu/tdp_mmu.c      |  8 ++--
> > > > >  5 files changed, 59 insertions(+), 51 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index e8281d64a431..5634347e5d05 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -1143,7 +1143,7 @@ struct kvm_arch {
> > > > >  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
> > > > >  	struct list_head active_mmu_pages;
> > > > >  	struct list_head zapped_obsolete_pages;
> > > > > -	struct list_head lpage_disallowed_mmu_pages;
> > > > > +	struct list_head possible_nx_huge_pages;
> > > > 
> > > > Honestly, I am struggling to understand this one. 'possible_*' indicates
> > > > that there are other possibilities. But what are those possibilities?
> > > 
> > > No, possible is being used as an adjective in this case.  possible_nx_huge_pages
> > > is the list of shadow pages for which it's possible to replace the shadow page
> > > with an NX huge page.
> > > 
> > > The noun version would yield a name like nx_huge_page_possiblities.
> > 
> > Right, but these shadow pages are not NX huge pages, right? IIUC, they
> > are pages to be zapped due to NX huge pages, aren't they?
> 
> Yes, they are shadow pages that the NX recovery thread should zap, but the reason
> they should be zapped is because (a) the shadow page has at least one execute child
> SPTE, (b) zapping the shadow page will also zap its child SPTEs, and (c) eliminating
> all executable child SPTEs means KVM _might_ be able to instantiate an NX huge page.
> 

oh, I scratched my head and finaly got your point. hmm. So the shadow
pages are the 'blockers' to (re)create a NX huge page because of at
least one present child executable spte. So, really, whether these
shadow pages themselves are NX huge or not does not really matter. All
we need to know is that they will be zapped in the future to help making
recovery of an NX huge page possible.


> > `nx_huge_page_disallowed` is easy to understand because it literally say
> > 'nx_huge_page is not allowed', which is correct.
> 
> No, it's not correct.  The list isn't simply the set of shadow pages that disallow
> NX huge pages, it's the set of shadow pages that disallow NX huge pages _and_ that
> can possibly be replaced by an NX huge page if the shadow page and all its
> (executable) children go away.
> 

hmm, I think this naming is correct. The flag is used to talk to the
'fault handler' to say 'hey, don't create nx huge page, stupid'. Of
course, it is also used to by the 'nx huge recovery thread', but the
recovery thread will only check it for sanity purpose, which really does
not matter, i.e., the thread will zap the pages anyway.

> > But this one, it says 'possible nx_huge_pages', but they are not
> > nx huge pages at all.
> 
> Yes, but they _can be_ NX huge pages, hence the "possible".  A super verbose name
> would be something like mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages.
> 

I can make a dramtic example as why 'possible' may not help:

/* Flag that decides something important. */
bool possible_one;

The information we (readers) gain from reading the above is _0_.

With that, since you already mentioned the name:
'mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages',
why can't we shorten it by using 'mmu_pages_to_recover_nx_huge' or
'pages_to_recover_nx_huge'? 'recover' is the word that immediately
connects with the 'recovery thread', which I think makes more sense on
readability.

> > Instead, they are 'shadow pages that are replaced with nx_huge_pages'. So
> > that's why updates to this list is done together with stats nx_plage_splits.
> > 
> > Please correct me if I am wrong. I am still struggling to understand the
> > meaning of these variables.
> > 
> > >
> > > > I feel this name is more confusing than the original one. Maybe just keep
> > > 
> > > Ignoring lpage => huge_page, the current name is terribly inaccurate.  The list
> > > doesn't contain all disallowed huge pages, nor does it even contain all disallowed
> > > NX huge pages, it specifically tracks shadow pages that might be able to be
> > > replaced with an NX huge page.
> > > 
> > > I'm open to other names, but whatever we choose should be paired with
> > > account_nx_huge_page()'s param that is currently named "nx_huge_page_possible".
> > 
> > How about mmu_pages_replaced_by_nx_huge,
> 
> "replaced" is past tense in this usage, and these are definitely not shadow pages
> that have already been replaced by NX huge pages.
> 
> > mmu_pages_replaced_by_possible_nx_huge or something starting with
> 
> Same issue with "replaced".
> 
> > possible_pages_, pages_ instead of possible_nx_huge_pages?
> 
> Reprhasing, I think you're asking that the variable have "mmu_pages" somewhere in
> the name to clarify that it's a list of kvm_mmu_page structs.  I agree that ideally
> it would have "mmu_pages" somewhere in there, but I also think that having both
> "nx_huge_pages" and "mmu_pages" in the name makes it unreadable, i.e. we have to
> choose one use of "pages".
> 
> The only alternative I can come up with that isn't completely gross is
> possible_nx_huge_mmu_pages.  I can't tell if that's less confusing or more
> confusing.

Aside from that.

Reviewed-by: Mingwei Zhang <mizhang@google.com>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-18 22:13           ` Mingwei Zhang
@ 2022-08-18 23:45             ` Sean Christopherson
  2022-08-19 18:30               ` Mingwei Zhang
  0 siblings, 1 reply; 27+ messages in thread
From: Sean Christopherson @ 2022-08-18 23:45 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Thu, Aug 18, 2022, Mingwei Zhang wrote:
> On Wed, Aug 17, 2022, Sean Christopherson wrote:
> > Yes, they are shadow pages that the NX recovery thread should zap, but the reason
> > they should be zapped is because (a) the shadow page has at least one execute child
> > SPTE, (b) zapping the shadow page will also zap its child SPTEs, and (c) eliminating
> > all executable child SPTEs means KVM _might_ be able to instantiate an NX huge page.
> > 
> 
> oh, I scratched my head and finaly got your point. hmm. So the shadow
> pages are the 'blockers' to (re)create a NX huge page because of at
> least one present child executable spte. So, really, whether these
> shadow pages themselves are NX huge or not does not really matter. All
> we need to know is that they will be zapped in the future to help making
> recovery of an NX huge page possible.

More precisely, we want to zap shadow pages with executable children if and only
if they can _possibly_ be replaced with an NX huge page.  The "possibly" is saying
that zapping _may or may not_ result in an NX huge page.  And it also conveys that
pages that _cannot_ be replaced with an NX huge page are not on the list.

If the guest is still using any of the huge page for execution, then KVM can't
create an NX huge page (or it may temporarily create one and then zap it when the
gets takes an executable fault), but KVM can't know that until it zaps and the
guest takes a fault.  Thus, possibly.

> > > `nx_huge_page_disallowed` is easy to understand because it literally say
> > > 'nx_huge_page is not allowed', which is correct.
> > 
> > No, it's not correct.  The list isn't simply the set of shadow pages that disallow
> > NX huge pages, it's the set of shadow pages that disallow NX huge pages _and_ that
> > can possibly be replaced by an NX huge page if the shadow page and all its
> > (executable) children go away.
> > 
> 
> hmm, I think this naming is correct. The flag is used to talk to the
> 'fault handler' to say 'hey, don't create nx huge page, stupid'. Of
> course, it is also used to by the 'nx huge recovery thread', but the
> recovery thread will only check it for sanity purpose, which really does
> not matter, i.e., the thread will zap the pages anyway.

Ah, sorry, I thought you were suggesting "nx_huge_page_disallowed" for the list
name, but you were talking about the flag.  Yes, 100% agree that the flag is
appropriately named.

> > > But this one, it says 'possible nx_huge_pages', but they are not
> > > nx huge pages at all.
> > 
> > Yes, but they _can be_ NX huge pages, hence the "possible".  A super verbose name
> > would be something like mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages.
> > 
> 
> I can make a dramtic example as why 'possible' may not help:
> 
> /* Flag that decides something important. */
> bool possible_one;
> 
> The information we (readers) gain from reading the above is _0_.

But that's only half the story.  If we also had an associated flag

  bool one_disallowed;

a.k.a. nx_huge_page_disallowed, then when viewed together, readers know that the
existince of this struct disallows "one", but that structs with one_disallowed=true
_and_ possible_one=true _might_ be converted to "one", whereas structs with
possible_one=false _cannot_ be converted to "one".

> With that, since you already mentioned the name:
> 'mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages',
> why can't we shorten it by using 'mmu_pages_to_recover_nx_huge' or
> 'pages_to_recover_nx_huge'? 'recover' is the word that immediately
> connects with the 'recovery thread', which I think makes more sense on
> readability.

mmu_pages_to_recover_nx_huge doesn't capture that recovery isn't guaranteed.
IMO it also does a poor job of capturing _why_ pages are on the list, i.e. a
reader knows they are pages that will be "recovered", but it doesn't clarify that
they'll be recovered/zapped because KVM might be able to be replace them with NX
huge pages.  In other words, it doesn't help the reader understand why some, but
not all, nx_huge_page_disallowed are on the recovery list.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-18 23:45             ` Sean Christopherson
@ 2022-08-19 18:30               ` Mingwei Zhang
  2022-08-20  1:04                 ` Mingwei Zhang
  0 siblings, 1 reply; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-19 18:30 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, linux-kernel, David Matlack, Yan Zhao, Ben Gardon

On Thu, Aug 18, 2022, Sean Christopherson wrote:
> On Thu, Aug 18, 2022, Mingwei Zhang wrote:
> > On Wed, Aug 17, 2022, Sean Christopherson wrote:
> > > Yes, they are shadow pages that the NX recovery thread should zap, but the reason
> > > they should be zapped is because (a) the shadow page has at least one execute child
> > > SPTE, (b) zapping the shadow page will also zap its child SPTEs, and (c) eliminating
> > > all executable child SPTEs means KVM _might_ be able to instantiate an NX huge page.
> > > 
> > 
> > oh, I scratched my head and finaly got your point. hmm. So the shadow
> > pages are the 'blockers' to (re)create a NX huge page because of at
> > least one present child executable spte. So, really, whether these
> > shadow pages themselves are NX huge or not does not really matter. All
> > we need to know is that they will be zapped in the future to help making
> > recovery of an NX huge page possible.
> 
> More precisely, we want to zap shadow pages with executable children if and only
> if they can _possibly_ be replaced with an NX huge page.  The "possibly" is saying
> that zapping _may or may not_ result in an NX huge page.  And it also conveys that
> pages that _cannot_ be replaced with an NX huge page are not on the list.
> 
> If the guest is still using any of the huge page for execution, then KVM can't
> create an NX huge page (or it may temporarily create one and then zap it when the
> gets takes an executable fault), but KVM can't know that until it zaps and the
> guest takes a fault.  Thus, possibly.
> 

Right, I think 'possible' is definitely a correct name for that. In
general, using 'possible' can cover the complexity to ensure the
description is correct. My only comment here is that 'possible_' might
requires extra comments in the code to be more developer friendly.

But overall, since I already remembered what was the problem. I no
longer think this naming is an issue to me. But just that the name could
be better.

> > With that, since you already mentioned the name:
> > 'mmu_pages_that_can_possibly_be_replaced_by_nx_huge_pages',
> > why can't we shorten it by using 'mmu_pages_to_recover_nx_huge' or
> > 'pages_to_recover_nx_huge'? 'recover' is the word that immediately
> > connects with the 'recovery thread', which I think makes more sense on
> > readability.
> 
> mmu_pages_to_recover_nx_huge doesn't capture that recovery isn't guaranteed.
> IMO it also does a poor job of capturing _why_ pages are on the list, i.e. a
> reader knows they are pages that will be "recovered", but it doesn't clarify that
> they'll be recovered/zapped because KVM might be able to be replace them with NX
> huge pages.  In other words, it doesn't help the reader understand why some, but
> not all, nx_huge_page_disallowed are on the recovery list.

I think you are right that the name does not call out 'why' the pages
are on the list. But on the other hand, I am not sure how much it could
help clarifying the situations by just reading the list name. I would
propose we add the conditions using the (flag, list).

(nx_huge_page_disallowed, possible_nx_huge_pages)

case (true,  in_list):     mitigation for multi-hit iTLB.
case (true,  not_in_list): dirty logging disabled; address misalignment; guest did not turn on paging.
case (false, in_list):     not possible.
case (false, not_in_list): Any other situation where KVM manipulate SPTEs.

Maybe this should be in the commit message of the previous patch.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency
  2022-08-19 18:30               ` Mingwei Zhang
@ 2022-08-20  1:04                 ` Mingwei Zhang
  0 siblings, 0 replies; 27+ messages in thread
From: Mingwei Zhang @ 2022-08-20  1:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, kvm, LKML, David Matlack, Yan Zhao, Ben Gardon

> (nx_huge_page_disallowed, possible_nx_huge_pages)
>
> case (true,  in_list):     mitigation for multi-hit iTLB.
> case (true,  not_in_list): dirty logging disabled; address misalignment; guest did not turn on paging.
> case (false, in_list):     not possible.
> case (false, not_in_list): Any other situation where KVM manipulate SPTEs.
>

made a mistake: should be:

case (true,  in_list): mitigation for multi-hit iTLB.
case (true,  not_in_list): address misalignment; guest did not turn on paging.
case (false, in_list): not possible.
case (false, not_in_list): dirty logging disabled; any other situation
KVM zaps small SPTEs in replace of a huge one.

Anyway, this is not a blocking comment. But I just don't want to leave
a mistake in the conversation.

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-08-20  1:04 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-05 23:05 [PATCH v3 0/8] KVM: x86: Apply NX mitigation more precisely Sean Christopherson
2022-08-05 23:05 ` [PATCH v3 1/8] KVM: x86/mmu: Bug the VM if KVM attempts to double count an NX huge page Sean Christopherson
2022-08-14  0:53   ` Mingwei Zhang
2022-08-05 23:05 ` [PATCH v3 2/8] KVM: x86/mmu: Tag disallowed NX huge pages even if they're not tracked Sean Christopherson
2022-08-14  0:53   ` Mingwei Zhang
2022-08-05 23:05 ` [PATCH v3 3/8] KVM: x86/mmu: Rename NX huge pages fields/functions for consistency Sean Christopherson
2022-08-14  1:12   ` Mingwei Zhang
2022-08-15 21:54     ` Sean Christopherson
2022-08-16 21:09       ` Mingwei Zhang
2022-08-17 16:13         ` Sean Christopherson
2022-08-18 22:13           ` Mingwei Zhang
2022-08-18 23:45             ` Sean Christopherson
2022-08-19 18:30               ` Mingwei Zhang
2022-08-20  1:04                 ` Mingwei Zhang
2022-08-05 23:05 ` [PATCH v3 4/8] KVM: x86/mmu: Properly account NX huge page workaround for nonpaging MMUs Sean Christopherson
2022-08-16 21:25   ` Mingwei Zhang
2022-08-05 23:05 ` [PATCH v3 5/8] KVM: x86/mmu: Set disallowed_nx_huge_page in TDP MMU before setting SPTE Sean Christopherson
2022-08-09  3:26   ` Yan Zhao
2022-08-09 12:49     ` Paolo Bonzini
2022-08-09 14:44       ` Sean Christopherson
2022-08-09 14:48         ` Paolo Bonzini
2022-08-09 15:05           ` Sean Christopherson
2022-08-05 23:05 ` [PATCH v3 6/8] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Sean Christopherson
2022-08-05 23:05 ` [PATCH v3 7/8] KVM: x86/mmu: Add helper to convert SPTE value to its shadow page Sean Christopherson
2022-08-05 23:05 ` [PATCH v3 8/8] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Sean Christopherson
2022-08-09 12:57   ` Paolo Bonzini
2022-08-09 14:49     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.