All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 0/6] Verify dirty logging works properly with page stats
@ 2022-04-01  6:36 Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE Mingwei Zhang
                   ` (5 more replies)
  0 siblings, 6 replies; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

This patch set aims to fix a bug in which KVM incorrectly assumes a large
page as a NX huge page. The bug would prevent guest VM from regaining large
pages and cause performance issue. We fix the bug by explicitly checking
the lpage_disallowed field in the shadow page.  Moreover, to fix the bug
properly for TDP MMU, we integrate two patches from Sean that ensures that
we update lpage_disallowed in shadow page before making spte visible to
guest.

To verify the bug fixed, we use dirty logging as the testing target and
dirty_log_perf_test as the selftest binary. By adding the code to check the
page stats from the per-VM interface, we discovered that VMs could regain
large pages after dirty logging disabled. We also verify the existence of
the bug if running with unpatched kernels.

To make the selftest working properly with per-VM stats interface, we
borrowes two patches come from Ben's series: "[PATCH 00/13] KVM: x86: Add a
cap to disable NX hugepages on a VM" [1].

[1] https://lore.kernel.org/all/20220310164532.1821490-2-bgardon@google.com/T/


v2 -> v3:
 - Update lpage_disallowed before making spte visible [seanjc].
 - Adding tdp_mmu_pages stats [seanjc]
 - update comments in selftest [bgardon]

v2: https://lore.kernel.org/lkml/20220323184915.1335049-1-mizhang@google.com/T/

v1 -> v2:
 - Update the commit message. [dmatlack]
 - Update the comments in patch 3/4 to clarify the motivation. [bgardon]
 - Add another iteration in dirty_log_perf_test to regain pages [bgardon]


Ben Gardon (2):
  KVM: selftests: Dump VM stats in binary stats test
  KVM: selftests: Test reading a single stat

Mingwei Zhang (2):
  KVM: x86/mmu: explicitly check nx_hugepage in
    disallowed_hugepage_adjust()
  selftests: KVM: use page stats to check if dirty logging works
    properly

Sean Christopherson (2):
  KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
  KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual
    pages

 arch/x86/include/asm/kvm_host.h               |  11 +-
 arch/x86/kvm/mmu/mmu.c                        |  28 ++-
 arch/x86/kvm/mmu/mmu_internal.h               |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  36 ++--
 .../selftests/kvm/dirty_log_perf_test.c       |  53 +++++
 .../selftests/kvm/include/kvm_util_base.h     |   2 +
 .../selftests/kvm/kvm_binary_stats_test.c     |   6 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 196 ++++++++++++++++++
 8 files changed, 303 insertions(+), 31 deletions(-)

-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  2022-04-04 17:58   ` Ben Gardon
  2022-04-01  6:36 ` [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Mingwei Zhang
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

From: Sean Christopherson <seanjc@google.com>

Set lpage_disallowed in TDP MMU shadow pages before making the SP visible
to other readers, i.e. before setting its SPTE.  This will allow KVM to
query lpage_disallowed when determining if a shadow page can be replaced
by a NX huge page without violating the rules of the mitigation.

Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c          | 14 ++++++++++----
 arch/x86/kvm/mmu/mmu_internal.h |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c      | 20 ++++++++++++--------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1361eb4599b4..5cb845fae56e 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
 	kvm_mmu_gfn_disallow_lpage(slot, gfn);
 }
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
-	if (sp->lpage_disallowed)
-		return;
-
 	++kvm->stat.nx_lpage_splits;
 	list_add_tail(&sp->lpage_disallowed_link,
 		      &kvm->arch.lpage_disallowed_mmu_pages);
+}
+
+static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
+{
+	if (sp->lpage_disallowed)
+		return;
+
+	__account_huge_nx_page(kvm, sp);
+
 	sp->lpage_disallowed = true;
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 1bff453f7cbe..4a0087efa1e3 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 
 void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
 
-void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
+void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
 
 #endif /* __KVM_X86_MMU_INTERNAL_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b3b6426725d4..f05423545e6d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
  * @kvm: kvm instance
  * @iter: a tdp_iter instance currently on the SPTE that should be set
  * @sp: The new TDP page table to install.
- * @account_nx: True if this page table is being installed to split a
- *              non-executable huge page.
  * @shared: This operation is running under the MMU lock in read mode.
  *
  * Returns: 0 if the new page table was installed. Non-0 if the page table
  *          could not be installed (e.g. the atomic compare-exchange failed).
  */
 static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
-			   struct kvm_mmu_page *sp, bool account_nx,
-			   bool shared)
+			   struct kvm_mmu_page *sp, bool shared)
 {
 	u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
 	int ret = 0;
@@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	if (account_nx)
-		account_huge_nx_page(kvm, sp);
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 	return 0;
@@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
 	int ret;
@@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			sp = tdp_mmu_alloc_sp(vcpu);
 			tdp_mmu_init_child_sp(sp, &iter);
 
-			if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
+			sp->lpage_disallowed = account_nx;
+
+			if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
 				tdp_mmu_free_sp(sp);
 				break;
 			}
+
+			if (account_nx) {
+				spin_lock(&kvm->arch.tdp_mmu_pages_lock);
+				__account_huge_nx_page(kvm, sp);
+				spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+			}
 		}
 	}
 
@@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
 	 * correctness standpoint since the translation will be the same either
 	 * way.
 	 */
-	ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
+	ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
 	if (ret)
 		goto out;
 
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  2022-04-01 14:22   ` Sean Christopherson
  2022-04-01  6:36 ` [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Mingwei Zhang
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

From: Sean Christopherson <seanjc@google.com>

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 11 +++--------
 arch/x86/kvm/mmu/tdp_mmu.c      | 16 ++++++++--------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9694dd5e6ccc..d0dd5ed2e209 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1192,6 +1192,9 @@ struct kvm_arch {
 	 */
 	bool tdp_mmu_enabled;
 
+	/* The number of TDP MMU pages across all roots. */
+	atomic64_t tdp_mmu_pages;
+
 	/*
 	 * List of struct kvm_mmu_pages being used as roots.
 	 * All struct kvm_mmu_pages in the list should have
@@ -1212,18 +1215,10 @@ struct kvm_arch {
 	 */
 	struct list_head tdp_mmu_roots;
 
-	/*
-	 * List of struct kvmp_mmu_pages not being used as roots.
-	 * All struct kvm_mmu_pages in the list should have
-	 * tdp_mmu_page set and a tdp_mmu_root_count of 0.
-	 */
-	struct list_head tdp_mmu_pages;
-
 	/*
 	 * Protects accesses to the following fields when the MMU lock
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
-	 *  - tdp_mmu_pages (above)
 	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
 	 *  - lpage_disallowed_mmu_pages
 	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f05423545e6d..5ca78a89d8ed 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -24,7 +24,6 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
-	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 	kvm->arch.tdp_mmu_zap_wq =
 		alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);
 
@@ -51,7 +50,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
 	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 
-	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
+	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
 	/*
@@ -381,14 +380,17 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 			      bool shared)
 {
+	atomic64_dec(&kvm->arch.tdp_mmu_pages);
+
+	if (!sp->lpage_disallowed)
+		return;
+
 	if (shared)
 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	else
 		lockdep_assert_held_write(&kvm->mmu_lock);
 
-	list_del(&sp->link);
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	unaccount_huge_nx_page(kvm, sp);
 
 	if (shared)
 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1141,9 +1143,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 		tdp_mmu_set_spte(kvm, iter, spte);
 	}
 
-	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+	atomic64_inc(&kvm->arch.tdp_mmu_pages);
 
 	return 0;
 }
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  2022-04-01 14:33   ` Sean Christopherson
  2022-04-01  6:36 ` [PATCH v3 4/6] KVM: selftests: Dump VM stats in binary stats test Mingwei Zhang
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

Explicitly check if a NX huge page is disallowed when determining if a page
fault needs to be forced to use a smaller sized page. KVM incorrectly
assumes that the NX huge page mitigation is the only scenario where KVM
will create a shadow page instead of a huge page. Any scenario that causes
KVM to zap leaf SPTEs may result in having a SP that can be made huge
without violating the NX huge page mitigation. E.g. disabling of dirty
logging, zapping from mmu_notifier due to page migration, guest MTRR
changes that affect the viability of a huge page, etc...

Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")

Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5cb845fae56e..033609e8b332 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2896,6 +2896,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 	    cur_level == fault->goal_level &&
 	    is_shadow_present_pte(spte) &&
 	    !is_large_pte(spte)) {
+		struct kvm_mmu_page *sp;
+		u64 page_mask;
+
+		sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
+
+		/* Prevent lpage_disallowed read from moving ahead. */
+		smp_rmb();
+
+		if (!sp->lpage_disallowed)
+			return;
 		/*
 		 * A small SPTE exists for this pfn, but FNAME(fetch)
 		 * and __direct_map would like to create a large PTE
@@ -2903,8 +2913,8 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
 		 * patching back for them into pfn the next 9 bits of
 		 * the address.
 		 */
-		u64 page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
-				KVM_PAGES_PER_HPAGE(cur_level - 1);
+		page_mask = KVM_PAGES_PER_HPAGE(cur_level) -
+			KVM_PAGES_PER_HPAGE(cur_level - 1);
 		fault->pfn |= fault->gfn & page_mask;
 		fault->goal_level--;
 	}
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 4/6] KVM: selftests: Dump VM stats in binary stats test
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
                   ` (2 preceding siblings ...)
  2022-04-01  6:36 ` [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 5/6] KVM: selftests: Test reading a single stat Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 6/6] selftests: KVM: use page stats to check if dirty logging works properly Mingwei Zhang
  5 siblings, 0 replies; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

From: Ben Gardon <bgardon@google.com>

Add kvm_util library functions to read KVM stats through the binary
stats interface and then dump them to stdout when running the binary
stats test. Subsequent commits will extend the kvm_util code and use it
to make assertions in a test for NX hugepages.

CC: Jing Zhang <jingzhangos@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |   1 +
 .../selftests/kvm/kvm_binary_stats_test.c     |   3 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 143 ++++++++++++++++++
 3 files changed, 147 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 92cef0ffb19e..c5f4a67772cb 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -400,6 +400,7 @@ void assert_on_unhandled_exception(struct kvm_vm *vm, uint32_t vcpuid);
 
 int vm_get_stats_fd(struct kvm_vm *vm);
 int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
+void dump_vm_stats(struct kvm_vm *vm);
 
 uint32_t guest_get_vcpuid(void);
 
diff --git a/tools/testing/selftests/kvm/kvm_binary_stats_test.c b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
index 17f65d514915..afc4701ce8dd 100644
--- a/tools/testing/selftests/kvm/kvm_binary_stats_test.c
+++ b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
@@ -174,6 +174,9 @@ static void vm_stats_test(struct kvm_vm *vm)
 	stats_test(stats_fd);
 	close(stats_fd);
 	TEST_ASSERT(fcntl(stats_fd, F_GETFD) == -1, "Stats fd not freed");
+
+	/* Dump VM stats */
+	dump_vm_stats(vm);
 }
 
 static void vcpu_stats_test(struct kvm_vm *vm, int vcpu_id)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 1665a220abcb..4d21c3b46780 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2556,3 +2556,146 @@ int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid)
 
 	return ioctl(vcpu->fd, KVM_GET_STATS_FD, NULL);
 }
+
+/* Caller is responsible for freeing the returned kvm_stats_header. */
+static struct kvm_stats_header *read_vm_stats_header(int stats_fd)
+{
+	struct kvm_stats_header *header;
+	ssize_t ret;
+
+	/* Read kvm stats header */
+	header = malloc(sizeof(*header));
+	TEST_ASSERT(header, "Allocate memory for stats header");
+
+	ret = read(stats_fd, header, sizeof(*header));
+	TEST_ASSERT(ret == sizeof(*header), "Read stats header");
+
+	return header;
+}
+
+static void dump_header(int stats_fd, struct kvm_stats_header *header)
+{
+	ssize_t ret;
+	char *id;
+
+	printf("flags: %u\n", header->flags);
+	printf("name size: %u\n", header->name_size);
+	printf("num_desc: %u\n", header->num_desc);
+	printf("id_offset: %u\n", header->id_offset);
+	printf("desc_offset: %u\n", header->desc_offset);
+	printf("data_offset: %u\n", header->data_offset);
+
+	/* Read kvm stats id string */
+	id = malloc(header->name_size);
+	TEST_ASSERT(id, "Allocate memory for id string");
+	ret = pread(stats_fd, id, header->name_size, header->id_offset);
+	TEST_ASSERT(ret == header->name_size, "Read id string");
+
+	printf("id: %s\n", id);
+
+	free(id);
+}
+
+static ssize_t stats_desc_size(struct kvm_stats_header *header)
+{
+	return sizeof(struct kvm_stats_desc) + header->name_size;
+}
+
+/* Caller is responsible for freeing the returned kvm_stats_desc. */
+static struct kvm_stats_desc *read_vm_stats_desc(int stats_fd,
+						 struct kvm_stats_header *header)
+{
+	struct kvm_stats_desc *stats_desc;
+	size_t size_desc;
+	ssize_t ret;
+
+	size_desc = header->num_desc * stats_desc_size(header);
+
+	/* Allocate memory for stats descriptors */
+	stats_desc = malloc(size_desc);
+	TEST_ASSERT(stats_desc, "Allocate memory for stats descriptors");
+
+	/* Read kvm stats descriptors */
+	ret = pread(stats_fd, stats_desc, size_desc, header->desc_offset);
+	TEST_ASSERT(ret == size_desc, "Read KVM stats descriptors");
+
+	return stats_desc;
+}
+
+/* Caller is responsible for freeing the memory *data. */
+static int read_stat_data(int stats_fd, struct kvm_stats_header *header,
+			  struct kvm_stats_desc *desc, uint64_t **data)
+{
+	u64 *stats_data;
+	ssize_t ret;
+
+	stats_data = malloc(desc->size * sizeof(*stats_data));
+
+	ret = pread(stats_fd, stats_data, desc->size * sizeof(*stats_data),
+		    header->data_offset + desc->offset);
+
+	/* ret is in bytes. */
+	ret = ret / sizeof(*stats_data);
+
+	TEST_ASSERT(ret == desc->size,
+		    "Read data of KVM stats: %s", desc->name);
+
+	*data = stats_data;
+
+	return ret;
+}
+
+static void dump_stat(int stats_fd, struct kvm_stats_header *header,
+		      struct kvm_stats_desc *desc)
+{
+	u64 *stats_data;
+	ssize_t ret;
+	int i;
+
+	printf("\tflags: %u\n", desc->flags);
+	printf("\texponent: %u\n", desc->exponent);
+	printf("\tsize: %u\n", desc->size);
+	printf("\toffset: %u\n", desc->offset);
+	printf("\tbucket_size: %u\n", desc->bucket_size);
+	printf("\tname: %s\n", (char *)&desc->name);
+
+	ret = read_stat_data(stats_fd, header, desc, &stats_data);
+
+	printf("\tdata: %lu", *stats_data);
+	for (i = 1; i < ret; i++)
+		printf(", %lu", *(stats_data + i));
+	printf("\n\n");
+
+	free(stats_data);
+}
+
+void dump_vm_stats(struct kvm_vm *vm)
+{
+	struct kvm_stats_desc *stats_desc;
+	struct kvm_stats_header *header;
+	struct kvm_stats_desc *desc;
+	size_t size_desc;
+	int stats_fd;
+	int i;
+
+	stats_fd = vm_get_stats_fd(vm);
+
+	header = read_vm_stats_header(stats_fd);
+	dump_header(stats_fd, header);
+
+	stats_desc = read_vm_stats_desc(stats_fd, header);
+
+	size_desc = stats_desc_size(header);
+
+	/* Read kvm stats data one by one */
+	for (i = 0; i < header->num_desc; ++i) {
+		desc = (void *)stats_desc + (i * size_desc);
+		dump_stat(stats_fd, header, desc);
+	}
+
+	free(stats_desc);
+	free(header);
+
+	close(stats_fd);
+}
+
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 5/6] KVM: selftests: Test reading a single stat
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
                   ` (3 preceding siblings ...)
  2022-04-01  6:36 ` [PATCH v3 4/6] KVM: selftests: Dump VM stats in binary stats test Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  2022-04-01  6:36 ` [PATCH v3 6/6] selftests: KVM: use page stats to check if dirty logging works properly Mingwei Zhang
  5 siblings, 0 replies; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

From: Ben Gardon <bgardon@google.com>

Retrieve the value of a single stat by name in the binary stats test to
ensure the kvm_util library functions work.

CC: Jing Zhang <jingzhangos@google.com>
Signed-off-by: Ben Gardon <bgardon@google.com>
---
 .../selftests/kvm/include/kvm_util_base.h     |  1 +
 .../selftests/kvm/kvm_binary_stats_test.c     |  3 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 53 +++++++++++++++++++
 3 files changed, 57 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index c5f4a67772cb..09ee70c0df26 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -401,6 +401,7 @@ void assert_on_unhandled_exception(struct kvm_vm *vm, uint32_t vcpuid);
 int vm_get_stats_fd(struct kvm_vm *vm);
 int vcpu_get_stats_fd(struct kvm_vm *vm, uint32_t vcpuid);
 void dump_vm_stats(struct kvm_vm *vm);
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name);
 
 uint32_t guest_get_vcpuid(void);
 
diff --git a/tools/testing/selftests/kvm/kvm_binary_stats_test.c b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
index afc4701ce8dd..97bde355f105 100644
--- a/tools/testing/selftests/kvm/kvm_binary_stats_test.c
+++ b/tools/testing/selftests/kvm/kvm_binary_stats_test.c
@@ -177,6 +177,9 @@ static void vm_stats_test(struct kvm_vm *vm)
 
 	/* Dump VM stats */
 	dump_vm_stats(vm);
+
+	/* Read a single stat. */
+	printf("remote_tlb_flush: %lu\n", vm_get_single_stat(vm, "remote_tlb_flush"));
 }
 
 static void vcpu_stats_test(struct kvm_vm *vm, int vcpu_id)
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 4d21c3b46780..1d3493d7fd55 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -2699,3 +2699,56 @@ void dump_vm_stats(struct kvm_vm *vm)
 	close(stats_fd);
 }
 
+static int vm_get_stat_data(struct kvm_vm *vm, const char *stat_name,
+			    uint64_t **data)
+{
+	struct kvm_stats_desc *stats_desc;
+	struct kvm_stats_header *header;
+	struct kvm_stats_desc *desc;
+	size_t size_desc;
+	int stats_fd;
+	int ret = -EINVAL;
+	int i;
+
+	*data = NULL;
+
+	stats_fd = vm_get_stats_fd(vm);
+
+	header = read_vm_stats_header(stats_fd);
+
+	stats_desc = read_vm_stats_desc(stats_fd, header);
+
+	size_desc = stats_desc_size(header);
+
+	/* Read kvm stats data one by one */
+	for (i = 0; i < header->num_desc; ++i) {
+		desc = (void *)stats_desc + (i * size_desc);
+
+		if (strcmp(desc->name, stat_name))
+			continue;
+
+		ret = read_stat_data(stats_fd, header, desc, data);
+	}
+
+	free(stats_desc);
+	free(header);
+
+	close(stats_fd);
+
+	return ret;
+}
+
+uint64_t vm_get_single_stat(struct kvm_vm *vm, const char *stat_name)
+{
+	uint64_t *data;
+	uint64_t value;
+	int ret;
+
+	ret = vm_get_stat_data(vm, stat_name, &data);
+	TEST_ASSERT(ret == 1, "Stat %s expected to have 1 element, but has %d",
+		    stat_name, ret);
+	value = *data;
+	free(data);
+	return value;
+}
+
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v3 6/6] selftests: KVM: use page stats to check if dirty logging works properly
  2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
                   ` (4 preceding siblings ...)
  2022-04-01  6:36 ` [PATCH v3 5/6] KVM: selftests: Test reading a single stat Mingwei Zhang
@ 2022-04-01  6:36 ` Mingwei Zhang
  5 siblings, 0 replies; 12+ messages in thread
From: Mingwei Zhang @ 2022-04-01  6:36 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Mingwei Zhang, Yosry Ahmed,
	Ben Gardon, David Matlack, Jing Zhang, Peter Xu

When dirty logging is enabled, KVM will remap all accessed pages in
NPT/EPT at 4K. This property could be used to check if
the page stats metrics work properly in KVM mmu. At the same time, this
logic might be used the other way around: using page stats to verify if
dirty logging really splits all huge pages. Moreover, when dirty logging is
disabled, KVM zaps corresponding SPTEs and we could check whether the large
pages come back when guest touches the pages again.

So add page stats checking in dirty logging performance selftest. In
particular, add checks in three locations:
 - just after vm is created;
 - after populating memory into vm but before enabling dirty logging;
 - finish dirty logging but before disabling it;
 - behind the final iteration after disabling dirty logging.

Tested using commands:
 - ./dirty_log_perf_test -s anonymous_hugetlb_1gb
 - ./dirty_log_perf_test -s anonymous_hugetlb_2mb
 - ./dirty_log_perf_test -s anonymous_thp

Cc: Sean Christopherson <seanjc@google.com>
Cc: David Matlack <dmatlack@google.com>
Cc: Jing Zhang <jingzhangos@google.com>
Cc: Peter Xu <peterx@redhat.com>

Suggested-by: Ben Gardon <bgardon@google.com>
Reviewed-by: Ben Gardon <bgardon@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 .../selftests/kvm/dirty_log_perf_test.c       | 53 +++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index c9d9e513ca04..dd48aabfff5c 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -25,6 +25,10 @@
 #define GICR_BASE_GPA			0x80A0000ULL
 #endif
 
+#ifdef __x86_64__
+#include "processor.h"
+#endif
+
 /* How many host loops to run by default (one KVM_GET_DIRTY_LOG for each loop)*/
 #define TEST_HOST_LOOP_N		2UL
 
@@ -191,6 +195,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 				 p->slots, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
+#ifdef __x86_64__
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") == 0,
+		    "4K page is non zero");
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+		    "2M page is non zero");
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+		    "1G page is non zero");
+#endif
 	perf_test_set_wr_fract(vm, p->wr_fract);
 
 	guest_num_pages = (nr_vcpus * guest_percpu_mem_size) >> vm_get_page_shift(vm);
@@ -232,6 +244,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	pr_info("Populate memory time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
 
+#ifdef __x86_64__
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+		    "4K page is zero");
+	if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+	    p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+		TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+			    "2M page is zero");
+	if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+		TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+			    "1G page is zero");
+#endif
 	/* Enable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	enable_dirty_logging(vm, p->slots);
@@ -277,6 +300,14 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 				iteration, ts_diff.tv_sec, ts_diff.tv_nsec);
 		}
 	}
+#ifdef __x86_64__
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_4k") != 0,
+		    "4K page is zero after dirty logging");
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") == 0,
+		    "2M page is non-zero after dirty logging");
+	TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") == 0,
+		    "1G page is non-zero after dirty logging");
+#endif
 
 	/* Disable dirty logging */
 	clock_gettime(CLOCK_MONOTONIC, &start);
@@ -285,6 +316,28 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	pr_info("Disabling dirty logging time: %ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
 
+	/*
+	 * Increment iteration to run the vcpus again to ensure all pages come
+	 * back.
+	 */
+	iteration++;
+	pr_info("Starting the final iteration to get all pages back.\n");
+	for (vcpu_id = 0; vcpu_id < nr_vcpus; vcpu_id++) {
+		while (READ_ONCE(vcpu_last_completed_iteration[vcpu_id])
+		       != iteration)
+			;
+	}
+
+#ifdef __x86_64__
+	if (p->backing_src == VM_MEM_SRC_ANONYMOUS_THP ||
+	    p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_2MB)
+		TEST_ASSERT(vm_get_single_stat(vm, "pages_2m") != 0,
+			    "2M page is zero");
+	if (p->backing_src == VM_MEM_SRC_ANONYMOUS_HUGETLB_1GB)
+		TEST_ASSERT(vm_get_single_stat(vm, "pages_1g") != 0,
+			    "1G page is zero");
+#endif
+
 	/* Tell the vcpu thread to quit */
 	host_quit = true;
 	perf_test_join_vcpu_threads(nr_vcpus);
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages
  2022-04-01  6:36 ` [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Mingwei Zhang
@ 2022-04-01 14:22   ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2022-04-01 14:22 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Yosry Ahmed, Ben Gardon,
	David Matlack, Jing Zhang, Peter Xu

[-- Attachment #1: Type: text/plain, Size: 665 bytes --]

On Fri, Apr 01, 2022, Mingwei Zhang wrote:
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index f05423545e6d..5ca78a89d8ed 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -24,7 +24,6 @@ bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  
>  	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
>  	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
> -	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
>  	kvm->arch.tdp_mmu_zap_wq =
>  		alloc_workqueue("kvm", WQ_UNBOUND|WQ_MEM_RECLAIM|WQ_CPU_INTENSIVE, 0);

This has a minor conflict with kvm/next and kvm/queue, attached a new version since
you'll need a v3 (foreshadowing...).

[-- Attachment #2: 0001-KVM-x86-mmu-Track-the-number-of-TDP-MMU-pages-but-no.patch --]
[-- Type: text/x-diff, Size: 4419 bytes --]

From 3d0c2796b22c75e3ab279cbcd3036c44500a6744 Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Fri, 1 Apr 2022 06:36:32 +0000
Subject: [PATCH] KVM: x86/mmu: Track the number of TDP MMU pages, but not the
 actual pages

Track the number of TDP MMU "shadow" pages instead of tracking the pages
themselves. With the NX huge page list manipulation moved out of the common
linking flow, elminating the list-based tracking means the happy path of
adding a shadow page doesn't need to acquire a spinlock and can instead
inc/dec an atomic.

Keep the tracking as the WARN during TDP MMU teardown on leaked shadow
pages is very, very useful for detecting KVM bugs.

Tracking the number of pages will also make it trivial to expose the
counter to userspace as a stat in the future, which may or may not be
desirable.

Note, the TDP MMU needs to use a separate counter (and stat if that ever
comes to be) from the existing n_used_mmu_pages. The TDP MMU doesn't bother
supporting the shrinker nor does it honor KVM_SET_NR_MMU_PAGES (because the
TDP MMU consumes so few pages relative to shadow paging), and including TDP
MMU pages in that counter would break both the shrinker and shadow MMUs,
e.g. if a VM is using nested TDP.

Cc: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 11 +++--------
 arch/x86/kvm/mmu/tdp_mmu.c      | 16 ++++++++--------
 2 files changed, 11 insertions(+), 16 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 676705ad1e23..8a80056cf841 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1192,6 +1192,9 @@ struct kvm_arch {
 	 */
 	bool tdp_mmu_enabled;
 
+	/* The number of TDP MMU pages across all roots. */
+	atomic64_t tdp_mmu_pages;
+
 	/*
 	 * List of struct kvm_mmu_pages being used as roots.
 	 * All struct kvm_mmu_pages in the list should have
@@ -1212,18 +1215,10 @@ struct kvm_arch {
 	 */
 	struct list_head tdp_mmu_roots;
 
-	/*
-	 * List of struct kvmp_mmu_pages not being used as roots.
-	 * All struct kvm_mmu_pages in the list should have
-	 * tdp_mmu_page set and a tdp_mmu_root_count of 0.
-	 */
-	struct list_head tdp_mmu_pages;
-
 	/*
 	 * Protects accesses to the following fields when the MMU lock
 	 * is held in read mode:
 	 *  - tdp_mmu_roots (above)
-	 *  - tdp_mmu_pages (above)
 	 *  - the link field of struct kvm_mmu_pages used by the TDP MMU
 	 *  - lpage_disallowed_mmu_pages
 	 *  - the lpage_disallowed_link field of struct kvm_mmu_pages used
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index d6417740646a..1f44826cf794 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -29,7 +29,6 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	kvm->arch.tdp_mmu_enabled = true;
 	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_roots);
 	spin_lock_init(&kvm->arch.tdp_mmu_pages_lock);
-	INIT_LIST_HEAD(&kvm->arch.tdp_mmu_pages);
 	kvm->arch.tdp_mmu_zap_wq = wq;
 	return 1;
 }
@@ -54,7 +53,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	flush_workqueue(kvm->arch.tdp_mmu_zap_wq);
 	destroy_workqueue(kvm->arch.tdp_mmu_zap_wq);
 
-	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_pages));
+	WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages));
 	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));
 
 	/*
@@ -384,14 +383,17 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 static void tdp_mmu_unlink_sp(struct kvm *kvm, struct kvm_mmu_page *sp,
 			      bool shared)
 {
+	atomic64_dec(&kvm->arch.tdp_mmu_pages);
+
+	if (!sp->lpage_disallowed)
+		return;
+
 	if (shared)
 		spin_lock(&kvm->arch.tdp_mmu_pages_lock);
 	else
 		lockdep_assert_held_write(&kvm->mmu_lock);
 
-	list_del(&sp->link);
-	if (sp->lpage_disallowed)
-		unaccount_huge_nx_page(kvm, sp);
+	unaccount_huge_nx_page(kvm, sp);
 
 	if (shared)
 		spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
@@ -1119,9 +1121,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
 		tdp_mmu_set_spte(kvm, iter, spte);
 	}
 
-	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
-	list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
-	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
+	atomic64_inc(&kvm->arch.tdp_mmu_pages);
 
 	return 0;
 }

base-commit: 88d1a6895b71b580b8fc98c23824aafa49d4948e
-- 
2.35.1.1094.g7c7d902a7c-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust()
  2022-04-01  6:36 ` [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Mingwei Zhang
@ 2022-04-01 14:33   ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2022-04-01 14:33 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Yosry Ahmed, Ben Gardon,
	David Matlack, Jing Zhang, Peter Xu

On Fri, Apr 01, 2022, Mingwei Zhang wrote:
> Explicitly check if a NX huge page is disallowed when determining if a page
> fault needs to be forced to use a smaller sized page. KVM incorrectly
> assumes that the NX huge page mitigation is the only scenario where KVM
> will create a shadow page instead of a huge page. Any scenario that causes
> KVM to zap leaf SPTEs may result in having a SP that can be made huge
> without violating the NX huge page mitigation. E.g. disabling of dirty
> logging, zapping from mmu_notifier due to page migration, guest MTRR
> changes that affect the viability of a huge page, etc...
> 
> Fixes: b8e8c8303ff2 ("kvm: mmu: ITLB_MULTIHIT mitigation")
> 
> Reviewed-by: Ben Gardon <bgardon@google.com>

Reviewed-by (and Acked-by and Tested-by) tags should be dropped when there is a
non-trivial, functional, and/or semantic change between versions.  It's often a
judgment call, but in this case I would definitely drop Ben's review.  It's usually
better to err on the side of caution for this rule/guideline.

> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 5cb845fae56e..033609e8b332 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2896,6 +2896,16 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>  	    cur_level == fault->goal_level &&
>  	    is_shadow_present_pte(spte) &&
>  	    !is_large_pte(spte)) {
> +		struct kvm_mmu_page *sp;
> +		u64 page_mask;
> +
> +		sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);
> +
> +		/* Prevent lpage_disallowed read from moving ahead. */
> +		smp_rmb();

smp_rmb() needs to be paired with e.g. smp_wmb(), and the comment is supposed to
tie the two together (the reason checkpatch warns about comments is to try and
prevent sprinking bogus barriers everywhere).  These types of barriers always need
to be paired with something, as both the reader(s) and the writer need to have
specific, coordinated ordering.  E.g. reading lpage_disallowed after checking for
a present SPTE is meaningless if the writer doesn't set lpage_disallowed before
setting the SPTE present (which is the purpose of patch 1).  I didn't add the
smp_wmb() in patch 1 because it's not necessary as of that patch, it's only when
the concurrent reader comes along in this patch that the barrier becomes necessary.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9f73c175845e..041b1ab838f0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2901,7 +2901,12 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_

                sp = to_shadow_page(spte & PT64_BASE_ADDR_MASK);

-               /* Prevent lpage_disallowed read from moving ahead. */
+               /*
+                * Ensure lpage_disallowed is read after checking for a present
+                * SPTE.  A different vCPU may be concurrently installing the
+                * shadow page if mmu_lock is held for read.  Pairs with the
+                * smp_wmb() in kvm_tdp_mmu_map().
+                */
                smp_rmb();

                if (!sp->lpage_disallowed)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1f44826cf794..2422684f1dc3 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1186,6 +1186,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)

                        sp->lpage_disallowed = account_nx;

+                       /*
+                        * Ensure lpage_disallowed is visible before the SP is
+                        * marked present, as mmu_lock is held for read.  Pairs
+                        * with the smp_rmb() in disallowed_hugepage_adjust().
+                        */
+                       smp_wmb();
+
                        if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
                                tdp_mmu_free_sp(sp);
                                break;


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
  2022-04-01  6:36 ` [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE Mingwei Zhang
@ 2022-04-04 17:58   ` Ben Gardon
  2022-04-04 18:19     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Ben Gardon @ 2022-04-04 17:58 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Yosry Ahmed, David Matlack,
	Jing Zhang, Peter Xu

On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> From: Sean Christopherson <seanjc@google.com>
>
> Set lpage_disallowed in TDP MMU shadow pages before making the SP visible
> to other readers, i.e. before setting its SPTE.  This will allow KVM to
> query lpage_disallowed when determining if a shadow page can be replaced
> by a NX huge page without violating the rules of the mitigation.
>
> Reviewed-by: Mingwei Zhang <mizhang@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c          | 14 ++++++++++----
>  arch/x86/kvm/mmu/mmu_internal.h |  2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c      | 20 ++++++++++++--------
>  3 files changed, 23 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 1361eb4599b4..5cb845fae56e 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -812,14 +812,20 @@ static void account_shadowed(struct kvm *kvm, struct kvm_mmu_page *sp)
>         kvm_mmu_gfn_disallow_lpage(slot, gfn);
>  }
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> -       if (sp->lpage_disallowed)
> -               return;
> -
>         ++kvm->stat.nx_lpage_splits;
>         list_add_tail(&sp->lpage_disallowed_link,
>                       &kvm->arch.lpage_disallowed_mmu_pages);
> +}
> +
> +static void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp)
> +{
> +       if (sp->lpage_disallowed)
> +               return;
> +
> +       __account_huge_nx_page(kvm, sp);
> +
>         sp->lpage_disallowed = true;
>  }
>
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 1bff453f7cbe..4a0087efa1e3 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
>
>  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
>
> -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);

I believe we need to modify the usage of this function in
paging_tmpl.h as well, at which point there should be no users of
account_huge_nx_page, so we can just modify the function directly
instead of adding a __helper.
(Disregard if the source I was looking at was out of date. Lots of
churn in this code recently.)


>  void unaccount_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
>
>  #endif /* __KVM_X86_MMU_INTERNAL_H */
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b3b6426725d4..f05423545e6d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1122,16 +1122,13 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>   * @kvm: kvm instance
>   * @iter: a tdp_iter instance currently on the SPTE that should be set
>   * @sp: The new TDP page table to install.
> - * @account_nx: True if this page table is being installed to split a
> - *              non-executable huge page.
>   * @shared: This operation is running under the MMU lock in read mode.
>   *
>   * Returns: 0 if the new page table was installed. Non-0 if the page table
>   *          could not be installed (e.g. the atomic compare-exchange failed).
>   */
>  static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
> -                          struct kvm_mmu_page *sp, bool account_nx,
> -                          bool shared)
> +                          struct kvm_mmu_page *sp, bool shared)
>  {
>         u64 spte = make_nonleaf_spte(sp->spt, !shadow_accessed_mask);
>         int ret = 0;
> @@ -1146,8 +1143,6 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>
>         spin_lock(&kvm->arch.tdp_mmu_pages_lock);
>         list_add(&sp->link, &kvm->arch.tdp_mmu_pages);
> -       if (account_nx)
> -               account_huge_nx_page(kvm, sp);
>         spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
>         return 0;
> @@ -1160,6 +1155,7 @@ static int tdp_mmu_link_sp(struct kvm *kvm, struct tdp_iter *iter,
>  int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
> +       struct kvm *kvm = vcpu->kvm;
>         struct tdp_iter iter;
>         struct kvm_mmu_page *sp;
>         int ret;
> @@ -1210,10 +1206,18 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                         sp = tdp_mmu_alloc_sp(vcpu);
>                         tdp_mmu_init_child_sp(sp, &iter);
>
> -                       if (tdp_mmu_link_sp(vcpu->kvm, &iter, sp, account_nx, true)) {
> +                       sp->lpage_disallowed = account_nx;
> +
> +                       if (tdp_mmu_link_sp(kvm, &iter, sp, true)) {
>                                 tdp_mmu_free_sp(sp);
>                                 break;
>                         }
> +
> +                       if (account_nx) {
> +                               spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> +                               __account_huge_nx_page(kvm, sp);
> +                               spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> +                       }
>                 }
>         }
>
> @@ -1501,7 +1505,7 @@ static int tdp_mmu_split_huge_page(struct kvm *kvm, struct tdp_iter *iter,
>          * correctness standpoint since the translation will be the same either
>          * way.
>          */
> -       ret = tdp_mmu_link_sp(kvm, iter, sp, false, shared);
> +       ret = tdp_mmu_link_sp(kvm, iter, sp, shared);
>         if (ret)
>                 goto out;
>
> --
> 2.35.1.1094.g7c7d902a7c-goog
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
  2022-04-04 17:58   ` Ben Gardon
@ 2022-04-04 18:19     ` Sean Christopherson
  2022-04-04 18:54       ` Ben Gardon
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-04 18:19 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Yosry Ahmed, David Matlack,
	Jing Zhang, Peter Xu

On Mon, Apr 04, 2022, Ben Gardon wrote:
> On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <mizhang@google.com> wrote:
> > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > index 1bff453f7cbe..4a0087efa1e3 100644
> > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> >
> >  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> >
> > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> 
> I believe we need to modify the usage of this function in
> paging_tmpl.h as well, at which point there should be no users of
> account_huge_nx_page, so we can just modify the function directly
> instead of adding a __helper.
> (Disregard if the source I was looking at was out of date. Lots of
> churn in this code recently.)

paging_tmpl.h is shadow paging only, i.e. will always handled page faults with
mmu_lock held for write and it also needs the check for sp->lpage_disallowed
already being set.  Only the TDP MMU code is special in that (a) it holds mmu_lock
for read and (b) never reuses shadow pages when inserting into the page tables.

Or did I completely misunderstand what you meant by "need to modify the usage"?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE
  2022-04-04 18:19     ` Sean Christopherson
@ 2022-04-04 18:54       ` Ben Gardon
  0 siblings, 0 replies; 12+ messages in thread
From: Ben Gardon @ 2022-04-04 18:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Yosry Ahmed, David Matlack,
	Jing Zhang, Peter Xu

On Mon, Apr 4, 2022 at 11:20 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Mon, Apr 04, 2022, Ben Gardon wrote:
> > On Thu, Mar 31, 2022 at 11:36 PM Mingwei Zhang <mizhang@google.com> wrote:
> > > diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> > > index 1bff453f7cbe..4a0087efa1e3 100644
> > > --- a/arch/x86/kvm/mmu/mmu_internal.h
> > > +++ b/arch/x86/kvm/mmu/mmu_internal.h
> > > @@ -168,7 +168,7 @@ void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_
> > >
> > >  void *mmu_memory_cache_alloc(struct kvm_mmu_memory_cache *mc);
> > >
> > > -void account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> > > +void __account_huge_nx_page(struct kvm *kvm, struct kvm_mmu_page *sp);
> >
> > I believe we need to modify the usage of this function in
> > paging_tmpl.h as well, at which point there should be no users of
> > account_huge_nx_page, so we can just modify the function directly
> > instead of adding a __helper.
> > (Disregard if the source I was looking at was out of date. Lots of
> > churn in this code recently.)
>
> paging_tmpl.h is shadow paging only, i.e. will always handled page faults with
> mmu_lock held for write and it also needs the check for sp->lpage_disallowed
> already being set.  Only the TDP MMU code is special in that (a) it holds mmu_lock
> for read and (b) never reuses shadow pages when inserting into the page tables.
>
> Or did I completely misunderstand what you meant by "need to modify the usage"?

Ah right duh. For some reason I thought we were modifying __direct_map
in this commit too. Nevermind, no change needed.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-04 21:52 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-01  6:36 [PATCH v3 0/6] Verify dirty logging works properly with page stats Mingwei Zhang
2022-04-01  6:36 ` [PATCH v3 1/6] KVM: x86/mmu: Set lpage_disallowed in TDP MMU before setting SPTE Mingwei Zhang
2022-04-04 17:58   ` Ben Gardon
2022-04-04 18:19     ` Sean Christopherson
2022-04-04 18:54       ` Ben Gardon
2022-04-01  6:36 ` [PATCH v3 2/6] KVM: x86/mmu: Track the number of TDP MMU pages, but not the actual pages Mingwei Zhang
2022-04-01 14:22   ` Sean Christopherson
2022-04-01  6:36 ` [PATCH v3 3/6] KVM: x86/mmu: explicitly check nx_hugepage in disallowed_hugepage_adjust() Mingwei Zhang
2022-04-01 14:33   ` Sean Christopherson
2022-04-01  6:36 ` [PATCH v3 4/6] KVM: selftests: Dump VM stats in binary stats test Mingwei Zhang
2022-04-01  6:36 ` [PATCH v3 5/6] KVM: selftests: Test reading a single stat Mingwei Zhang
2022-04-01  6:36 ` [PATCH v3 6/6] selftests: KVM: use page stats to check if dirty logging works properly Mingwei Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.