linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
@ 2022-04-28 23:34 Sean Christopherson
  2022-04-29 10:36 ` Paolo Bonzini
  2022-05-02 11:12 ` Kai Huang
  0 siblings, 2 replies; 20+ messages in thread
From: Sean Christopherson @ 2022-04-28 23:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Maxim Levitsky, Ben Gardon,
	David Matlack

Disallow memslots and MMIO SPTEs whose gpa range would exceed the host's
MAXPHYADDR, i.e. don't create SPTEs for gfns that exceed host.MAXPHYADDR.
The TDP MMU bounds its zapping based on host.MAXPHYADDR, and so if the
guest, possibly with help from userspace, manages to coerce KVM into
creating a SPTE for an "impossible" gfn, KVM will leak the associated
shadow pages (page tables).

On bare metal, encountering an impossible gpa in the page fault path is
well and truly impossible, barring CPU bugs, as the CPU will signal #PF
during the gva=>gpa translation (or a similar failure when stuffing a
physical address into e.g. the VMCS/VMCB).  But if KVM is running as a VM
itself, the MAXPHYADDR enumerated to KVM may not be the actual MAXPHYADDR
of the underlying hardware, in which case the hardware will not fault on
the illegal-from-KVM's-perspective gpa.

Alternatively, KVM could continue allowing the dodgy behavior and simply
zap the max possible range.  But, for hosts with MAXPHYADDR < 52, that's
a (minor) waste of cycles, and more importantly, KVM can't reasonably
support impossible memslots when running on bare metal (or with an
accurate MAXPHYADDR as a VM).  Note, limiting the overhead by checking if
KVM is running as a guest is not a safe option as the host isn't required
to announce itself to the guest in any way, e.g. doesn't need to set the
HYPERVISOR CPUID bit.

A second alternative to disallowing the memslot behavior would be to
disallow creating a VM with guest.MAXPHYADDR > host.MAXPHYADDR.  That
restriction is undesirable as there are legitimate use cases for doing
so, e.g. using the highest host.MAXPHYADDR out of a pool of heterogeneous
systems so that VMs can be migrated between hosts with different
MAXPHYADDRs without running afoul of the allow_smaller_maxphyaddr mess.

Opportunistically make the now common kvm_mmu_max_gfn_host() inclusive
instead of exclusive.  The inclusive approach is somewhat silly as the
memslot and TDP MMU code want an exclusive value, but the name implies
the returned value is inclusive, and the MMIO path needs an inclusive
check.

  WARNING: CPU: 10 PID: 1122 at arch/x86/kvm/mmu/tdp_mmu.c:57
                                kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Modules linked in: kvm_intel kvm irqbypass
  CPU: 10 PID: 1122 Comm: set_memory_regi Tainted: G        W         5.18.0-rc1+ #293
  Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
  RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x4b/0x60 [kvm]
  Call Trace:
   <TASK>
   kvm_arch_destroy_vm+0x130/0x1b0 [kvm]
   kvm_destroy_vm+0x162/0x2d0 [kvm]
   kvm_vm_release+0x1d/0x30 [kvm]
   __fput+0x82/0x240
   task_work_run+0x5b/0x90
   exit_to_user_mode_prepare+0xd2/0xe0
   syscall_exit_to_user_mode+0x1d/0x40
   entry_SYSCALL_64_after_hwframe+0x44/0xae
   </TASK>

Fixes: faaf05b00aec ("kvm: x86/mmu: Support zapping SPTEs in the TDP MMU")
Fixes: 524a1e4e381f ("KVM: x86/mmu: Don't leak non-leaf SPTEs when zapping all SPTEs")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Cc: Ben Gardon <bgardon@google.com>
Cc: David Matlack <dmatlack@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---

Maxim, I didn't add you as Reported-by because I'm not confident this is
actually the cause of the bug you're hitting.  I wasn't able to reproduce
using your ipi_stress test, I found this horror via inspection and proved
it by hacking a selftest.

That said, I'm holding out hope that this does fix your issue.  Your
config doesn't specify host-phys-bits and IIUC you haven't been able to
repro on bare metal, which fits the kvm.MAXPHYADDR < cpu.MAXPHYADDR
scenario.  In theory if the test went rogue it could acccess a bad gfn and
cause KVM to insert an MMIO SPTE.  Fingers crossed :-)

If it does fix your case, a Reported-by for both you and Syzbot is
definitely in order.

 arch/x86/kvm/mmu.h         | 21 +++++++++++++++++++++
 arch/x86/kvm/mmu/mmu.c     | 10 ++++++++--
 arch/x86/kvm/mmu/spte.h    |  6 ------
 arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++++++-------
 arch/x86/kvm/x86.c         |  6 +++++-
 5 files changed, 42 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e..d291ff3065dc 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,6 +65,27 @@ static __always_inline u64 rsvd_bits(int s, int e)
 	return ((2ULL << (e - s)) - 1) << s;
 }
 
+/*
+ * The number of non-reserved physical address bits irrespective of features
+ * that repurpose legal bits, e.g. MKTME.
+ */
+extern u8 __read_mostly shadow_phys_bits;
+
+static inline gfn_t kvm_mmu_max_gfn_host(void)
+{
+	/*
+	 * Disallow SPTEs (via memslots or cached MMIO) whose gfn would exceed
+	 * host.MAXPHYADDR.  Assuming KVM is running on bare metal, guest
+	 * accesses beyond host.MAXPHYADDR will hit a #PF(RSVD) and never hit
+	 * an EPT Violation/Misconfig / #NPF, and so KVM will never install a
+	 * SPTE for such addresses.  That doesn't hold true if KVM is running
+	 * as a VM itself, e.g. if the MAXPHYADDR KVM sees is less than
+	 * hardware's real MAXPHYADDR, but since KVM can't honor such behavior
+	 * on bare metal, disallow it entirely to simplify e.g. the TDP MMU.
+	 */
+	return (1ULL << (shadow_phys_bits - PAGE_SHIFT)) - 1;
+}
+
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff218..c10ae25135e3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3010,9 +3010,15 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		/*
 		 * If MMIO caching is disabled, emulate immediately without
 		 * touching the shadow page tables as attempting to install an
-		 * MMIO SPTE will just be an expensive nop.
+		 * MMIO SPTE will just be an expensive nop.  Do not cache MMIO
+		 * whose gfn is greater than host.MAXPHYADDR, any guest that
+		 * generates such gfns is either malicious or in the weeds.
+		 * Note, it's possible to observe a gfn > host.MAXPHYADDR if
+		 * and only if host.MAXPHYADDR is inaccurate with respect to
+		 * hardware behavior, e.g. if KVM itself is running as a VM.
 		 */
-		if (unlikely(!enable_mmio_caching)) {
+		if (unlikely(!enable_mmio_caching) ||
+		    unlikely(fault->gfn > kvm_mmu_max_gfn_host())) {
 			*ret_val = RET_PF_EMULATE;
 			return true;
 		}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index ad8ce3c5d083..43eceb821b31 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -203,12 +203,6 @@ static inline bool is_removed_spte(u64 spte)
  */
 extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
 
-/*
- * The number of non-reserved physical address bits irrespective of features
- * that repurpose legal bits, e.g. MKTME.
- */
-extern u8 __read_mostly shadow_phys_bits;
-
 static inline bool is_mmio_spte(u64 spte)
 {
 	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 566548a3efa7..a81994ad43de 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -815,14 +815,15 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
 	return iter->yielded;
 }
 
-static inline gfn_t tdp_mmu_max_gfn_host(void)
+static inline gfn_t tdp_mmu_max_exclusive_gfn_host(void)
 {
 	/*
-	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
-	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
-	 * and so KVM will never install a SPTE for such addresses.
+	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
+	 * a gpa range that would exceed the max gfn, and KVM does not create
+	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
+	 * the slow emulation path every time.
 	 */
-	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+	return kvm_mmu_max_gfn_host() + 1;
 }
 
 static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -830,7 +831,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	gfn_t end = tdp_mmu_max_gfn_host();
+	gfn_t end = tdp_mmu_max_exclusive_gfn_host();
 	gfn_t start = 0;
 
 	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
@@ -923,7 +924,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 {
 	struct tdp_iter iter;
 
-	end = min(end, tdp_mmu_max_gfn_host());
+	end = min(end, tdp_mmu_max_exclusive_gfn_host());
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccda..26ea5999d72b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12078,8 +12078,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
 				   struct kvm_memory_slot *new,
 				   enum kvm_mr_change change)
 {
-	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
+	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
+		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn_host())
+			return -EINVAL;
+
 		return kvm_alloc_memslot_metadata(kvm, new);
+	}
 
 	if (change == KVM_MR_FLAGS_ONLY)
 		memcpy(&new->arch, &old->arch, sizeof(old->arch));

base-commit: 2a39d8b39bffdaf1a4223d0d22f07baee154c8f3
-- 
2.36.0.464.gb9c8b46e94-goog


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-28 23:34 [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Sean Christopherson
@ 2022-04-29 10:36 ` Paolo Bonzini
  2022-04-29 14:24   ` Sean Christopherson
  2022-05-02 11:12 ` Kai Huang
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2022-04-29 10:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On 4/29/22 01:34, Sean Christopherson wrote:

> +static inline gfn_t kvm_mmu_max_gfn_host(void)
> +{
> +	/*
> +	 * Disallow SPTEs (via memslots or cached MMIO) whose gfn would exceed
> +	 * host.MAXPHYADDR.  Assuming KVM is running on bare metal, guest
> +	 * accesses beyond host.MAXPHYADDR will hit a #PF(RSVD) and never hit
> +	 * an EPT Violation/Misconfig / #NPF, and so KVM will never install a
> +	 * SPTE for such addresses.  That doesn't hold true if KVM is running
> +	 * as a VM itself, e.g. if the MAXPHYADDR KVM sees is less than
> +	 * hardware's real MAXPHYADDR, but since KVM can't honor such behavior
> +	 * on bare metal, disallow it entirely to simplify e.g. the TDP MMU.
> +	 */
> +	return (1ULL << (shadow_phys_bits - PAGE_SHIFT)) - 1;

The host.MAXPHYADDR however does not matter if EPT/NPT is not in use, because
the shadow paging fault path can accept any gfn.

> -static inline gfn_t tdp_mmu_max_gfn_host(void)
> +static inline gfn_t tdp_mmu_max_exclusive_gfn_host(void)
>   {
>   	/*
> -	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
> -	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
> -	 * and so KVM will never install a SPTE for such addresses.
> +	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
> +	 * a gpa range that would exceed the max gfn, and KVM does not create
> +	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
> +	 * the slow emulation path every time.
>   	 */
> -	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
> +	return kvm_mmu_max_gfn_host() + 1;
>   }

Slightly nicer name, tdp_mmu_max_gfn_exclusive().  It has to be the host
one because EPT/NPT is in use, but it doesn't really matter.

> +		 * whose gfn is greater than host.MAXPHYADDR, any guest that
> +		 * generates such gfns is either malicious or in the weeds.
> +		 * Note, it's possible to observe a gfn > host.MAXPHYADDR if
> +		 * and only if host.MAXPHYADDR is inaccurate with respect to
> +		 * hardware behavior, e.g. if KVM itself is running as a VM.

I don't think maliciousness is particularly likely, and "in the weeds" implies
L2 is the buggy one.  Slightly more accurate:

         * whose gfn is greater than host.MAXPHYADDR, any guest that
         * generates such gfns is running nested and is being tricked
         * by L0 userspace (you can observe gfn > L1.MAXPHYADDR if
         * and only if L1's MAXPHYADDR is inaccurate with respect to
         * the hardware's).

Putting everything together and rebasing on top of kvm/master:

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e6cae6f22683..dba275d323a7 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -65,6 +65,30 @@ static __always_inline u64 rsvd_bits(int s, int e)
  	return ((2ULL << (e - s)) - 1) << s;
  }
  
+/*
+ * The number of non-reserved physical address bits irrespective of features
+ * that repurpose legal bits, e.g. MKTME.
+ */
+extern u8 __read_mostly shadow_phys_bits;
+
+static inline gfn_t kvm_mmu_max_gfn(void)
+{
+	/*
+	 * Note that this uses the host MAXPHYADDR, not the guest's.
+	 * EPT/NPT cannot support GPAs that would exceed host.MAXPHYADDR;
+	 * assuming KVM is running on bare metal, guest accesses beyond
+	 * host.MAXPHYADDR will hit a #PF(RSVD) and never cause a vmexit
+	 * (either EPT Violation/Misconfig or #NPF), and so KVM will never
+	 * install a SPTE for such addresses.  If KVM is running as a VM
+	 * itself, on the other hand, it might see a MAXPHYADDR that is less
+	 * than hardware's real MAXPHYADDR.  Using the host MAXPHYADDR
+	 * disallows such SPTEs entirely and simplifies the TDP MMU.
+	 */
+	int max_gpa_bits = likely(tdp_enabled) ? shadow_phys_bits : 52;
+
+	return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
+}
+
  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
  
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index af7910a46c12..7b632a4f81cb 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3033,9 +3033,15 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
  		/*
  		 * If MMIO caching is disabled, emulate immediately without
  		 * touching the shadow page tables as attempting to install an
-		 * MMIO SPTE will just be an expensive nop.
+		 * MMIO SPTE will just be an expensive nop.  Do not cache MMIO
+		 * whose gfn is greater than host.MAXPHYADDR, any guest that
+		 * generates such gfns is running nested and is being tricked
+		 * by L0 userspace (you can observe gfn > L1.MAXPHYADDR if
+		 * and only if L1's MAXPHYADDR is inaccurate with respect to
+		 * the hardware's).
  		 */
-		if (unlikely(!shadow_mmio_value)) {
+		if (unlikely(!shadow_mmio_value) ||
+		    unlikely(fault->gfn > kvm_mmu_max_gfn())) {
  			*ret_val = RET_PF_EMULATE;
  			return true;
  		}
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 73f12615416f..e4abeb5df1b1 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -201,12 +201,6 @@ static inline bool is_removed_spte(u64 spte)
   */
  extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
  
-/*
- * The number of non-reserved physical address bits irrespective of features
- * that repurpose legal bits, e.g. MKTME.
- */
-extern u8 __read_mostly shadow_phys_bits;
-
  static inline bool is_mmio_spte(u64 spte)
  {
  	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c472769e0300..edc68538819b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -815,14 +815,15 @@ static inline bool __must_check tdp_mmu_iter_cond_resched(struct kvm *kvm,
  	return iter->yielded;
  }
  
-static inline gfn_t tdp_mmu_max_gfn_host(void)
+static inline gfn_t tdp_mmu_max_gfn_exclusive(void)
  {
  	/*
-	 * Bound TDP MMU walks at host.MAXPHYADDR, guest accesses beyond that
-	 * will hit a #PF(RSVD) and never hit an EPT Violation/Misconfig / #NPF,
-	 * and so KVM will never install a SPTE for such addresses.
+	 * Bound TDP MMU walks at host.MAXPHYADDR.  KVM disallows memslots with
+	 * a gpa range that would exceed the max gfn, and KVM does not create
+	 * MMIO SPTEs for "impossible" gfns, instead sending such accesses down
+	 * the slow emulation path every time.
  	 */
-	return 1ULL << (shadow_phys_bits - PAGE_SHIFT);
+	return kvm_mmu_max_gfn() + 1;
  }
  
  static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
@@ -830,7 +831,7 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
  {
  	struct tdp_iter iter;
  
-	gfn_t end = tdp_mmu_max_gfn_host();
+	gfn_t end = tdp_mmu_max_gfn_exclusive();
  	gfn_t start = 0;
  
  	for_each_tdp_pte_min_level(iter, root, zap_level, start, end) {
@@ -923,7 +924,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
  {
  	struct tdp_iter iter;
  
-	end = min(end, tdp_mmu_max_gfn_host());
+	end = min(end, tdp_mmu_max_gfn_exclusive());
  
  	lockdep_assert_held_write(&kvm->mmu_lock);
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 278b2fdd3590..015ecc249c2e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11994,8 +11994,12 @@ int kvm_arch_prepare_memory_region(struct kvm *kvm,
  				   struct kvm_memory_slot *new,
  				   enum kvm_mr_change change)
  {
-	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE)
+	if (change == KVM_MR_CREATE || change == KVM_MR_MOVE) {
+		if ((new->base_gfn + new->npages - 1) > kvm_mmu_max_gfn())
+			return -EINVAL;
+
  		return kvm_alloc_memslot_metadata(kvm, new);
+	}
  
  	if (change == KVM_MR_FLAGS_ONLY)
  		memcpy(&new->arch, &old->arch, sizeof(old->arch));


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 10:36 ` Paolo Bonzini
@ 2022-04-29 14:24   ` Sean Christopherson
  2022-04-29 14:37     ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-29 14:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 01:34, Sean Christopherson wrote:
> 
> > +static inline gfn_t kvm_mmu_max_gfn_host(void)
> > +{
> > +	/*
> > +	 * Disallow SPTEs (via memslots or cached MMIO) whose gfn would exceed
> > +	 * host.MAXPHYADDR.  Assuming KVM is running on bare metal, guest
> > +	 * accesses beyond host.MAXPHYADDR will hit a #PF(RSVD) and never hit
> > +	 * an EPT Violation/Misconfig / #NPF, and so KVM will never install a
> > +	 * SPTE for such addresses.  That doesn't hold true if KVM is running
> > +	 * as a VM itself, e.g. if the MAXPHYADDR KVM sees is less than
> > +	 * hardware's real MAXPHYADDR, but since KVM can't honor such behavior
> > +	 * on bare metal, disallow it entirely to simplify e.g. the TDP MMU.
> > +	 */
> > +	return (1ULL << (shadow_phys_bits - PAGE_SHIFT)) - 1;
> 
> The host.MAXPHYADDR however does not matter if EPT/NPT is not in use, because
> the shadow paging fault path can accept any gfn.

... 

> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index e6cae6f22683..dba275d323a7 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -65,6 +65,30 @@ static __always_inline u64 rsvd_bits(int s, int e)
>  	return ((2ULL << (e - s)) - 1) << s;
>  }
> +/*
> + * The number of non-reserved physical address bits irrespective of features
> + * that repurpose legal bits, e.g. MKTME.
> + */
> +extern u8 __read_mostly shadow_phys_bits;
> +
> +static inline gfn_t kvm_mmu_max_gfn(void)
> +{
> +	/*
> +	 * Note that this uses the host MAXPHYADDR, not the guest's.
> +	 * EPT/NPT cannot support GPAs that would exceed host.MAXPHYADDR;
> +	 * assuming KVM is running on bare metal, guest accesses beyond
> +	 * host.MAXPHYADDR will hit a #PF(RSVD) and never cause a vmexit
> +	 * (either EPT Violation/Misconfig or #NPF), and so KVM will never
> +	 * install a SPTE for such addresses.  If KVM is running as a VM
> +	 * itself, on the other hand, it might see a MAXPHYADDR that is less
> +	 * than hardware's real MAXPHYADDR.  Using the host MAXPHYADDR
> +	 * disallows such SPTEs entirely and simplifies the TDP MMU.
> +	 */
> +	int max_gpa_bits = likely(tdp_enabled) ? shadow_phys_bits : 52;

I don't love the divergent memslot behavior, but it's technically correct, so I
can't really argue.  Do we want to "officially" document the memslot behavior?

> +
> +	return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
> +}
> +
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 14:24   ` Sean Christopherson
@ 2022-04-29 14:37     ` Paolo Bonzini
  2022-04-29 14:42       ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2022-04-29 14:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On 4/29/22 16:24, Sean Christopherson wrote:
> I don't love the divergent memslot behavior, but it's technically correct, so I
> can't really argue.  Do we want to "officially" document the memslot behavior?
> 

I don't know what you mean by officially document, but at least I have 
relied on it to test KVM's MAXPHYADDR=52 cases before such hardware 
existed. :)

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 14:37     ` Paolo Bonzini
@ 2022-04-29 14:42       ` Sean Christopherson
  2022-04-29 14:50         ` Paolo Bonzini
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-29 14:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 16:24, Sean Christopherson wrote:
> > I don't love the divergent memslot behavior, but it's technically correct, so I
> > can't really argue.  Do we want to "officially" document the memslot behavior?
> > 
> 
> I don't know what you mean by officially document,

Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.

> but at least I have relied on it to test KVM's MAXPHYADDR=52 cases before
> such hardware existed.  :)

Ah, that's a very good reason to support this for shadow paging.  Maybe throw
something about testing in the changelog?  Without considering the testing angle,
it looks like KVM supports max=52 for !TDP just because it can, because practically
speaking there's unlikely to be a use case for exposing that much memory to a
guest when using shadow paging.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 14:42       ` Sean Christopherson
@ 2022-04-29 14:50         ` Paolo Bonzini
  2022-04-29 16:01           ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2022-04-29 14:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On 4/29/22 16:42, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Paolo Bonzini wrote:
>> On 4/29/22 16:24, Sean Christopherson wrote:
>>> I don't love the divergent memslot behavior, but it's technically correct, so I
>>> can't really argue.  Do we want to "officially" document the memslot behavior?
>>>
>>
>> I don't know what you mean by officially document,
> 
> Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.

Not sure if the API documentation is the best place because userspace 
does not know whether shadow paging is on (except indirectly through 
other capabilities, perhaps)?

It could even be programmatic, such as returning 52 for 
CPUID[0x80000008].  A nested KVM on L1 would not be able to use the 
#PF(RSVD) trick to detect MMIO faults.  That's not a big price to pay, 
however I'm not sure it's a good idea in general...

Paolo

> 
>> but at least I have relied on it to test KVM's MAXPHYADDR=52 cases before
>> such hardware existed.  :)
> 
> Ah, that's a very good reason to support this for shadow paging.  Maybe throw
> something about testing in the changelog?  Without considering the testing angle,
> it looks like KVM supports max=52 for !TDP just because it can, because practically
> speaking there's unlikely to be a use case for exposing that much memory to a
> guest when using shadow paging.
> 


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 14:50         ` Paolo Bonzini
@ 2022-04-29 16:01           ` Sean Christopherson
  2022-05-01 14:28             ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-04-29 16:01 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> On 4/29/22 16:42, Sean Christopherson wrote:
> > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > On 4/29/22 16:24, Sean Christopherson wrote:
> > > > I don't love the divergent memslot behavior, but it's technically correct, so I
> > > > can't really argue.  Do we want to "officially" document the memslot behavior?
> > > > 
> > > 
> > > I don't know what you mean by officially document,
> > 
> > Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.
> 
> Not sure if the API documentation is the best place because userspace does
> not know whether shadow paging is on (except indirectly through other
> capabilities, perhaps)?

Hrm, true, it's not like the userspace VMM can rewrite itself at runtime.

> It could even be programmatic, such as returning 52 for CPUID[0x80000008].
> A nested KVM on L1 would not be able to use the #PF(RSVD) trick to detect
> MMIO faults.  That's not a big price to pay, however I'm not sure it's a
> good idea in general...

Agreed, messing with CPUID is likely to end in tears.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-29 16:01           ` Sean Christopherson
@ 2022-05-01 14:28             ` Maxim Levitsky
  2022-05-01 14:32               ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-01 14:28 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon, David Matlack

On Fri, 2022-04-29 at 16:01 +0000, Sean Christopherson wrote:
> On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > On 4/29/22 16:42, Sean Christopherson wrote:
> > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > On 4/29/22 16:24, Sean Christopherson wrote:
> > > > > I don't love the divergent memslot behavior, but it's technically correct, so I
> > > > > can't really argue.  Do we want to "officially" document the memslot behavior?
> > > > > 
> > > > 
> > > > I don't know what you mean by officially document,
> > > 
> > > Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.
> > 
> > Not sure if the API documentation is the best place because userspace does
> > not know whether shadow paging is on (except indirectly through other
> > capabilities, perhaps)?
> 
> Hrm, true, it's not like the userspace VMM can rewrite itself at runtime.
> 
> > It could even be programmatic, such as returning 52 for CPUID[0x80000008].
> > A nested KVM on L1 would not be able to use the #PF(RSVD) trick to detect
> > MMIO faults.  That's not a big price to pay, however I'm not sure it's a
> > good idea in general...
> 
> Agreed, messing with CPUID is likely to end in tears.
> 


Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).

I tested kvm/queue as of today, sadly I still see the warning.

[mlevitsk@fedora34 ~]$[   35.205241] ------------[ cut here ]------------
[   35.207156] WARNING: CPU: 6 PID: 3236 at arch/x86/kvm/mmu/tdp_mmu.c:46 kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
[   35.211468] Modules linked in: uinput snd_seq_dummy snd_hrtimer xt_MASQUERADE xt_conntrack ipt_REJECT ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat bridge rpcsec_gss_krb5 auth_rpcgss
nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd snd_hwdep ccp snd_hda_core rng_core snd_seq kvm
snd_seq_device snd_pcm joydev irqbypass snd_timer input_leds snd lpc_ich virtio_input mfd_core pcspkr efi_pstore rtc_cmos button ext4 mbcache jbd2 hid_generic usbhid hid virtio_gpu virtio_dma_buf
drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec virtio_net net_failover drm virtio_console failover i2c_core virtio_blk crc32_pclmul xhci_pci crc32c_intel xhci_hcd virtio_pci
virtio_pci_modern_dev virtio_ring virtio dm_mirror dm_region_hash dm_log fuse ipv6 autofs4
[   35.248745] CPU: 6 PID: 3236 Comm: CPU 2/KVM Not tainted 5.14.0.stable #90
[   35.251559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
[   35.255011] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
[   35.257531] Code: 48 89 e5 48 39 c2 75 21 48 8b 87 b0 91 00 00 48 81 c7 b0 91 00 00 48 39 f8 75 08 e8 b3 7c cd e0 5d c3 c3 90 0f 0b 90 eb f2 90 <0f> 0b 90 eb d9 0f 1f 40 00 0f 1f 44 00 00 55 b8 ff
ff ff ff 48 89
[   35.265355] RSP: 0018:ffffc90001f6fc28 EFLAGS: 00010283
[   35.267659] RAX: ffffc90001f5a1c0 RBX: 0000000000000008 RCX: 0000000000000000
[   35.270823] RDX: ffff888114168958 RSI: ffff888115636ac0 RDI: ffffc90001f51000
[   35.273769] RBP: ffffc90001f6fc28 R08: 0000000000004802 R09: 0000000000000000
[   35.276595] R10: 00000000000001cd R11: 0000000000000018 R12: ffffc90001f51000
[   35.279470] R13: ffffc90001f51998 R14: ffff8881001d3060 R15: dead000000000100
[   35.282314] FS:  0000000000000000(0000) GS:ffff88846ef80000(0000) knlGS:0000000000000000
[   35.285594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   35.287943] CR2: 0000000000000000 CR3: 0000000002a0b000 CR4: 0000000000350ee0
[   35.290979] Call Trace:
[   35.292082]  kvm_mmu_uninit_vm+0x22/0x30 [kvm]
[   35.293909]  kvm_arch_destroy_vm+0x18f/0x200 [kvm]
[   35.295884]  kvm_destroy_vm+0x164/0x250 [kvm]
[   35.297680]  kvm_put_kvm+0x26/0x40 [kvm]
[   35.299309]  kvm_vm_release+0x22/0x30 [kvm]
[   35.301088]  __fput+0x94/0x240
[   35.302338]  ____fput+0xe/0x10
[   35.303599]  task_work_run+0x63/0xa0
[   35.305083]  do_exit+0x353/0x9d0
[   35.306470]  do_group_exit+0x3b/0xa0
[   35.307882]  get_signal+0x163/0x850
[   35.309403]  arch_do_signal_or_restart+0xf3/0x7c0
[   35.311390]  exit_to_user_mode_prepare+0x112/0x1f0
[   35.313374]  syscall_exit_to_user_mode+0x18/0x40
[   35.315244]  do_syscall_64+0x44/0xb0
[   35.316819]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   35.318874] RIP: 0033:0x7f51e5f8b0ab
[   35.320395] Code: Unable to access opcode bytes at RIP 0x7f51e5f8b081.
[   35.322985] RSP: 002b:00007f50dbdfd5c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[   35.326015] RAX: fffffffffffffffc RBX: 000055df487dd0a0 RCX: 00007f51e5f8b0ab
[   35.328914] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
[   35.332162] RBP: 00007f50dbdfd6c0 R08: 000055df46521e60 R09: 00007ffcd47ed080
[   35.335172] R10: 00007ffcd47ed090 R11: 0000000000000246 R12: 00007ffcd4653f2e
[   35.338302] R13: 00007ffcd4653f2f R14: 0000000000000000 R15: 00007f50dbdff640
[   35.341320] ---[ end trace fa01d10f9909874f ]---


Oh, well, I will now switch to vanilla L0 kernel, just in case, and see where to go from this point.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-01 14:28             ` Maxim Levitsky
@ 2022-05-01 14:32               ` Maxim Levitsky
  2022-05-02  7:59                 ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-01 14:32 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon, David Matlack

On Sun, 2022-05-01 at 17:28 +0300, Maxim Levitsky wrote:
> On Fri, 2022-04-29 at 16:01 +0000, Sean Christopherson wrote:
> > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > On 4/29/22 16:42, Sean Christopherson wrote:
> > > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > > On 4/29/22 16:24, Sean Christopherson wrote:
> > > > > > I don't love the divergent memslot behavior, but it's technically correct, so I
> > > > > > can't really argue.  Do we want to "officially" document the memslot behavior?
> > > > > > 
> > > > > 
> > > > > I don't know what you mean by officially document,
> > > > 
> > > > Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.
> > > 
> > > Not sure if the API documentation is the best place because userspace does
> > > not know whether shadow paging is on (except indirectly through other
> > > capabilities, perhaps)?
> > 
> > Hrm, true, it's not like the userspace VMM can rewrite itself at runtime.
> > 
> > > It could even be programmatic, such as returning 52 for CPUID[0x80000008].
> > > A nested KVM on L1 would not be able to use the #PF(RSVD) trick to detect
> > > MMIO faults.  That's not a big price to pay, however I'm not sure it's a
> > > good idea in general...
> > 
> > Agreed, messing with CPUID is likely to end in tears.
> > 
> 
> Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> 
> I tested kvm/queue as of today, sadly I still see the warning.

Due to a race, the above statements are out of order ;-)

Best regards,
	Maxim Levitsky


> 
> [mlevitsk@fedora34 ~]$[   35.205241] ------------[ cut here ]------------
> [   35.207156] WARNING: CPU: 6 PID: 3236 at arch/x86/kvm/mmu/tdp_mmu.c:46 kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> [   35.211468] Modules linked in: uinput snd_seq_dummy snd_hrtimer xt_MASQUERADE xt_conntrack ipt_REJECT ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat bridge rpcsec_gss_krb5 auth_rpcgss
> nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd snd_hwdep ccp snd_hda_core rng_core snd_seq kvm
> snd_seq_device snd_pcm joydev irqbypass snd_timer input_leds snd lpc_ich virtio_input mfd_core pcspkr efi_pstore rtc_cmos button ext4 mbcache jbd2 hid_generic usbhid hid virtio_gpu virtio_dma_buf
> drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec virtio_net net_failover drm virtio_console failover i2c_core virtio_blk crc32_pclmul xhci_pci crc32c_intel xhci_hcd virtio_pci
> virtio_pci_modern_dev virtio_ring virtio dm_mirror dm_region_hash dm_log fuse ipv6 autofs4
> [   35.248745] CPU: 6 PID: 3236 Comm: CPU 2/KVM Not tainted 5.14.0.stable #90
> [   35.251559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> [   35.255011] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> [   35.257531] Code: 48 89 e5 48 39 c2 75 21 48 8b 87 b0 91 00 00 48 81 c7 b0 91 00 00 48 39 f8 75 08 e8 b3 7c cd e0 5d c3 c3 90 0f 0b 90 eb f2 90 <0f> 0b 90 eb d9 0f 1f 40 00 0f 1f 44 00 00 55 b8 ff
> ff ff ff 48 89
> [   35.265355] RSP: 0018:ffffc90001f6fc28 EFLAGS: 00010283
> [   35.267659] RAX: ffffc90001f5a1c0 RBX: 0000000000000008 RCX: 0000000000000000
> [   35.270823] RDX: ffff888114168958 RSI: ffff888115636ac0 RDI: ffffc90001f51000
> [   35.273769] RBP: ffffc90001f6fc28 R08: 0000000000004802 R09: 0000000000000000
> [   35.276595] R10: 00000000000001cd R11: 0000000000000018 R12: ffffc90001f51000
> [   35.279470] R13: ffffc90001f51998 R14: ffff8881001d3060 R15: dead000000000100
> [   35.282314] FS:  0000000000000000(0000) GS:ffff88846ef80000(0000) knlGS:0000000000000000
> [   35.285594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   35.287943] CR2: 0000000000000000 CR3: 0000000002a0b000 CR4: 0000000000350ee0
> [   35.290979] Call Trace:
> [   35.292082]  kvm_mmu_uninit_vm+0x22/0x30 [kvm]
> [   35.293909]  kvm_arch_destroy_vm+0x18f/0x200 [kvm]
> [   35.295884]  kvm_destroy_vm+0x164/0x250 [kvm]
> [   35.297680]  kvm_put_kvm+0x26/0x40 [kvm]
> [   35.299309]  kvm_vm_release+0x22/0x30 [kvm]
> [   35.301088]  __fput+0x94/0x240
> [   35.302338]  ____fput+0xe/0x10
> [   35.303599]  task_work_run+0x63/0xa0
> [   35.305083]  do_exit+0x353/0x9d0
> [   35.306470]  do_group_exit+0x3b/0xa0
> [   35.307882]  get_signal+0x163/0x850
> [   35.309403]  arch_do_signal_or_restart+0xf3/0x7c0
> [   35.311390]  exit_to_user_mode_prepare+0x112/0x1f0
> [   35.313374]  syscall_exit_to_user_mode+0x18/0x40
> [   35.315244]  do_syscall_64+0x44/0xb0
> [   35.316819]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [   35.318874] RIP: 0033:0x7f51e5f8b0ab
> [   35.320395] Code: Unable to access opcode bytes at RIP 0x7f51e5f8b081.
> [   35.322985] RSP: 002b:00007f50dbdfd5c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [   35.326015] RAX: fffffffffffffffc RBX: 000055df487dd0a0 RCX: 00007f51e5f8b0ab
> [   35.328914] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
> [   35.332162] RBP: 00007f50dbdfd6c0 R08: 000055df46521e60 R09: 00007ffcd47ed080
> [   35.335172] R10: 00007ffcd47ed090 R11: 0000000000000246 R12: 00007ffcd4653f2e
> [   35.338302] R13: 00007ffcd4653f2f R14: 0000000000000000 R15: 00007f50dbdff640
> [   35.341320] ---[ end trace fa01d10f9909874f ]---
> 
> 
> Oh, well, I will now switch to vanilla L0 kernel, just in case, and see where to go from this point.
> 
> Best regards,
> 	Maxim Levitsky
> 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-01 14:32               ` Maxim Levitsky
@ 2022-05-02  7:59                 ` Maxim Levitsky
  2022-05-02  8:56                   ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-02  7:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon, David Matlack

[-- Attachment #1: Type: text/plain, Size: 6593 bytes --]

On Sun, 2022-05-01 at 17:32 +0300, Maxim Levitsky wrote:
> On Sun, 2022-05-01 at 17:28 +0300, Maxim Levitsky wrote:
> > On Fri, 2022-04-29 at 16:01 +0000, Sean Christopherson wrote:
> > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > On 4/29/22 16:42, Sean Christopherson wrote:
> > > > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > > > On 4/29/22 16:24, Sean Christopherson wrote:
> > > > > > > I don't love the divergent memslot behavior, but it's technically correct, so I
> > > > > > > can't really argue.  Do we want to "officially" document the memslot behavior?
> > > > > > > 
> > > > > > 
> > > > > > I don't know what you mean by officially document,
> > > > > 
> > > > > Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.
> > > > 
> > > > Not sure if the API documentation is the best place because userspace does
> > > > not know whether shadow paging is on (except indirectly through other
> > > > capabilities, perhaps)?
> > > 
> > > Hrm, true, it's not like the userspace VMM can rewrite itself at runtime.
> > > 
> > > > It could even be programmatic, such as returning 52 for CPUID[0x80000008].
> > > > A nested KVM on L1 would not be able to use the #PF(RSVD) trick to detect
> > > > MMIO faults.  That's not a big price to pay, however I'm not sure it's a
> > > > good idea in general...
> > > 
> > > Agreed, messing with CPUID is likely to end in tears.
> > > 
> > 
> > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > 
> > I tested kvm/queue as of today, sadly I still see the warning.
> 
> Due to a race, the above statements are out of order ;-)


So futher investigation shows that the trigger for this *is* cpu_pm=on :(

So this is enough to trigger the warning when run in the guest:

qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host -overcommit cpu-pm=on


'-smp 8' is needed, and the more vCPUs the more often the warning appears.


Due to non atomic memslot update bug, I use patched qemu version, with an attached hack,
to pause/resume vcpus around the memslot update it does, but even without this hack,
you can just ctrl+c the test after it gets the KVM internal error, and 
then tdp mmu memory leak warning shows up (not always but very often).


Oh, and if I run the above command on the bare metal, it  never terminates. Must be due to preemption,
qemu shows beeing stuck in kvm_vcpu_block. AVIC disabled, kvm/queue.
Bugs, bugs, and features :)

Best regards,
	Maxim Levitsky


> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> > [mlevitsk@fedora34 ~]$[   35.205241] ------------[ cut here ]------------
> > [   35.207156] WARNING: CPU: 6 PID: 3236 at arch/x86/kvm/mmu/tdp_mmu.c:46 kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> > [   35.211468] Modules linked in: uinput snd_seq_dummy snd_hrtimer xt_MASQUERADE xt_conntrack ipt_REJECT ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat bridge rpcsec_gss_krb5 auth_rpcgss
> > nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd snd_hwdep ccp snd_hda_core rng_core snd_seq kvm
> > snd_seq_device snd_pcm joydev irqbypass snd_timer input_leds snd lpc_ich virtio_input mfd_core pcspkr efi_pstore rtc_cmos button ext4 mbcache jbd2 hid_generic usbhid hid virtio_gpu virtio_dma_buf
> > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec virtio_net net_failover drm virtio_console failover i2c_core virtio_blk crc32_pclmul xhci_pci crc32c_intel xhci_hcd virtio_pci
> > virtio_pci_modern_dev virtio_ring virtio dm_mirror dm_region_hash dm_log fuse ipv6 autofs4
> > [   35.248745] CPU: 6 PID: 3236 Comm: CPU 2/KVM Not tainted 5.14.0.stable #90
> > [   35.251559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > [   35.255011] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> > [   35.257531] Code: 48 89 e5 48 39 c2 75 21 48 8b 87 b0 91 00 00 48 81 c7 b0 91 00 00 48 39 f8 75 08 e8 b3 7c cd e0 5d c3 c3 90 0f 0b 90 eb f2 90 <0f> 0b 90 eb d9 0f 1f 40 00 0f 1f 44 00 00 55 b8 ff
> > ff ff ff 48 89
> > [   35.265355] RSP: 0018:ffffc90001f6fc28 EFLAGS: 00010283
> > [   35.267659] RAX: ffffc90001f5a1c0 RBX: 0000000000000008 RCX: 0000000000000000
> > [   35.270823] RDX: ffff888114168958 RSI: ffff888115636ac0 RDI: ffffc90001f51000
> > [   35.273769] RBP: ffffc90001f6fc28 R08: 0000000000004802 R09: 0000000000000000
> > [   35.276595] R10: 00000000000001cd R11: 0000000000000018 R12: ffffc90001f51000
> > [   35.279470] R13: ffffc90001f51998 R14: ffff8881001d3060 R15: dead000000000100
> > [   35.282314] FS:  0000000000000000(0000) GS:ffff88846ef80000(0000) knlGS:0000000000000000
> > [   35.285594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [   35.287943] CR2: 0000000000000000 CR3: 0000000002a0b000 CR4: 0000000000350ee0
> > [   35.290979] Call Trace:
> > [   35.292082]  kvm_mmu_uninit_vm+0x22/0x30 [kvm]
> > [   35.293909]  kvm_arch_destroy_vm+0x18f/0x200 [kvm]
> > [   35.295884]  kvm_destroy_vm+0x164/0x250 [kvm]
> > [   35.297680]  kvm_put_kvm+0x26/0x40 [kvm]
> > [   35.299309]  kvm_vm_release+0x22/0x30 [kvm]
> > [   35.301088]  __fput+0x94/0x240
> > [   35.302338]  ____fput+0xe/0x10
> > [   35.303599]  task_work_run+0x63/0xa0
> > [   35.305083]  do_exit+0x353/0x9d0
> > [   35.306470]  do_group_exit+0x3b/0xa0
> > [   35.307882]  get_signal+0x163/0x850
> > [   35.309403]  arch_do_signal_or_restart+0xf3/0x7c0
> > [   35.311390]  exit_to_user_mode_prepare+0x112/0x1f0
> > [   35.313374]  syscall_exit_to_user_mode+0x18/0x40
> > [   35.315244]  do_syscall_64+0x44/0xb0
> > [   35.316819]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > [   35.318874] RIP: 0033:0x7f51e5f8b0ab
> > [   35.320395] Code: Unable to access opcode bytes at RIP 0x7f51e5f8b081.
> > [   35.322985] RSP: 002b:00007f50dbdfd5c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > [   35.326015] RAX: fffffffffffffffc RBX: 000055df487dd0a0 RCX: 00007f51e5f8b0ab
> > [   35.328914] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
> > [   35.332162] RBP: 00007f50dbdfd6c0 R08: 000055df46521e60 R09: 00007ffcd47ed080
> > [   35.335172] R10: 00007ffcd47ed090 R11: 0000000000000246 R12: 00007ffcd4653f2e
> > [   35.338302] R13: 00007ffcd4653f2f R14: 0000000000000000 R15: 00007f50dbdff640
> > [   35.341320] ---[ end trace fa01d10f9909874f ]---
> > 
> > 
> > Oh, well, I will now switch to vanilla L0 kernel, just in case, and see where to go from this point.
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 


[-- Attachment #2: 0001-Pause-all-VCPUs-on-host-bridge-memregion-changes.patch --]
[-- Type: text/x-patch, Size: 2114 bytes --]

From 5abf99c76e76fa2f5c72c51db1443e7032423998 Mon Sep 17 00:00:00 2001
From: Maxim Levitsky <mlevitsk@redhat.com>
Date: Thu, 20 Jan 2022 18:00:07 +0200
Subject: [PATCH] Pause all VCPUs on host bridge memregion changes

This is temp hack to avoid cpu_pm=on failures
---
 hw/pci-host/i440fx.c | 5 +++++
 hw/pci-host/q35.c    | 5 +++++
 2 files changed, 10 insertions(+)

diff --git a/hw/pci-host/i440fx.c b/hw/pci-host/i440fx.c
index e08716142b..03384b410e 100644
--- a/hw/pci-host/i440fx.c
+++ b/hw/pci-host/i440fx.c
@@ -36,6 +36,7 @@
 #include "qapi/visitor.h"
 #include "qemu/error-report.h"
 #include "qom/object.h"
+#include "sysemu/cpus.h"
 
 /*
  * I440FX chipset data sheet.
@@ -69,6 +70,8 @@ static void i440fx_update_memory_mappings(PCII440FXState *d)
     int i;
     PCIDevice *pd = PCI_DEVICE(d);
 
+    pause_all_vcpus();
+
     memory_region_transaction_begin();
     for (i = 0; i < ARRAY_SIZE(d->pam_regions); i++) {
         pam_update(&d->pam_regions[i], i,
@@ -79,6 +82,8 @@ static void i440fx_update_memory_mappings(PCII440FXState *d)
     memory_region_set_enabled(&d->smram,
                               pd->config[I440FX_SMRAM] & SMRAM_G_SMRAME);
     memory_region_transaction_commit();
+
+    resume_all_vcpus();
 }
 
 
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index ab5a47aff5..02831f5248 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -37,6 +37,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "qemu/module.h"
+#include "sysemu/cpus.h"
 
 /****************************************************************************
  * Q35 host
@@ -471,6 +472,8 @@ static void mch_write_config(PCIDevice *d,
 {
     MCHPCIState *mch = MCH_PCI_DEVICE(d);
 
+    pause_all_vcpus();
+
     pci_default_write_config(d, address, val, len);
 
     if (ranges_overlap(address, len, MCH_HOST_BRIDGE_PAM0,
@@ -496,6 +499,8 @@ static void mch_write_config(PCIDevice *d,
     if (ranges_overlap(address, len, MCH_HOST_BRIDGE_F_SMBASE, 1)) {
         mch_update_smbase_smram(mch);
     }
+
+    resume_all_vcpus();
 }
 
 static void mch_update(MCHPCIState *mch)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-02  7:59                 ` Maxim Levitsky
@ 2022-05-02  8:56                   ` Maxim Levitsky
  2022-05-02 16:51                     ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-02  8:56 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon, David Matlack

On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> On Sun, 2022-05-01 at 17:32 +0300, Maxim Levitsky wrote:
> > On Sun, 2022-05-01 at 17:28 +0300, Maxim Levitsky wrote:
> > > On Fri, 2022-04-29 at 16:01 +0000, Sean Christopherson wrote:
> > > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > > On 4/29/22 16:42, Sean Christopherson wrote:
> > > > > > On Fri, Apr 29, 2022, Paolo Bonzini wrote:
> > > > > > > On 4/29/22 16:24, Sean Christopherson wrote:
> > > > > > > > I don't love the divergent memslot behavior, but it's technically correct, so I
> > > > > > > > can't really argue.  Do we want to "officially" document the memslot behavior?
> > > > > > > > 
> > > > > > > 
> > > > > > > I don't know what you mean by officially document,
> > > > > > 
> > > > > > Something in kvm/api.rst under KVM_SET_USER_MEMORY_REGION.
> > > > > 
> > > > > Not sure if the API documentation is the best place because userspace does
> > > > > not know whether shadow paging is on (except indirectly through other
> > > > > capabilities, perhaps)?
> > > > 
> > > > Hrm, true, it's not like the userspace VMM can rewrite itself at runtime.
> > > > 
> > > > > It could even be programmatic, such as returning 52 for CPUID[0x80000008].
> > > > > A nested KVM on L1 would not be able to use the #PF(RSVD) trick to detect
> > > > > MMIO faults.  That's not a big price to pay, however I'm not sure it's a
> > > > > good idea in general...
> > > > 
> > > > Agreed, messing with CPUID is likely to end in tears.
> > > > 
> > > 
> > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > 
> > > I tested kvm/queue as of today, sadly I still see the warning.
> > 
> > Due to a race, the above statements are out of order ;-)
> 
> So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> 
> So this is enough to trigger the warning when run in the guest:
> 
> qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host -overcommit cpu-pm=on
> 
> 
> '-smp 8' is needed, and the more vCPUs the more often the warning appears.
> 
> 
> Due to non atomic memslot update bug, I use patched qemu version, with an attached hack,
> to pause/resume vcpus around the memslot update it does, but even without this hack,
> you can just ctrl+c the test after it gets the KVM internal error, and 
> then tdp mmu memory leak warning shows up (not always but very often).
> 
> 
> Oh, and if I run the above command on the bare metal, it  never terminates. Must be due to preemption,
> qemu shows beeing stuck in kvm_vcpu_block. AVIC disabled, kvm/queue.
> Bugs, bugs, and features :)

All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
which is apparently used by KVM unit tests to signal exit from the VM.

Best regards,
	Maxim Levitsky

> 
> Best regards,
> 	Maxim Levitsky
> 
> 
> > Best regards,
> > 	Maxim Levitsky
> > 
> > 
> > > [mlevitsk@fedora34 ~]$[   35.205241] ------------[ cut here ]------------
> > > [   35.207156] WARNING: CPU: 6 PID: 3236 at arch/x86/kvm/mmu/tdp_mmu.c:46 kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> > > [   35.211468] Modules linked in: uinput snd_seq_dummy snd_hrtimer xt_MASQUERADE xt_conntrack ipt_REJECT ip6table_filter ip6_tables iptable_mangle iptable_nat nf_nat bridge rpcsec_gss_krb5 auth_rpcgss
> > > nfsv4 dns_resolver nfs lockd grace fscache netfs rfkill sunrpc vfat fat snd_hda_codec_generic snd_hda_intel snd_intel_dspcfg snd_hda_codec kvm_amd snd_hwdep ccp snd_hda_core rng_core snd_seq kvm
> > > snd_seq_device snd_pcm joydev irqbypass snd_timer input_leds snd lpc_ich virtio_input mfd_core pcspkr efi_pstore rtc_cmos button ext4 mbcache jbd2 hid_generic usbhid hid virtio_gpu virtio_dma_buf
> > > drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops cec virtio_net net_failover drm virtio_console failover i2c_core virtio_blk crc32_pclmul xhci_pci crc32c_intel xhci_hcd virtio_pci
> > > virtio_pci_modern_dev virtio_ring virtio dm_mirror dm_region_hash dm_log fuse ipv6 autofs4
> > > [   35.248745] CPU: 6 PID: 3236 Comm: CPU 2/KVM Not tainted 5.14.0.stable #90
> > > [   35.251559] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
> > > [   35.255011] RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x47/0x50 [kvm]
> > > [   35.257531] Code: 48 89 e5 48 39 c2 75 21 48 8b 87 b0 91 00 00 48 81 c7 b0 91 00 00 48 39 f8 75 08 e8 b3 7c cd e0 5d c3 c3 90 0f 0b 90 eb f2 90 <0f> 0b 90 eb d9 0f 1f 40 00 0f 1f 44 00 00 55 b8 ff
> > > ff ff ff 48 89
> > > [   35.265355] RSP: 0018:ffffc90001f6fc28 EFLAGS: 00010283
> > > [   35.267659] RAX: ffffc90001f5a1c0 RBX: 0000000000000008 RCX: 0000000000000000
> > > [   35.270823] RDX: ffff888114168958 RSI: ffff888115636ac0 RDI: ffffc90001f51000
> > > [   35.273769] RBP: ffffc90001f6fc28 R08: 0000000000004802 R09: 0000000000000000
> > > [   35.276595] R10: 00000000000001cd R11: 0000000000000018 R12: ffffc90001f51000
> > > [   35.279470] R13: ffffc90001f51998 R14: ffff8881001d3060 R15: dead000000000100
> > > [   35.282314] FS:  0000000000000000(0000) GS:ffff88846ef80000(0000) knlGS:0000000000000000
> > > [   35.285594] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [   35.287943] CR2: 0000000000000000 CR3: 0000000002a0b000 CR4: 0000000000350ee0
> > > [   35.290979] Call Trace:
> > > [   35.292082]  kvm_mmu_uninit_vm+0x22/0x30 [kvm]
> > > [   35.293909]  kvm_arch_destroy_vm+0x18f/0x200 [kvm]
> > > [   35.295884]  kvm_destroy_vm+0x164/0x250 [kvm]
> > > [   35.297680]  kvm_put_kvm+0x26/0x40 [kvm]
> > > [   35.299309]  kvm_vm_release+0x22/0x30 [kvm]
> > > [   35.301088]  __fput+0x94/0x240
> > > [   35.302338]  ____fput+0xe/0x10
> > > [   35.303599]  task_work_run+0x63/0xa0
> > > [   35.305083]  do_exit+0x353/0x9d0
> > > [   35.306470]  do_group_exit+0x3b/0xa0
> > > [   35.307882]  get_signal+0x163/0x850
> > > [   35.309403]  arch_do_signal_or_restart+0xf3/0x7c0
> > > [   35.311390]  exit_to_user_mode_prepare+0x112/0x1f0
> > > [   35.313374]  syscall_exit_to_user_mode+0x18/0x40
> > > [   35.315244]  do_syscall_64+0x44/0xb0
> > > [   35.316819]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > > [   35.318874] RIP: 0033:0x7f51e5f8b0ab
> > > [   35.320395] Code: Unable to access opcode bytes at RIP 0x7f51e5f8b081.
> > > [   35.322985] RSP: 002b:00007f50dbdfd5c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> > > [   35.326015] RAX: fffffffffffffffc RBX: 000055df487dd0a0 RCX: 00007f51e5f8b0ab
> > > [   35.328914] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
> > > [   35.332162] RBP: 00007f50dbdfd6c0 R08: 000055df46521e60 R09: 00007ffcd47ed080
> > > [   35.335172] R10: 00007ffcd47ed090 R11: 0000000000000246 R12: 00007ffcd4653f2e
> > > [   35.338302] R13: 00007ffcd4653f2f R14: 0000000000000000 R15: 00007f50dbdff640
> > > [   35.341320] ---[ end trace fa01d10f9909874f ]---
> > > 
> > > 
> > > Oh, well, I will now switch to vanilla L0 kernel, just in case, and see where to go from this point.
> > > 
> > > Best regards,
> > > 	Maxim Levitsky
> > > 



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-04-28 23:34 [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Sean Christopherson
  2022-04-29 10:36 ` Paolo Bonzini
@ 2022-05-02 11:12 ` Kai Huang
  2022-05-02 11:52   ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Kai Huang @ 2022-05-02 11:12 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On Thu, 2022-04-28 at 23:34 +0000, Sean Christopherson wrote:
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 904f0faff218..c10ae25135e3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3010,9 +3010,15 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>  		/*
>  		 * If MMIO caching is disabled, emulate immediately without
>  		 * touching the shadow page tables as attempting to install an
> -		 * MMIO SPTE will just be an expensive nop.
> +		 * MMIO SPTE will just be an expensive nop.  Do not cache MMIO
> +		 * whose gfn is greater than host.MAXPHYADDR, any guest that
> +		 * generates such gfns is either malicious or in the weeds.
> +		 * Note, it's possible to observe a gfn > host.MAXPHYADDR if
> +		 * and only if host.MAXPHYADDR is inaccurate with respect to
> +		 * hardware behavior, e.g. if KVM itself is running as a VM.
>  		 */
> -		if (unlikely(!enable_mmio_caching)) {
> +		if (unlikely(!enable_mmio_caching) ||
> +		    unlikely(fault->gfn > kvm_mmu_max_gfn_host())) {

Shouldn't we check fault->gfn against cpuid_maxphyaddr(vcpu) instead of
kvm_mmu_max_gfn_host() here?

>  			*ret_val = RET_PF_EMULATE;
>  			return true;
>  		}


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-02 11:12 ` Kai Huang
@ 2022-05-02 11:52   ` Paolo Bonzini
  0 siblings, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2022-05-02 11:52 UTC (permalink / raw)
  To: Kai Huang, Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Maxim Levitsky, Ben Gardon, David Matlack

On 5/2/22 13:12, Kai Huang wrote:
>> -		if (unlikely(!enable_mmio_caching)) {
>> +		if (unlikely(!enable_mmio_caching) ||
>> +		    unlikely(fault->gfn > kvm_mmu_max_gfn_host())) {
> Shouldn't we check fault->gfn against cpuid_maxphyaddr(vcpu) instead of
> kvm_mmu_max_gfn_host() here?

No, the point of this check is to handle the case where 
kvm_mmu_max_gfn_host() is smaller than cpuid_maxphyaddr().

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-02  8:56                   ` Maxim Levitsky
@ 2022-05-02 16:51                     ` Sean Christopherson
  2022-05-03  9:12                       ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-05-02 16:51 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Mon, May 02, 2022, Maxim Levitsky wrote:
> On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> > > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > > 
> > > > I tested kvm/queue as of today, sadly I still see the warning.
> > > 
> > > Due to a race, the above statements are out of order ;-)
> > 
> > So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> > 
> > So this is enough to trigger the warning when run in the guest:
> > 
> > qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm
> > -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host
> > -overcommit cpu-pm=on
> > 
> > 
> > '-smp 8' is needed, and the more vCPUs the more often the warning appears.
> > 
> > 
> > Due to non atomic memslot update bug, I use patched qemu version, with an
> > attached hack, to pause/resume vcpus around the memslot update it does, but
> > even without this hack, you can just ctrl+c the test after it gets the KVM
> > internal error, and then tdp mmu memory leak warning shows up (not always
> > but very often).
> > 
> > 
> > Oh, and if I run the above command on the bare metal, it  never terminates.
> > Must be due to preemption, qemu shows beeing stuck in kvm_vcpu_block. AVIC
> > disabled, kvm/queue.  Bugs, bugs, and features :)
> 
> All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
> which is apparently used by KVM unit tests to signal exit from the VM.

Can you provide your QEMU command line for running your L1 VM?  And your L0 and L1
Kconfigs too?  I've tried both the dummy and ipi_stress tests on a variety of hardware,
kernels, QEMUs, etc..., with no luck.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-02 16:51                     ` Sean Christopherson
@ 2022-05-03  9:12                       ` Maxim Levitsky
  2022-05-03 15:12                         ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-03  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

[-- Attachment #1: Type: text/plain, Size: 2373 bytes --]

On Mon, 2022-05-02 at 16:51 +0000, Sean Christopherson wrote:
> On Mon, May 02, 2022, Maxim Levitsky wrote:
> > On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> > > > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > > > 
> > > > > I tested kvm/queue as of today, sadly I still see the warning.
> > > > 
> > > > Due to a race, the above statements are out of order ;-)
> > > 
> > > So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> > > 
> > > So this is enough to trigger the warning when run in the guest:
> > > 
> > > qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm
> > > -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host
> > > -overcommit cpu-pm=on
> > > 
> > > 
> > > '-smp 8' is needed, and the more vCPUs the more often the warning appears.
> > > 
> > > 
> > > Due to non atomic memslot update bug, I use patched qemu version, with an
> > > attached hack, to pause/resume vcpus around the memslot update it does, but
> > > even without this hack, you can just ctrl+c the test after it gets the KVM
> > > internal error, and then tdp mmu memory leak warning shows up (not always
> > > but very often).
> > > 
> > > 
> > > Oh, and if I run the above command on the bare metal, it  never terminates.
> > > Must be due to preemption, qemu shows beeing stuck in kvm_vcpu_block. AVIC
> > > disabled, kvm/queue.  Bugs, bugs, and features :)
> > 
> > All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
> > which is apparently used by KVM unit tests to signal exit from the VM.
> 
> Can you provide your QEMU command line for running your L1 VM?  And your L0 and L1
> Kconfigs too?  I've tried both the dummy and ipi_stress tests on a variety of hardware,
> kernels, QEMUs, etc..., with no luck.
> 

So now both L0 and L1 run almost pure kvm/queue)
(commit 2764011106d0436cb44702cfb0981339d68c3509)

I have some local patches but they are not relevant to KVM at all, more
like various tweaks to sensors, a sad hack for yet another regression
in AMDGPU, etc.

The config and qemu command line attached.

AVIC disabled in L0, L0 qemu is from master upstream.
Bug reproduces too well IMHO, almost always.

For reference the warning is printed in L1's dmesg.



Best regards,
	Maxim Levitsky

[-- Attachment #2: Type: application/x-config, Size: 147589 bytes --]

[-- Attachment #3: Type: application/x-config, Size: 157078 bytes --]

[-- Attachment #4: cmdline.log --]
[-- Type: text/x-log, Size: 3800 bytes --]

/home/mlevitsk/.build/home/mlevitsk/Qemu/master/unstable/output/bin/qemu-system-x86_64
-smp 8
-name debug-threads=on
-pidfile /run/vmspawn/fedora30_m72p75h8//qemu.pid
-accel kvm
-nodefaults
-display none
-name guest=/home/mlevitsk/.build/home/mlevitsk/Qemu/master@unstable@Fedora34,debug-threads=on
-uuid c4aa9ea6-942a-11ea-ae1e-8c16456df72f
-qmp tcp:0:3001,server,nowait
-machine kernel-irqchip=on
-machine q35,sata=off,usb=off,vmport=on,smbus=off,smm=off
-rtc base=utc,clock=host
-no-hpet
-device pcie-root-port,slot=0,id=rport.0,bus=pcie.0,addr=0x1c.0x0,multifunction=on
-device pcie-root-port,slot=1,id=rport.1,bus=pcie.0,addr=0x1c.0x1
-device pcie-root-port,slot=2,id=rport.2,bus=pcie.0,addr=0x1c.0x2
-device pcie-root-port,slot=3,id=rport.3,bus=pcie.0,addr=0x1c.0x3
-device pcie-root-port,slot=4,id=rport.4,bus=pcie.0,addr=0x1c.0x4
-device pcie-root-port,slot=5,id=rport.5,bus=pcie.0,addr=0x1c.0x5
-device pcie-root-port,slot=6,id=rport.6,bus=pcie.0,addr=0x1c.0x6
-device pcie-root-port,slot=7,id=rport.7,bus=pcie.0,addr=0x1c.0x7
-machine smm=on
-blockdev node-name=flash0,driver=file,filename=/home/mlevitsk/FIRMWARE/ovmf/build-unstable/smm3264/OVMF_CODE.fd,read-only=on
-blockdev node-name=flash1,driver=file,filename=.fw/.ovmf_vars.fd
-machine pflash0=flash0,pflash1=flash1
-global driver=cfi.pflash01,property=secure,value=on
-boot menu=on,strict=on,splash-time=1000
-global kvm-pit.lost_tick_policy=discard
-smp maxcpus=64,sockets=1,cores=32,threads=2
-cpu host,host-cache-info,invtsc,topoext,x2apic=off,-hv-vapic,kvm-pv-ipi=off,+hv-avic
-overcommit mem-lock=on,cpu-pm=off
-m 16G,maxmem=64G
-object iothread,id=iothread0
-drive if=none,id=os_image,file=./disk_s1.qcow2,aio=native,discard=unmap,cache=none
-device virtio-blk,id=scsi-ctrl,bus=rport.0,drive=os_image,iothread=iothread0,bootindex=1,num-queues=8
-device virtio-vga,max_outputs=1,id=gpu1,bus=rport.1
-display gtk,window-close=off
-netdev tap,id=tap0,vhost=off,ifname=tap0_Fedora34,script=no,downscript=no
-device virtio-net-pci,id=net0,mac=02:00:00:21:CB:01,netdev=tap0,bus=rport.2,disable-legacy=on
-audiodev pa,id=pulseaudio0,server=/run/user/103992/pulse/native,timer-period=2000,out.mixing-engine=off,out.fixed-settings=off,out.buffer-length=50000
-device ich9-intel-hda,id=sound0,msi=on,bus=pcie.0,addr=0x1f.0x4
-device hda-micro,id=sound0-codec0,bus=sound0.0,cad=0,audiodev=pulseaudio0
-device qemu-xhci,id=usb0,bus=pcie.0,addr=0x1e.0x0,p3=16,p2=16
-device usb-tablet,id=auto_id22
-device virtio-keyboard-pci,disable-legacy=on,bus=rport.3,addr=00.0,id=auto_id23
-device virtio-serial-pci,id=virtio-serial0,bus=rport.4,disable-legacy=on
-chardev socket,id=chr_qga,path=/run/vmspawn/fedora30_m72p75h8//guest_agent.socket,server=on,wait=no
-device virtserialport,bus=virtio-serial0.0,nr=1,chardev=chr_qga,name=org.qemu.guest_agent.0,id=auto_id24
-chardev qemu-vdagent,id=ch1,name=vdagent,clipboard=on,mouse=off
-device virtserialport,bus=virtio-serial0.0,nr=4,chardev=ch1,name=com.redhat.spice.0,id=auto_id25
-chardev socket,path=/run/vmspawn/fedora30_m72p75h8//hmp_monitor.socket,id=internal_hmp_monitor_socket_chardev,server=on,wait=off
-mon chardev=internal_hmp_monitor_socket_chardev,mode=readline
-chardev socket,path=/run/vmspawn/fedora30_m72p75h8//qmp_monitor.socket,id=internal_qmp_monitor_socket_chardev,server=on,wait=off
-mon chardev=internal_qmp_monitor_socket_chardev,mode=control
-chardev socket,path=/run/vmspawn/fedora30_m72p75h8//serial.socket,id=internal_serial0_chardev,server=on,logfile=/home/mlevitsk/.shared/VM/fedora34/.logs/serial.log,wait=off
-device isa-serial,chardev=internal_serial0_chardev,index=0,id=auto_id28
-chardev file,path=/home/mlevitsk/.shared/VM/fedora34/.logs/firmware.log,id=internal_debugcon0_chardev
-device isa-debugcon,chardev=internal_debugcon0_chardev,iobase=1026,id=auto_id29

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-03  9:12                       ` Maxim Levitsky
@ 2022-05-03 15:12                         ` Maxim Levitsky
  2022-05-03 20:30                           ` Sean Christopherson
  0 siblings, 1 reply; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-03 15:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Tue, 2022-05-03 at 12:12 +0300, Maxim Levitsky wrote:
> On Mon, 2022-05-02 at 16:51 +0000, Sean Christopherson wrote:
> > On Mon, May 02, 2022, Maxim Levitsky wrote:
> > > On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> > > > > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > > > > 
> > > > > > I tested kvm/queue as of today, sadly I still see the warning.
> > > > > 
> > > > > Due to a race, the above statements are out of order ;-)
> > > > 
> > > > So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> > > > 
> > > > So this is enough to trigger the warning when run in the guest:
> > > > 
> > > > qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm
> > > > -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host
> > > > -overcommit cpu-pm=on
> > > > 
> > > > 
> > > > '-smp 8' is needed, and the more vCPUs the more often the warning appears.
> > > > 
> > > > 
> > > > Due to non atomic memslot update bug, I use patched qemu version, with an
> > > > attached hack, to pause/resume vcpus around the memslot update it does, but
> > > > even without this hack, you can just ctrl+c the test after it gets the KVM
> > > > internal error, and then tdp mmu memory leak warning shows up (not always
> > > > but very often).
> > > > 
> > > > 
> > > > Oh, and if I run the above command on the bare metal, it  never terminates.
> > > > Must be due to preemption, qemu shows beeing stuck in kvm_vcpu_block. AVIC
> > > > disabled, kvm/queue.  Bugs, bugs, and features :)
> > > 
> > > All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
> > > which is apparently used by KVM unit tests to signal exit from the VM.
> > 
> > Can you provide your QEMU command line for running your L1 VM?  And your L0 and L1
> > Kconfigs too?  I've tried both the dummy and ipi_stress tests on a variety of hardware,
> > kernels, QEMUs, etc..., with no luck.
> > 
> 
> So now both L0 and L1 run almost pure kvm/queue)
> (commit 2764011106d0436cb44702cfb0981339d68c3509)
> 
> I have some local patches but they are not relevant to KVM at all, more
> like various tweaks to sensors, a sad hack for yet another regression
> in AMDGPU, etc.
> 
> The config and qemu command line attached.
> 
> AVIC disabled in L0, L0 qemu is from master upstream.
> Bug reproduces too well IMHO, almost always.
> 
> For reference the warning is printed in L1's dmesg.

Tested this without any preemption in L0 and L1 - bug still reproduces just fine.
(kvm/queue)

Best regards,
	Maxim Levitsky

> 
> 
> 
> Best regards,
> 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-03 15:12                         ` Maxim Levitsky
@ 2022-05-03 20:30                           ` Sean Christopherson
  2022-05-04 12:08                             ` Maxim Levitsky
  0 siblings, 1 reply; 20+ messages in thread
From: Sean Christopherson @ 2022-05-03 20:30 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Tue, May 03, 2022, Maxim Levitsky wrote:
> On Tue, 2022-05-03 at 12:12 +0300, Maxim Levitsky wrote:
> > On Mon, 2022-05-02 at 16:51 +0000, Sean Christopherson wrote:
> > > On Mon, May 02, 2022, Maxim Levitsky wrote:
> > > > On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> > > > > > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > > > > > 
> > > > > > > I tested kvm/queue as of today, sadly I still see the warning.
> > > > > > 
> > > > > > Due to a race, the above statements are out of order ;-)
> > > > > 
> > > > > So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> > > > > 
> > > > > So this is enough to trigger the warning when run in the guest:
> > > > > 
> > > > > qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm
> > > > > -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host
> > > > > -overcommit cpu-pm=on

...

> > > > All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
> > > > which is apparently used by KVM unit tests to signal exit from the VM.
> > > 
> > > Can you provide your QEMU command line for running your L1 VM?  And your L0 and L1
> > > Kconfigs too?  I've tried both the dummy and ipi_stress tests on a variety of hardware,
> > > kernels, QEMUs, etc..., with no luck.
> > 
> > So now both L0 and L1 run almost pure kvm/queue)
> > (commit 2764011106d0436cb44702cfb0981339d68c3509)
> > 
> > I have some local patches but they are not relevant to KVM at all, more
> > like various tweaks to sensors, a sad hack for yet another regression
> > in AMDGPU, etc.
> > 
> > The config and qemu command line attached.
> > 
> > AVIC disabled in L0, L0 qemu is from master upstream.
> > Bug reproduces too well IMHO, almost always.
> > 
> > For reference the warning is printed in L1's dmesg.
> 
> Tested this without any preemption in L0 and L1 - bug still reproduces just fine.
> (kvm/queue)

Well, I officially give up, I'm out of ideas to try and repro this on my end.  To
try and narrow the search, maybe try processing "all" possible gfns and see if that
makes the leak go away?

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 7e258cc94152..a354490939ec 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -84,9 +84,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
         * than hardware's real MAXPHYADDR.  Using the host MAXPHYADDR
         * disallows such SPTEs entirely and simplifies the TDP MMU.
         */
-       int max_gpa_bits = likely(tdp_enabled) ? shadow_phys_bits : 52;
-
-       return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
+       return (1ULL << (52 - PAGE_SHIFT)) - 1;
 }

 static inline u8 kvm_get_shadow_phys_bits(void)


^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-03 20:30                           ` Sean Christopherson
@ 2022-05-04 12:08                             ` Maxim Levitsky
  2022-05-04 14:47                               ` Sean Christopherson
  2022-05-04 19:11                               ` Paolo Bonzini
  0 siblings, 2 replies; 20+ messages in thread
From: Maxim Levitsky @ 2022-05-04 12:08 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Tue, 2022-05-03 at 20:30 +0000, Sean Christopherson wrote:
> On Tue, May 03, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-05-03 at 12:12 +0300, Maxim Levitsky wrote:
> > > On Mon, 2022-05-02 at 16:51 +0000, Sean Christopherson wrote:
> > > > On Mon, May 02, 2022, Maxim Levitsky wrote:
> > > > > On Mon, 2022-05-02 at 10:59 +0300, Maxim Levitsky wrote:
> > > > > > > > Also I can reproduce it all the way to 5.14 kernel (last kernel I have installed in this VM).
> > > > > > > > 
> > > > > > > > I tested kvm/queue as of today, sadly I still see the warning.
> > > > > > > 
> > > > > > > Due to a race, the above statements are out of order ;-)
> > > > > > 
> > > > > > So futher investigation shows that the trigger for this *is* cpu_pm=on :(
> > > > > > 
> > > > > > So this is enough to trigger the warning when run in the guest:
> > > > > > 
> > > > > > qemu-system-x86_64  -nodefaults  -vnc none -serial stdio -machine accel=kvm
> > > > > > -kernel x86/dummy.flat -machine kernel-irqchip=on -smp 8 -m 1g -cpu host
> > > > > > -overcommit cpu-pm=on
> 
> ...
> 
> > > > > All right, at least that was because I removed the '-device isa-debug-exit,iobase=0xf4,iosize=0x4',
> > > > > which is apparently used by KVM unit tests to signal exit from the VM.
> > > > 
> > > > Can you provide your QEMU command line for running your L1 VM?  And your L0 and L1
> > > > Kconfigs too?  I've tried both the dummy and ipi_stress tests on a variety of hardware,
> > > > kernels, QEMUs, etc..., with no luck.
> > > 
> > > So now both L0 and L1 run almost pure kvm/queue)
> > > (commit 2764011106d0436cb44702cfb0981339d68c3509)
> > > 
> > > I have some local patches but they are not relevant to KVM at all, more
> > > like various tweaks to sensors, a sad hack for yet another regression
> > > in AMDGPU, etc.
> > > 
> > > The config and qemu command line attached.
> > > 
> > > AVIC disabled in L0, L0 qemu is from master upstream.
> > > Bug reproduces too well IMHO, almost always.
> > > 
> > > For reference the warning is printed in L1's dmesg.
> > 
> > Tested this without any preemption in L0 and L1 - bug still reproduces just fine.
> > (kvm/queue)
> 
> Well, I officially give up, I'm out of ideas to try and repro this on my end.  To
> try and narrow the search, maybe try processing "all" possible gfns and see if that
> makes the leak go away?
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 7e258cc94152..a354490939ec 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -84,9 +84,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
>          * than hardware's real MAXPHYADDR.  Using the host MAXPHYADDR
>          * disallows such SPTEs entirely and simplifies the TDP MMU.
>          */
> -       int max_gpa_bits = likely(tdp_enabled) ? shadow_phys_bits : 52;
> -
> -       return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
> +       return (1ULL << (52 - PAGE_SHIFT)) - 1;
>  }
> 
>  static inline u8 kvm_get_shadow_phys_bits(void)
> 

Nope, still reproduces.

I'll think on how to trace this, maybe that will give me some ideas.
Anything useful to dump from the mmu pages that are still not freed at that point?

Also do you test on AMD? I test on my 3970X.


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-04 12:08                             ` Maxim Levitsky
@ 2022-05-04 14:47                               ` Sean Christopherson
  2022-05-04 19:11                               ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Sean Christopherson @ 2022-05-04 14:47 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Ben Gardon, David Matlack

On Wed, May 04, 2022, Maxim Levitsky wrote:
> On Tue, 2022-05-03 at 20:30 +0000, Sean Christopherson wrote:
> > Well, I officially give up, I'm out of ideas to try and repro this on my end.  To
> > try and narrow the search, maybe try processing "all" possible gfns and see if that
> > makes the leak go away?
> > 
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 7e258cc94152..a354490939ec 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -84,9 +84,7 @@ static inline gfn_t kvm_mmu_max_gfn(void)
> >          * than hardware's real MAXPHYADDR.  Using the host MAXPHYADDR
> >          * disallows such SPTEs entirely and simplifies the TDP MMU.
> >          */
> > -       int max_gpa_bits = likely(tdp_enabled) ? shadow_phys_bits : 52;
> > -
> > -       return (1ULL << (max_gpa_bits - PAGE_SHIFT)) - 1;
> > +       return (1ULL << (52 - PAGE_SHIFT)) - 1;
> >  }
> > 
> >  static inline u8 kvm_get_shadow_phys_bits(void)
> > 
> 
> Nope, still reproduces.
> 
> I'll think on how to trace this, maybe that will give me some ideas.
> Anything useful to dump from the mmu pages that are still not freed at that point?

Dumping the role and gfn is most likely to be useful.  Assuming you aren't seeing
this WARN too:

	WARN_ON(!list_empty(&kvm->arch.tdp_mmu_roots));

then it's not a VM refcounting problem.  The bugs thus far have been tied to the
gfn in some way, e.g. skipping back-to-back entries, the MAXPHYADDR thing.  But I
don't have any ideas for why such a simple test would generate unique behavior.

> Also do you test on AMD? I test on my 3970X.

Yep, I've tried Rome and Milan, and CLX (or maybe SKX?) and HSW on the Intel side.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR
  2022-05-04 12:08                             ` Maxim Levitsky
  2022-05-04 14:47                               ` Sean Christopherson
@ 2022-05-04 19:11                               ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2022-05-04 19:11 UTC (permalink / raw)
  To: Maxim Levitsky, Sean Christopherson
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Ben Gardon, David Matlack

On 5/4/22 14:08, Maxim Levitsky wrote:
> Nope, still reproduces.
> 
> I'll think on how to trace this, maybe that will give me some ideas.
> Anything useful to dump from the mmu pages that are still not freed at that point?

Perhaps you can dump the gpa and TSC of the allocation, and also print 
the TSC right before destroy_workqueue?

Paolo


^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2022-05-04 19:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-28 23:34 [PATCH] KVM: x86/mmu: Do not create SPTEs for GFNs that exceed host.MAXPHYADDR Sean Christopherson
2022-04-29 10:36 ` Paolo Bonzini
2022-04-29 14:24   ` Sean Christopherson
2022-04-29 14:37     ` Paolo Bonzini
2022-04-29 14:42       ` Sean Christopherson
2022-04-29 14:50         ` Paolo Bonzini
2022-04-29 16:01           ` Sean Christopherson
2022-05-01 14:28             ` Maxim Levitsky
2022-05-01 14:32               ` Maxim Levitsky
2022-05-02  7:59                 ` Maxim Levitsky
2022-05-02  8:56                   ` Maxim Levitsky
2022-05-02 16:51                     ` Sean Christopherson
2022-05-03  9:12                       ` Maxim Levitsky
2022-05-03 15:12                         ` Maxim Levitsky
2022-05-03 20:30                           ` Sean Christopherson
2022-05-04 12:08                             ` Maxim Levitsky
2022-05-04 14:47                               ` Sean Christopherson
2022-05-04 19:11                               ` Paolo Bonzini
2022-05-02 11:12 ` Kai Huang
2022-05-02 11:52   ` Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).