kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap
@ 2023-06-16  2:31 Yan Zhao
  2023-06-16  2:32 ` [PATCH v3 01/11] KVM: x86/mmu: helpers to return if KVM honors guest MTRRs Yan Zhao
                   ` (11 more replies)
  0 siblings, 12 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

This series refines mmu zap caused by EPT memory type update when guest
MTRRs are honored.

The first 5 patches revolve around utilizing helper functions to check if
KVM TDP honors guest MTRRs, so that TDP zap and page fault max_level
reduction are only targeted to TDPs that honor guest MTRRs.

-The 5th patch will trigger zapping of TDP leaf entries if non-coherent
 DMA devices count goes from 0 to 1 or from 1 to 0.

The last 6 patches are fixes and optimizations for mmu zaps happen when
guest MTRRs are honored. Those mmu zaps are usually triggered from all
vCPUs in bursts on all GFN ranges, intending to remove stale memtypes of
TDP entries.

- The 6th patch places TDP zap to when CR0.CD toggles and when guest MTRRs
  update under CR0.CD=0.

- The 7th-8th patches refine KVM_X86_QUIRK_CD_NW_CLEARED by removing the
  IPAT bit in EPT memtype when CR0.CD=1 and guest MTRRs are honored.

- The 9th-11th patches are optimizations of the mmu zap when guest MTRRs
  are honored by serializing vCPUs' gfn zap requests and calculating of
  precise fine-grained ranges to zap.
  They are put in mtrr.c because the optimizations are related to when
  guest MTRRs are honored and because it requires to read guest MTRRs
  for fine-grained ranges.
  Calls to kvm_unmap_gfn_range() are not included into the optimization,
  because they are not triggered from all vCPUs in bursts and not all of
  them are blockable. They usually happen at memslot removal and thus do
  not affect the mmu zaps when guest MTRRs are honored. Also, current
  performance data shows that there's no observable performance difference
  to mmu zaps by turning on/off auto numa balancing triggered
  kvm_unmap_gfn_range().

A reference performance data for last 6 patches as below:

Base: base code before patch 6
C6-8: includes base code + patches 6 + 7 + 8
      patch 6: move TDP zaps from guest MTRRs update to CR0.CD toggling
      patch 7: drop IPAT in memtype when CD=1 for
               KVM_X86_QUIRK_CD_NW_CLEARED
      patch 8: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
               code
C9:   includes C6-8 + patch 9
      patch 9: serialize vCPUs to zap gfn when guest MTRRs are honored
C10:  includes C9 + patch 10
      patch 10: fine-grained gfn zap when guest MTRRs are honored
C11:  includes C10 + patch 11
      patch 11: split a single gfn zap range when guest MTRRs are honored

vCPUs cnt: 8,  guest memory: 16G
Physical CPU frequency: 3100 MHz

     |              OVMF            |             Seabios          |
     | EPT zap cycles | EPT zap cnt | EPT zap cycles | EPT zap cnt |
Base |    3444.97M    |      84     |      61.29M    |      50     |
C6-8 |    4343.68M    |      74     |     503.04M    |      42     |*     
 C9  |     261.45M    |      74     |     106.64M    |      42     |     
 C10 |     157.42M    |      74     |      71.04M    |      42     |     
 C11 |      33.95M    |      74     |      24.04M    |      42     |     

* With C8, EPT zap cnt are reduced because there are some MTRR updates
  under CR0.CD=1.
  EPT zap cycles increases a bit (especially true in case of Seabios)
  because concurrency is more intense when CR0.CD toggles than when
  guest MTRRs update.
  (patch 7/8 are neglectable in performance)

Changelog:
v2 --> v3:
1. Updated patch 1 about definition of honor guest MTRRs helper. (Sean)
2. Added patch 2 to use honor guest MTRRs helper in kvm_tdp_page_fault().
   (Sean)
3. Remove unnecessary calculation of MTRR ranges.
   (Chao Gao, Kai Huang, Sean)
4. Updated patches 3-5 to use the helper. (Chao Gao, Kai Huang, Sean)
5. Added patches 6,7 to reposition TDP zap and drop IPAT bit. (Sean)
6. Added patch 8 to prepare for patch 10's memtype calculation when
   CR0.CD=1.
7. Added patches 9-11 to speed up MTRR update /CD0 toggle when guest
   MTRRs are honored. (Sean)
8. Dropped per-VM based MTRRs in v2 (Sean)

v1 --> v2:
1. Added a helper to skip non EPT case in patch 1
2. Added patch 2 to skip mmu zap when guest CR0_CD changes if EPT is not
   enabled. (Chao Gao)
3. Added patch 3 to skip mmu zap when guest MTRR changes if EPT is not
   enabled.
4. Do not mention TDX in patch 4 as the code is not merged yet (Chao Gao)
5. Added patches 5-6 to reduce EPT zap during guest bootup.

v2:
https://lore.kernel.org/all/20230509134825.1523-1-yan.y.zhao@intel.com/

v1:
https://lore.kernel.org/all/20230508034700.7686-1-yan.y.zhao@intel.com/

Yan Zhao (11):
  KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
  KVM: x86/mmu: Use KVM honors guest MTRRs helper in
    kvm_tdp_page_fault()
  KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
  KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
  KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
  KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
  KVM: VMX: drop IPAT in memtype when CD=1 for
    KVM_X86_QUIRK_CD_NW_CLEARED
  KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
    code
  KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
  KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
    honored

 arch/x86/include/asm/kvm_host.h |   4 +
 arch/x86/kvm/mmu.h              |   7 +
 arch/x86/kvm/mmu/mmu.c          |  18 +-
 arch/x86/kvm/mtrr.c             | 286 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.c          |  11 +-
 arch/x86/kvm/x86.c              |  25 ++-
 arch/x86/kvm/x86.h              |   2 +
 7 files changed, 333 insertions(+), 20 deletions(-)


base-commit: 24ff4c08e5bbdd7399d45f940f10fed030dfadda
-- 
2.17.1


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCH v3 01/11] KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
@ 2023-06-16  2:32 ` Yan Zhao
  2023-06-16  2:34 ` [PATCH v3 02/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper in kvm_tdp_page_fault() Yan Zhao
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Added helpers to check if KVM honors guest MTRRs.
The inner helper __kvm_mmu_honors_guest_mtrrs() is also provided to
outside callers for the purpose of checking if guest MTRRs were honored
before stopping non-coherent DMA.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu.h     |  7 +++++++
 arch/x86/kvm/mmu/mmu.c | 15 +++++++++++++++
 2 files changed, 22 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..38bd449226f6 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -235,6 +235,13 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 	return -(u32)fault & errcode;
 }
 
+bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma);
+
+static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
+{
+	return __kvm_mmu_honors_guest_mtrrs(kvm, kvm_arch_has_noncoherent_dma(kvm));
+}
+
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 1e5db621241f..b4f89f015c37 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4516,6 +4516,21 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 }
 #endif
 
+bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma)
+{
+	/*
+	 * If the TDP is enabled, the host MTRRs are ignored by TDP
+	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
+	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
+	 * from the guest's MTRRs so that guest accesses to memory that is
+	 * DMA'd aren't cached against the guest's wishes.
+	 *
+	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
+	 * e.g. KVM will force UC memtype for host MMIO.
+	 */
+	return vm_has_noncoherent_dma && tdp_enabled && shadow_memtype_mask;
+}
+
 int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 02/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper in kvm_tdp_page_fault()
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
  2023-06-16  2:32 ` [PATCH v3 01/11] KVM: x86/mmu: helpers to return if KVM honors guest MTRRs Yan Zhao
@ 2023-06-16  2:34 ` Yan Zhao
  2023-06-16  2:35 ` [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles Yan Zhao
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:34 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Let kvm_tdp_page_fault() use helper function kvm_mmu_honors_guest_mtrrs()
to decide if it needs to consult guest MTRR to check GFN range consistency.

No functional change intended.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 11 ++---------
 1 file changed, 2 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b4f89f015c37..7f52bbe013b3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4536,16 +4536,9 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	/*
 	 * If the guest's MTRRs may be used to compute the "real" memtype,
 	 * restrict the mapping level to ensure KVM uses a consistent memtype
-	 * across the entire mapping.  If the host MTRRs are ignored by TDP
-	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
-	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
-	 * from the guest's MTRRs so that guest accesses to memory that is
-	 * DMA'd aren't cached against the guest's wishes.
-	 *
-	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
-	 * e.g. KVM will force UC memtype for host MMIO.
+	 * across the entire mapping.
 	 */
-	if (shadow_memtype_mask && kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
+	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
 		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
 			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
 			gfn_t base = gfn_round_for_level(fault->gfn,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
  2023-06-16  2:32 ` [PATCH v3 01/11] KVM: x86/mmu: helpers to return if KVM honors guest MTRRs Yan Zhao
  2023-06-16  2:34 ` [PATCH v3 02/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper in kvm_tdp_page_fault() Yan Zhao
@ 2023-06-16  2:35 ` Yan Zhao
  2023-06-28 21:59   ` Sean Christopherson
  2023-06-16  2:36 ` [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr Yan Zhao
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:35 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Call helper to check if guest MTRRs are honored by KVM MMU before zapping,
as values of guest CR0.CD will only affect memory types of KVM TDP when
guest MTRRs are honored.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9e7186864542..6693daeb5686 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_mmu_reset_context(vcpu);
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
+	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (2 preceding siblings ...)
  2023-06-16  2:35 ` [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles Yan Zhao
@ 2023-06-16  2:36 ` Yan Zhao
  2023-06-28 22:08   ` Sean Christopherson
  2023-06-16  2:37 ` [PATCH v3 05/11] KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops Yan Zhao
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:36 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Call helper to check if guest MTRRs are honored by KVM MMU before
calculation and zapping.
Guest MTRRs only affect TDP memtypes when TDP honors guest MTRRs, there's
no meaning to do the calculation and zapping otherwise.

Suggested-by: Chao Gao <chao.gao@intel.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Cc: Kai Huang <kai.huang@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 3eb6e7f47e96..a67c28a56417 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
 	gfn_t start, end;
 
-	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		return;
 
 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 05/11] KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (3 preceding siblings ...)
  2023-06-16  2:36 ` [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr Yan Zhao
@ 2023-06-16  2:37 ` Yan Zhao
  2023-06-16  2:37 ` [PATCH v3 06/11] KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling Yan Zhao
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:37 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Zap KVM TDP when noncoherent DMA assignment starts (noncoherent dma count
transitions from 0 to 1) or stops (noncoherent dma count transistions
from 1 to 0). Before the zap, test if guest MTRR is to be honored after
the assignment starts or was honored before the assignment stops.

When there's no noncoherent DMA device, EPT memory type is
((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT)

When there're noncoherent DMA devices, EPT memory type needs to honor
guest CR0.CD and MTRR settings.

So, if noncoherent DMA count transitions between 0 and 1, EPT leaf entries
need to be zapped to clear stale memory type.

This issue might be hidden when the device is statically assigned with
VFIO adding/removing MMIO regions of the noncoherent DMA devices for
several times during guest boot, and current KVM MMU will call
kvm_mmu_zap_all_fast() on the memslot removal.

But if the device is hot-plugged, or if the guest has mmio_always_on for
the device, the MMIO regions of it may only be added for once, then there's
no path to do the EPT entries zapping to clear stale memory type.

Therefore do the EPT zapping when noncoherent assignment starts/stops to
ensure stale entries cleaned away.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/x86.c | 20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6693daeb5686..ac9548efa76f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13164,15 +13164,31 @@ bool noinstr kvm_arch_has_assigned_device(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
 
+static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
+{
+	/*
+	 * Non-coherent DMA assignement and de-assignment will affect
+	 * whether KVM honors guest MTRRs and cause changes in memtypes
+	 * in TDP.
+	 * So, specify the second parameter as true here to indicate
+	 * non-coherent DMAs are/were involved and TDP zap might be
+	 * necessary.
+	 */
+	if (__kvm_mmu_honors_guest_mtrrs(kvm, true))
+		kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+}
+
 void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
 {
-	atomic_inc(&kvm->arch.noncoherent_dma_count);
+	if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1)
+		kvm_noncoherent_dma_assignment_start_or_stop(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
 
 void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
 {
-	atomic_dec(&kvm->arch.noncoherent_dma_count);
+	if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count))
+		kvm_noncoherent_dma_assignment_start_or_stop(kvm);
 }
 EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 06/11] KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (4 preceding siblings ...)
  2023-06-16  2:37 ` [PATCH v3 05/11] KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops Yan Zhao
@ 2023-06-16  2:37 ` Yan Zhao
  2023-06-16  2:38 ` [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED Yan Zhao
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:37 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

If guest MTRRs are honored, always zap TDP when CR0.CD toggles and don't do
it if guest MTRRs are updated under CR0.CD=1.

This is because CR0.CD=1 takes precedence over guest MTRRs to decide TDP
memory types, TDP memtypes are not changed if guest MTRRs update under
CR0.CD=1.

Instead, always do the TDP zapping when CR0.CD toggles, because even with
the quirk KVM_X86_QUIRK_CD_NW_CLEARED, TDP memory types may change after
guest CR0.CD toggles.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c | 3 +++
 arch/x86/kvm/x86.c  | 3 +--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index a67c28a56417..3ce58734ad22 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		return;
 
+	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
+		return;
+
 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
 		return;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ac9548efa76f..32cc8bfaa5f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,8 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_mmu_reset_context(vcpu);
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
-	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (5 preceding siblings ...)
  2023-06-16  2:37 ` [PATCH v3 06/11] KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling Yan Zhao
@ 2023-06-16  2:38 ` Yan Zhao
  2023-06-20  2:42   ` Chao Gao
  2023-06-16  2:38 ` [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code Yan Zhao
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:38 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
types when cache is disabled and non-coherent DMA are present.

With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
EPT memory type when guest cache is disabled before this patch.
Removing the IPAT bit in this patch will allow effective memory type to
honor PAT values as well, which will make the effective memory type
stronger than WB as WB is the weakest memtype. However, this change is
acceptable as it doesn't make the memory type weaker, consider without
this quirk, the original memtype for cache disabled is UC + IPAT.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 9 +++------
 1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0ecf4be2c6af..c1e93678cea4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
 
 static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
-	u8 cache;
-
 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
 	 * memory aliases with conflicting memory types and sometimes MCEs.
 	 * We have to be careful as to what are honored and when.
@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-			cache = MTRR_TYPE_WRBACK;
+			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
 		else
-			cache = MTRR_TYPE_UNCACHABLE;
-
-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
+				VMX_EPT_IPAT_BIT;
 	}
 
 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (6 preceding siblings ...)
  2023-06-16  2:38 ` [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED Yan Zhao
@ 2023-06-16  2:38 ` Yan Zhao
  2023-06-28 22:57   ` Sean Christopherson
  2023-06-16  2:39 ` [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored Yan Zhao
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:38 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Move code in vmx.c to get cache disabled memtype when non-coherent DMA
present to x86 common code.

This is the preparation patch for later implementation of fine-grained gfn
zap for CR0.CD toggles when guest MTRRs are honored.

No functional change intended.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c | 10 +++++-----
 arch/x86/kvm/x86.h     |  1 +
 3 files changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 3ce58734ad22..b35dd0bc9cad 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -721,3 +721,22 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 
 	return type == mtrr_default_type(mtrr_state);
 }
+
+void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
+{
+	/*
+	 * this routine is supposed to be called when guest mtrrs are honored
+	 */
+	if (unlikely(!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))) {
+		*type = MTRR_TYPE_WRBACK;
+		*ipat = true;
+	} else if (unlikely(!kvm_check_has_quirk(vcpu->kvm,
+			    KVM_X86_QUIRK_CD_NW_CLEARED))) {
+		*type = MTRR_TYPE_UNCACHABLE;
+		*ipat = true;
+	} else {
+		*type = MTRR_TYPE_WRBACK;
+		*ipat = false;
+	}
+}
+EXPORT_SYMBOL_GPL(kvm_mtrr_get_cd_memory_type);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index c1e93678cea4..6414c5a6e892 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7573,11 +7573,11 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 
 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
-		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
-		else
-			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
-				VMX_EPT_IPAT_BIT;
+		bool ipat;
+		u8 cache;
+
+		kvm_mtrr_get_cd_memory_type(vcpu, &cache, &ipat);
+		return cache << VMX_EPT_MT_EPTE_SHIFT | (ipat ? VMX_EPT_IPAT_BIT : 0);
 	}
 
 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 82e3dafc5453..9781b4b32d68 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -313,6 +313,7 @@ int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 					  int page_num);
+void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat);
 bool kvm_vector_hashing_enabled(void);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
 int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (7 preceding siblings ...)
  2023-06-16  2:38 ` [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code Yan Zhao
@ 2023-06-16  2:39 ` Yan Zhao
  2023-06-16  7:45   ` Yuan Yao
  2023-06-28 23:00   ` Sean Christopherson
  2023-06-16  2:41 ` [PATCH v3 10/11] KVM: x86/mmu: fine-grained gfn zap " Yan Zhao
                   ` (2 subsequent siblings)
  11 siblings, 2 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:39 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Serialize concurrent and repeated calls of kvm_zap_gfn_range() from every
vCPU for CR0.CD toggles and MTRR updates when guest MTRRs are honored.

During guest boot-up, if guest MTRRs are honored by TDP, TDP zaps are
triggered several times by each vCPU for CR0.CD toggles and MTRRs updates.
This will take unexpected longer CPU cycles because of the contention of
kvm->mmu_lock.

Therefore, introduce a mtrr_zap_list to remove duplicated zap and an atomic
mtrr_zapping to allow only one vCPU to do the real zap work at one time.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |   4 +
 arch/x86/kvm/mtrr.c             | 141 +++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c              |   2 +-
 arch/x86/kvm/x86.h              |   1 +
 4 files changed, 146 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 28bd38303d70..8da1517a1513 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,6 +1444,10 @@ struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+	struct list_head mtrr_zap_list;
+	spinlock_t mtrr_zap_list_lock;
+	atomic_t mtrr_zapping;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index b35dd0bc9cad..688748e3a4d2 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -25,6 +25,8 @@
 #define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
 #define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)
 
+static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
+				   gfn_t gfn_start, gfn_t gfn_end);
 static bool is_mtrr_base_msr(unsigned int msr)
 {
 	/* MTRR base MSRs use even numbers, masks use odd numbers. */
@@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
 	}
 
-	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
+	kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
 }
 
 static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
@@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
 {
 	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
+
+	if (vcpu->vcpu_id == 0) {
+		spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
+		INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
+	}
 }
 
 struct mtrr_iter {
@@ -740,3 +747,135 @@ void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
 	}
 }
 EXPORT_SYMBOL_GPL(kvm_mtrr_get_cd_memory_type);
+
+struct mtrr_zap_range {
+	gfn_t start;
+	/* end is exclusive */
+	gfn_t end;
+	struct list_head node;
+};
+
+static void kvm_clear_mtrr_zap_list(struct kvm *kvm)
+{
+	struct list_head *head = &kvm->arch.mtrr_zap_list;
+	struct mtrr_zap_range *tmp, *n;
+
+	spin_lock(&kvm->arch.mtrr_zap_list_lock);
+	list_for_each_entry_safe(tmp, n, head, node) {
+		list_del(&tmp->node);
+		kfree(tmp);
+	}
+	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+/*
+ * Add @range into kvm->arch.mtrr_zap_list and sort the list in
+ * "length" ascending + "start" descending order, so that
+ * ranges consuming more zap cycles can be dequeued later and their
+ * chances of being found duplicated are increased.
+ */
+static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
+{
+	struct list_head *head = &kvm->arch.mtrr_zap_list;
+	u64 len = range->end - range->start;
+	struct mtrr_zap_range *cur, *n;
+	bool added = false;
+
+	spin_lock(&kvm->arch.mtrr_zap_list_lock);
+
+	if (list_empty(head)) {
+		list_add(&range->node, head);
+		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+		return;
+	}
+
+	list_for_each_entry_safe(cur, n, head, node) {
+		u64 cur_len = cur->end - cur->start;
+
+		if (len < cur_len)
+			break;
+
+		if (len > cur_len)
+			continue;
+
+		if (range->start > cur->start)
+			break;
+
+		if (range->start < cur->start)
+			continue;
+
+		/* equal len & start, no need to add */
+		added = true;
+		kfree(range);
+		break;
+	}
+
+	if (!added)
+		list_add_tail(&range->node, &cur->node);
+
+	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
+{
+	struct list_head *head = &kvm->arch.mtrr_zap_list;
+	struct mtrr_zap_range *cur = NULL;
+
+	spin_lock(&kvm->arch.mtrr_zap_list_lock);
+
+	while (!list_empty(head)) {
+		u64 start, end;
+
+		cur = list_first_entry(head, typeof(*cur), node);
+		start = cur->start;
+		end = cur->end;
+		list_del(&cur->node);
+		kfree(cur);
+		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+
+		kvm_zap_gfn_range(kvm, start, end);
+
+		spin_lock(&kvm->arch.mtrr_zap_list_lock);
+	}
+
+	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
+}
+
+static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
+{
+	if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
+		kvm_zap_mtrr_zap_list(kvm);
+		atomic_set_release(&kvm->arch.mtrr_zapping, 0);
+		return;
+	}
+
+	while (atomic_read(&kvm->arch.mtrr_zapping))
+		cpu_relax();
+}
+
+static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
+				   gfn_t gfn_start, gfn_t gfn_end)
+{
+	struct mtrr_zap_range *range;
+
+	range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+	if (!range)
+		goto fail;
+
+	range->start = gfn_start;
+	range->end = gfn_end;
+
+	kvm_add_mtrr_zap_list(vcpu->kvm, range);
+
+	kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
+	return;
+
+fail:
+	kvm_clear_mtrr_zap_list(vcpu->kvm);
+	kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
+}
+
+void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
+{
+	return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 32cc8bfaa5f1..74aac14a3c0b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -943,7 +943,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
-		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+		kvm_zap_gfn_range_on_cd_toggle(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9781b4b32d68..be946aba2bf0 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -314,6 +314,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
 bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
 					  int page_num);
 void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat);
+void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu);
 bool kvm_vector_hashing_enabled(void);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
 int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 10/11] KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (8 preceding siblings ...)
  2023-06-16  2:39 ` [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored Yan Zhao
@ 2023-06-16  2:41 ` Yan Zhao
  2023-06-16  2:42 ` [PATCH v3 11/11] KVM: x86/mmu: split a single gfn zap range " Yan Zhao
  2023-06-28 23:02 ` [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:41 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Find out guest MTRR ranges of non-default type and only zap those ranges
when guest MTRRs are enabled and MTRR default type equals to the memtype of
CR0.CD=1.

This allows fine-grained and faster gfn zap because of precise and shorter
zap ranges, and increases chances for concurent vCPUs to find existing
ranges to zap in zap list.

Incidentally, fix a typo in the original comment.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c | 108 +++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 106 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 688748e3a4d2..e2a097822a62 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -179,7 +179,7 @@ static struct fixed_mtrr_segment fixed_seg_table[] = {
 	{
 		.start = 0xc0000,
 		.end = 0x100000,
-		.range_shift = 12, /* 12K */
+		.range_shift = 12, /* 4K */
 		.range_start = 24,
 	}
 };
@@ -816,6 +816,67 @@ static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
 	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
 }
 
+/*
+ * Fixed ranges are only 256 pages in total.
+ * After balancing between reducing overhead of zap multiple ranges
+ * and increasing chances of finding duplicated ranges,
+ * just add fixed mtrr ranges as a whole to the mtrr zap list
+ * if memory type of one of them is not the specified type.
+ */
+static int prepare_zaplist_fixed_mtrr_of_non_type(struct kvm_vcpu *vcpu, u8 type)
+{
+	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	struct mtrr_zap_range *range;
+	int index, seg_end;
+	u8 mem_type;
+
+	for (index = 0; index < KVM_NR_FIXED_MTRR_REGION; index++) {
+		mem_type = mtrr_state->fixed_ranges[index];
+
+		if (mem_type == type)
+			continue;
+
+		range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+		if (!range)
+			return -ENOMEM;
+
+		seg_end = ARRAY_SIZE(fixed_seg_table) - 1;
+		range->start = gpa_to_gfn(fixed_seg_table[0].start);
+		range->end = gpa_to_gfn(fixed_seg_table[seg_end].end);
+		kvm_add_mtrr_zap_list(vcpu->kvm, range);
+		break;
+	}
+	return 0;
+}
+
+/*
+ * Add var mtrr ranges to the mtrr zap list
+ * if its memory type does not equal to type
+ */
+static int prepare_zaplist_var_mtrr_of_non_type(struct kvm_vcpu *vcpu, u8 type)
+{
+	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	struct mtrr_zap_range *range;
+	struct kvm_mtrr_range *tmp;
+	u8 mem_type;
+
+	list_for_each_entry(tmp, &mtrr_state->head, node) {
+		mem_type = tmp->base & 0xff;
+		if (mem_type == type)
+			continue;
+
+		range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+		if (!range)
+			return -ENOMEM;
+
+		var_mtrr_range(tmp, &range->start, &range->end);
+		range->start = gpa_to_gfn(range->start);
+		range->end = gpa_to_gfn(range->end);
+		kvm_add_mtrr_zap_list(vcpu->kvm, range);
+	}
+	return 0;
+}
+
 static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
 {
 	struct list_head *head = &kvm->arch.mtrr_zap_list;
@@ -875,7 +936,50 @@ static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
 	kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
 }
 
+/*
+ * Zap GFN ranges when CR0.CD toggles between 0 and 1.
+ * With noncoherent DMA present,
+ * when CR0.CD=1, TDP memtype is WB or UC + IPAT;
+ * when CR0.CD=0, TDP memtype is determined by guest MTRR.
+ * Therefore, if the cache disabled memtype is different from default memtype
+ * in guest MTRR, everything is zapped;
+ * if the cache disabled memtype is equal to default memtype in guest MTRR,
+ * only MTRR ranges of non-default-memtype are required to be zapped.
+ */
 void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
 {
-	return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	bool mtrr_enabled = mtrr_is_enabled(mtrr_state);
+	u8 default_type;
+	u8 cd_type;
+	bool ipat;
+
+	kvm_mtrr_get_cd_memory_type(vcpu, &cd_type, &ipat);
+
+	default_type = mtrr_enabled ? mtrr_default_type(mtrr_state) :
+		       mtrr_disabled_type(vcpu);
+
+	if (cd_type != default_type || ipat)
+		return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+
+	/*
+	 * If mtrr is not enabled, it will go to zap all above if the default
+	 * type does not equal to cd_type;
+	 * Or it has no need to zap if the default type equals to cd_type.
+	 */
+	if (mtrr_enabled) {
+		if (prepare_zaplist_fixed_mtrr_of_non_type(vcpu, default_type))
+			goto fail;
+
+		if (prepare_zaplist_var_mtrr_of_non_type(vcpu, default_type))
+			goto fail;
+
+		kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
+	}
+	return;
+fail:
+	kvm_clear_mtrr_zap_list(vcpu->kvm);
+	/* resort to zapping all on failure*/
+	kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+	return;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* [PATCH v3 11/11] KVM: x86/mmu: split a single gfn zap range when guest MTRRs are honored
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (9 preceding siblings ...)
  2023-06-16  2:41 ` [PATCH v3 10/11] KVM: x86/mmu: fine-grained gfn zap " Yan Zhao
@ 2023-06-16  2:42 ` Yan Zhao
  2023-06-28 23:02 ` [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
  11 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  2:42 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: pbonzini, seanjc, chao.gao, kai.huang, robert.hoo.linux, Yan Zhao

Split a single gfn zap range (specifially range [0, ~0UL)) to smaller
ranges according to current memslot layout when guest MTRRs are honored.

Though vCPUs have been serialized to perform kvm_zap_gfn_range() for MTRRs
updates and CR0.CD toggles, contention caused rescheduling cost is still
huge when there're concurrent page fault caused read locks of
kvm->mmu_lock.

Split a single huge zap range according to the actual memslot layout can
reduce unnecessary transversal and scheduling cost in tdp mmu.
Also, it can increase the chances for larger ranges to find existing ranges
to zap in zap list.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c | 31 +++++++++++++++++++++++++------
 1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index e2a097822a62..b83abd14ccb1 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -917,21 +917,40 @@ static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
 static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
 				   gfn_t gfn_start, gfn_t gfn_end)
 {
+	int idx = srcu_read_lock(&vcpu->kvm->srcu);
+	const struct kvm_memory_slot *memslot;
 	struct mtrr_zap_range *range;
+	struct kvm_memslot_iter iter;
+	struct kvm_memslots *slots;
+	gfn_t start, end;
+	int i;
+
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		slots = __kvm_memslots(vcpu->kvm, i);
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, gfn_start, gfn_end) {
+			memslot = iter.slot;
+			start = max(gfn_start, memslot->base_gfn);
+			end = min(gfn_end, memslot->base_gfn + memslot->npages);
+			if (WARN_ON_ONCE(start >= end))
+				continue;
 
-	range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
-	if (!range)
-		goto fail;
+			range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
+			if (!range)
+				goto fail;
 
-	range->start = gfn_start;
-	range->end = gfn_end;
+			range->start = start;
+			range->end = end;
 
-	kvm_add_mtrr_zap_list(vcpu->kvm, range);
+			kvm_add_mtrr_zap_list(vcpu->kvm, range);
+		}
+	}
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 
 	kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
 	return;
 
 fail:
+	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 	kvm_clear_mtrr_zap_list(vcpu->kvm);
 	kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  7:45   ` Yuan Yao
@ 2023-06-16  7:37     ` Yan Zhao
  2023-06-16  8:09       ` Yuan Yao
  0 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  7:37 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, pbonzini, seanjc, chao.gao, kai.huang,
	robert.hoo.linux

On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > +/*
> > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > + * "length" ascending + "start" descending order, so that
> > + * ranges consuming more zap cycles can be dequeued later and their
> > + * chances of being found duplicated are increased.
> > + */
> > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > +{
> > +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> > +	u64 len = range->end - range->start;
> > +	struct mtrr_zap_range *cur, *n;
> > +	bool added = false;
> > +
> > +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > +	if (list_empty(head)) {
> > +		list_add(&range->node, head);
> > +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +		return;
> > +	}
> > +
> > +	list_for_each_entry_safe(cur, n, head, node) {
> > +		u64 cur_len = cur->end - cur->start;
> > +
> > +		if (len < cur_len)
> > +			break;
> > +
> > +		if (len > cur_len)
> > +			continue;
> > +
> > +		if (range->start > cur->start)
> > +			break;
> > +
> > +		if (range->start < cur->start)
> > +			continue;
> > +
> > +		/* equal len & start, no need to add */
> > +		added = true;
> 
> Possible/worth to ignore the range already covered
> by queued range ?

I may not get you correctly, but
the "added" here means an queued range with exactly same start + len
found, so free and drop adding the new range here.

> 
> > +		kfree(range);
> > +		break;
> > +	}
> > +
> > +	if (!added)
> > +		list_add_tail(&range->node, &cur->node);
> > +
> > +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +}
> > +
> > +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> > +{
> > +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> > +	struct mtrr_zap_range *cur = NULL;
> > +
> > +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > +	while (!list_empty(head)) {
> > +		u64 start, end;
> > +
> > +		cur = list_first_entry(head, typeof(*cur), node);
> > +		start = cur->start;
> > +		end = cur->end;
> > +		list_del(&cur->node);
> > +		kfree(cur);
> > +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +
> > +		kvm_zap_gfn_range(kvm, start, end);
> > +
> > +		spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > +	}
> > +
> > +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > +}
> > +
> > +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> > +{
> > +	if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> > +		kvm_zap_mtrr_zap_list(kvm);
> > +		atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> > +		return;
> > +	}
> > +
> > +	while (atomic_read(&kvm->arch.mtrr_zapping))
> > +		cpu_relax();
> > +}
> > +
> > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > +				   gfn_t gfn_start, gfn_t gfn_end)
> > +{
> > +	struct mtrr_zap_range *range;
> > +
> > +	range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> > +	if (!range)
> > +		goto fail;
> > +
> > +	range->start = gfn_start;
> > +	range->end = gfn_end;
> > +
> > +	kvm_add_mtrr_zap_list(vcpu->kvm, range);
> > +
> > +	kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> > +	return;
> > +
> > +fail:
> > +	kvm_clear_mtrr_zap_list(vcpu->kvm);
> A very small chance race condition that incorrectly
> clear the queued ranges which have not been zapped by another thread ?
> Like below:
> 
> Thread A                         |  Thread B
> kvm_add_mtrr_zap_list()          |
>                                  |  kvm_clear_mtrr_zap_list()
> kvm_zap_or_wait_mtrr_zap_list()  |
> 
> Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
> threads(B here) who put thing in the queue will take care them well.

> > +   kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end); 

Yes, if gfn_start and gfn_end here are not 0 and ~0ULL, the
kvm_clear_mtrr_zap_list() is not necessary.
Though in reality, they are always 0-~0ULL, I agree dropping the
kvm_clear_mtrr_zap_list() here is better.

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  2:39 ` [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored Yan Zhao
@ 2023-06-16  7:45   ` Yuan Yao
  2023-06-16  7:37     ` Yan Zhao
  2023-06-28 23:00   ` Sean Christopherson
  1 sibling, 1 reply; 37+ messages in thread
From: Yuan Yao @ 2023-06-16  7:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, seanjc, chao.gao, kai.huang,
	robert.hoo.linux

On Fri, Jun 16, 2023 at 10:39:45AM +0800, Yan Zhao wrote:
> Serialize concurrent and repeated calls of kvm_zap_gfn_range() from every
> vCPU for CR0.CD toggles and MTRR updates when guest MTRRs are honored.
>
> During guest boot-up, if guest MTRRs are honored by TDP, TDP zaps are
> triggered several times by each vCPU for CR0.CD toggles and MTRRs updates.
> This will take unexpected longer CPU cycles because of the contention of
> kvm->mmu_lock.
>
> Therefore, introduce a mtrr_zap_list to remove duplicated zap and an atomic
> mtrr_zapping to allow only one vCPU to do the real zap work at one time.
>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |   4 +
>  arch/x86/kvm/mtrr.c             | 141 +++++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.c              |   2 +-
>  arch/x86/kvm/x86.h              |   1 +
>  4 files changed, 146 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 28bd38303d70..8da1517a1513 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1444,6 +1444,10 @@ struct kvm_arch {
>  	 */
>  #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
>  	struct kvm_mmu_memory_cache split_desc_cache;
> +
> +	struct list_head mtrr_zap_list;
> +	spinlock_t mtrr_zap_list_lock;
> +	atomic_t mtrr_zapping;
>  };
>
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index b35dd0bc9cad..688748e3a4d2 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -25,6 +25,8 @@
>  #define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
>  #define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)
>
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> +				   gfn_t gfn_start, gfn_t gfn_end);
>  static bool is_mtrr_base_msr(unsigned int msr)
>  {
>  	/* MTRR base MSRs use even numbers, masks use odd numbers. */
> @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
>  	}
>
> -	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> +	kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
>  }
>
>  static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
>  {
>  	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> +
> +	if (vcpu->vcpu_id == 0) {
> +		spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> +		INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> +	}
>  }
>
>  struct mtrr_iter {
> @@ -740,3 +747,135 @@ void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
>  	}
>  }
>  EXPORT_SYMBOL_GPL(kvm_mtrr_get_cd_memory_type);
> +
> +struct mtrr_zap_range {
> +	gfn_t start;
> +	/* end is exclusive */
> +	gfn_t end;
> +	struct list_head node;
> +};
> +
> +static void kvm_clear_mtrr_zap_list(struct kvm *kvm)
> +{
> +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> +	struct mtrr_zap_range *tmp, *n;
> +
> +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +	list_for_each_entry_safe(tmp, n, head, node) {
> +		list_del(&tmp->node);
> +		kfree(tmp);
> +	}
> +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +/*
> + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> + * "length" ascending + "start" descending order, so that
> + * ranges consuming more zap cycles can be dequeued later and their
> + * chances of being found duplicated are increased.
> + */
> +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> +{
> +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> +	u64 len = range->end - range->start;
> +	struct mtrr_zap_range *cur, *n;
> +	bool added = false;
> +
> +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +
> +	if (list_empty(head)) {
> +		list_add(&range->node, head);
> +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +		return;
> +	}
> +
> +	list_for_each_entry_safe(cur, n, head, node) {
> +		u64 cur_len = cur->end - cur->start;
> +
> +		if (len < cur_len)
> +			break;
> +
> +		if (len > cur_len)
> +			continue;
> +
> +		if (range->start > cur->start)
> +			break;
> +
> +		if (range->start < cur->start)
> +			continue;
> +
> +		/* equal len & start, no need to add */
> +		added = true;

Possible/worth to ignore the range already covered
by queued range ?

> +		kfree(range);
> +		break;
> +	}
> +
> +	if (!added)
> +		list_add_tail(&range->node, &cur->node);
> +
> +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> +{
> +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> +	struct mtrr_zap_range *cur = NULL;
> +
> +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +
> +	while (!list_empty(head)) {
> +		u64 start, end;
> +
> +		cur = list_first_entry(head, typeof(*cur), node);
> +		start = cur->start;
> +		end = cur->end;
> +		list_del(&cur->node);
> +		kfree(cur);
> +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +
> +		kvm_zap_gfn_range(kvm, start, end);
> +
> +		spin_lock(&kvm->arch.mtrr_zap_list_lock);
> +	}
> +
> +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> +}
> +
> +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> +{
> +	if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> +		kvm_zap_mtrr_zap_list(kvm);
> +		atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> +		return;
> +	}
> +
> +	while (atomic_read(&kvm->arch.mtrr_zapping))
> +		cpu_relax();
> +}
> +
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> +				   gfn_t gfn_start, gfn_t gfn_end)
> +{
> +	struct mtrr_zap_range *range;
> +
> +	range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> +	if (!range)
> +		goto fail;
> +
> +	range->start = gfn_start;
> +	range->end = gfn_end;
> +
> +	kvm_add_mtrr_zap_list(vcpu->kvm, range);
> +
> +	kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> +	return;
> +
> +fail:
> +	kvm_clear_mtrr_zap_list(vcpu->kvm);

A very small chance race condition that incorrectly
clear the queued ranges which have not been zapped by another thread ?
Like below:

Thread A                         |  Thread B
kvm_add_mtrr_zap_list()          |
                                 |  kvm_clear_mtrr_zap_list()
kvm_zap_or_wait_mtrr_zap_list()  |

Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
threads(B here) who put thing in the queue will take care them well.

> +	kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
> +}
> +
> +void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> +}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 32cc8bfaa5f1..74aac14a3c0b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -943,7 +943,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
>  	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> -		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);

>  }
>  EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
>
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 9781b4b32d68..be946aba2bf0 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -314,6 +314,7 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
>  bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
>  					  int page_num);
>  void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat);
> +void kvm_zap_gfn_range_on_cd_toggle(struct kvm_vcpu *vcpu);
>  bool kvm_vector_hashing_enabled(void);
>  void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
>  int x86_decode_emulated_instruction(struct kvm_vcpu *vcpu, int emulation_type,
> --
> 2.17.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  8:09       ` Yuan Yao
@ 2023-06-16  7:50         ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-16  7:50 UTC (permalink / raw)
  To: Yuan Yao
  Cc: kvm, linux-kernel, pbonzini, seanjc, chao.gao, kai.huang,
	robert.hoo.linux

On Fri, Jun 16, 2023 at 04:09:17PM +0800, Yuan Yao wrote:
> On Fri, Jun 16, 2023 at 03:37:29PM +0800, Yan Zhao wrote:
> > On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > > > +/*
> > > > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > > > + * "length" ascending + "start" descending order, so that
> > > > + * ranges consuming more zap cycles can be dequeued later and their
> > > > + * chances of being found duplicated are increased.
> > > > + */
> > > > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > > > +{
> > > > +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > > +	u64 len = range->end - range->start;
> > > > +	struct mtrr_zap_range *cur, *n;
> > > > +	bool added = false;
> > > > +
> > > > +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > > +
> > > > +	if (list_empty(head)) {
> > > > +		list_add(&range->node, head);
> > > > +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > > +		return;
> > > > +	}
> > > > +
> > > > +	list_for_each_entry_safe(cur, n, head, node) {
> > > > +		u64 cur_len = cur->end - cur->start;
> > > > +
> > > > +		if (len < cur_len)
> > > > +			break;
> > > > +
> > > > +		if (len > cur_len)
> > > > +			continue;
> > > > +
> > > > +		if (range->start > cur->start)
> > > > +			break;
> > > > +
> > > > +		if (range->start < cur->start)
> > > > +			continue;
> > > > +
> > > > +		/* equal len & start, no need to add */
> > > > +		added = true;
> > >
> > > Possible/worth to ignore the range already covered
> > > by queued range ?
> >
> > I may not get you correctly, but
> > the "added" here means an queued range with exactly same start + len
> > found, so free and drop adding the new range here.
> 
> I mean drop adding three B below if A already in the queue:
> 
> |------A--------|
> |----B----|
> 
> |------A--------|
>       |----B----|
> 
> |------A--------|
>   |----B----|
> 
Oh, I implemented this way in my first version.
But it will complicate the logic and increase time holding spinlock.
And as usually in the zaps caused by MTRRs update and CR0.CD toggles,
the queued ranges are duplicated in different vCPUs and non-overlapping
in one vCPU, I turned to this simplier implemenation finally.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  7:37     ` Yan Zhao
@ 2023-06-16  8:09       ` Yuan Yao
  2023-06-16  7:50         ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Yuan Yao @ 2023-06-16  8:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, seanjc, chao.gao, kai.huang,
	robert.hoo.linux

On Fri, Jun 16, 2023 at 03:37:29PM +0800, Yan Zhao wrote:
> On Fri, Jun 16, 2023 at 03:45:50PM +0800, Yuan Yao wrote:
> > > +/*
> > > + * Add @range into kvm->arch.mtrr_zap_list and sort the list in
> > > + * "length" ascending + "start" descending order, so that
> > > + * ranges consuming more zap cycles can be dequeued later and their
> > > + * chances of being found duplicated are increased.
> > > + */
> > > +static void kvm_add_mtrr_zap_list(struct kvm *kvm, struct mtrr_zap_range *range)
> > > +{
> > > +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > +	u64 len = range->end - range->start;
> > > +	struct mtrr_zap_range *cur, *n;
> > > +	bool added = false;
> > > +
> > > +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > +	if (list_empty(head)) {
> > > +		list_add(&range->node, head);
> > > +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +		return;
> > > +	}
> > > +
> > > +	list_for_each_entry_safe(cur, n, head, node) {
> > > +		u64 cur_len = cur->end - cur->start;
> > > +
> > > +		if (len < cur_len)
> > > +			break;
> > > +
> > > +		if (len > cur_len)
> > > +			continue;
> > > +
> > > +		if (range->start > cur->start)
> > > +			break;
> > > +
> > > +		if (range->start < cur->start)
> > > +			continue;
> > > +
> > > +		/* equal len & start, no need to add */
> > > +		added = true;
> >
> > Possible/worth to ignore the range already covered
> > by queued range ?
>
> I may not get you correctly, but
> the "added" here means an queued range with exactly same start + len
> found, so free and drop adding the new range here.

I mean drop adding three B below if A already in the queue:

|------A--------|
|----B----|

|------A--------|
      |----B----|

|------A--------|
  |----B----|

>
> >
> > > +		kfree(range);
> > > +		break;
> > > +	}
> > > +
> > > +	if (!added)
> > > +		list_add_tail(&range->node, &cur->node);
> > > +
> > > +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +}
> > > +
> > > +static void kvm_zap_mtrr_zap_list(struct kvm *kvm)
> > > +{
> > > +	struct list_head *head = &kvm->arch.mtrr_zap_list;
> > > +	struct mtrr_zap_range *cur = NULL;
> > > +
> > > +	spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > +	while (!list_empty(head)) {
> > > +		u64 start, end;
> > > +
> > > +		cur = list_first_entry(head, typeof(*cur), node);
> > > +		start = cur->start;
> > > +		end = cur->end;
> > > +		list_del(&cur->node);
> > > +		kfree(cur);
> > > +		spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +
> > > +		kvm_zap_gfn_range(kvm, start, end);
> > > +
> > > +		spin_lock(&kvm->arch.mtrr_zap_list_lock);
> > > +	}
> > > +
> > > +	spin_unlock(&kvm->arch.mtrr_zap_list_lock);
> > > +}
> > > +
> > > +static void kvm_zap_or_wait_mtrr_zap_list(struct kvm *kvm)
> > > +{
> > > +	if (atomic_cmpxchg_acquire(&kvm->arch.mtrr_zapping, 0, 1) == 0) {
> > > +		kvm_zap_mtrr_zap_list(kvm);
> > > +		atomic_set_release(&kvm->arch.mtrr_zapping, 0);
> > > +		return;
> > > +	}
> > > +
> > > +	while (atomic_read(&kvm->arch.mtrr_zapping))
> > > +		cpu_relax();
> > > +}
> > > +
> > > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > > +				   gfn_t gfn_start, gfn_t gfn_end)
> > > +{
> > > +	struct mtrr_zap_range *range;
> > > +
> > > +	range = kmalloc(sizeof(*range), GFP_KERNEL_ACCOUNT);
> > > +	if (!range)
> > > +		goto fail;
> > > +
> > > +	range->start = gfn_start;
> > > +	range->end = gfn_end;
> > > +
> > > +	kvm_add_mtrr_zap_list(vcpu->kvm, range);
> > > +
> > > +	kvm_zap_or_wait_mtrr_zap_list(vcpu->kvm);
> > > +	return;
> > > +
> > > +fail:
> > > +	kvm_clear_mtrr_zap_list(vcpu->kvm);
> > A very small chance race condition that incorrectly
> > clear the queued ranges which have not been zapped by another thread ?
> > Like below:
> >
> > Thread A                         |  Thread B
> > kvm_add_mtrr_zap_list()          |
> >                                  |  kvm_clear_mtrr_zap_list()
> > kvm_zap_or_wait_mtrr_zap_list()  |
> >
> > Call kvm_clear_mtrr_zap_list() here looks unnecessary, other
> > threads(B here) who put thing in the queue will take care them well.
>
> > > +   kvm_zap_gfn_range(vcpu->kvm, gfn_start, gfn_end);
>
> Yes, if gfn_start and gfn_end here are not 0 and ~0ULL, the
> kvm_clear_mtrr_zap_list() is not necessary.
> Though in reality, they are always 0-~0ULL, I agree dropping the
> kvm_clear_mtrr_zap_list() here is better.
>
> Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-20  2:42   ` Chao Gao
@ 2023-06-20  2:34     ` Yan Zhao
  2023-06-20  3:34       ` Chao Gao
  2023-06-25  7:14       ` Xiaoyao Li
  2023-06-20  3:17     ` Yan Zhao
  1 sibling, 2 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-20  2:34 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> >types when cache is disabled and non-coherent DMA are present.
> >
> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> >EPT memory type when guest cache is disabled before this patch.
> >Removing the IPAT bit in this patch will allow effective memory type to
> >honor PAT values as well, which will make the effective memory type
> 
> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> which guests can benefit from this change?
This patch is actually a preparation for later patch 10 to implement
fine-grained zap.
If when CR0.CD=1 the EPT type is WB + IPAT, and
when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
without IPAT, then we have to always zap all EPT entries.

Given removing the IPAT bit when CR0.CD=1 only makes the quirk
KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
the guest PAT overwrites it), it's still acceptable.

On the other hand, from the guest's point of view, it sees the GPA is UC
with CR0.CD=1. So, if we want to align host EPT memtype with guest's
view, we have to drop the quirk KVM_X86_QUIRK_CD_NW_CLEARED, which,
however, will impact the guest bootup performance dramatically and is
not adopted in any VM.

So, we remove the IPAT bit when CR0.CD=1 with the quirk
KVM_X86_QUIRK_CD_NW_CLEARED to make it stricter and to enable later
optimization.

> 
> >stronger than WB as WB is the weakest memtype. However, this change is
> >acceptable as it doesn't make the memory type weaker,
> 
> >consider without
> >this quirk, the original memtype for cache disabled is UC + IPAT.
> 
> This change impacts only the case where the quirk is enabled. Maybe you
> mean:
> 
> _with_ the quirk, the original memtype for cached disabled is _WB_ + IPAT.
> 
Uh, I mean originally without the quirk, UC + IPAT should be returned,
which is the correct value to return.

I can explain the full background more clearly in the next version.

Thanks

> 
> >
> >Suggested-by: Sean Christopherson <seanjc@google.com>
> >Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >---
> > arch/x86/kvm/vmx/vmx.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 0ecf4be2c6af..c1e93678cea4 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
> > 
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >-	u8 cache;
> >-
> > 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > 	 * memory aliases with conflicting memory types and sometimes MCEs.
> > 	 * We have to be careful as to what are honored and when.
> >@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > 
> > 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> > 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >-			cache = MTRR_TYPE_WRBACK;
> >+			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> 
> Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.
> 
> > 		else
> >-			cache = MTRR_TYPE_UNCACHABLE;
> >-
> >-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >+				VMX_EPT_IPAT_BIT;
> > 	}
> > 
> > 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >-- 
> >2.17.1
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-16  2:38 ` [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED Yan Zhao
@ 2023-06-20  2:42   ` Chao Gao
  2023-06-20  2:34     ` Yan Zhao
  2023-06-20  3:17     ` Yan Zhao
  0 siblings, 2 replies; 37+ messages in thread
From: Chao Gao @ 2023-06-20  2:42 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>types when cache is disabled and non-coherent DMA are present.
>
>With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>EPT memory type when guest cache is disabled before this patch.
>Removing the IPAT bit in this patch will allow effective memory type to
>honor PAT values as well, which will make the effective memory type

Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
which guests can benefit from this change?

>stronger than WB as WB is the weakest memtype. However, this change is
>acceptable as it doesn't make the memory type weaker,

>consider without
>this quirk, the original memtype for cache disabled is UC + IPAT.

This change impacts only the case where the quirk is enabled. Maybe you
mean:

_with_ the quirk, the original memtype for cached disabled is _WB_ + IPAT.


>
>Suggested-by: Sean Christopherson <seanjc@google.com>
>Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 0ecf4be2c6af..c1e93678cea4 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
> 
> static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> {
>-	u8 cache;
>-
> 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> 	 * memory aliases with conflicting memory types and sometimes MCEs.
> 	 * We have to be careful as to what are honored and when.
>@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> 
> 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>-			cache = MTRR_TYPE_WRBACK;
>+			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;

Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.

> 		else
>-			cache = MTRR_TYPE_UNCACHABLE;
>-
>-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
>+				VMX_EPT_IPAT_BIT;
> 	}
> 
> 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-20  2:42   ` Chao Gao
  2023-06-20  2:34     ` Yan Zhao
@ 2023-06-20  3:17     ` Yan Zhao
  1 sibling, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-20  3:17 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> >
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index 0ecf4be2c6af..c1e93678cea4 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -7548,8 +7548,6 @@ static int vmx_vm_init(struct kvm *kvm)
> > 
> > static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > {
> >-	u8 cache;
> >-
> > 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
> > 	 * memory aliases with conflicting memory types and sometimes MCEs.
> > 	 * We have to be careful as to what are honored and when.
> >@@ -7576,11 +7574,10 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > 
> > 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> > 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> >-			cache = MTRR_TYPE_WRBACK;
> >+			return MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> 
> Shouldn't KVM honor guest MTRRs as well? i.e., as if CR0.CD isn't set.

I proposed to return guest mtrr type when CR0.CD=1.
https://lore.kernel.org/all/ZHDZpHYQhPtkNnQe@google.com/

Sean rejected it because before guest MTRR is first time enabled, we
have to return WB with the quirk, otherwise the VM bootup will be super
slow.
Or maybe we can return WB when guest MTRR is not enabled and guest MTRR
type when guest MTRR is anabled to woraround it. e.g.

if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
	{
		cache = !mtrr_is_enabled(mtrr_state) ? MTRR_TYPE_WRBACK :
		kvm_mtrr_get_guest_memory_type(vcpu, gfn);
		return cache << VMX_EPT_MT_EPTE_SHIFT;
	}
	...
}

But Sean does not like piling too much on top of the quirk and I think it is right
to keep the quirk simple.

"In short, I am steadfastly against any change that piles more arbitrary behavior
functional behavior on top of the quirk, especially when that behavior relies on
heuristics to try and guess what the guest is doing."

> 
> > 		else
> >-			cache = MTRR_TYPE_UNCACHABLE;
> >-
> >-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> >+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) |
> >+				VMX_EPT_IPAT_BIT;
> > 	}
> > 
> > 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> >-- 
> >2.17.1
> >

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-20  3:34       ` Chao Gao
@ 2023-06-20  3:19         ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-20  3:19 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On Tue, Jun 20, 2023 at 11:34:43AM +0800, Chao Gao wrote:
> On Tue, Jun 20, 2023 at 10:34:29AM +0800, Yan Zhao wrote:
> >On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> >> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> >> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> >> >types when cache is disabled and non-coherent DMA are present.
> >> >
> >> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> >> >EPT memory type when guest cache is disabled before this patch.
> >> >Removing the IPAT bit in this patch will allow effective memory type to
> >> >honor PAT values as well, which will make the effective memory type
> >> 
> >> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> >> which guests can benefit from this change?
> >This patch is actually a preparation for later patch 10 to implement
> >fine-grained zap.
> >If when CR0.CD=1 the EPT type is WB + IPAT, and
> >when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> >without IPAT, then we have to always zap all EPT entries.
> 
> OK. The goal is to reduce the cost of toggling CR0.CD. The key is if KVM sets
> the IPAT, then when CR0.CD is cleared by guest, KVM has to zap _all_ EPT entries
> at least to clear IPAT.
> 
> Can kvm honor guest MTRRs as well when CR0.CD=1 && with the quirk? then later
> clearing CR0.CD needn't zap _any_ EPT entry. But the optimization is exactly the
> one removed in patch 6. Maybe I miss something important.
I replied it in another mail.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-20  2:34     ` Yan Zhao
@ 2023-06-20  3:34       ` Chao Gao
  2023-06-20  3:19         ` Yan Zhao
  2023-06-25  7:14       ` Xiaoyao Li
  1 sibling, 1 reply; 37+ messages in thread
From: Chao Gao @ 2023-06-20  3:34 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On Tue, Jun 20, 2023 at 10:34:29AM +0800, Yan Zhao wrote:
>On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
>> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>> >For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>> >types when cache is disabled and non-coherent DMA are present.
>> >
>> >With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>> >EPT memory type when guest cache is disabled before this patch.
>> >Removing the IPAT bit in this patch will allow effective memory type to
>> >honor PAT values as well, which will make the effective memory type
>> 
>> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
>> which guests can benefit from this change?
>This patch is actually a preparation for later patch 10 to implement
>fine-grained zap.
>If when CR0.CD=1 the EPT type is WB + IPAT, and
>when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
>without IPAT, then we have to always zap all EPT entries.

OK. The goal is to reduce the cost of toggling CR0.CD. The key is if KVM sets
the IPAT, then when CR0.CD is cleared by guest, KVM has to zap _all_ EPT entries
at least to clear IPAT.

Can kvm honor guest MTRRs as well when CR0.CD=1 && with the quirk? then later
clearing CR0.CD needn't zap _any_ EPT entry. But the optimization is exactly the
one removed in patch 6. Maybe I miss something important.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-20  2:34     ` Yan Zhao
  2023-06-20  3:34       ` Chao Gao
@ 2023-06-25  7:14       ` Xiaoyao Li
  2023-06-26  0:08         ` Yan Zhao
  1 sibling, 1 reply; 37+ messages in thread
From: Xiaoyao Li @ 2023-06-25  7:14 UTC (permalink / raw)
  To: Yan Zhao, Chao Gao
  Cc: kvm, linux-kernel, pbonzini, seanjc, kai.huang, robert.hoo.linux

On 6/20/2023 10:34 AM, Yan Zhao wrote:
> On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
>> On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
>>> For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
>>> types when cache is disabled and non-coherent DMA are present.
>>>
>>> With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
>>> EPT memory type when guest cache is disabled before this patch.
>>> Removing the IPAT bit in this patch will allow effective memory type to
>>> honor PAT values as well, which will make the effective memory type
>> Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
>> which guests can benefit from this change?
> This patch is actually a preparation for later patch 10 to implement
> fine-grained zap.
> If when CR0.CD=1 the EPT type is WB + IPAT, and
> when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> without IPAT, then we have to always zap all EPT entries.
> 
> Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> the guest PAT overwrites it), it's still acceptable.

Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED 
is to ensure the memory type is WB to achieve better boot performance 
for old OVMF.

you need to justify the original purpose is not broken by this patch.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-25  7:14       ` Xiaoyao Li
@ 2023-06-26  0:08         ` Yan Zhao
  2023-06-26  3:40           ` Yuan Yao
  0 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-26  0:08 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Chao Gao, kvm, linux-kernel, pbonzini, seanjc, kai.huang,
	robert.hoo.linux

On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > types when cache is disabled and non-coherent DMA are present.
> > > > 
> > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > EPT memory type when guest cache is disabled before this patch.
> > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > honor PAT values as well, which will make the effective memory type
> > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > which guests can benefit from this change?
> > This patch is actually a preparation for later patch 10 to implement
> > fine-grained zap.
> > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > without IPAT, then we have to always zap all EPT entries.
> > 
> > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > the guest PAT overwrites it), it's still acceptable.
> 
> Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> to ensure the memory type is WB to achieve better boot performance for old
> OVMF.
It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
which is Apr 14 2015.


> you need to justify the original purpose is not broken by this patch.

Hmm, to dig into the history, the reason for this quirk is explained below:

commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
Author: Xiao Guangrong <guangrong.xiao@intel.com>
Date:   Thu Jul 16 03:25:56 2015 +0800

    KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED

    OVMF depends on WB to boot fast, because it only clears caches after
    it has set up MTRRs---which is too late.

    Let's do writeback if CR0.CD is set to make it happy, similar to what
    SVM is already doing.


which means WB is only a must for fast boot before OVMF has set up MTRRs.
At that period, PAT is default to WB.

After OVMF setting up MTRR, according to the definition of no-fill cache
mode, "Strict memory ordering is not enforced unless the MTRRs are
disabled and/or all memory is referenced as uncached", it's valid to
honor PAT in no-fill cache mode.
Besides, if the guest explicitly claim UC via PAT, why should KVM return
WB?
In other words, if it's still slow caused by a UC value in guest PAT,
it's desired to be fixed in guest instead of a workaround in KVM.





^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-26  3:40           ` Yuan Yao
@ 2023-06-26  3:38             ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-26  3:38 UTC (permalink / raw)
  To: Yuan Yao
  Cc: Xiaoyao Li, Chao Gao, kvm, linux-kernel, pbonzini, seanjc,
	kai.huang, robert.hoo.linux

On Mon, Jun 26, 2023 at 11:40:28AM +0800, Yuan Yao wrote:
> On Mon, Jun 26, 2023 at 08:08:20AM +0800, Yan Zhao wrote:
> > On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> > > On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > > > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > > > types when cache is disabled and non-coherent DMA are present.
> > > > > >
> > > > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > > > EPT memory type when guest cache is disabled before this patch.
> > > > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > > > honor PAT values as well, which will make the effective memory type
> > > > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > > > which guests can benefit from this change?
> > > > This patch is actually a preparation for later patch 10 to implement
> > > > fine-grained zap.
> > > > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > > > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > > > without IPAT, then we have to always zap all EPT entries.
> > > >
> > > > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > > > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > > > the guest PAT overwrites it), it's still acceptable.
> > >
> > > Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> > > to ensure the memory type is WB to achieve better boot performance for old
> > > OVMF.
> > It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
> > which is Apr 14 2015.
> >
> >
> > > you need to justify the original purpose is not broken by this patch.
> >
> > Hmm, to dig into the history, the reason for this quirk is explained below:
> >
> > commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> > Author: Xiao Guangrong <guangrong.xiao@intel.com>
> > Date:   Thu Jul 16 03:25:56 2015 +0800
> >
> >     KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
> >
> >     OVMF depends on WB to boot fast, because it only clears caches after
> >     it has set up MTRRs---which is too late.
> >
> >     Let's do writeback if CR0.CD is set to make it happy, similar to what
> >     SVM is already doing.
> >
> >
> > which means WB is only a must for fast boot before OVMF has set up MTRRs.
> > At that period, PAT is default to WB.
> >
> > After OVMF setting up MTRR, according to the definition of no-fill cache
> > mode, "Strict memory ordering is not enforced unless the MTRRs are
> > disabled and/or all memory is referenced as uncached", it's valid to
> > honor PAT in no-fill cache mode.
> 
> Does it also mean that, the honor PAT in such no-fill cache mode should
> also happen for non-quirk case ? e.g. the effective memory type can be
> WC if EPT is UC + guest PAT is WC for CD=1.
No. Only the quirk KVM_X86_QUIRK_CD_NW_CLEARED indicates no-fill cache
mode (CD=1 and NW=0).
Without the quirk, UC + IPAT is desired.

> 
> > Besides, if the guest explicitly claim UC via PAT, why should KVM return
> > WB?
> > In other words, if it's still slow caused by a UC value in guest PAT,
> > it's desired to be fixed in guest instead of a workaround in KVM.
> 
> the quirk may not work after this patch if the guest PAT is
> stronger than WB for CD=1, we don't if any guest "works correctly" based
> on this quirk, I hope no. How about highlight this in commit message
At least for Seabios and OVMF, the PAT is WB by default.
Even after MTRRs enabled, if there are UC ranges, they are small in size
and are desired to be UC.
So, I think it's ok.

> explicitly ?
Will try to explain the background and possible influence.

Thanks

> 
> Also I agree that such issue should be fixed in guest not in KVM.
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED
  2023-06-26  0:08         ` Yan Zhao
@ 2023-06-26  3:40           ` Yuan Yao
  2023-06-26  3:38             ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Yuan Yao @ 2023-06-26  3:40 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Xiaoyao Li, Chao Gao, kvm, linux-kernel, pbonzini, seanjc,
	kai.huang, robert.hoo.linux

On Mon, Jun 26, 2023 at 08:08:20AM +0800, Yan Zhao wrote:
> On Sun, Jun 25, 2023 at 03:14:37PM +0800, Xiaoyao Li wrote:
> > On 6/20/2023 10:34 AM, Yan Zhao wrote:
> > > On Tue, Jun 20, 2023 at 10:42:57AM +0800, Chao Gao wrote:
> > > > On Fri, Jun 16, 2023 at 10:38:15AM +0800, Yan Zhao wrote:
> > > > > For KVM_X86_QUIRK_CD_NW_CLEARED, remove the ignore PAT bit in EPT memory
> > > > > types when cache is disabled and non-coherent DMA are present.
> > > > >
> > > > > With the quirk KVM_X86_QUIRK_CD_NW_CLEARED, WB + IPAT are returned as the
> > > > > EPT memory type when guest cache is disabled before this patch.
> > > > > Removing the IPAT bit in this patch will allow effective memory type to
> > > > > honor PAT values as well, which will make the effective memory type
> > > > Given guest sets CR0.CD, what's the point of honoring (guest) PAT? e.g.,
> > > > which guests can benefit from this change?
> > > This patch is actually a preparation for later patch 10 to implement
> > > fine-grained zap.
> > > If when CR0.CD=1 the EPT type is WB + IPAT, and
> > > when CR0.CD=0 + mtrr enabled, EPT type is WB or UC or ..., which are
> > > without IPAT, then we have to always zap all EPT entries.
> > >
> > > Given removing the IPAT bit when CR0.CD=1 only makes the quirk
> > > KVM_X86_QUIRK_CD_NW_CLEARED more strict (meaning it could be WC/UC... if
> > > the guest PAT overwrites it), it's still acceptable.
> >
> > Per my understanding, the reason why KVM had KVM_X86_QUIRK_CD_NW_CLEARED is
> > to ensure the memory type is WB to achieve better boot performance for old
> > OVMF.
> It works well for OVMF c9e5618f84b0cb54a9ac2d7604f7b7e7859b45a7,
> which is Apr 14 2015.
>
>
> > you need to justify the original purpose is not broken by this patch.
>
> Hmm, to dig into the history, the reason for this quirk is explained below:
>
> commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
> Author: Xiao Guangrong <guangrong.xiao@intel.com>
> Date:   Thu Jul 16 03:25:56 2015 +0800
>
>     KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED
>
>     OVMF depends on WB to boot fast, because it only clears caches after
>     it has set up MTRRs---which is too late.
>
>     Let's do writeback if CR0.CD is set to make it happy, similar to what
>     SVM is already doing.
>
>
> which means WB is only a must for fast boot before OVMF has set up MTRRs.
> At that period, PAT is default to WB.
>
> After OVMF setting up MTRR, according to the definition of no-fill cache
> mode, "Strict memory ordering is not enforced unless the MTRRs are
> disabled and/or all memory is referenced as uncached", it's valid to
> honor PAT in no-fill cache mode.

Does it also mean that, the honor PAT in such no-fill cache mode should
also happen for non-quirk case ? e.g. the effective memory type can be
WC if EPT is UC + guest PAT is WC for CD=1.

> Besides, if the guest explicitly claim UC via PAT, why should KVM return
> WB?
> In other words, if it's still slow caused by a UC value in guest PAT,
> it's desired to be fixed in guest instead of a workaround in KVM.

the quirk may not work after this patch if the guest PAT is
stronger than WB for CD=1, we don't if any guest "works correctly" based
on this quirk, I hope no. How about highlight this in commit message
explicitly ?

Also I agree that such issue should be fixed in guest not in KVM.

>
>
>
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
  2023-06-16  2:35 ` [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles Yan Zhao
@ 2023-06-28 21:59   ` Sean Christopherson
  2023-06-29  1:42     ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2023-06-28 21:59 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023, Yan Zhao wrote:
> Call helper to check if guest MTRRs are honored by KVM MMU before zapping,

Nit, state the effect, not what the code literally does.  The important part is
that the end result is that KVM will zap if and only if guest MTRRs are being
honored, e.g.

  Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring
  guest MTRRs, which is the only time that KVM incorporates the guest's
  CR0.CD into the final memtype.

> as values of guest CR0.CD will only affect memory types of KVM TDP when
> guest MTRRs are honored.
> 
> Suggested-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9e7186864542..6693daeb5686 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_mmu_reset_context(vcpu);
>  
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> -	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> +	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
>  	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>  		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>  }
> -- 
> 2.17.1
> 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
  2023-06-16  2:36 ` [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr Yan Zhao
@ 2023-06-28 22:08   ` Sean Christopherson
  0 siblings, 0 replies; 37+ messages in thread
From: Sean Christopherson @ 2023-06-28 22:08 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023, Yan Zhao wrote:
> Call helper to check if guest MTRRs are honored by KVM MMU before
> calculation and zapping.

Same comment here about stating the effect, not the literal code change.

> Guest MTRRs only affect TDP memtypes when TDP honors guest MTRRs, there's
> no meaning to do the calculation and zapping otherwise.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-16  2:38 ` [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code Yan Zhao
@ 2023-06-28 22:57   ` Sean Christopherson
  2023-06-29  0:55     ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2023-06-28 22:57 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023, Yan Zhao wrote:
> Move code in vmx.c to get cache disabled memtype when non-coherent DMA
> present to x86 common code.
> 
> This is the preparation patch for later implementation of fine-grained gfn
> zap for CR0.CD toggles when guest MTRRs are honored.
> 
> No functional change intended.
> 
> Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> ---
>  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c | 10 +++++-----
>  arch/x86/kvm/x86.h     |  1 +
>  3 files changed, 25 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 3ce58734ad22..b35dd0bc9cad 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -721,3 +721,22 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
>  
>  	return type == mtrr_default_type(mtrr_state);
>  }
> +
> +void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)

Hmm, I'm not convinced that this logic is subtle enough to warrant a common
helper with out params (I *really* don't like out params :-) ).

UC, or more specifically CR0.CD=1 on VMX without the quirk, is a super special case,
because to faithfully emulatee "No Fill" mode, KVM needs to ignore guest PAT (stupid WC).

I don't love having the same logic/assumptions in multiple places, but the CR0.CD=1
behavior is so rigidly tied to what KVM must do to that I think trying to provide a
common helper makes the code more complex than it needs to be.

If we open code the logic in the MTRR helper, than I think it can be distilled
down to:

	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
	bool mtrr_enabled = mtrr_is_enabled(mtrr_state);
	u8 default_type;

	/*
	 * Faithfully emulating CR0.CD=1 on VMX requires ignoring guest PAT, as
	 * WC in the PAT overrides UC in the MTRRs.  Zap all SPTEs so that KVM
	 * will once again start honoring guest PAT.
	 */
	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
		return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));

	default_type = mtrr_enabled ? mtrr_default_type(mtrr_state) :
				      mtrr_disabled_type(vcpu);

	if (default_type != MTRR_TYPE_WRBACK)
		return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));

	if (mtrr_enabled) {
		if (gather_non_wb_fixed_mtrrs(vcpu, MTRR_TYPE_WRBACK))
			goto fail;

		if (gather_non_wb_var_mtrrs(vcpu, MTRR_TYPE_WRBACK))
			goto fail;

		kvm_zap_or_wait_mtrrs(vcpu->kvm);
	}

and this patch goes away.

> +{
> +	/*
> +	 * this routine is supposed to be called when guest mtrrs are honored
> +	 */
> +	if (unlikely(!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))) {

I don't think this is worth checking, e.g. this would be WARN-worthy if it weren't
for an otherwise benign race with device (un)assignment.

> +		*type = MTRR_TYPE_WRBACK;
> +		*ipat = true;

> +	} else if (unlikely(!kvm_check_has_quirk(vcpu->kvm,

Eh, drop the "unlikely()" annotations.  AIUI, they almost never provide actual
performance benefits, and I dislike unnecessarily speculating on what userspace
is doing when it comes to code (though I 100% agree that this definitely unlikely)

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-16  2:39 ` [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored Yan Zhao
  2023-06-16  7:45   ` Yuan Yao
@ 2023-06-28 23:00   ` Sean Christopherson
  2023-06-29  1:51     ` Yan Zhao
  1 sibling, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2023-06-28 23:00 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023, Yan Zhao wrote:
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index b35dd0bc9cad..688748e3a4d2 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -25,6 +25,8 @@
>  #define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
>  #define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)
>  
> +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> +				   gfn_t gfn_start, gfn_t gfn_end);
>  static bool is_mtrr_base_msr(unsigned int msr)
>  {
>  	/* MTRR base MSRs use even numbers, masks use odd numbers. */
> @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
>  	}
>  
> -	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> +	kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
>  }
>  
>  static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
>  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
>  {
>  	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> +
> +	if (vcpu->vcpu_id == 0) {

Eww.  This is actually unsafe, because kvm_arch_vcpu_create() is invoked without
holding kvm->lock.  Oh, and vcpu_id is userspace controlled, so it's *very*
unsafe.  Just initialize these in kvm_arch_init_vm().

> +		spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> +		INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> +	}
>  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap
  2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (10 preceding siblings ...)
  2023-06-16  2:42 ` [PATCH v3 11/11] KVM: x86/mmu: split a single gfn zap range " Yan Zhao
@ 2023-06-28 23:02 ` Sean Christopherson
  2023-07-14  7:11   ` Yan Zhao
  11 siblings, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2023-06-28 23:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Fri, Jun 16, 2023, Yan Zhao wrote:
> This series refines mmu zap caused by EPT memory type update when guest
> MTRRs are honored.

...

> Yan Zhao (11):
>   KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
>   KVM: x86/mmu: Use KVM honors guest MTRRs helper in
>     kvm_tdp_page_fault()
>   KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
>   KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
>   KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
>   KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
>   KVM: VMX: drop IPAT in memtype when CD=1 for
>     KVM_X86_QUIRK_CD_NW_CLEARED
>   KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
>     code
>   KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
>   KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
>   KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
>     honored

I got through the easy patches, I'll circle back for the last few patches in a
few weeks (probably 3+ weeks at this point).

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-28 22:57   ` Sean Christopherson
@ 2023-06-29  0:55     ` Yan Zhao
  2023-06-29 20:42       ` Sean Christopherson
  0 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-29  0:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Wed, Jun 28, 2023 at 03:57:09PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > Move code in vmx.c to get cache disabled memtype when non-coherent DMA
> > present to x86 common code.
> > 
> > This is the preparation patch for later implementation of fine-grained gfn
> > zap for CR0.CD toggles when guest MTRRs are honored.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > ---
> >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> >  arch/x86/kvm/vmx/vmx.c | 10 +++++-----
> >  arch/x86/kvm/x86.h     |  1 +
> >  3 files changed, 25 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index 3ce58734ad22..b35dd0bc9cad 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -721,3 +721,22 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> >  
> >  	return type == mtrr_default_type(mtrr_state);
> >  }
> > +
> > +void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
> 
> Hmm, I'm not convinced that this logic is subtle enough to warrant a common
I added this patch because the memtype to use under CR0.CD=1 is determined by
vmx specific code (i.e. vmx.c), while mtrr.c is a common code for x86.

I don't know if it's good to assume what vmx.c will return as in below open code. 
(e.g. if someone added IPAT bit for CR0.CD=1 under the quirk, and forgot
to update the code here, we actually need to zap everything rather than
zap only non-WB ranges).

That's why I want to introduce a helper and let vmx.c call into it.
(sorry, I didn't write a good commit message to explain the real intent).

> helper with out params (I *really* don't like out params :-) ).
I don't like the out params either. :)
I just don't know a better way to return the ipat in the helper.

> 
> UC, or more specifically CR0.CD=1 on VMX without the quirk, is a super special case,
> because to faithfully emulatee "No Fill" mode, KVM needs to ignore guest PAT (stupid WC).
> 
> I don't love having the same logic/assumptions in multiple places, but the CR0.CD=1
> behavior is so rigidly tied to what KVM must do to that I think trying to provide a
> common helper makes the code more complex than it needs to be.
> 
> If we open code the logic in the MTRR helper, than I think it can be distilled
> down to:
> 
> 	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> 	bool mtrr_enabled = mtrr_is_enabled(mtrr_state);
> 	u8 default_type;
> 
> 	/*
> 	 * Faithfully emulating CR0.CD=1 on VMX requires ignoring guest PAT, as
> 	 * WC in the PAT overrides UC in the MTRRs.  Zap all SPTEs so that KVM
> 	 * will once again start honoring guest PAT.
> 	 */
> 	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> 		return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> 
> 	default_type = mtrr_enabled ? mtrr_default_type(mtrr_state) :
> 				      mtrr_disabled_type(vcpu);
> 
> 	if (default_type != MTRR_TYPE_WRBACK)
> 		return kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> 
> 	if (mtrr_enabled) {
> 		if (gather_non_wb_fixed_mtrrs(vcpu, MTRR_TYPE_WRBACK))
> 			goto fail;
> 
> 		if (gather_non_wb_var_mtrrs(vcpu, MTRR_TYPE_WRBACK))
> 			goto fail;
> 
> 		kvm_zap_or_wait_mtrrs(vcpu->kvm);
> 	}
> 
> and this patch goes away.
> 
> > +{
> > +	/*
> > +	 * this routine is supposed to be called when guest mtrrs are honored
> > +	 */
> > +	if (unlikely(!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))) {
> 
> I don't think this is worth checking, e.g. this would be WARN-worthy if it weren't
> for an otherwise benign race with device (un)assignment.
Yes, WANR_ON is a better way.

> 
> > +		*type = MTRR_TYPE_WRBACK;
> > +		*ipat = true;
> 
> > +	} else if (unlikely(!kvm_check_has_quirk(vcpu->kvm,
> 
> Eh, drop the "unlikely()" annotations.  AIUI, they almost never provide actual
> performance benefits, and I dislike unnecessarily speculating on what userspace
> is doing when it comes to code (though I 100% agree that this definitely unlikely)
It makes sence!

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
  2023-06-28 21:59   ` Sean Christopherson
@ 2023-06-29  1:42     ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-29  1:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Wed, Jun 28, 2023 at 02:59:00PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > Call helper to check if guest MTRRs are honored by KVM MMU before zapping,
> 
> Nit, state the effect, not what the code literally does.  The important part is
> that the end result is that KVM will zap if and only if guest MTRRs are being
> honored, e.g.
> 
>   Zap SPTEs when CR0.CD is toggled if and only if KVM's MMU is honoring
>   guest MTRRs, which is the only time that KVM incorporates the guest's
>   CR0.CD into the final memtype.
> 
Thanks! Will update it. 

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
  2023-06-28 23:00   ` Sean Christopherson
@ 2023-06-29  1:51     ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-06-29  1:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Wed, Jun 28, 2023 at 04:00:55PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index b35dd0bc9cad..688748e3a4d2 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -25,6 +25,8 @@
> >  #define IA32_MTRR_DEF_TYPE_FE		(1ULL << 10)
> >  #define IA32_MTRR_DEF_TYPE_TYPE_MASK	(0xff)
> >  
> > +static void kvm_mtrr_zap_gfn_range(struct kvm_vcpu *vcpu,
> > +				   gfn_t gfn_start, gfn_t gfn_end);
> >  static bool is_mtrr_base_msr(unsigned int msr)
> >  {
> >  	/* MTRR base MSRs use even numbers, masks use odd numbers. */
> > @@ -341,7 +343,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> >  		var_mtrr_range(var_mtrr_msr_to_range(vcpu, msr), &start, &end);
> >  	}
> >  
> > -	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > +	kvm_mtrr_zap_gfn_range(vcpu, gpa_to_gfn(start), gpa_to_gfn(end));
> >  }
> >  
> >  static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> > @@ -437,6 +439,11 @@ int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata)
> >  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
> >  {
> >  	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
> > +
> > +	if (vcpu->vcpu_id == 0) {
> 
> Eww.  This is actually unsafe, because kvm_arch_vcpu_create() is invoked without
> holding kvm->lock.  Oh, and vcpu_id is userspace controlled, so it's *very*
> unsafe.  Just initialize these in kvm_arch_init_vm().
Will do. Thanks!

> 
> > +		spin_lock_init(&vcpu->kvm->arch.mtrr_zap_list_lock);
> > +		INIT_LIST_HEAD(&vcpu->kvm->arch.mtrr_zap_list);
> > +	}
> >  }

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-29  0:55     ` Yan Zhao
@ 2023-06-29 20:42       ` Sean Christopherson
  2023-06-30  7:49         ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Sean Christopherson @ 2023-06-29 20:42 UTC (permalink / raw)
  To: Yan Zhao
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Thu, Jun 29, 2023, Yan Zhao wrote:
> On Wed, Jun 28, 2023 at 03:57:09PM -0700, Sean Christopherson wrote:
> > On Fri, Jun 16, 2023, Yan Zhao wrote:
> > > Move code in vmx.c to get cache disabled memtype when non-coherent DMA
> > > present to x86 common code.
> > > 
> > > This is the preparation patch for later implementation of fine-grained gfn
> > > zap for CR0.CD toggles when guest MTRRs are honored.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> > >  arch/x86/kvm/vmx/vmx.c | 10 +++++-----
> > >  arch/x86/kvm/x86.h     |  1 +
> > >  3 files changed, 25 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index 3ce58734ad22..b35dd0bc9cad 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -721,3 +721,22 @@ bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
> > >  
> > >  	return type == mtrr_default_type(mtrr_state);
> > >  }
> > > +
> > > +void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
> > 
> > Hmm, I'm not convinced that this logic is subtle enough to warrant a common
> I added this patch because the memtype to use under CR0.CD=1 is determined by
> vmx specific code (i.e. vmx.c), while mtrr.c is a common code for x86.
> 
> I don't know if it's good to assume what vmx.c will return as in below open code. 
> (e.g. if someone added IPAT bit for CR0.CD=1 under the quirk, and forgot
> to update the code here, we actually need to zap everything rather than
> zap only non-WB ranges).
> 
> That's why I want to introduce a helper and let vmx.c call into it.
> (sorry, I didn't write a good commit message to explain the real intent).

No need to apologize, I fully understood the intent.  I'm just not convinced that
the risk of us screwing up this particular case is worth the extra layers of crud
that are necessary to let VMX and MTRRs share the core logic.

Absent emulating CR0.CD=1 with UC, setting IPAT is complete nonsense when KVM is
honoring the guest memtype.

I 100% agree that splitting the logic is less than ideal, but providing a common
helper feels forced and IMO yields significantly less readable code.  And exporting
kvm_mtrr_get_cd_memory_type() only adds to the confusion because calling it on
SVM, which can't fully ignore gPAT, is also nonsensical.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-29 20:42       ` Sean Christopherson
@ 2023-06-30  7:49         ` Yan Zhao
  2023-07-14  7:00           ` Yan Zhao
  0 siblings, 1 reply; 37+ messages in thread
From: Yan Zhao @ 2023-06-30  7:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Thu, Jun 29, 2023 at 01:42:46PM -0700, Sean Christopherson wrote:
> On Thu, Jun 29, 2023, Yan Zhao wrote:
> > On Wed, Jun 28, 2023 at 03:57:09PM -0700, Sean Christopherson wrote:
> > > On Fri, Jun 16, 2023, Yan Zhao wrote:
...
> > > > +void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
> > > 
> > > Hmm, I'm not convinced that this logic is subtle enough to warrant a common
> > I added this patch because the memtype to use under CR0.CD=1 is determined by
> > vmx specific code (i.e. vmx.c), while mtrr.c is a common code for x86.
> > 
> > I don't know if it's good to assume what vmx.c will return as in below open code. 
> > (e.g. if someone added IPAT bit for CR0.CD=1 under the quirk, and forgot
> > to update the code here, we actually need to zap everything rather than
> > zap only non-WB ranges).
> > 
> > That's why I want to introduce a helper and let vmx.c call into it.
> > (sorry, I didn't write a good commit message to explain the real intent).
> 
> No need to apologize, I fully understood the intent.  I'm just not convinced that
> the risk of us screwing up this particular case is worth the extra layers of crud
> that are necessary to let VMX and MTRRs share the core logic.
> 
> Absent emulating CR0.CD=1 with UC, setting IPAT is complete nonsense when KVM is
> honoring the guest memtype.
Yes, I'm just paranoid :)

> 
> I 100% agree that splitting the logic is less than ideal, but providing a common
> helper feels forced and IMO yields significantly less readable code.  And exporting
> kvm_mtrr_get_cd_memory_type() only adds to the confusion because calling it on
> SVM, which can't fully ignore gPAT, is also nonsensical.
Ok. I get your concern now. You are right.
Looks the easiest way now is to add some comments in VMM to caution that
changes in memtype when noncoherent DMA present and CR0.CD=1 may lead to
update of code for GFN zap.
Or, do you think it's worth adding a new callback in kvm_x86_ops, e.g.
static_call_cond(kvm_x86_get_cd_mt_honor_guest_mtrr)()?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code
  2023-06-30  7:49         ` Yan Zhao
@ 2023-07-14  7:00           ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-07-14  7:00 UTC (permalink / raw)
  To: Sean Christopherson, kvm, linux-kernel, pbonzini, chao.gao,
	kai.huang, robert.hoo.linux

On Fri, Jun 30, 2023 at 03:49:10PM +0800, Yan Zhao wrote:
> On Thu, Jun 29, 2023 at 01:42:46PM -0700, Sean Christopherson wrote:
> > On Thu, Jun 29, 2023, Yan Zhao wrote:
> > > On Wed, Jun 28, 2023 at 03:57:09PM -0700, Sean Christopherson wrote:
> > > > On Fri, Jun 16, 2023, Yan Zhao wrote:
> ...
> > > > > +void kvm_mtrr_get_cd_memory_type(struct kvm_vcpu *vcpu, u8 *type, bool *ipat)
> > > > 
> > > > Hmm, I'm not convinced that this logic is subtle enough to warrant a common
> > > I added this patch because the memtype to use under CR0.CD=1 is determined by
> > > vmx specific code (i.e. vmx.c), while mtrr.c is a common code for x86.
> > > 
> > > I don't know if it's good to assume what vmx.c will return as in below open code. 
> > > (e.g. if someone added IPAT bit for CR0.CD=1 under the quirk, and forgot
> > > to update the code here, we actually need to zap everything rather than
> > > zap only non-WB ranges).
> > > 
> > > That's why I want to introduce a helper and let vmx.c call into it.
> > > (sorry, I didn't write a good commit message to explain the real intent).
> > 
> > No need to apologize, I fully understood the intent.  I'm just not convinced that
> > the risk of us screwing up this particular case is worth the extra layers of crud
> > that are necessary to let VMX and MTRRs share the core logic.
> > 
> > Absent emulating CR0.CD=1 with UC, setting IPAT is complete nonsense when KVM is
> > honoring the guest memtype.
> Yes, I'm just paranoid :)
> 
> > 
> > I 100% agree that splitting the logic is less than ideal, but providing a common
> > helper feels forced and IMO yields significantly less readable code.  And exporting
What about renaming it to kvm_honors_guest_mtrrs_get_cd_memtype()?
Then it's only needed to be called when guest mtrrs are honored and provides a
kind of enforcement. So that if there're other x86 participants (besides VMX/SVM)
who want to honor guest mtrr, the same memtype is used with CR0.CD=1.
(I know there might never be such kind of participants, or you may want to
update the code until they appear)

I tried in this way in v4 here
https://lore.kernel.org/all/20230714065356.20620-1-yan.y.zhao@intel.com/.
Feel free to ask me to drop it if you still don't like it :)

> > kvm_mtrr_get_cd_memory_type() only adds to the confusion because calling it on
> > SVM, which can't fully ignore gPAT, is also nonsensical.
> Ok. I get your concern now. You are right.
> Looks the easiest way now is to add some comments in VMM to caution that
> changes in memtype when noncoherent DMA present and CR0.CD=1 may lead to
> update of code for GFN zap.
> Or, do you think it's worth adding a new callback in kvm_x86_ops, e.g.
> static_call_cond(kvm_x86_get_cd_mt_honor_guest_mtrr)()?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap
  2023-06-28 23:02 ` [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
@ 2023-07-14  7:11   ` Yan Zhao
  0 siblings, 0 replies; 37+ messages in thread
From: Yan Zhao @ 2023-07-14  7:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, linux-kernel, pbonzini, chao.gao, kai.huang, robert.hoo.linux

On Wed, Jun 28, 2023 at 04:02:05PM -0700, Sean Christopherson wrote:
> On Fri, Jun 16, 2023, Yan Zhao wrote:
> > This series refines mmu zap caused by EPT memory type update when guest
> > MTRRs are honored.
> 
> ...
> 
> > Yan Zhao (11):
> >   KVM: x86/mmu: helpers to return if KVM honors guest MTRRs
> >   KVM: x86/mmu: Use KVM honors guest MTRRs helper in
> >     kvm_tdp_page_fault()
> >   KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles
> >   KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr
> >   KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops
> >   KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling
> >   KVM: VMX: drop IPAT in memtype when CD=1 for
> >     KVM_X86_QUIRK_CD_NW_CLEARED
> >   KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common
> >     code
> >   KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored
> >   KVM: x86/mmu: fine-grained gfn zap when guest MTRRs are honored
> >   KVM: x86/mmu: split a single gfn zap range when guest MTRRs are
> >     honored
> 
> I got through the easy patches, I'll circle back for the last few patches in a
> few weeks (probably 3+ weeks at this point).
Thanks for this heads-up.
I addressed almost all the comments for v3 currently, except about
where to get memtype for CR0.CD=1, and feel free to decline my new
proposal in v4 as explained in another mail :)
v4 is available here
https://lore.kernel.org/all/20230714064656.20147-1-yan.y.zhao@intel.com/
Please review the new version directly.

Thanks!

^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, other threads:[~2023-07-14  7:37 UTC | newest]

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-06-16  2:31 [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
2023-06-16  2:32 ` [PATCH v3 01/11] KVM: x86/mmu: helpers to return if KVM honors guest MTRRs Yan Zhao
2023-06-16  2:34 ` [PATCH v3 02/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper in kvm_tdp_page_fault() Yan Zhao
2023-06-16  2:35 ` [PATCH v3 03/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when CR0.CD toggles Yan Zhao
2023-06-28 21:59   ` Sean Christopherson
2023-06-29  1:42     ` Yan Zhao
2023-06-16  2:36 ` [PATCH v3 04/11] KVM: x86/mmu: Use KVM honors guest MTRRs helper when update mtrr Yan Zhao
2023-06-28 22:08   ` Sean Christopherson
2023-06-16  2:37 ` [PATCH v3 05/11] KVM: x86/mmu: zap KVM TDP when noncoherent DMA assignment starts/stops Yan Zhao
2023-06-16  2:37 ` [PATCH v3 06/11] KVM: x86/mmu: move TDP zaps from guest MTRRs update to CR0.CD toggling Yan Zhao
2023-06-16  2:38 ` [PATCH v3 07/11] KVM: VMX: drop IPAT in memtype when CD=1 for KVM_X86_QUIRK_CD_NW_CLEARED Yan Zhao
2023-06-20  2:42   ` Chao Gao
2023-06-20  2:34     ` Yan Zhao
2023-06-20  3:34       ` Chao Gao
2023-06-20  3:19         ` Yan Zhao
2023-06-25  7:14       ` Xiaoyao Li
2023-06-26  0:08         ` Yan Zhao
2023-06-26  3:40           ` Yuan Yao
2023-06-26  3:38             ` Yan Zhao
2023-06-20  3:17     ` Yan Zhao
2023-06-16  2:38 ` [PATCH v3 08/11] KVM: x86: move vmx code to get EPT memtype when CR0.CD=1 to x86 common code Yan Zhao
2023-06-28 22:57   ` Sean Christopherson
2023-06-29  0:55     ` Yan Zhao
2023-06-29 20:42       ` Sean Christopherson
2023-06-30  7:49         ` Yan Zhao
2023-07-14  7:00           ` Yan Zhao
2023-06-16  2:39 ` [PATCH v3 09/11] KVM: x86/mmu: serialize vCPUs to zap gfn when guest MTRRs are honored Yan Zhao
2023-06-16  7:45   ` Yuan Yao
2023-06-16  7:37     ` Yan Zhao
2023-06-16  8:09       ` Yuan Yao
2023-06-16  7:50         ` Yan Zhao
2023-06-28 23:00   ` Sean Christopherson
2023-06-29  1:51     ` Yan Zhao
2023-06-16  2:41 ` [PATCH v3 10/11] KVM: x86/mmu: fine-grained gfn zap " Yan Zhao
2023-06-16  2:42 ` [PATCH v3 11/11] KVM: x86/mmu: split a single gfn zap range " Yan Zhao
2023-06-28 23:02 ` [PATCH v3 00/11] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
2023-07-14  7:11   ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).