linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap
@ 2023-05-09 13:48 Yan Zhao
  2023-05-09 13:50 ` [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes Yan Zhao
                   ` (6 more replies)
  0 siblings, 7 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:48 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

This series refines mmu zap caused by EPT memory type update.

The first 3 patches will only do mmu zap when the target is to update
memory type of EPT entries by introducing a help in patch 1 to skip
non-EPT cases.

The 4th patch will trigger zapping of EPT leaf entries if non-coherent
DMA devices count goes from 0 to 1 or from 1 to 0.

The 5th-6th patches reduces EPT zap count by introducing a per-VM based
guest MTRR and only zap EPT entries when this per-VM based guest MTRR
changes.

Changelog:
v1 --> v2:
1. Added a helper to skip non EPT case in patch 1
2. Added patch 2 to skip mmu zap when guest CR0_CD changes if EPT is not
   enabled. (Chao Gao)
3. Added patch 3 to skip mmu zap when guest MTRR changes if EPT is not
   enabled.
4. Do not mention TDX in patch 4 as the code is not merged yet (Chao Gao)
5. Added patches 5-6 to reduce EPT zap during guest bootup.

v1:
https://lore.kernel.org/all/20230508034700.7686-1-yan.y.zhao@intel.com/

Yan Zhao (6):
  KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  KVM: x86/mmu: only zap EPT when guest CR0_CD changes
  KVM: x86/mmu: only zap EPT when guest MTRR changes
  KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count
  KVM: x86: Keep a per-VM MTRR state
  KVM: x86/mmu: use per-VM based MTRR for EPT

 arch/x86/include/asm/kvm_host.h |   3 +
 arch/x86/kvm/mmu.h              |   1 +
 arch/x86/kvm/mmu/mmu.c          |  18 ++++-
 arch/x86/kvm/mtrr.c             | 112 +++++++++++++++++++++++++-------
 arch/x86/kvm/vmx/vmx.c          |   2 +-
 arch/x86/kvm/x86.c              |  10 ++-
 arch/x86/kvm/x86.h              |   6 +-
 7 files changed, 122 insertions(+), 30 deletions(-)


base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5
-- 
2.17.1


^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
@ 2023-05-09 13:50 ` Yan Zhao
  2023-05-10  5:30   ` Chao Gao
  2023-05-09 13:51 ` [PATCH v2 2/6] KVM: x86/mmu: only zap EPT when guest CR0_CD changes Yan Zhao
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:50 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

Add a helper to indicate that the kvm_zap_gfn_range() request is to update
memory type.

Then the zap can be avoided in cases:
1. TDP is not enabled.
2. EPT is not enabled.

This is because only memory type of EPT leaf entries are subjected to
change when noncoherent DMA/guest CR0.CD/guest MTRR settings change.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..a04577afbc71 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -236,6 +236,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
 }
 
 void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..2706754794d1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6272,6 +6272,22 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
 	return flush;
 }
 
+/*
+ * Invalidate (zap) TDP SPTEs that cover GFNs from gfn_start and up to gfn_end
+ * (not including it) for reason of memory type being updated.
+ */
+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
+{
+	/* Currently only memory type of EPT leaf entries are affected by
+	 * guest CR0.CD and guest MTRR.
+	 * So skip invalidation (zap) in other cases
+	 */
+	if (!shadow_memtype_mask)
+		return;
+
+	kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
+}
+
 /*
  * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end
  * (not including it)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 2/6] KVM: x86/mmu: only zap EPT when guest CR0_CD changes
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
  2023-05-09 13:50 ` [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes Yan Zhao
@ 2023-05-09 13:51 ` Yan Zhao
  2023-05-09 13:51 ` [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes Yan Zhao
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:51 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
enabled.

Guest CR0_CD value will affect memory type of EPT leaf entry with
noncoherent DMA present. But mmu zap is not necessary if EPT is not
enabled.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e7f78fe79b32..ed1e3939bd05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
 	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+		kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
  2023-05-09 13:50 ` [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes Yan Zhao
  2023-05-09 13:51 ` [PATCH v2 2/6] KVM: x86/mmu: only zap EPT when guest CR0_CD changes Yan Zhao
@ 2023-05-09 13:51 ` Yan Zhao
  2023-05-10  5:39   ` Chao Gao
  2023-05-09 13:52 ` [PATCH v2 4/6] KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count Yan Zhao
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:51 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
enabled.

When guest MTRR changes and it's desired to zap TDP entries to remove
stale mappings, only do it when EPT is enabled, because only memory type
of EPT leaf is affected by guest MTRR with noncoherent DMA present.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mtrr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 9fac1ec03463..62ebb9978156 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -330,7 +330,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
 	}
 
-	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
+	kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
 }
 
 static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 4/6] KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (2 preceding siblings ...)
  2023-05-09 13:51 ` [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes Yan Zhao
@ 2023-05-09 13:52 ` Yan Zhao
  2023-05-09 13:53 ` [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state Yan Zhao
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:52 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

Zap all EPT leaf entries when noncoherent DMA count goes from 0 to 1, or
from 1 to 0.

When there's no noncoherent DMA device, EPT memory type is
((MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT)

When there're noncoherent DMA devices, EPT memory type needs to honor
guest CR0_CD and MTRR settings.

So, if noncoherent DMA count changes between 0 and 1, EPT leaf entries
need to be zapped to clear stale memory type.

This issue might be hidden when the device is statically assigned with
VFIO adding/removing MMIO regions of the noncoherent DMA devices for
several times during guest boot, and current KVM MMU will call
kvm_mmu_zap_all_fast() on the memslot removal.

But if the device is hot-plugged, or if the guest has mmio_always_on for
the device, the MMIO regions of it may only be added for once, then there's
no path to do the EPT entries zapping to clear stale memory type.

Therefore do the EPT zapping of all leaf entries when present/non-present
state of noncoherent DMA devices changes to ensure stale entries cleaned
away.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/x86.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ed1e3939bd05..48b683a305b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13145,13 +13145,15 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
 
 void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
 {
-	atomic_inc(&kvm->arch.noncoherent_dma_count);
+	if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1)
+		kvm_zap_gfn_for_memtype(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
 }
 EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
 
 void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
 {
-	atomic_dec(&kvm->arch.noncoherent_dma_count);
+	if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count))
+		kvm_zap_gfn_for_memtype(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
 }
 EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (3 preceding siblings ...)
  2023-05-09 13:52 ` [PATCH v2 4/6] KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count Yan Zhao
@ 2023-05-09 13:53 ` Yan Zhao
  2023-05-10 17:23   ` David Matlack
  2023-05-21  3:44   ` Robert Hoo
  2023-05-09 13:53 ` [PATCH v2 6/6] KVM: x86/mmu: use per-VM based MTRR for EPT Yan Zhao
  2023-05-24  0:15 ` [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
  6 siblings, 2 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:53 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.

This is a preparation patch for KVM to reference a per-VM guest MTRR
to decide memory type of EPT leaf entries when noncoherent DMA is present.

Though each vCPU has its own MTRR state, MTRR states should be
consistent across each VM, which is demanded as in Intel's SDM
"In a multiprocessor system using a processor in the P6 family or a more
recent family, each processor MUST use the identical MTRR memory map so
that software will have a consistent view of memory."

Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
a per-VM version of guest MTRR can be referenced.

Each vCPU still has its own MTRR state field to keep guest rdmsr()
returning the right value when there's lag of MTRR update for each vCPU.

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/mtrr.c             | 22 ++++++++++++++++++++++
 arch/x86/kvm/x86.c              |  2 ++
 arch/x86/kvm/x86.h              |  2 ++
 4 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2865c3cb3501..a2b6b1e1548f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1444,6 +1444,9 @@ struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+	struct kvm_mtrr *mtrr_state;
+	bool has_mtrr;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 62ebb9978156..1ae80c756797 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -438,6 +438,28 @@ void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu)
 	INIT_LIST_HEAD(&vcpu->arch.mtrr_state.head);
 }
 
+void kvm_mtrr_init(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	if (vcpu->vcpu_id)
+		return;
+
+	rcu_assign_pointer(kvm->arch.mtrr_state, &vcpu->arch.mtrr_state);
+	kvm->arch.has_mtrr = guest_cpuid_has(vcpu, X86_FEATURE_MTRR);
+}
+
+void kvm_mtrr_destroy(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+
+	if (vcpu->vcpu_id)
+		return;
+
+	rcu_assign_pointer(kvm->arch.mtrr_state, NULL);
+	synchronize_srcu_expedited(&kvm->srcu);
+}
+
 struct mtrr_iter {
 	/* input fields. */
 	struct kvm_mtrr *mtrr_state;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 48b683a305b3..b8aa18031877 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11879,6 +11879,7 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.msr_platform_info = MSR_PLATFORM_INFO_CPUID_FAULT;
 	kvm_xen_init_vcpu(vcpu);
 	kvm_vcpu_mtrr_init(vcpu);
+	kvm_mtrr_init(vcpu);
 	vcpu_load(vcpu);
 	kvm_set_tsc_khz(vcpu, vcpu->kvm->arch.default_tsc_khz);
 	kvm_vcpu_reset(vcpu, false);
@@ -11948,6 +11949,7 @@ void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
 	kvfree(vcpu->arch.cpuid_entries);
 	if (!lapic_in_kernel(vcpu))
 		static_branch_dec(&kvm_has_noapic_vcpu);
+	kvm_mtrr_destroy(vcpu);
 }
 
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index c544602d07a3..d0a7e50de739 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -308,6 +308,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 				   struct kvm_queued_exception *ex);
 
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
+void kvm_mtrr_init(struct kvm_vcpu *vcpu);
+void kvm_mtrr_destroy(struct kvm_vcpu *vcpu);
 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
 bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* [PATCH v2 6/6] KVM: x86/mmu: use per-VM based MTRR for EPT
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (4 preceding siblings ...)
  2023-05-09 13:53 ` [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state Yan Zhao
@ 2023-05-09 13:53 ` Yan Zhao
  2023-05-24  0:15 ` [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
  6 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-09 13:53 UTC (permalink / raw)
  To: kvm, linux-kernel; +Cc: pbonzini, seanjc, Yan Zhao

When KVM mmu checking guest MTRR, check the per-VM one and only zap EPT if
per-VM MTRR (MTRR of vCPU 0) changes.

Before this patch, if there're noncoherent DMA, EPT violation handler
will reference the guest MTRR state of the vCPU causing the violation.
EPT leaf entries will be zapped if MTRR settings of each vCPU changes.
But as one EPT leaf entry can only have one memory type, it may still cause
problem if vCPUs have different MTRR state.
So, insane guests without consistent MTRR state across vCPUs will only
cause problem to its own.

Therefore, this patch switches to use per-VM MTRR and only zap EPT when
this per-VM MTRR changes, which can avoid several EPT zap during guest
boot.

A reference data (average of 10 times of guest boot) is as below:

Physical CPU frequency: 3100 MHz
       | vCPU cnt |  memory | EPT zap cnt | EPT zap cycles | bootup time
before |     8    |   2G    |    84       |     4164.57M   |   19.38s
after  |     8    |   2G    |    14       |       16.07M   |   18.83s
before |     8    |  16G    |    84       |     4163.38M   |   24.51s
after  |     8    |  16G    |    14       |       16.68M   |   23.94s

Legends:
before:         before this patch
after:          after this patch
vCPU cnt:       guest vCPU count of a VM
memory:         guest memory size
EPT zap cnt:    the count of EPT zap caused by update_mtrr() during guest
                boot
EPT zap cycles: the cpu cyles of EPT zap caused by update_mtrr() during
                guest boot
bootup time:    guest bootup time, measured from starting QEMU to guest
                rc.local

Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
---
 arch/x86/kvm/mmu/mmu.c |  2 +-
 arch/x86/kvm/mtrr.c    | 88 +++++++++++++++++++++++++++++++-----------
 arch/x86/kvm/vmx/vmx.c |  2 +-
 arch/x86/kvm/x86.h     |  4 +-
 4 files changed, 70 insertions(+), 26 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2706754794d1..4b05ce1f0241 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4532,7 +4532,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			gfn_t base = gfn_round_for_level(fault->gfn,
 							 fault->max_level);
 
-			if (kvm_mtrr_check_gfn_range_consistency(vcpu, base, page_num))
+			if (kvm_mtrr_check_gfn_range_consistency(vcpu->kvm, base, page_num))
 				break;
 		}
 	}
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 1ae80c756797..9be8ed40e226 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -105,7 +105,7 @@ static u8 mtrr_default_type(struct kvm_mtrr *mtrr_state)
 	return mtrr_state->deftype & IA32_MTRR_DEF_TYPE_TYPE_MASK;
 }
 
-static u8 mtrr_disabled_type(struct kvm_vcpu *vcpu)
+static u8 mtrr_disabled_type(struct kvm *kvm)
 {
 	/*
 	 * Intel SDM 11.11.2.2: all MTRRs are disabled when
@@ -117,10 +117,7 @@ static u8 mtrr_disabled_type(struct kvm_vcpu *vcpu)
 	 * enable MTRRs and it is obviously undesirable to run the
 	 * guest entirely with UC memory and we use WB.
 	 */
-	if (guest_cpuid_has(vcpu, X86_FEATURE_MTRR))
-		return MTRR_TYPE_UNCACHABLE;
-	else
-		return MTRR_TYPE_WRBACK;
+	return kvm->arch.has_mtrr ? MTRR_TYPE_UNCACHABLE : MTRR_TYPE_WRBACK;
 }
 
 /*
@@ -310,6 +307,12 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	gfn_t start, end;
 	int index;
 
+	/*  MTRR is consistency between all the processors in the system
+	 *  so just update the TDP according to MTRR settings in vcpu0
+	 */
+	if (vcpu->vcpu_id)
+		return;
+
 	if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
 	      !kvm_arch_has_noncoherent_dma(vcpu->kvm))
 		return;
@@ -635,10 +638,11 @@ static void mtrr_lookup_next(struct mtrr_iter *iter)
 	for (mtrr_lookup_init(_iter_, _mtrr_, _gpa_start_, _gpa_end_); \
 	     mtrr_lookup_okay(_iter_); mtrr_lookup_next(_iter_))
 
-u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
+u8 kvm_mtrr_get_guest_memory_type(struct kvm *kvm, gfn_t gfn)
 {
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	struct kvm_mtrr *mtrr_state;
 	struct mtrr_iter iter;
+	int srcu_idx;
 	u64 start, end;
 	int type = -1;
 	const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
@@ -647,6 +651,16 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
 	start = gfn_to_gpa(gfn);
 	end = start + PAGE_SIZE;
 
+	srcu_idx = srcu_read_lock(&kvm->srcu);
+	mtrr_state = srcu_dereference(kvm->arch.mtrr_state, &kvm->srcu);
+	/* kvm mtrr_state points to mtrr_state of vcpu0.
+	 * should not reach here unless vcpu0 is destroyed
+	 */
+	if (WARN_ON(!mtrr_state)) {
+		type = mtrr_disabled_type(kvm);
+		goto out;
+	}
+
 	mtrr_for_each_mem_type(&iter, mtrr_state, start, end) {
 		int curr_type = iter.mem_type;
 
@@ -694,12 +708,16 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
 		return MTRR_TYPE_WRBACK;
 	}
 
-	if (iter.mtrr_disabled)
-		return mtrr_disabled_type(vcpu);
+	if (iter.mtrr_disabled) {
+		type = mtrr_disabled_type(kvm);
+		goto out;
+	}
 
 	/* not contained in any MTRRs. */
-	if (type == -1)
-		return mtrr_default_type(mtrr_state);
+	if (type == -1) {
+		type = mtrr_default_type(mtrr_state);
+		goto out;
+	}
 
 	/*
 	 * We just check one page, partially covered by MTRRs is
@@ -707,38 +725,64 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
 	 */
 	WARN_ON(iter.partial_map);
 
+out:
+	srcu_read_unlock(&kvm->srcu, srcu_idx);
 	return type;
 }
 EXPORT_SYMBOL_GPL(kvm_mtrr_get_guest_memory_type);
 
-bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
+bool kvm_mtrr_check_gfn_range_consistency(struct kvm *kvm, gfn_t gfn,
 					  int page_num)
 {
-	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
+	struct kvm_mtrr *mtrr_state;
 	struct mtrr_iter iter;
+	int srcu_idx;
 	u64 start, end;
 	int type = -1;
+	int ret;
 
 	start = gfn_to_gpa(gfn);
 	end = gfn_to_gpa(gfn + page_num);
+
+	srcu_idx = srcu_read_lock(&kvm->srcu);
+	mtrr_state = srcu_dereference(kvm->arch.mtrr_state, &kvm->srcu);
+	/* kvm mtrr_state points to mtrr_state of vcpu0.
+	 * should not reach here unless vcpu0 is destroyed
+	 */
+	if (WARN_ON(!mtrr_state)) {
+		ret = true;
+		goto out;
+	}
+
 	mtrr_for_each_mem_type(&iter, mtrr_state, start, end) {
 		if (type == -1) {
 			type = iter.mem_type;
 			continue;
 		}
 
-		if (type != iter.mem_type)
-			return false;
+		if (type != iter.mem_type) {
+			ret = false;
+			goto out;
+		}
 	}
 
-	if (iter.mtrr_disabled)
-		return true;
+	if (iter.mtrr_disabled) {
+		ret = true;
+		goto out;
+	}
 
-	if (!iter.partial_map)
-		return true;
+	if (!iter.partial_map) {
+		ret = true;
+		goto out;
+	}
 
-	if (type == -1)
-		return true;
+	if (type == -1) {
+		ret = true;
+		goto out;
+	}
 
-	return type == mtrr_default_type(mtrr_state);
+	ret = (type == mtrr_default_type(mtrr_state));
+out:
+	srcu_read_unlock(&kvm->srcu, srcu_idx);
+	return ret;
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 44fb619803b8..2ae9d5f3da99 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7540,7 +7540,7 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 	}
 
-	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
+	return kvm_mtrr_get_guest_memory_type(vcpu->kvm, gfn) << VMX_EPT_MT_EPTE_SHIFT;
 }
 
 static void vmcs_set_secondary_exec_control(struct vcpu_vmx *vmx, u32 new_ctl)
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index d0a7e50de739..a7acfeacbc04 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -310,11 +310,11 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
 void kvm_mtrr_init(struct kvm_vcpu *vcpu);
 void kvm_mtrr_destroy(struct kvm_vcpu *vcpu);
-u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
+u8 kvm_mtrr_get_guest_memory_type(struct kvm *kvm, gfn_t gfn);
 bool kvm_mtrr_valid(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data);
 int kvm_mtrr_get_msr(struct kvm_vcpu *vcpu, u32 msr, u64 *pdata);
-bool kvm_mtrr_check_gfn_range_consistency(struct kvm_vcpu *vcpu, gfn_t gfn,
+bool kvm_mtrr_check_gfn_range_consistency(struct kvm *kvm, gfn_t gfn,
 					  int page_num);
 bool kvm_vector_hashing_enabled(void);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-09 13:50 ` [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes Yan Zhao
@ 2023-05-10  5:30   ` Chao Gao
  2023-05-10  8:06     ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Chao Gao @ 2023-05-10  5:30 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Tue, May 09, 2023 at 09:50:06PM +0800, Yan Zhao wrote:
>Add a helper to indicate that the kvm_zap_gfn_range() request is to update
>memory type.
>
>Then the zap can be avoided in cases:
>1. TDP is not enabled.
>2. EPT is not enabled.
>
>This is because only memory type of EPT leaf entries are subjected to
>change when noncoherent DMA/guest CR0.CD/guest MTRR settings change.
>
>Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>---
> arch/x86/kvm/mmu.h     |  1 +
> arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
> 2 files changed, 17 insertions(+)
>
>diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
>index 92d5a1924fc1..a04577afbc71 100644
>--- a/arch/x86/kvm/mmu.h
>+++ b/arch/x86/kvm/mmu.h
>@@ -236,6 +236,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> }
> 
> void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
>+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> 
> int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> 
>diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
>index c8961f45e3b1..2706754794d1 100644
>--- a/arch/x86/kvm/mmu/mmu.c
>+++ b/arch/x86/kvm/mmu/mmu.c
>@@ -6272,6 +6272,22 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
> 	return flush;
> }
> 
>+/*
>+ * Invalidate (zap) TDP SPTEs that cover GFNs from gfn_start and up to gfn_end
>+ * (not including it) for reason of memory type being updated.
>+ */
>+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>+{
>+	/* Currently only memory type of EPT leaf entries are affected by
>+	 * guest CR0.CD and guest MTRR.
>+	 * So skip invalidation (zap) in other cases
>+	 */
>+	if (!shadow_memtype_mask)

Do you need to check kvm_arch_has_noncoherent_dma()? if yes, maybe just extract
the first if() statement and its associated comment from kvm_tdp_page_fault()?

>+		return;
>+
>+	kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));

This should be:

	kvm_zap_gfn_range(kvm, start, end);

>+}
>+
> /*
>  * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end
>  * (not including it)
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-09 13:51 ` [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes Yan Zhao
@ 2023-05-10  5:39   ` Chao Gao
  2023-05-10  8:00     ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Chao Gao @ 2023-05-10  5:39 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Tue, May 09, 2023 at 09:51:43PM +0800, Yan Zhao wrote:
>Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
>enabled.
>
>When guest MTRR changes and it's desired to zap TDP entries to remove
>stale mappings, only do it when EPT is enabled, because only memory type
>of EPT leaf is affected by guest MTRR with noncoherent DMA present.
>
>Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
>---
> arch/x86/kvm/mtrr.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
>index 9fac1ec03463..62ebb9978156 100644
>--- a/arch/x86/kvm/mtrr.c
>+++ b/arch/x86/kvm/mtrr.c
>@@ -330,7 +330,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> 		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
> 	}
> 
>-	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
>+	kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));

I am wondering if the check of shadow_memtype_mask (now inside the
kvm_zap_gfn_for_memtype()) should be moved to the beginning of update_mtrr().
Because if EPT isn't enabled, computing @start/@end is useless and can be
skipped.

> }
> 
> static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
>-- 
>2.17.1
>

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-10  5:39   ` Chao Gao
@ 2023-05-10  8:00     ` Yan Zhao
  2023-05-10 10:54       ` Huang, Kai
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-10  8:00 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Wed, May 10, 2023 at 01:39:25PM +0800, Chao Gao wrote:
> On Tue, May 09, 2023 at 09:51:43PM +0800, Yan Zhao wrote:
> >Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
> >enabled.
> >
> >When guest MTRR changes and it's desired to zap TDP entries to remove
> >stale mappings, only do it when EPT is enabled, because only memory type
> >of EPT leaf is affected by guest MTRR with noncoherent DMA present.
> >
> >Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >---
> > arch/x86/kvm/mtrr.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> >diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> >index 9fac1ec03463..62ebb9978156 100644
> >--- a/arch/x86/kvm/mtrr.c
> >+++ b/arch/x86/kvm/mtrr.c
> >@@ -330,7 +330,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > 		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
> > 	}
> > 
> >-	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> >+	kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> 
> I am wondering if the check of shadow_memtype_mask (now inside the
> kvm_zap_gfn_for_memtype()) should be moved to the beginning of update_mtrr().
> Because if EPT isn't enabled, computing @start/@end is useless and can be
> skipped.

No, shadow_memtype_mask is not accessible in mtrr.c.
It's better to confine it in mmu related files, e.g. mmu/mmu.c,
mmu/spte.c

> 
> > }
> > 
> > static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> >-- 
> >2.17.1
> >

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-10  5:30   ` Chao Gao
@ 2023-05-10  8:06     ` Yan Zhao
  2023-05-23 22:51       ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-10  8:06 UTC (permalink / raw)
  To: Chao Gao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Wed, May 10, 2023 at 01:30:29PM +0800, Chao Gao wrote:
> On Tue, May 09, 2023 at 09:50:06PM +0800, Yan Zhao wrote:
> >Add a helper to indicate that the kvm_zap_gfn_range() request is to update
> >memory type.
> >
> >Then the zap can be avoided in cases:
> >1. TDP is not enabled.
> >2. EPT is not enabled.
> >
> >This is because only memory type of EPT leaf entries are subjected to
> >change when noncoherent DMA/guest CR0.CD/guest MTRR settings change.
> >
> >Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> >---
> > arch/x86/kvm/mmu.h     |  1 +
> > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
> > 2 files changed, 17 insertions(+)
> >
> >diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> >index 92d5a1924fc1..a04577afbc71 100644
> >--- a/arch/x86/kvm/mmu.h
> >+++ b/arch/x86/kvm/mmu.h
> >@@ -236,6 +236,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > }
> > 
> > void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> >+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > 
> > int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> > 
> >diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >index c8961f45e3b1..2706754794d1 100644
> >--- a/arch/x86/kvm/mmu/mmu.c
> >+++ b/arch/x86/kvm/mmu/mmu.c
> >@@ -6272,6 +6272,22 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
> > 	return flush;
> > }
> > 
> >+/*
> >+ * Invalidate (zap) TDP SPTEs that cover GFNs from gfn_start and up to gfn_end
> >+ * (not including it) for reason of memory type being updated.
> >+ */
> >+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >+{
> >+	/* Currently only memory type of EPT leaf entries are affected by
> >+	 * guest CR0.CD and guest MTRR.
> >+	 * So skip invalidation (zap) in other cases
> >+	 */
> >+	if (!shadow_memtype_mask)
> 
> Do you need to check kvm_arch_has_noncoherent_dma()? if yes, maybe just extract
> the first if() statement and its associated comment from kvm_tdp_page_fault()?
>
This kvm_zap_gfn_for_memtype() helper will be called in
kvm_arch_unregister_noncoherent_dma() as well when noncoherent_dma_count
goes from 1 to 0.
So, checking kvm_arch_has_noncoherent_dma() is not desired.

> >+		return;
> >+
> >+	kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> 
> This should be:
> 
> 	kvm_zap_gfn_range(kvm, start, end);
>
Oops. sorry about this mistake. Will fix it in next version.

> >+}
> >+
> > /*
> >  * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end
> >  * (not including it)
> >-- 
> >2.17.1
> >

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-10  8:00     ` Yan Zhao
@ 2023-05-10 10:54       ` Huang, Kai
  2023-05-11  0:15         ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Kai @ 2023-05-10 10:54 UTC (permalink / raw)
  To: Zhao, Yan Y, Gao, Chao; +Cc: kvm, pbonzini, linux-kernel, Christopherson,, Sean

On Wed, 2023-05-10 at 16:00 +0800, Yan Zhao wrote:
> On Wed, May 10, 2023 at 01:39:25PM +0800, Chao Gao wrote:
> > On Tue, May 09, 2023 at 09:51:43PM +0800, Yan Zhao wrote:
> > > Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
> > > enabled.
> > > 
> > > When guest MTRR changes and it's desired to zap TDP entries to remove
> > > stale mappings, only do it when EPT is enabled, because only memory type
> > > of EPT leaf is affected by guest MTRR with noncoherent DMA present.
> > > 
> > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > ---
> > > arch/x86/kvm/mtrr.c | 2 +-
> > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index 9fac1ec03463..62ebb9978156 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -330,7 +330,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > > 		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
> > > 	}
> > > 
> > > -	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > > +	kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > 
> > I am wondering if the check of shadow_memtype_mask (now inside the
> > kvm_zap_gfn_for_memtype()) should be moved to the beginning of update_mtrr().
> > Because if EPT isn't enabled, computing @start/@end is useless and can be
> > skipped.
> 
> No, shadow_memtype_mask is not accessible in mtrr.c.
> It's better to confine it in mmu related files, e.g. mmu/mmu.c,
> mmu/spte.c
> 

No, I think it should be shadow_memtype_mask.

Conceptually, after commit 38bf9d7bf277 ("KVM: x86/mmu: Add shadow mask for
effective host MTRR memtype"), I believe in update_mtrr() the check:

	if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
              !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return;

should be:

	if (msr == MSR_IA32_CR_PAT || !shadow_memtype_mask ||
              !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return;

Because the intention of 'shadow_memtype_mask' is to use it as a dedicated
variable to represent hardware has capability to override the host MTRRs (which
is the case for EPT), instead of relying on TDP being enabled.

That being said, to me it's more reasonable to have a separate patch to replace
the '!tdp_enabled' check with '!shadow_memtype_mask' in update_mtrr(), perhaps
with a Fixes tag for commit 38bf9d7bf277.

The kvm_zap_gfn_range() should be remain unchanged.

For the problem that shadow_memtype_mask isn't accessible in mtrr.c I think you
can include "mmu/spte.h"?

> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-09 13:53 ` [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state Yan Zhao
@ 2023-05-10 17:23   ` David Matlack
  2023-05-21  3:44   ` Robert Hoo
  1 sibling, 0 replies; 41+ messages in thread
From: David Matlack @ 2023-05-10 17:23 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Tue, May 09, 2023 at 09:53:00PM +0800, Yan Zhao wrote:
> +void kvm_mtrr_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +
> +	if (vcpu->vcpu_id)
> +		return;

vcpu_id is provided by userspace so I don't think there's any guarantee
that a vCPU with vcpu_id == 0 exists.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-10 10:54       ` Huang, Kai
@ 2023-05-11  0:15         ` Yan Zhao
  2023-05-11  2:42           ` Huang, Kai
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-11  0:15 UTC (permalink / raw)
  To: Huang, Kai; +Cc: Gao, Chao, kvm, pbonzini, linux-kernel, Christopherson,, Sean

On Wed, May 10, 2023 at 06:54:51PM +0800, Huang, Kai wrote:
> On Wed, 2023-05-10 at 16:00 +0800, Yan Zhao wrote:
> > On Wed, May 10, 2023 at 01:39:25PM +0800, Chao Gao wrote:
> > > On Tue, May 09, 2023 at 09:51:43PM +0800, Yan Zhao wrote:
> > > > Call new helper kvm_zap_gfn_for_memtype() to skip zap mmu if EPT is not
> > > > enabled.
> > > > 
> > > > When guest MTRR changes and it's desired to zap TDP entries to remove
> > > > stale mappings, only do it when EPT is enabled, because only memory type
> > > > of EPT leaf is affected by guest MTRR with noncoherent DMA present.
> > > > 
> > > > Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > > > ---
> > > > arch/x86/kvm/mtrr.c | 2 +-
> > > > 1 file changed, 1 insertion(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > > index 9fac1ec03463..62ebb9978156 100644
> > > > --- a/arch/x86/kvm/mtrr.c
> > > > +++ b/arch/x86/kvm/mtrr.c
> > > > @@ -330,7 +330,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > > > 		var_mtrr_range(&mtrr_state->var_ranges[index], &start, &end);
> > > > 	}
> > > > 
> > > > -	kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > > > +	kvm_zap_gfn_for_memtype(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
> > > 
> > > I am wondering if the check of shadow_memtype_mask (now inside the
> > > kvm_zap_gfn_for_memtype()) should be moved to the beginning of update_mtrr().
> > > Because if EPT isn't enabled, computing @start/@end is useless and can be
> > > skipped.
> > 
> > No, shadow_memtype_mask is not accessible in mtrr.c.
> > It's better to confine it in mmu related files, e.g. mmu/mmu.c,
> > mmu/spte.c
> > 
> 
> No, I think it should be shadow_memtype_mask.
> 
> Conceptually, after commit 38bf9d7bf277 ("KVM: x86/mmu: Add shadow mask for
> effective host MTRR memtype"), I believe in update_mtrr() the check:
> 
> 	if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
>               !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return;
> 
> should be:
> 
> 	if (msr == MSR_IA32_CR_PAT || !shadow_memtype_mask ||
>               !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return;
> 
> Because the intention of 'shadow_memtype_mask' is to use it as a dedicated
> variable to represent hardware has capability to override the host MTRRs (which
> is the case for EPT), instead of relying on TDP being enabled.
> 
> That being said, to me it's more reasonable to have a separate patch to replace
> the '!tdp_enabled' check with '!shadow_memtype_mask' in update_mtrr(), perhaps
> with a Fixes tag for commit 38bf9d7bf277.
> 
> The kvm_zap_gfn_range() should be remain unchanged.
> 
> For the problem that shadow_memtype_mask isn't accessible in mtrr.c I think you
> can include "mmu/spte.h"?
> 

I agree shadow_memtype_mask is the right value to check but it's indeed
internal to kvm mmu.
Including "mmu/spte.h" will further include "mmu/mmu_internal.h"

What about introduce kvm_mmu_memtye_effective() instead as below?

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a04577afbc71..173960b1827d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -254,6 +254,8 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
        return smp_load_acquire(&kvm->arch.shadow_root_allocated);
 }
 
+bool kvm_mmu_memtye_effective(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 extern bool tdp_mmu_enabled;
 #else
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2706754794d1..ff7752d158d7 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6288,6 +6288,10 @@ void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
        kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
 }
 
+bool kvm_mmu_memtye_effective(struct kvm *kvm)
+{
+       return shadow_memtype_mask;
+}
 /*
  * Invalidate (zap) SPTEs that cover GFNs from gfn_start and up to gfn_end
  * (not including it)
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 5ce5bc0c4971..b6bd147d22a0 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -313,7 +313,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
        int cnt;
        unsigned long long all;
 
-       if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
+       if (msr == MSR_IA32_CR_PAT || !kvm_mmu_memtye_effective(vcpu->kvm) ||
              !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return;


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-11  2:42           ` Huang, Kai
@ 2023-05-11  2:31             ` Yan Zhao
  2023-05-11  3:05               ` Huang, Kai
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-11  2:31 UTC (permalink / raw)
  To: Huang, Kai; +Cc: kvm, pbonzini, linux-kernel, Gao, Chao, Christopherson,, Sean

On Thu, May 11, 2023 at 10:42:13AM +0800, Huang, Kai wrote:
> > > 
> > 
> > I agree shadow_memtype_mask is the right value to check but it's indeed
> > internal to kvm mmu.
> > Including "mmu/spte.h" will further include "mmu/mmu_internal.h"
> > 
> > What about introduce kvm_mmu_memtye_effective() instead as below?
> > 
> 
> [...]
> 
> >  
> > +bool kvm_mmu_memtye_effective(struct kvm *kvm)
> > +{
> > +       return shadow_memtype_mask;
> > +}
> 
> I don't think it's a good name.  shadow_memtype_mask doesn't mean any actual
> effective memtype, but just represent those bits that KVM can manipulate to get
> an effective memtype for the guest access.
> 
> Instead, it should represent the hardware capability to be able to emulate
> guest's MTRR independent from host's MTRR.  So, perhaps something like:
> 
> 	bool kvm_mmu_guest_mtrr_emulatable();

What about kvm_mmu_cap_effective_guest_mtrr()?
Guest MTRR is always emulated with vMTRR.

> 
> It's better if Sean can provide input here.
>
> (Btw, off this topic, I think it's even more reasonable to make
> shadow_memtype_mask only be meaningful when tdp_enabled is true, because the
> capability to override host MTRR and emulate guest MTRR should be only available
> when secondary MMU is turned on).
>  

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-11  0:15         ` Yan Zhao
@ 2023-05-11  2:42           ` Huang, Kai
  2023-05-11  2:31             ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Huang, Kai @ 2023-05-11  2:42 UTC (permalink / raw)
  To: Zhao, Yan Y; +Cc: kvm, pbonzini, linux-kernel, Gao, Chao, Christopherson,, Sean

> > 
> 
> I agree shadow_memtype_mask is the right value to check but it's indeed
> internal to kvm mmu.
> Including "mmu/spte.h" will further include "mmu/mmu_internal.h"
> 
> What about introduce kvm_mmu_memtye_effective() instead as below?
> 

[...]

>  
> +bool kvm_mmu_memtye_effective(struct kvm *kvm)
> +{
> +       return shadow_memtype_mask;
> +}

I don't think it's a good name.  shadow_memtype_mask doesn't mean any actual
effective memtype, but just represent those bits that KVM can manipulate to get
an effective memtype for the guest access.

Instead, it should represent the hardware capability to be able to emulate
guest's MTRR independent from host's MTRR.  So, perhaps something like:

	bool kvm_mmu_guest_mtrr_emulatable();

It's better if Sean can provide input here.

(Btw, off this topic, I think it's even more reasonable to make
shadow_memtype_mask only be meaningful when tdp_enabled is true, because the
capability to override host MTRR and emulate guest MTRR should be only available
when secondary MMU is turned on).
 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes
  2023-05-11  2:31             ` Yan Zhao
@ 2023-05-11  3:05               ` Huang, Kai
  0 siblings, 0 replies; 41+ messages in thread
From: Huang, Kai @ 2023-05-11  3:05 UTC (permalink / raw)
  To: Zhao, Yan Y; +Cc: kvm, pbonzini, linux-kernel, Gao, Chao, Christopherson,, Sean

On Thu, 2023-05-11 at 10:31 +0800, Zhao, Yan Y wrote:
> On Thu, May 11, 2023 at 10:42:13AM +0800, Huang, Kai wrote:
> > > > 
> > > 
> > > I agree shadow_memtype_mask is the right value to check but it's indeed
> > > internal to kvm mmu.
> > > Including "mmu/spte.h" will further include "mmu/mmu_internal.h"
> > > 
> > > What about introduce kvm_mmu_memtye_effective() instead as below?
> > > 
> > 
> > [...]
> > 
> > >  
> > > +bool kvm_mmu_memtye_effective(struct kvm *kvm)
> > > +{
> > > +       return shadow_memtype_mask;
> > > +}
> > 
> > I don't think it's a good name.  shadow_memtype_mask doesn't mean any actual
> > effective memtype, but just represent those bits that KVM can manipulate to get
> > an effective memtype for the guest access.
> > 
> > Instead, it should represent the hardware capability to be able to emulate
> > guest's MTRR independent from host's MTRR.  So, perhaps something like:
> > 
> > 	bool kvm_mmu_guest_mtrr_emulatable();
> 
> What about kvm_mmu_cap_effective_guest_mtrr()?
> Guest MTRR is always emulated with vMTRR.

Fine to me, but again it's better if Sean can provide input.

> 
> > 
> > It's better if Sean can provide input here.
> > 
> > (Btw, off this topic, I think it's even more reasonable to make
> > shadow_memtype_mask only be meaningful when tdp_enabled is true, because the
> > capability to override host MTRR and emulate guest MTRR should be only available
> > when secondary MMU is turned on).
> >  


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-09 13:53 ` [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state Yan Zhao
  2023-05-10 17:23   ` David Matlack
@ 2023-05-21  3:44   ` Robert Hoo
  2023-05-23  6:21     ` Yan Zhao
  1 sibling, 1 reply; 41+ messages in thread
From: Robert Hoo @ 2023-05-21  3:44 UTC (permalink / raw)
  To: Yan Zhao, kvm, linux-kernel; +Cc: pbonzini, seanjc

On 5/9/2023 9:53 PM, Yan Zhao wrote:
> Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> 
> This is a preparation patch for KVM to reference a per-VM guest MTRR
> to decide memory type of EPT leaf entries when noncoherent DMA is present.
> 
> Though each vCPU has its own MTRR state, MTRR states should be
> consistent across each VM, which is demanded as in Intel's SDM
> "In a multiprocessor system using a processor in the P6 family or a more
> recent family, each processor MUST use the identical MTRR memory map so
> that software will have a consistent view of memory."
> 
> Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> a per-VM version of guest MTRR can be referenced.
> 
> Each vCPU still has its own MTRR state field to keep guest rdmsr()
> returning the right value when there's lag of MTRR update for each vCPU.
> 
Can we get rid of per-vCPU MTRR state copies and just have this per-VM 
state only? therefore can simplify implementation and avoid hazard of 
inconsistency among per-VPU MTRR states.

I see in SDM, it notes:
"In multiple processor systems, the operating system must maintain MTRR 
consistency between all the processors in the system (that is, all 
processors must use the same MTRR values). The P6 and more recent processor 
families provide no hardware support for maintaining this consistency."

leaving each vCPU's MTRR is just to fully mimic HW?


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-21  3:44   ` Robert Hoo
@ 2023-05-23  6:21     ` Yan Zhao
  2023-05-24  0:13       ` Sean Christopherson
  2023-05-25  7:21       ` Robert Hoo
  0 siblings, 2 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-23  6:21 UTC (permalink / raw)
  To: Robert Hoo; +Cc: kvm, linux-kernel, pbonzini, seanjc

On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > 
> > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > 
> > Though each vCPU has its own MTRR state, MTRR states should be
> > consistent across each VM, which is demanded as in Intel's SDM
> > "In a multiprocessor system using a processor in the P6 family or a more
> > recent family, each processor MUST use the identical MTRR memory map so
> > that software will have a consistent view of memory."
> > 
> > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > a per-VM version of guest MTRR can be referenced.
> > 
> > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > returning the right value when there's lag of MTRR update for each vCPU.
> > 
> Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> only? therefore can simplify implementation and avoid hazard of
> inconsistency among per-VPU MTRR states.
> 
> I see in SDM, it notes:
> "In multiple processor systems, the operating system must maintain MTRR
> consistency between all the processors in the system (that is, all
> processors must use the same MTRR values). The P6 and more recent processor
> families provide no hardware support for maintaining this consistency."
> 
> leaving each vCPU's MTRR is just to fully mimic HW?
>
Yes, leaving each vCPU's MTRR to mimic HW.

As also suggested in SDM, the guest OS manipulates MTRRs in this way:

for each online CPUs {
	disable MTRR
	update fixed/var MTRR ranges
	enable MTRR
}
Guest OS needs to access memory only after this full pattern.

So, I think there should not have "hazard of inconsistency among per-VPU MTRR
states".

I want to have per-VM MTRR state is because I want to reduce unnessary EPT
zap, which costs quite a lot cpu cycles even when the EPT is empty.

In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
so that it can save some memory to keep the MTRR state.
But I found out that it would only work when vCPU 0 (boot processor) is
always online (which is not true for x86 under some configration).

I'll try to find out lowest online vCPU and keep a per-VM copy of MTRR state
in next version.

Thanks!



^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-10  8:06     ` Yan Zhao
@ 2023-05-23 22:51       ` Sean Christopherson
  2023-05-24  2:22         ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-23 22:51 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Wed, May 10, 2023, Yan Zhao wrote:
> On Wed, May 10, 2023 at 01:30:29PM +0800, Chao Gao wrote:
> > On Tue, May 09, 2023 at 09:50:06PM +0800, Yan Zhao wrote:
> > >Add a helper to indicate that the kvm_zap_gfn_range() request is to update
> > >memory type.
> > >
> > >Then the zap can be avoided in cases:
> > >1. TDP is not enabled.
> > >2. EPT is not enabled.
> > >
> > >This is because only memory type of EPT leaf entries are subjected to
> > >change when noncoherent DMA/guest CR0.CD/guest MTRR settings change.
> > >
> > >Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
> > >---
> > > arch/x86/kvm/mmu.h     |  1 +
> > > arch/x86/kvm/mmu/mmu.c | 16 ++++++++++++++++
> > > 2 files changed, 17 insertions(+)
> > >
> > >diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > >index 92d5a1924fc1..a04577afbc71 100644
> > >--- a/arch/x86/kvm/mmu.h
> > >+++ b/arch/x86/kvm/mmu.h
> > >@@ -236,6 +236,7 @@ static inline u8 permission_fault(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu,
> > > }
> > > 
> > > void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > >+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
> > > 
> > > int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> > > 
> > >diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > >index c8961f45e3b1..2706754794d1 100644
> > >--- a/arch/x86/kvm/mmu/mmu.c
> > >+++ b/arch/x86/kvm/mmu/mmu.c
> > >@@ -6272,6 +6272,22 @@ static bool kvm_rmap_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_e
> > > 	return flush;
> > > }
> > > 
> > >+/*
> > >+ * Invalidate (zap) TDP SPTEs that cover GFNs from gfn_start and up to gfn_end
> > >+ * (not including it) for reason of memory type being updated.
> > >+ */
> > >+void kvm_zap_gfn_for_memtype(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> > >+{
> > >+	/* Currently only memory type of EPT leaf entries are affected by
> > >+	 * guest CR0.CD and guest MTRR.
> > >+	 * So skip invalidation (zap) in other cases
> > >+	 */
> > >+	if (!shadow_memtype_mask)
> > 
> > Do you need to check kvm_arch_has_noncoherent_dma()? if yes, maybe just extract
> > the first if() statement and its associated comment from kvm_tdp_page_fault()?
> >
> This kvm_zap_gfn_for_memtype() helper will be called in
> kvm_arch_unregister_noncoherent_dma() as well when noncoherent_dma_count
> goes from 1 to 0.
> So, checking kvm_arch_has_noncoherent_dma() is not desired.
> 
> > >+		return;
> > >+
> > >+	kvm_zap_gfn_range(kvm, gpa_to_gfn(0), gpa_to_gfn(~0ULL));
> > 
> > This should be:
> > 
> > 	kvm_zap_gfn_range(kvm, start, end);
> >
> Oops. sorry about this mistake. Will fix it in next version.

Rather than add a helper to zap "for memtype", add a helper to query if KVM honors
guest MTRRs.  The "for memtype" isn't intuitive and ends up being incomplete.
That will also allow for deduplicating other code (and a comment).  Waiting until
the zap also wastes cycles in the MTRR case, there's no need to compute start and
end if the zap will ultimately be skipped.

E.g. over two or three patches:

---
 arch/x86/kvm/mmu.h     |  1 +
 arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++-----
 arch/x86/kvm/mtrr.c    |  2 +-
 arch/x86/kvm/x86.c     |  2 +-
 4 files changed, 17 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 92d5a1924fc1..c3c50ec9d9db 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -103,6 +103,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c8961f45e3b1..a2b1723bc6a4 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4512,12 +4512,10 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
 }
 #endif
 
-int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
 {
 	/*
-	 * If the guest's MTRRs may be used to compute the "real" memtype,
-	 * restrict the mapping level to ensure KVM uses a consistent memtype
-	 * across the entire mapping.  If the host MTRRs are ignored by TDP
+	 * If the TDP is enabled, the host MTRRs are ignored by TDP
 	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
 	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
 	 * from the guest's MTRRs so that guest accesses to memory that is
@@ -4526,7 +4524,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
 	 * e.g. KVM will force UC memtype for host MMIO.
 	 */
-	if (shadow_memtype_mask && kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
+	return tdp_enabled && shadow_memtype_mask &&
+	       kvm_arch_has_noncoherent_dma(kvm);
+}
+
+int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
+{
+	/*
+	 * If the guest's MTRRs may be used to compute the "real" memtype,
+	 * restrict the mapping level to ensure KVM uses a consistent memtype
+	 * across the entire mapping.
+	 */
+	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
 		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
 			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
 			gfn_t base = gfn_round_for_level(fault->gfn,
diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 3eb6e7f47e96..a67c28a56417 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
 	gfn_t start, end;
 
-	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
+	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		return;
 
 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 076d50f6321b..977dceb7ba7e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 		kvm_mmu_reset_context(vcpu);
 
 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
+	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }

base-commit: 8d4a1b4011c125b1510254b724433aaae746ce14
-- 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-23  6:21     ` Yan Zhao
@ 2023-05-24  0:13       ` Sean Christopherson
  2023-05-24 11:03         ` Yan Zhao
  2023-05-25  7:21       ` Robert Hoo
  1 sibling, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-24  0:13 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Tue, May 23, 2023, Yan Zhao wrote:
> On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> > On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > > 
> > > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > > 
> > > Though each vCPU has its own MTRR state, MTRR states should be
> > > consistent across each VM, which is demanded as in Intel's SDM
> > > "In a multiprocessor system using a processor in the P6 family or a more
> > > recent family, each processor MUST use the identical MTRR memory map so
> > > that software will have a consistent view of memory."
> > > 
> > > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > > a per-VM version of guest MTRR can be referenced.
> > > 
> > > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > > returning the right value when there's lag of MTRR update for each vCPU.
> > > 
> > Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> > only? therefore can simplify implementation and avoid hazard of
> > inconsistency among per-VPU MTRR states.
> > 
> > I see in SDM, it notes:
> > "In multiple processor systems, the operating system must maintain MTRR
> > consistency between all the processors in the system (that is, all
> > processors must use the same MTRR values). The P6 and more recent processor
> > families provide no hardware support for maintaining this consistency."
> > 
> > leaving each vCPU's MTRR is just to fully mimic HW?
> >
> Yes, leaving each vCPU's MTRR to mimic HW.
> 
> As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> 
> for each online CPUs {
> 	disable MTRR
> 	update fixed/var MTRR ranges
> 	enable MTRR
> }
> Guest OS needs to access memory only after this full pattern.

FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
doesn't optimize for self-snoop?).

> So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> states".

Probably not, but despite the SDM's assertion that software "MUST" keep things
consistent, that's not actually reality.  E.g. a careful kernel that partitions
memory doesn't need to strictly keep MTRRs consistent.  Ditto for scenarios where
CPUs are offline (for a variety of definitions of "offline"), in which case only
online CPUs need to have consistent MTRRs, e.g. before APs are brought up, MTRRs
are obviously inconsistent.

> I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> zap, which costs quite a lot cpu cycles even when the EPT is empty.
> 
> In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> so that it can save some memory to keep the MTRR state.
> But I found out that it would only work when vCPU 0 (boot processor) is
> always online (which is not true for x86 under some configration).
> 
> I'll try to find out lowest online vCPU

Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
ignores MTRR updates from other vCPUs in the new kernel.

One idea would be to let all vCPUs write the per-VM state, and zap if and only if
the MTRRs are actually different.  But as above, I'm on the fence about diverging
from what hardware actually does with respect to MTRRs.

Argh, stupid quirks.  I was going to suggest we take advantage of the guest needing
to either disable MTRRs or put the cache into no-fill mode to avoid zapping SPTEs,
i.e. skip the zap if CR0.CD=1, but KVM's implementation of the quirk is all kinds
of messed up.  KVM ignores MTRRs and guest PAT when CR0.CD=1, but then doesn't zap
when CR0.CD is cleared:

	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);

which means that either KVM is relying on later MTRR changes to zap the bogus
WB SPTEs, or more likely things "work" because KVM doesn't install SPTEs for the
ranges that end up being DMA'd while CR0.CD is set.

*sigh*  What an absolutely asinine quirk.  KVM should never have taken it on to
workaround OVMF's stupidity.  Good gravy, that thing was even Cc'd stable@.  Yeesh.

Aha!  Idea.  There's should be no need to ignore guest PAT when CR0.CD=1 and the
quirk is enabled.  Even OVMF must be smart enough to map its code as WB in its page
tables.  And there is no need to zap SPTEs when CR0.CD is _set_, because SPTEs
created while CR0.CD=0 would have honored MTRRs.  Lastly, DMA when CR0.CD=1 and
the quirk is enabled simply can't work since KVM historically ignores everything
from the guest and maps all memory WB.

So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
and just
process MTRRs one by one.

Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
needed comments), the below is what I'm thinking.  If more intelligent zapping
provides the desired performance improvements, then I think/hope we avoid trying
to play games with MTRRs.

---
 arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c |  8 ++------
 arch/x86/kvm/x86.c     |  6 ++----
 3 files changed, 23 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index a67c28a56417..e700c230268b 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
 		return;
 
+	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
+		return;
+
 	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
 		return;
 
@@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 	}
 }
 
+void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (cr0 & X86_CR0_CD)
+		return;
+
+	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
+		return;
+
+	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+		return;
+	}
+
+	<zap non-WB memory>;
+}
+
 int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
 	int index;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2d9d155691a7..962abc17afc0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7502,8 +7502,6 @@ static int vmx_vm_init(struct kvm *kvm)
 
 static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
-	u8 cache;
-
 	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
 	 * memory aliases with conflicting memory types and sometimes MCEs.
 	 * We have to be careful as to what are honored and when.
@@ -7530,11 +7528,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
 		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-			cache = MTRR_TYPE_WRBACK;
+			return MTRR_TYPE_WRBACK;
 		else
-			cache = MTRR_TYPE_UNCACHABLE;
-
-		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
 	}
 
 	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 977dceb7ba7e..3c085b6f9e3c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -941,10 +941,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
 	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
 		kvm_mmu_reset_context(vcpu);
 
-	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
-	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+	if ((cr0 ^ old_cr0) & X86_CR0_CD))
+		kvm_mtrr_post_set_cr0(vcpu, cr0);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
 

base-commit: 7e998f35c57cb0a2c35f8545b54ab5f4f5a83841
-- 


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap
  2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
                   ` (5 preceding siblings ...)
  2023-05-09 13:53 ` [PATCH v2 6/6] KVM: x86/mmu: use per-VM based MTRR for EPT Yan Zhao
@ 2023-05-24  0:15 ` Sean Christopherson
  2023-05-24 11:04   ` Yan Zhao
  6 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-24  0:15 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini

On Tue, May 09, 2023, Yan Zhao wrote:
> This series refines mmu zap caused by EPT memory type update.
> Yan Zhao (6):
>   KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
>   KVM: x86/mmu: only zap EPT when guest CR0_CD changes
>   KVM: x86/mmu: only zap EPT when guest MTRR changes
>   KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count
>   KVM: x86: Keep a per-VM MTRR state
>   KVM: x86/mmu: use per-VM based MTRR for EPT
> 
>  arch/x86/include/asm/kvm_host.h |   3 +
>  arch/x86/kvm/mmu.h              |   1 +
>  arch/x86/kvm/mmu/mmu.c          |  18 ++++-
>  arch/x86/kvm/mtrr.c             | 112 +++++++++++++++++++++++++-------
>  arch/x86/kvm/vmx/vmx.c          |   2 +-
>  arch/x86/kvm/x86.c              |  10 ++-
>  arch/x86/kvm/x86.h              |   6 +-
>  7 files changed, 122 insertions(+), 30 deletions(-)
> 
> 
> base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5
> -- 

To save us both a bit of pain, can you base the next version on top of my PAT
cleanup[*]?  I am planning on applying that series "soon".  Thanks!

[*] https://lore.kernel.org/all/20230511233351.635053-1-seanjc@google.com 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-23 22:51       ` Sean Christopherson
@ 2023-05-24  2:22         ` Yan Zhao
  2023-05-24 14:50           ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-24  2:22 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Tue, May 23, 2023 at 03:51:49PM -0700, Sean Christopherson wrote:
> Rather than add a helper to zap "for memtype", add a helper to query if KVM honors
> guest MTRRs.  The "for memtype" isn't intuitive and ends up being incomplete.
> That will also allow for deduplicating other code (and a comment).  Waiting until
> the zap also wastes cycles in the MTRR case, there's no need to compute start and
> end if the zap will ultimately be skipped.
Agreed.

> 
> E.g. over two or three patches:
> 
> ---
>  arch/x86/kvm/mmu.h     |  1 +
>  arch/x86/kvm/mmu/mmu.c | 19 ++++++++++++++-----
>  arch/x86/kvm/mtrr.c    |  2 +-
>  arch/x86/kvm/x86.c     |  2 +-
>  4 files changed, 17 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 92d5a1924fc1..c3c50ec9d9db 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -103,6 +103,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> +bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm);
>  
>  void kvm_init_mmu(struct kvm_vcpu *vcpu);
>  void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index c8961f45e3b1..a2b1723bc6a4 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4512,12 +4512,10 @@ static int kvm_tdp_mmu_page_fault(struct kvm_vcpu *vcpu,
>  }
>  #endif
>  
> -int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
>  {
>  	/*
> -	 * If the guest's MTRRs may be used to compute the "real" memtype,
> -	 * restrict the mapping level to ensure KVM uses a consistent memtype
> -	 * across the entire mapping.  If the host MTRRs are ignored by TDP
> +	 * If the TDP is enabled, the host MTRRs are ignored by TDP
>  	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
>  	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
>  	 * from the guest's MTRRs so that guest accesses to memory that is
> @@ -4526,7 +4524,18 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
>  	 * e.g. KVM will force UC memtype for host MMIO.
>  	 */
> -	if (shadow_memtype_mask && kvm_arch_has_noncoherent_dma(vcpu->kvm)) {
> +	return tdp_enabled && shadow_memtype_mask &&
> +	       kvm_arch_has_noncoherent_dma(kvm);
> +}
> +
> +int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> +{
> +	/*
> +	 * If the guest's MTRRs may be used to compute the "real" memtype,
> +	 * restrict the mapping level to ensure KVM uses a consistent memtype
> +	 * across the entire mapping.
> +	 */
> +	if (kvm_mmu_honors_guest_mtrrs(vcpu->kvm)) {
>  		for ( ; fault->max_level > PG_LEVEL_4K; --fault->max_level) {
>  			int page_num = KVM_PAGES_PER_HPAGE(fault->max_level);
>  			gfn_t base = gfn_round_for_level(fault->gfn,
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 3eb6e7f47e96..a67c28a56417 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
>  	gfn_t start, end;
>  
> -	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
Could we also add another helper kvm_mmu_cap_honors_guest_mtrrs(), which
does not check kvm_arch_has_noncoherent_dma()?

+static inline bool kvm_mmu_cap_honors_guest_mtrrs(struct kvm *kvm)
+{
+       return !!shadow_memtype_mask;
+}

This is because in patch 4 I plan to do the EPT zap when
noncoherent_dma_count goes from 1 to 0.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 41d7bb51a297..ad0c43d7f532 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13146,13 +13146,19 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);

 void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
 {
-       atomic_inc(&kvm->arch.noncoherent_dma_count);
+       if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1) {
+               if (kvm_mmu_cap_honors_guest_mtrrs(kvm))
+                       kvm_zap_gfn_range(kvm, 0, ~0ULL);
+       }
 }
 EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);

 void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
 {
-       atomic_dec(&kvm->arch.noncoherent_dma_count);
+       if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count)) {
+               if (kvm_mmu_cap_honors_guest_mtrrs(kvm))
+                       kvm_zap_gfn_range(kvm, 0, ~0ULL);
+       }
 }
 EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);


>  		return;
>  
>  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 076d50f6321b..977dceb7ba7e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -942,7 +942,7 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  		kvm_mmu_reset_context(vcpu);
>  
>  	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> -	    kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
> +	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
>  	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>  		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
>  }
> 
> base-commit: 8d4a1b4011c125b1510254b724433aaae746ce14
> -- 
> 

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-24  0:13       ` Sean Christopherson
@ 2023-05-24 11:03         ` Yan Zhao
  2023-05-24 18:21           ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-24 11:03 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Tue, May 23, 2023 at 05:13:38PM -0700, Sean Christopherson wrote:
> On Tue, May 23, 2023, Yan Zhao wrote:
> > On Sun, May 21, 2023 at 11:44:36AM +0800, Robert Hoo wrote:
> > > On 5/9/2023 9:53 PM, Yan Zhao wrote:
> > > > Keep a per-VM MTRR state and point it to the MTRR state of vCPU 0.
> > > > 
> > > > This is a preparation patch for KVM to reference a per-VM guest MTRR
> > > > to decide memory type of EPT leaf entries when noncoherent DMA is present.
> > > > 
> > > > Though each vCPU has its own MTRR state, MTRR states should be
> > > > consistent across each VM, which is demanded as in Intel's SDM
> > > > "In a multiprocessor system using a processor in the P6 family or a more
> > > > recent family, each processor MUST use the identical MTRR memory map so
> > > > that software will have a consistent view of memory."
> > > > 
> > > > Therefore, when memory type of EPT leaf entry needs to honor guest MTRR,
> > > > a per-VM version of guest MTRR can be referenced.
> > > > 
> > > > Each vCPU still has its own MTRR state field to keep guest rdmsr()
> > > > returning the right value when there's lag of MTRR update for each vCPU.
> > > > 
> > > Can we get rid of per-vCPU MTRR state copies and just have this per-VM state
> > > only? therefore can simplify implementation and avoid hazard of
> > > inconsistency among per-VPU MTRR states.
> > > 
> > > I see in SDM, it notes:
> > > "In multiple processor systems, the operating system must maintain MTRR
> > > consistency between all the processors in the system (that is, all
> > > processors must use the same MTRR values). The P6 and more recent processor
> > > families provide no hardware support for maintaining this consistency."
> > > 
> > > leaving each vCPU's MTRR is just to fully mimic HW?
> > >
> > Yes, leaving each vCPU's MTRR to mimic HW.
> > 
> > As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> > 
> > for each online CPUs {
> > 	disable MTRR
> > 	update fixed/var MTRR ranges
> > 	enable MTRR
> > }
> > Guest OS needs to access memory only after this full pattern.
> 
> FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
> into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
> doesn't optimize for self-snoop?).
I think Linux also follows this patten.
This is the call trace I found out in my environment.
cache_cpu_init
    cache_disable
        write_cr0 to CD=1, NW=0
        mtrr_disable
    mtrr_generic_set_state
        mtrr_wrmsr to fixed/var ranges
    cache_enable
        mtrr_enable
        write_cr0(read_cr0() & ~X86_CR0_CD);

For PAT not enabled in guest,
arch_phys_wc_add
    mtrr_add
        mtrr_add_page
            set_mtrr_cpuslocked
                mtrr_rendezvous_handler on each online cpu
                    generic_set_mtrr
                        cache_disable
                            write_cr0 to CD=1, NW=0
                            mtrr_disable
                        mtrr_wrmsr
                        cache_enable


For OVMF and Seabios, I dumped in host to observe their MTRR behaviors:
1. OVMF updates MTRR in below sequence
(1) vCPU 0
    CR0.CD=1
    MTRR disable
    MTRR update
    MTRR enable
    CR0.CD=0
(2) vCPU 1-n
    CR0.CD=1
    MTRR disable
(3) vCPU 1-n    
    MTRR update
(4) vCPU 1-n
    MTRR enable
    CR0.CD=0

2. Seabios can update MTRRs both when CR0.CD=0 and when CR0.CD=1 and follows below
sequence in each CPU
    MTRR disable
    MTRR update
    MTRR enable
I found it can update MTRRs even when CR0.CD=0 to below values:
MSR_MTRRfix16K_80000(0x258): 0x606060606060606
MSR_MTRRfix4K_C0000(0x268): 0x505050505050505
...
MSR_MTRRfix4K_F8000(0x26f): 0x505050505050505
MTRRphysBase_MSR(0): 0xc0000000
MTRRphysMask_MSR(0): 0x3fffc0000800

> 
> > So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> > states".
> 
> Probably not, but despite the SDM's assertion that software "MUST" keep things
> consistent, that's not actually reality.  E.g. a careful kernel that partitions
> memory doesn't need to strictly keep MTRRs consistent.  Ditto for scenarios where
> CPUs are offline (for a variety of definitions of "offline"), in which case only
> online CPUs need to have consistent MTRRs, e.g. before APs are brought up, MTRRs
> are obviously inconsistent.
>
Yes. By "no inconsistency", I mean online vCPUs(who trigger active memory accesses)
need to all in a state that their MTRRs are consistent.
Offline vCPUs, as they will not access memory, it doesn't matter.
But once they are back to online, their MTRRs need to be in sync with online vCPUs.

> > I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> > zap, which costs quite a lot cpu cycles even when the EPT is empty.
> > 
> > In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> > so that it can save some memory to keep the MTRR state.
> > But I found out that it would only work when vCPU 0 (boot processor) is
> > always online (which is not true for x86 under some configration).
> > 
> > I'll try to find out lowest online vCPU
> 
> Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> ignores MTRR updates from other vCPUs in the new kernel.
>
Not familiar with kexec().
I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
the lowest online vCPU id can be written to per-VM MTRR state.
Will that work for kexec()?

> One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> the MTRRs are actually different.  But as above, I'm on the fence about diverging
> from what hardware actually does with respect to MTRRs.
> 
> Argh, stupid quirks.  I was going to suggest we take advantage of the guest needing
> to either disable MTRRs or put the cache into no-fill mode to avoid zapping SPTEs,
> i.e. skip the zap if CR0.CD=1, but KVM's implementation of the quirk is all kinds
> of messed up.  KVM ignores MTRRs and guest PAT when CR0.CD=1, but then doesn't zap
> when CR0.CD is cleared:
> 
> 	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> 	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> 	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> 		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> 
> which means that either KVM is relying on later MTRR changes to zap the bogus
> WB SPTEs, or more likely things "work" because KVM doesn't install SPTEs for the
> ranges that end up being DMA'd while CR0.CD is set.
> 
> *sigh*  What an absolutely asinine quirk.  KVM should never have taken it on to
> workaround OVMF's stupidity.  Good gravy, that thing was even Cc'd stable@.  Yeesh.
> 
> Aha!  Idea.  There's should be no need to ignore guest PAT when CR0.CD=1 and the
> quirk is enabled.  Even OVMF must be smart enough to map its code as WB in its page
> tables.  And there is no need to zap SPTEs when CR0.CD is _set_, because SPTEs
> created while CR0.CD=0 would have honored MTRRs.  Lastly, DMA when CR0.CD=1 and
> the quirk is enabled simply can't work since KVM historically ignores everything
> from the guest and maps all memory WB.
> 
> So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> and just
> process MTRRs one by one.
> 
> Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> needed comments), the below is what I'm thinking.  If more intelligent zapping
> provides the desired performance improvements, then I think/hope we avoid trying
> to play games with MTRRs.
> 
> ---
>  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c |  8 ++------
>  arch/x86/kvm/x86.c     |  6 ++----
>  3 files changed, 23 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index a67c28a56417..e700c230268b 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
>  		return;
>  
> +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> +		return;
This will always make update_mtrr() return here for Linux and OVMF. 
> +
>  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
>  		return;
>  
> @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  	}
>  }
>  
> +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> +{
> +	if (cr0 & X86_CR0_CD)
> +		return;
> +
> +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> +		return;
> +
> +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> +		return;
> +	}
> +
> +	<zap non-WB memory>;
This zap looks will happen on each vCPU. Then only half of zaps are
saved compared to the original count of zaps in update_mtrr().
And pieces of no-WB memory might produce more kvm_zap_gfn_range()?


> +}
> +
>  int kvm_mtrr_set_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  {
>  	int index;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 2d9d155691a7..962abc17afc0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7502,8 +7502,6 @@ static int vmx_vm_init(struct kvm *kvm)
>  
>  static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  {
> -	u8 cache;
> -
>  	/* We wanted to honor guest CD/MTRR/PAT, but doing so could result in
>  	 * memory aliases with conflicting memory types and sometimes MCEs.
>  	 * We have to be careful as to what are honored and when.
> @@ -7530,11 +7528,9 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>  
>  	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
>  		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -			cache = MTRR_TYPE_WRBACK;
> +			return MTRR_TYPE_WRBACK;
>  		else
> -			cache = MTRR_TYPE_UNCACHABLE;
> -
> -		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +			return (MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
>  	}
>  
>  	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 977dceb7ba7e..3c085b6f9e3c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -941,10 +941,8 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
>  	if ((cr0 ^ old_cr0) & KVM_MMU_CR0_ROLE_BITS)
>  		kvm_mmu_reset_context(vcpu);
>  
> -	if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
> -	    kvm_mmu_honors_guest_mtrrs(vcpu->kvm) &&
> -	    !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> +	if ((cr0 ^ old_cr0) & X86_CR0_CD))
> +		kvm_mtrr_post_set_cr0(vcpu, cr0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
>  
> 
> base-commit: 7e998f35c57cb0a2c35f8545b54ab5f4f5a83841
> -- 
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap
  2023-05-24  0:15 ` [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
@ 2023-05-24 11:04   ` Yan Zhao
  0 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-24 11:04 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: kvm, linux-kernel, pbonzini

On Tue, May 23, 2023 at 05:15:52PM -0700, Sean Christopherson wrote:
> On Tue, May 09, 2023, Yan Zhao wrote:
> > This series refines mmu zap caused by EPT memory type update.
> > Yan Zhao (6):
> >   KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
> >   KVM: x86/mmu: only zap EPT when guest CR0_CD changes
> >   KVM: x86/mmu: only zap EPT when guest MTRR changes
> >   KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count
> >   KVM: x86: Keep a per-VM MTRR state
> >   KVM: x86/mmu: use per-VM based MTRR for EPT
> > 
> >  arch/x86/include/asm/kvm_host.h |   3 +
> >  arch/x86/kvm/mmu.h              |   1 +
> >  arch/x86/kvm/mmu/mmu.c          |  18 ++++-
> >  arch/x86/kvm/mtrr.c             | 112 +++++++++++++++++++++++++-------
> >  arch/x86/kvm/vmx/vmx.c          |   2 +-
> >  arch/x86/kvm/x86.c              |  10 ++-
> >  arch/x86/kvm/x86.h              |   6 +-
> >  7 files changed, 122 insertions(+), 30 deletions(-)
> > 
> > 
> > base-commit: 5c291b93e5d665380dbecc6944973583f9565ee5
> > -- 
> 
> To save us both a bit of pain, can you base the next version on top of my PAT
> cleanup[*]?  I am planning on applying that series "soon".  Thanks!
> 
> [*] https://lore.kernel.org/all/20230511233351.635053-1-seanjc@google.com 
Sure, will do it!

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-24  2:22         ` Yan Zhao
@ 2023-05-24 14:50           ` Sean Christopherson
  2023-05-25 10:14             ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-24 14:50 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Wed, May 24, 2023, Yan Zhao wrote:
> On Tue, May 23, 2023 at 03:51:49PM -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index 3eb6e7f47e96..a67c28a56417 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> >  	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> >  	gfn_t start, end;
> >  
> > -	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> Could we also add another helper kvm_mmu_cap_honors_guest_mtrrs(), which
> does not check kvm_arch_has_noncoherent_dma()?
> 
> +static inline bool kvm_mmu_cap_honors_guest_mtrrs(struct kvm *kvm)
> +{
> +       return !!shadow_memtype_mask;
> +}
> 
> This is because in patch 4 I plan to do the EPT zap when
> noncoherent_dma_count goes from 1 to 0.

Hrm, the 1->0 transition is annoying.  Rather than trying to capture the "everything
except non-coherent DMA" aspect, what about this?

mmu.c:

bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma)
{
	/*
	 * If the TDP is enabled, the host MTRRs are ignored by TDP
	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
	 * from the guest's MTRRs so that guest accesses to memory that is
	 * DMA'd aren't cached against the guest's wishes.
	 *
	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
	 * e.g. KVM will force UC memtype for host MMIO.
	 */
	return vm_has_noncoherent_dma && tdp_enabled && shadow_memtype_mask;
}

mmu.h:

bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma);

static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
{
	
	return __kvm_mmu_honors_guest_mtrrs(kvm, kvm_arch_has_noncoherent_dma(kvm));
}

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 41d7bb51a297..ad0c43d7f532 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13146,13 +13146,19 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
> 
>  void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
>  {
> -       atomic_inc(&kvm->arch.noncoherent_dma_count);
> +       if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1) {
> +               if (kvm_mmu_cap_honors_guest_mtrrs(kvm))
> +                       kvm_zap_gfn_range(kvm, 0, ~0ULL);

No need for multiple if statements.  Though rather than have identical code in
both the start/end paths, how about this?  That provides a single location for a
comment.  Or maybe first/last instead of start/end?

static void kvm_noncoherent_dma_start_or_end(struct kvm *kvm)
{
	/* comment goes here. */
	if (__kvm_mmu_honors_guest_mtrrs(kvm, true))
		kvm_zap_gfn_range(kvm, 0, ~0ULL);
}

void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
{
	if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1)
		kvm_noncoherent_dma_start_or_end(kvm);
}
EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);

void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
{
	if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count))
		kvm_noncoherent_dma_start_or_end(kvm);
}
EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-24 11:03         ` Yan Zhao
@ 2023-05-24 18:21           ` Sean Christopherson
  2023-05-25 10:09             ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-24 18:21 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Wed, May 24, 2023, Yan Zhao wrote:
> On Tue, May 23, 2023 at 05:13:38PM -0700, Sean Christopherson wrote:
> > On Tue, May 23, 2023, Yan Zhao wrote:
> > > As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> > > 
> > > for each online CPUs {
> > > 	disable MTRR
> > > 	update fixed/var MTRR ranges
> > > 	enable MTRR
> > > }
> > > Guest OS needs to access memory only after this full pattern.
> > 
> > FWIW, that Linux doesn't use that approach.  Linux instead only puts the cache
> > into no-fill mode (CR0.CD=1) when modifying MTRRs.  OVMF does both (and apparently
> > doesn't optimize for self-snoop?).
> I think Linux also follows this patten.
> This is the call trace I found out in my environment.
> cache_cpu_init
>     cache_disable
>         write_cr0 to CD=1, NW=0
>         mtrr_disable

Huh, I somehow missed this call to mtrr_disable().  I distinctly remember looking
at this helper, no idea how I missed that.

>     mtrr_generic_set_state
>         mtrr_wrmsr to fixed/var ranges
>     cache_enable
>         mtrr_enable
>         write_cr0(read_cr0() & ~X86_CR0_CD);
> 

...

> > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > ignores MTRR updates from other vCPUs in the new kernel.
> >
> Not familiar with kexec().
> I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> the lowest online vCPU id can be written to per-VM MTRR state.
> Will that work for kexec()?

kexec() allows "booting" into a new kernel without transitioning through BIOS
and without doing a full reboot.  The scenario I'm worried about is if the new
kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
think it's an extremely unlikely scenario, but my point is that selecting a single
vCPU to control the MTRRs works if and only if that vCPU stays online for the
entire lifetime of the VM, which KVM can't guarantee.

> > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > from what hardware actually does with respect to MTRRs.

...

> > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > and just
> > process MTRRs one by one.
> > 
> > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > provides the desired performance improvements, then I think/hope we avoid trying
> > to play games with MTRRs.
> > 
> > ---
> >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> >  arch/x86/kvm/x86.c     |  6 ++----
> >  3 files changed, 23 insertions(+), 10 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > index a67c28a56417..e700c230268b 100644
> > --- a/arch/x86/kvm/mtrr.c
> > +++ b/arch/x86/kvm/mtrr.c
> > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> >  		return;
> >  
> > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > +		return;
> This will always make update_mtrr() return here for Linux and OVMF. 

Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
relevant SPTEs when CR0.CD is cleared.

And this is actually already the behavior for all MTRRs execpt for MSR_MTRRdefType
due to the !mtrr_is_enabled() clause below.

> >  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> >  		return;
> >  
> > @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> >  	}
> >  }
> >  
> > +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> > +{
> > +	if (cr0 & X86_CR0_CD)
> > +		return;
> > +
> > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > +		return;
> > +
> > +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> > +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > +		return;
> > +	}
> > +
> > +	<zap non-WB memory>;
> This zap looks will happen on each vCPU.

Yes, on CR0.CD 1=>0 transition.

> Then only half of zaps are saved compared to the original count of zaps in
> update_mtrr().

I don't think the current code is functionally correct though, even if we define
"correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
is unlikely to be a problem, but setting IGMT definitely is.  Though of course
we can and should fix the IGMT thing separately.

> And pieces of no-WB memory might produce more kvm_zap_gfn_range()?

Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
be batched into a single zap by sacrificing a tiny amount of precision, i.e. zap
from the lowest non-WB fixed MTRR to the highest.  Crucially, assuming BIOS and
the kernel aren't doing something bizarre, that should preserve the SPTEs for the
code the guest is executing from (0 - VGA hole should be WB).

And if we want to squeeze more performance out of this path, there are other
optimizations we can make.  E.g. I'm guessing one of the problems, perhaps even
_the_ problem, is that there's contention on mmu_lock in kvm_zap_gfn_range() when
bringing up APs, which is likely why you observe slow downs even when there are
no SPTEs to zap.  A thought for handling that would be to do something akin to
kvm_recalculate_apic_map().  E.g. instead of having every vCPU zap, track (a)
if a zap is in-progress and (b) the ranges being zapped.  There will still be
lock contention to add ranges, but it should be fairly short in duration compared
to zapping all possible SPTEs (current behavior).

Something like

	struct tbd *range = NULL;
	bool do_zap = false;

	<gather non-wb ranges>

	for_each_non_wb_range(...) {
		if (!range)
			range = kmalloc(...);

		spin_lock(&kvm->arch.mtrr_zap_lock);
		if (<range not in list>) {
			list_add_tail(&range->list, &kvm->arch.mtrr_zap_list);
			range = NULL;

			if (!kvm->arch.mtrr_zapping) {
				do_zap = true;
				kvm->arch.mtrr_zapping = true;
			}
		}
		spin_unlock(&kvm->arch.mtrr_zap_lock);
	}

	kfree(zap_entry);

	if (do_zap) {
		spin_lock(&kvm->arch.mtrr_zap_lock);

		while (!list_empty(&kvm->arch.mtrr_zap_list)) {
			struct tbd *range;

			range = list_first_entry(&kvm->arch.mtrr_zap_list);
			list_del(range->list);

			spin_unlock(&kvm->arch.mtrr_zap_lock);

			kvm_zap_gfn_range(..., range->start, range->end);
			kfree(range);
		
			spin_lock(&kvm->arch.mtrr_zap_lock);
		}

		/* Clear under lock to ensure new entries are processed. */
		kvm->arch.mtrr_zapping = false;

		spin_unlock(&kvm->arch.mtrr_zap_lock);
	}

	/* Wait for the zap to complete. */
	while (READ_ONCE(kvm->arch.mtrr_zapping))
		cpu_relax();

and I'm sure someone that's better at lockless programming could optimize that
even further if necessary, e.g. by checking for "range in list" outside of the
spinlock.

E.g. in my OVMF based VM, the precise zap would target only two ranges, the VGA
hole and the 32-bit PCI:

  kvm: vCPU0 default type = 6
  kvm: vCPU0 fixed MTRR range 0 - 15 == 6
  kvm: vCPU0 fixed MTRR range 16 - 87 == 0
  kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0

That bumps up to three ranges for the kernel, which adds what I suspect is the
64-bit PCI hole:

  kvm: vCPU0 default type = 6                                                   
  kvm: vCPU0 fixed MTRR range 0 - 15 == 6                                       
  kvm: vCPU0 fixed MTRR range 16 - 87 == 0
  kvm: vCPU0 variable MTRR range 800000000 - 1000000000  = 0
  kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0

The above is distilled information from this hack-a-print:

diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
index 9fac1ec03463..6259c7a4bcd3 100644
--- a/arch/x86/kvm/mtrr.c
+++ b/arch/x86/kvm/mtrr.c
@@ -304,12 +304,42 @@ static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
        *end = (*start | ~mask) + 1;
 }
 
+
+static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
+{
+       return (range->mask & (1 << 11)) != 0;
+}
+
 static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
 {
        struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
        gfn_t start, end;
        int index;
 
+       if (mtrr_is_enabled(mtrr_state) && msr == MSR_MTRRdefType) {
+               struct kvm_mtrr_range *range;
+               int i;
+
+               pr_warn("vCPU%u default type = %u\n", vcpu->vcpu_idx, (u8)(mtrr_state->deftype & 0xff));
+
+               if (!fixed_mtrr_is_enabled(mtrr_state)) {
+                       pr_warn("vCPU%u fixed MTRRs disabled\n", vcpu->vcpu_idx);
+               } else {
+                       for (i = 0; i < ARRAY_SIZE(mtrr_state->fixed_ranges); i++)
+                               pr_warn("vCPU%u fixed MTRR range %u == %u\n",
+                                       vcpu->vcpu_idx, i, mtrr_state->fixed_ranges[i]);
+               }
+
+               list_for_each_entry(range, &mtrr_state->head, node) {
+                       if (!var_mtrr_range_is_valid(range))
+                               continue;
+
+                       var_mtrr_range(range, &start, &end);
+                       pr_warn("vCPU%d variable MTRR range %llx - %llx  = %u\n",
+                               vcpu->vcpu_idx, start, end, (u8)(range->base & 0xff));
+               }
+       }
+
        if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
              !kvm_arch_has_noncoherent_dma(vcpu->kvm))
                return;
@@ -333,11 +363,6 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
        kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
 }
 
-static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
-{
-       return (range->mask & (1 << 11)) != 0;
-}
-
 static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
 {
        struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;


^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-23  6:21     ` Yan Zhao
  2023-05-24  0:13       ` Sean Christopherson
@ 2023-05-25  7:21       ` Robert Hoo
  2023-05-25 15:00         ` Sean Christopherson
  1 sibling, 1 reply; 41+ messages in thread
From: Robert Hoo @ 2023-05-25  7:21 UTC (permalink / raw)
  To: Yan Zhao; +Cc: kvm, linux-kernel, pbonzini, seanjc

On 5/23/2023 2:21 PM, Yan Zhao wrote:
[...]
> Yes, leaving each vCPU's MTRR to mimic HW.
> 
> As also suggested in SDM, the guest OS manipulates MTRRs in this way:
> 
> for each online CPUs {
> 	disable MTRR
> 	update fixed/var MTRR ranges
> 	enable MTRR
> }
> Guest OS needs to access memory only after this full pattern.
> 
Thanks for confirmation and clarification.

> So, I think there should not have "hazard of inconsistency among per-VPU MTRR
> states".
> 
> I want to have per-VM MTRR state is because I want to reduce unnessary EPT
> zap, which costs quite a lot cpu cycles even when the EPT is empty.
> 
> In this patch, per-VM MTRR pointer is used to point to vCPU 0's MTRR state,
> so that it can save some memory to keep the MTRR state.
> But I found out that it would only work when vCPU 0 (boot processor) is
> always online (which is not true for x86 under some configration).
> 
> I'll try to find out lowest online vCPU and keep a per-VM copy of MTRR state
> in next version.
> 
> Thanks!
> 

IIUC, your saving comes from skips the intermediate state during boot, when 
APs goes through setting MTRR, which would cause SPTE zap before your this 
patch set.

MHO was, now that your ignores other vCPU's MTRR settings (unless it is 
different from BP's MTRR?), why not let each vCPU's MTRR set/update 
directly set to the per-VM MTRR states (if differs from current value). 
It's guest OS/BIOS's responsibility to keep the consistency anyway. And 
even if the malfunction caused by the inconsistency might differ from that 
of native, SDM doesn't clearly state how the malfunction should be, does it?
that's to say, anyone knows, when inconsistency happens, does it cause that 
logical processor malfunction or in fact it impacts the global MTRR 
settings? If the latter, I think leaving only the per-VM MTRR state aligns 
with native.

BTW, with regard to KVM_X86_QUIRK_CD_NW_CLEARED, I see svm honors it while 
vmx doesn't before it clear CR0.CD/NW.

svm_set_cr0():

	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
		hcr0 &= ~(X86_CR0_CD | X86_CR0_NW);


vmx_set_cr0():

	hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);

Perhaps vmx side can be fixed passingly?

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-24 18:21           ` Sean Christopherson
@ 2023-05-25 10:09             ` Yan Zhao
  2023-05-25 14:53               ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-25 10:09 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Wed, May 24, 2023 at 11:21:09AM -0700, Sean Christopherson wrote:

...

> > > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > > ignores MTRR updates from other vCPUs in the new kernel.
> > >
> > Not familiar with kexec().
> > I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> > apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> > the lowest online vCPU id can be written to per-VM MTRR state.
> > Will that work for kexec()?
> 
> kexec() allows "booting" into a new kernel without transitioning through BIOS
> and without doing a full reboot.  The scenario I'm worried about is if the new
> kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
> think it's an extremely unlikely scenario, but my point is that selecting a single
> vCPU to control the MTRRs works if and only if that vCPU stays online for the
> entire lifetime of the VM, which KVM can't guarantee.
> 
> > > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > > from what hardware actually does with respect to MTRRs.
> 
> ...
> 
> > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
Not only non-WB ranges are required to be zapped.
Think about a scenario:
(1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
(2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
(3) then guest performs MTRR disable, no zap too.
(4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
(5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.

Does it make sense?

> > > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > > and just
> > > process MTRRs one by one.
> > > 
> > > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > > provides the desired performance improvements, then I think/hope we avoid trying
> > > to play games with MTRRs.
> > > 
> > > ---
> > >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> > >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> > >  arch/x86/kvm/x86.c     |  6 ++----
> > >  3 files changed, 23 insertions(+), 10 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index a67c28a56417..e700c230268b 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > >  		return;
> > >  
> > > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > > +		return;
> > This will always make update_mtrr() return here for Linux and OVMF. 
> 
> Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
> UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
> unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
> properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
> relevant SPTEs when CR0.CD is cleared.
My worry is that if the quirk is enabled,
(1) when CR0.CD=1, the EPT memtype is WB.
(2) when MTRR is further disabled,
    a. with EPT zap, the new EPT memtype is UC, effective memtype is UC
    b. without EPT zap, some EPT memtype is still WB, effective memtype is guest PAT type
However, in guest driver's point of view, the memtype is UC, because its MTRR is disabled.

So, if we don't zap EPT when guest disables MTRR, and if there's
non-coherent DMA on-going which requires guest driver to flush caches to
ensure cache coherency (though I don't think this scenario will happen in reality), guest
driver may not do the right thing as it thinks the memory is UC which
has no need to flush cache while the effective memtype is WB.

Previously, (i.e. in current upstream implementation),  
(1) when CR0.CD=1, the EPT memtype is WB + ignore guest PAT.
(2) when MTRR is further disabled,
    EPT is zapped, new EPT memtype is UC, effective memtype is UC.
The guest device driver can flush cache well for non-coherent DMA.

So, though the quirk will make current upstream code malfunction when
CR0.CD is set alone, it still functions well if MTRR is also disabled.

But I doubt if we need to care about this extreme condition of
non-coherent DMA that happens when CR0.CD=1 and MTRR is disabled.


> 
> And this is actually already the behavior for all MTRRs execpt for MSR_MTRRdefType
> due to the !mtrr_is_enabled() clause below.
> 
> > >  	if (!mtrr_is_enabled(mtrr_state) && msr != MSR_MTRRdefType)
> > >  		return;
> > >  
> > > @@ -375,6 +378,22 @@ static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
> > >  	}
> > >  }
> > >  
> > > +void kvm_mtrr_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> > > +{
> > > +	if (cr0 & X86_CR0_CD)
> > > +		return;
> > > +
> > > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > +		return;
> > > +
> > > +	if (!kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> > > +		kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
> > > +		return;
> > > +	}
> > > +
> > > +	<zap non-WB memory>;
> > This zap looks will happen on each vCPU.
> 
> Yes, on CR0.CD 1=>0 transition.
> 
> > Then only half of zaps are saved compared to the original count of zaps in
> > update_mtrr().
> 
> I don't think the current code is functionally correct though, even if we define
> "correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
> can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
> while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
> only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
> is unlikely to be a problem, but setting IGMT definitely is.  Though of course
> we can and should fix the IGMT thing separately.
Current code may also not function well for non-coherent DMA when guest vCPU has
no X86_FEATURE_MTRR, as it returns WB (without IPAT) as EPT memtype.
Then the effective memtype is guest PAT, but from guest's point of view, it's all
UC. (see pat_x_mtrr_type() in Linux's code).
But I guess we don't need to mind this case?


> 
> > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> 
> Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
Could we instead keep a per-VM copy of MTRR?
Then, if a vCPU modifies an MTRR entry and we find it's different from that
in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
those dirty ranges. (or just zap all if there are dirty entries)
In theory, only one vCPU will trigger this zap in each round of MTRR update.

> be batched into a single zap by sacrificing a tiny amount of precision, i.e. zap
> from the lowest non-WB fixed MTRR to the highest.  Crucially, assuming BIOS and
> the kernel aren't doing something bizarre, that should preserve the SPTEs for the
> code the guest is executing from (0 - VGA hole should be WB).
> 
> And if we want to squeeze more performance out of this path, there are other
> optimizations we can make.  E.g. I'm guessing one of the problems, perhaps even
> _the_ problem, is that there's contention on mmu_lock in kvm_zap_gfn_range() when
> bringing up APs, which is likely why you observe slow downs even when there are
> no SPTEs to zap.  A thought for handling that would be to do something akin to
> kvm_recalculate_apic_map().  E.g. instead of having every vCPU zap, track (a)
> if a zap is in-progress and (b) the ranges being zapped.  There will still be
> lock contention to add ranges, but it should be fairly short in duration compared
> to zapping all possible SPTEs (current behavior).
Yes, I guess this one is more performant, too :)

> 
> Something like
> 
> 	struct tbd *range = NULL;
> 	bool do_zap = false;
> 
> 	<gather non-wb ranges>
> 
> 	for_each_non_wb_range(...) {
> 		if (!range)
> 			range = kmalloc(...);
> 
> 		spin_lock(&kvm->arch.mtrr_zap_lock);
> 		if (<range not in list>) {
> 			list_add_tail(&range->list, &kvm->arch.mtrr_zap_list);
> 			range = NULL;
> 
> 			if (!kvm->arch.mtrr_zapping) {
> 				do_zap = true;
> 				kvm->arch.mtrr_zapping = true;
> 			}
> 		}
> 		spin_unlock(&kvm->arch.mtrr_zap_lock);
> 	}
> 
> 	kfree(zap_entry);
> 
> 	if (do_zap) {
> 		spin_lock(&kvm->arch.mtrr_zap_lock);
> 
> 		while (!list_empty(&kvm->arch.mtrr_zap_list)) {
> 			struct tbd *range;
> 
> 			range = list_first_entry(&kvm->arch.mtrr_zap_list);
> 			list_del(range->list);
> 
> 			spin_unlock(&kvm->arch.mtrr_zap_lock);
> 
> 			kvm_zap_gfn_range(..., range->start, range->end);
> 			kfree(range);
> 		
> 			spin_lock(&kvm->arch.mtrr_zap_lock);
> 		}
> 
> 		/* Clear under lock to ensure new entries are processed. */
> 		kvm->arch.mtrr_zapping = false;
> 
> 		spin_unlock(&kvm->arch.mtrr_zap_lock);
> 	}
> 
> 	/* Wait for the zap to complete. */
> 	while (READ_ONCE(kvm->arch.mtrr_zapping))
> 		cpu_relax();
> 
> and I'm sure someone that's better at lockless programming could optimize that
> even further if necessary, e.g. by checking for "range in list" outside of the
> spinlock.
> 
> E.g. in my OVMF based VM, the precise zap would target only two ranges, the VGA
> hole and the 32-bit PCI:
> 
>   kvm: vCPU0 default type = 6
>   kvm: vCPU0 fixed MTRR range 0 - 15 == 6
>   kvm: vCPU0 fixed MTRR range 16 - 87 == 0
>   kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0
> 
> That bumps up to three ranges for the kernel, which adds what I suspect is the
> 64-bit PCI hole:
> 
>   kvm: vCPU0 default type = 6                                                   
>   kvm: vCPU0 fixed MTRR range 0 - 15 == 6                                       
>   kvm: vCPU0 fixed MTRR range 16 - 87 == 0
>   kvm: vCPU0 variable MTRR range 800000000 - 1000000000  = 0
>   kvm: vCPU0 variable MTRR range 80000000 - 100000000  = 0
> 
> The above is distilled information from this hack-a-print:
> 
> diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> index 9fac1ec03463..6259c7a4bcd3 100644
> --- a/arch/x86/kvm/mtrr.c
> +++ b/arch/x86/kvm/mtrr.c
> @@ -304,12 +304,42 @@ static void var_mtrr_range(struct kvm_mtrr_range *range, u64 *start, u64 *end)
>         *end = (*start | ~mask) + 1;
>  }
>  
> +
> +static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> +{
> +       return (range->mask & (1 << 11)) != 0;
> +}
> +
>  static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>  {
>         struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
>         gfn_t start, end;
>         int index;
>  
> +       if (mtrr_is_enabled(mtrr_state) && msr == MSR_MTRRdefType) {
> +               struct kvm_mtrr_range *range;
> +               int i;
> +
> +               pr_warn("vCPU%u default type = %u\n", vcpu->vcpu_idx, (u8)(mtrr_state->deftype & 0xff));
> +
> +               if (!fixed_mtrr_is_enabled(mtrr_state)) {
> +                       pr_warn("vCPU%u fixed MTRRs disabled\n", vcpu->vcpu_idx);
> +               } else {
> +                       for (i = 0; i < ARRAY_SIZE(mtrr_state->fixed_ranges); i++)
> +                               pr_warn("vCPU%u fixed MTRR range %u == %u\n",
> +                                       vcpu->vcpu_idx, i, mtrr_state->fixed_ranges[i]);
> +               }
> +
> +               list_for_each_entry(range, &mtrr_state->head, node) {
> +                       if (!var_mtrr_range_is_valid(range))
> +                               continue;
> +
> +                       var_mtrr_range(range, &start, &end);
> +                       pr_warn("vCPU%d variable MTRR range %llx - %llx  = %u\n",
> +                               vcpu->vcpu_idx, start, end, (u8)(range->base & 0xff));
> +               }
> +       }
> +
>         if (msr == MSR_IA32_CR_PAT || !tdp_enabled ||
>               !kvm_arch_has_noncoherent_dma(vcpu->kvm))
>                 return;
> @@ -333,11 +363,6 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
>         kvm_zap_gfn_range(vcpu->kvm, gpa_to_gfn(start), gpa_to_gfn(end));
>  }
>  
> -static bool var_mtrr_range_is_valid(struct kvm_mtrr_range *range)
> -{
> -       return (range->mask & (1 << 11)) != 0;
> -}
> -
>  static void set_var_mtrr_msr(struct kvm_vcpu *vcpu, u32 msr, u64 data)
>  {
>         struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-24 14:50           ` Sean Christopherson
@ 2023-05-25 10:14             ` Yan Zhao
  2023-05-25 15:54               ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-25 10:14 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Wed, May 24, 2023 at 07:50:24AM -0700, Sean Christopherson wrote:
> On Wed, May 24, 2023, Yan Zhao wrote:
> > On Tue, May 23, 2023 at 03:51:49PM -0700, Sean Christopherson wrote:
> > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > index 3eb6e7f47e96..a67c28a56417 100644
> > > --- a/arch/x86/kvm/mtrr.c
> > > +++ b/arch/x86/kvm/mtrr.c
> > > @@ -320,7 +320,7 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > >  	struct kvm_mtrr *mtrr_state = &vcpu->arch.mtrr_state;
> > >  	gfn_t start, end;
> > >  
> > > -	if (!tdp_enabled || !kvm_arch_has_noncoherent_dma(vcpu->kvm))
> > > +	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > Could we also add another helper kvm_mmu_cap_honors_guest_mtrrs(), which
> > does not check kvm_arch_has_noncoherent_dma()?
> > 
> > +static inline bool kvm_mmu_cap_honors_guest_mtrrs(struct kvm *kvm)
> > +{
> > +       return !!shadow_memtype_mask;
> > +}
> > 
> > This is because in patch 4 I plan to do the EPT zap when
> > noncoherent_dma_count goes from 1 to 0.
> 
> Hrm, the 1->0 transition is annoying.  Rather than trying to capture the "everything
> except non-coherent DMA" aspect, what about this?
> 
> mmu.c:
> 
> bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma)
> {
> 	/*
> 	 * If the TDP is enabled, the host MTRRs are ignored by TDP
> 	 * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
> 	 * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
> 	 * from the guest's MTRRs so that guest accesses to memory that is
> 	 * DMA'd aren't cached against the guest's wishes.
> 	 *
> 	 * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
> 	 * e.g. KVM will force UC memtype for host MMIO.
> 	 */
> 	return vm_has_noncoherent_dma && tdp_enabled && shadow_memtype_mask;
> }
> 
> mmu.h:
> 
> bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma);
> 
> static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
> {
> 	
> 	return __kvm_mmu_honors_guest_mtrrs(kvm, kvm_arch_has_noncoherent_dma(kvm));
> }

This should work and it centralizes the comments into one place, though I dislike
having to pass true as vm_has_noncoherent_dma in case of 1->0 transition. :)

> 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 41d7bb51a297..ad0c43d7f532 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13146,13 +13146,19 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
> > 
> >  void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
> >  {
> > -       atomic_inc(&kvm->arch.noncoherent_dma_count);
> > +       if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1) {
> > +               if (kvm_mmu_cap_honors_guest_mtrrs(kvm))
> > +                       kvm_zap_gfn_range(kvm, 0, ~0ULL);
> 
> No need for multiple if statements.  Though rather than have identical code in
> both the start/end paths, how about this?  That provides a single location for a
> comment.  Or maybe first/last instead of start/end?
> 
> static void kvm_noncoherent_dma_start_or_end(struct kvm *kvm)
What does start_or_end or first_or_last stand for? 

> {
> 	/* comment goes here. */
> 	if (__kvm_mmu_honors_guest_mtrrs(kvm, true))
> 		kvm_zap_gfn_range(kvm, 0, ~0ULL);
> }
> 
> void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
> {
> 	if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1)
> 		kvm_noncoherent_dma_start_or_end(kvm);
> }
> EXPORT_SYMBOL_GPL(kvm_arch_register_noncoherent_dma);
> 
> void kvm_arch_unregister_noncoherent_dma(struct kvm *kvm)
> {
> 	if (!atomic_dec_return(&kvm->arch.noncoherent_dma_count))
> 		kvm_noncoherent_dma_start_or_end(kvm);
> }
> EXPORT_SYMBOL_GPL(kvm_arch_unregister_noncoherent_dma);
> 

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-25 10:09             ` Yan Zhao
@ 2023-05-25 14:53               ` Sean Christopherson
  2023-05-26  7:54                 ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-25 14:53 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023, Yan Zhao wrote:
> On Wed, May 24, 2023 at 11:21:09AM -0700, Sean Christopherson wrote:
> 
> ...
> 
> > > > Picking a single vCPU will always be subject to edge cases.  E.g. I can see something
> > > > like kexec() "offlining" KVM's chosen vCPU and then having problems because KVM
> > > > ignores MTRR updates from other vCPUs in the new kernel.
> > > >
> > > Not familiar with kexec().
> > > I wanted to trap APIC_SPIV and finding the lowest online vCPU id by checking
> > > apic->sw_enabled status. Then only MTRR state of vCPU whose id equals to
> > > the lowest online vCPU id can be written to per-VM MTRR state.
> > > Will that work for kexec()?
> > 
> > kexec() allows "booting" into a new kernel without transitioning through BIOS
> > and without doing a full reboot.  The scenario I'm worried about is if the new
> > kernel offlines KVM's chosen CPU for whatever reason and also modifies MTRRs.  I
> > think it's an extremely unlikely scenario, but my point is that selecting a single
> > vCPU to control the MTRRs works if and only if that vCPU stays online for the
> > entire lifetime of the VM, which KVM can't guarantee.
> > 
> > > > One idea would be to let all vCPUs write the per-VM state, and zap if and only if
> > > > the MTRRs are actually different.  But as above, I'm on the fence about diverging
> > > > from what hardware actually does with respect to MTRRs.
> > 
> > ...
> > 
> > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> Not only non-WB ranges are required to be zapped.
> Think about a scenario:
> (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> (3) then guest performs MTRR disable, no zap too.
> (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.

KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
currently zaps everything when the quirk is disabled, and I'm not proposing that
be changed.

> Does it make sense?
> 
> > > > non-WB MTRR ranges doesn't need to actually resolve the memtype, i.e. can be simple
> > > > and just
> > > > process MTRRs one by one.
> > > > 
> > > > Did that make sense?  Minus the code to identify non-WB MTRR ranges (and much
> > > > needed comments), the below is what I'm thinking.  If more intelligent zapping
> > > > provides the desired performance improvements, then I think/hope we avoid trying
> > > > to play games with MTRRs.
> > > > 
> > > > ---
> > > >  arch/x86/kvm/mtrr.c    | 19 +++++++++++++++++++
> > > >  arch/x86/kvm/vmx/vmx.c |  8 ++------
> > > >  arch/x86/kvm/x86.c     |  6 ++----
> > > >  3 files changed, 23 insertions(+), 10 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kvm/mtrr.c b/arch/x86/kvm/mtrr.c
> > > > index a67c28a56417..e700c230268b 100644
> > > > --- a/arch/x86/kvm/mtrr.c
> > > > +++ b/arch/x86/kvm/mtrr.c
> > > > @@ -323,6 +323,9 @@ static void update_mtrr(struct kvm_vcpu *vcpu, u32 msr)
> > > >  	if (!kvm_mmu_honors_guest_mtrrs(vcpu->kvm))
> > > >  		return;
> > > >  
> > > > +	if (kvm_is_cr0_bit_set(vcpu, X86_CR0_CD))
> > > > +		return;
> > > This will always make update_mtrr() return here for Linux and OVMF. 
> > 
> > Yes, that's the intent.  If the CR0.CD quirk is _disabled_, then all SPTEs are
> > UC, i.e. there's no need to zap.  If the quirk is enabled (the common case,
> > unfortunately), then KVM isn't honoring MTRRs, i.e. non-coherent DMA won't function
> > properly, and so zapping SPTEs is pointless.  And in both cases, KVM will zap
> > relevant SPTEs when CR0.CD is cleared.
> My worry is that if the quirk is enabled,
> (1) when CR0.CD=1, the EPT memtype is WB.
> (2) when MTRR is further disabled,
>     a. with EPT zap, the new EPT memtype is UC, effective memtype is UC
>     b. without EPT zap, some EPT memtype is still WB, effective memtype is guest PAT type
> However, in guest driver's point of view, the memtype is UC, because its MTRR is disabled.

The quirk throws all of that out the window.  As the quirk is implemented in KVM,
i.e. what guests have been running with for years, the MTRRs are completely
ignored when CR0.CD.

> So, if we don't zap EPT when guest disables MTRR, and if there's
> non-coherent DMA on-going which requires guest driver to flush caches to
> ensure cache coherency (though I don't think this scenario will happen in
> reality), guest driver may not do the right thing as it thinks the memory is
> UC which has no need to flush cache while the effective memtype is WB.
> 
> Previously, (i.e. in current upstream implementation),  
> (1) when CR0.CD=1, the EPT memtype is WB + ignore guest PAT.
> (2) when MTRR is further disabled,
>     EPT is zapped, new EPT memtype is UC, effective memtype is UC.

No, because after zapping, CR0.CD=1, and thus when KVM re-installs SPTEs the guest
will get WB memory for _everything_ once again.

> The guest device driver can flush cache well for non-coherent DMA.
>
> So, though the quirk will make current upstream code malfunction when
> CR0.CD is set alone, it still functions well if MTRR is also disabled.

Not unless I'm missing something.  I don't see any way for the guest to get UC
memory when CR0.CD=1 and the quirk is enabled.  The behavior is nonsensical IMO,
but that's been KVM's quirky behavior for years.

	if (is_mmio)
		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;

	if (!kvm_arch_has_noncoherent_dma(vcpu->kvm))
		return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

	if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
		if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
			cache = MTRR_TYPE_WRBACK;  <===== Overrides MTRR settings
		else
			cache = MTRR_TYPE_UNCACHABLE;

		return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
	}

	return kvm_mtrr_get_guest_memory_type(vcpu, gfn) << VMX_EPT_MT_EPTE_SHIFT;

> But I doubt if we need to care about this extreme condition of
> non-coherent DMA that happens when CR0.CD=1 and MTRR is disabled.

It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
A SPTE installed while CR0.CD=1 could overlap a *future* non-coherent DMA range,
i.e. not zapping when CR0.CD is cleared would leave a stale WB SPTE.  This is
especially likely with hugepages, e.g. if KVM can create 1GiB mappings.

> > > This zap looks will happen on each vCPU.
> > 
> > Yes, on CR0.CD 1=>0 transition.
> > 
> > > Then only half of zaps are saved compared to the original count of zaps in
> > > update_mtrr().
> > 
> > I don't think the current code is functionally correct though, even if we define
> > "correct" to only apply when CR0.CD=0 (because the CR0.CD=1 case when the quirk
> > can't possibly be correct).  Keeping stale WB+IGMT entries that were installed
> > while CR0.CD=1 is broken.  As mentioned in my previous mail, I suspect it works
> > only because KVM doesn't install SPTEs for DMA'd ranges.  FWIW, the WB memtype
> > is unlikely to be a problem, but setting IGMT definitely is.  Though of course
> > we can and should fix the IGMT thing separately.
> Current code may also not function well for non-coherent DMA when guest vCPU has
> no X86_FEATURE_MTRR, as it returns WB (without IPAT) as EPT memtype.
> Then the effective memtype is guest PAT, but from guest's point of view, it's all
> UC. (see pat_x_mtrr_type() in Linux's code).
> But I guess we don't need to mind this case?

Yes, we can ignore !X86_FEATURE_MTRR guests.  If someone wants to run such a
setup with non-coherent DMA, they're welcome to send patches and write a lengthy
explanation of why they want to run such insanity :-)

> > > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> > 
> > Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
> Could we instead keep a per-VM copy of MTRR?
> Then, if a vCPU modifies an MTRR entry and we find it's different from that
> in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
> those dirty ranges. (or just zap all if there are dirty entries)
> In theory, only one vCPU will trigger this zap in each round of MTRR update.

I don't think tracking "dirty" MTRRs releative to a per-VM copy will work because
of the weird behavior of the quirk.  E.g. if the below happens, vCPU1's WB SPTE
won't be zapped because MTRRx wasn't "dirty".

     vCPU0         vCPU1
     ----------    ----------
1.   MTRRx = UC 
2.   CR0.CD = 1    CR0.CD=1
3.                 SPTE = WB
4.                 MTRRx = UC
5.                 CR0.CD=0

Another problem is that while only one vCPU will trigger the zap, KVM needs to
stall all vCPUs that dirtied MTRRs *relative to the original per-VM state*.  E.g.
if the per-VM state for MTRRx is WB and both vCPU1 and vCPU2 dirty an MTRR and
both clear CR0.CD, then KVM needs to stall vCPU1 and vCPU2 regardless of which
vCPU actually does the zapping.

And when handling the transition from dirty=>clean, KVM would need to prevent
writes to MTRRs while grabbing the snapshot of dirty MTRRs, i.e. to-be-zapped
ranges, in order to provide stable view of the clean per-VM copy.

In other words, the net effect will essentially be what I sketched out with the
precise zapping, but with the added downside of serializing writes to MTRRs while
transitioning the per-VM state from dirty=>clean.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-25  7:21       ` Robert Hoo
@ 2023-05-25 15:00         ` Sean Christopherson
  2023-05-26  1:49           ` Robert Hoo
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-25 15:00 UTC (permalink / raw)
  To: Robert Hoo; +Cc: Yan Zhao, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023, Robert Hoo wrote:
> On 5/23/2023 2:21 PM, Yan Zhao wrote:
> IIUC, your saving comes from skips the intermediate state during boot, when
> APs goes through setting MTRR, which would cause SPTE zap before your this
> patch set.
> 
> MHO was, now that your ignores other vCPU's MTRR settings (unless it is
> different from BP's MTRR?), why not let each vCPU's MTRR set/update directly
> set to the per-VM MTRR states (if differs from current value). It's guest
> OS/BIOS's responsibility to keep the consistency anyway. And even if the
> malfunction caused by the inconsistency might differ from that of native,
> SDM doesn't clearly state how the malfunction should be, does it?
> that's to say, anyone knows, when inconsistency happens, does it cause that
> logical processor malfunction or in fact it impacts the global MTRR
> settings? If the latter, I think leaving only the per-VM MTRR state aligns
> with native.

The MTRRs are not system wide or per-package though, they are per logical CPU.
Yes, they "need" to be consistent with respect to one another, but only when the
CPU is actually accessing memory.  This is a big reason why trying to track MTRRs
as a per-VM asset in KVM is so difficult/messy.  Software doesn't rendezvous all
CPUs and then do the write on just one CPU, each CPU does its own writes more or
less independently.

> BTW, with regard to KVM_X86_QUIRK_CD_NW_CLEARED, I see svm honors it while
> vmx doesn't before it clear CR0.CD/NW.
> 
> svm_set_cr0():
> 
> 	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> 		hcr0 &= ~(X86_CR0_CD | X86_CR0_NW);
> 
> 
> vmx_set_cr0():
> 
> 	hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
> 
> Perhaps vmx side can be fixed passingly?

Sadly, no.  SVM and VMX manage guest memtype completely differently.  VMX doesn't
allow CR0.CD=1 when VMX is enabled, and so KVM needs to emulate CR0.CD via the EPT
memtype.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-25 10:14             ` Yan Zhao
@ 2023-05-25 15:54               ` Sean Christopherson
  2023-05-30  1:32                 ` Yan Zhao
  2023-05-30  9:48                 ` Yan Zhao
  0 siblings, 2 replies; 41+ messages in thread
From: Sean Christopherson @ 2023-05-25 15:54 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023, Yan Zhao wrote:
> On Wed, May 24, 2023 at 07:50:24AM -0700, Sean Christopherson wrote:
> > bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma);
> > 
> > static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
> > {
> > 	
> > 	return __kvm_mmu_honors_guest_mtrrs(kvm, kvm_arch_has_noncoherent_dma(kvm));
> > }
> 
> This should work and it centralizes the comments into one place, though I dislike
> having to pass true as vm_has_noncoherent_dma in case of 1->0 transition. :)

Yeah, I don't love it either, but the whole 1=>0 transition is awkward.  FWIW,
KVM doesn't strictly need to zap in that case since the guest isn't relying on
WB for functionality, i.e. we could just skip it.

> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 41d7bb51a297..ad0c43d7f532 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -13146,13 +13146,19 @@ EXPORT_SYMBOL_GPL(kvm_arch_has_assigned_device);
> > > 
> > >  void kvm_arch_register_noncoherent_dma(struct kvm *kvm)
> > >  {
> > > -       atomic_inc(&kvm->arch.noncoherent_dma_count);
> > > +       if (atomic_inc_return(&kvm->arch.noncoherent_dma_count) == 1) {
> > > +               if (kvm_mmu_cap_honors_guest_mtrrs(kvm))
> > > +                       kvm_zap_gfn_range(kvm, 0, ~0ULL);
> > 
> > No need for multiple if statements.  Though rather than have identical code in
> > both the start/end paths, how about this?  That provides a single location for a
> > comment.  Or maybe first/last instead of start/end?
> > 
> > static void kvm_noncoherent_dma_start_or_end(struct kvm *kvm)
> What does start_or_end or first_or_last stand for? 

Start/End of device (un)assignment, or First/Last device (un)assigned.  Definitely
feel free to pick a better name.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-25 15:00         ` Sean Christopherson
@ 2023-05-26  1:49           ` Robert Hoo
  0 siblings, 0 replies; 41+ messages in thread
From: Robert Hoo @ 2023-05-26  1:49 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Yan Zhao, kvm, linux-kernel, pbonzini

On 5/25/2023 11:00 PM, Sean Christopherson wrote:
[...]
> The MTRRs are not system wide or per-package though, they are per logical CPU.
> Yes, they "need" to be consistent with respect to one another, but only when the
> CPU is actually accessing memory.  This is a big reason why trying to track MTRRs
> as a per-VM asset in KVM is so difficult/messy.  Software doesn't rendezvous all
> CPUs and then do the write on just one CPU, each CPU does its own writes more or
> less independently.

Ah, got it, thanks!

(Some things of each logical processor seems just a shadow of an 
consolidate global one, e.g. CR0.CD. Some things are really separated 
setting of each logical processor, e.g. MTRR above. Really unfathomable 😅)
> 
>> BTW, with regard to KVM_X86_QUIRK_CD_NW_CLEARED, I see svm honors it while
>> vmx doesn't before it clear CR0.CD/NW.
>>
>> svm_set_cr0():
>>
>> 	if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
>> 		hcr0 &= ~(X86_CR0_CD | X86_CR0_NW);
>>
>>
>> vmx_set_cr0():
>>
>> 	hw_cr0 = (cr0 & ~KVM_VM_CR0_ALWAYS_OFF);
>>
>> Perhaps vmx side can be fixed passingly?
> 
> Sadly, no.  SVM and VMX manage guest memtype completely differently.  VMX doesn't
> allow CR0.CD=1 when VMX is enabled, and so KVM needs to emulate CR0.CD via the EPT
> memtype.

OK, get it, thanks. Wasn't aware of this through SDM.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-25 14:53               ` Sean Christopherson
@ 2023-05-26  7:54                 ` Yan Zhao
  2023-05-26 16:09                   ` Sean Christopherson
  0 siblings, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-26  7:54 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > Not only non-WB ranges are required to be zapped.
> > Think about a scenario:
> > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > (3) then guest performs MTRR disable, no zap too.
> > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> 
> KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> currently zaps everything when the quirk is disabled, and I'm not proposing that
> be changed.
I didn't explain it clearly.

(1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
(2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
(3) then guest performs MTRR disable, no zap too, according to our change.
(4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
(5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.


> It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
If we don't need to mind non-coherent DMA within the window CR0.CD=1 to CR0.CD=0,
do you think it's a good idea to do in this way...

(1) when CR0.CD=1, return guest mtrr type. 

--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7532,12 +7532,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
                return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;

        if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
-               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
-                       cache = MTRR_TYPE_WRBACK;
-               else
+               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+                       cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
+                       return cache << VMX_EPT_MT_EPTE_SHIFT;
+               } else {
                        cache = MTRR_TYPE_UNCACHABLE;
-
-               return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+                       return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+               }
        }


(2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
@@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
        struct mtrr_iter iter;
        u64 start, end;
        int type = -1;
        const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
                               | (1 << MTRR_TYPE_WRTHROUGH);
 
+       if (!mtrr_state->enabled_once)
+               return MTRR_TYPE_WRBACK;
+


(3) when CR0.CD = 1 and the quirk is on, return MTRR type as if MTRR is enabled
+       ignore_disable = kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
+                        kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED);

-static void mtrr_lookup_start(struct mtrr_iter *iter)
+static void mtrr_lookup_start(struct mtrr_iter *iter, bool ignore_disable)
 {
-       if (!mtrr_is_enabled(iter->mtrr_state)) {
+       if (!ignore_disable && !mtrr_is_enabled(iter->mtrr_state)) {
                iter->mtrr_disabled = true;
                return;
        }

(4) Then we only need to do EPT zap when MTRR is enabled for the first time
and when MTRR fixed/var entries are changed at enabling MTRR or at CR0.CD=0
(I prefer at enabling MTRR, as seabios may do the MTRR disable/update/enable when
CR0.CD=0)


Besides, accoding to the commit message introducing KVM_QUIRK_CD_NW_CLEARED,
we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
once. (And I tried, it works!)

commit fb279950ba02e3210a16b11ecfa8871f3ee0ca49
Author: Xiao Guangrong <guangrong.xiao@intel.com>
Date:   Thu Jul 16 03:25:56 2015 +0800

    KVM: vmx: obey KVM_QUIRK_CD_NW_CLEARED

    OVMF depends on WB to boot fast, because it only clears caches after
    it has set up MTRRs---which is too late.

    Let's do writeback if CR0.CD is set to make it happy, similar to what
    SVM is already doing.
 
> Yes, we can ignore !X86_FEATURE_MTRR guests.  If someone wants to run such a
> setup with non-coherent DMA, they're welcome to send patches and write a lengthy
> explanation of why they want to run such insanity :-)
great :)

> 
> > > > And pieces of no-WB memory might produce more kvm_zap_gfn_range()?
> > > 
> > > Yes, but they'll be much, much more precise.  And the bajillion fixed MTRRs can
> > Could we instead keep a per-VM copy of MTRR?
> > Then, if a vCPU modifies an MTRR entry and we find it's different from that
> > in the per-VM copy, we mark that entry dirty. When CD0=0, we just zap
> > those dirty ranges. (or just zap all if there are dirty entries)
> > In theory, only one vCPU will trigger this zap in each round of MTRR update.
> 
> I don't think tracking "dirty" MTRRs releative to a per-VM copy will work because
> of the weird behavior of the quirk.  E.g. if the below happens, vCPU1's WB SPTE
> won't be zapped because MTRRx wasn't "dirty".
>
By returning MTRR type (as if MTRR is enabled) for CR0.CD=1, we can avoid to zap
entries in this window. Agree?

>      vCPU0         vCPU1
>      ----------    ----------
> 1.   MTRRx = UC 
> 2.   CR0.CD = 1    CR0.CD=1
> 3.                 SPTE = WB
> 4.                 MTRRx = UC
> 5.                 CR0.CD=0
> 
> Another problem is that while only one vCPU will trigger the zap, KVM needs to
> stall all vCPUs that dirtied MTRRs *relative to the original per-VM state*.  E.g.
> if the per-VM state for MTRRx is WB and both vCPU1 and vCPU2 dirty an MTRR and
> both clear CR0.CD, then KVM needs to stall vCPU1 and vCPU2 regardless of which
> vCPU actually does the zapping.
> 
> And when handling the transition from dirty=>clean, KVM would need to prevent
> writes to MTRRs while grabbing the snapshot of dirty MTRRs, i.e. to-be-zapped
> ranges, in order to provide stable view of the clean per-VM copy.
> 
> In other words, the net effect will essentially be what I sketched out with the
> precise zapping, but with the added downside of serializing writes to MTRRs while
> transitioning the per-VM state from dirty=>clean.
Yes, the net effect will essentially be what you sketched out in last mail.

I can try to figure out which way is more efficient before sending out next version
if you agree with the above direction, i.e.
(1) Do not consider non-coherent DMA when CR0.CD=1
(2) return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enabled for once.
(3) return MTRR type as if MTRR is enabled for CR0.CD=1

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-26  7:54                 ` Yan Zhao
@ 2023-05-26 16:09                   ` Sean Christopherson
  2023-05-30  1:19                     ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-26 16:09 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Fri, May 26, 2023, Yan Zhao wrote:
> On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > > Not only non-WB ranges are required to be zapped.
> > > Think about a scenario:
> > > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > > (3) then guest performs MTRR disable, no zap too.
> > > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> > 
> > KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> > currently zaps everything when the quirk is disabled, and I'm not proposing that
> > be changed.
> I didn't explain it clearly.
> 
> (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
> (2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
> (3) then guest performs MTRR disable, no zap too, according to our change.
> (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.

Ugh, right.  But that case can be handled by zapping non-WB ranges on CR0.CD being
set.  Hmm, and KVM would need to zap everything if CR0.CD is toggled with MTRRs
disabled, but I don't think OVMF ever does that.  Zapping on CR0.CD toggling would
would likely introduce a small performance hit for SeaBIOS due to SeaBIOS clearing
CR0.CD immediately after SIPI, i.e. with MTRRs disabled, but that's arguably a
correctness fix since the quirk means KVM incorrectly runs the vCPU with WB SPTEs
until MTRRs are programmed.

With precise+batched zapping, zapping non-WB ranges even when CR0.CD is set should
still be a healthy performance boost for OVMF.

> > It doesn't actually require non-coherent DMA within the CR0.CD=1 window though.
> If we don't need to mind non-coherent DMA within the window CR0.CD=1 to CR0.CD=0,
> do you think it's a good idea to do in this way...
> 
> (1) when CR0.CD=1, return guest mtrr type. 
> 
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7532,12 +7532,13 @@ static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>                 return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> 
>         if (kvm_read_cr0_bits(vcpu, X86_CR0_CD)) {
> -               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
> -                       cache = MTRR_TYPE_WRBACK;
> -               else
> +               if (kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
> +                       cache = kvm_mtrr_get_guest_memory_type(vcpu, gfn);
> +                       return cache << VMX_EPT_MT_EPTE_SHIFT;
> +               } else {
>                         cache = MTRR_TYPE_UNCACHABLE;
> -
> -               return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +                       return (cache << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +               }
>         }
> 
> 
> (2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
> u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> @@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
>         struct mtrr_iter iter;
>         u64 start, end;
>         int type = -1;
>         const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
>                                | (1 << MTRR_TYPE_WRTHROUGH);
>  
> +       if (!mtrr_state->enabled_once)
> +               return MTRR_TYPE_WRBACK;

No, because this assumes too many things about the guest, and completely falls
apart if the guest reboots.

> (3) when CR0.CD = 1 and the quirk is on, return MTRR type as if MTRR is enabled
> +       ignore_disable = kvm_read_cr0_bits(vcpu, X86_CR0_CD) &&
> +                        kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED);
> 
> -static void mtrr_lookup_start(struct mtrr_iter *iter)
> +static void mtrr_lookup_start(struct mtrr_iter *iter, bool ignore_disable)
>  {
> -       if (!mtrr_is_enabled(iter->mtrr_state)) {
> +       if (!ignore_disable && !mtrr_is_enabled(iter->mtrr_state)) {
>                 iter->mtrr_disabled = true;
>                 return;
>         }
> 
> (4) Then we only need to do EPT zap when MTRR is enabled for the first time
> and when MTRR fixed/var entries are changed at enabling MTRR or at CR0.CD=0
> (I prefer at enabling MTRR, as seabios may do the MTRR disable/update/enable when
> CR0.CD=0)
> 
> 
> Besides, accoding to the commit message introducing KVM_QUIRK_CD_NW_CLEARED,

I am not willing to lean on the historical commit messages for the quirk to
justify any change.  I'm not at all convinced that there was any meaningful thought
put into ensuring correctness.

> we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
> once. (And I tried, it works!)

On your system, for your setup.  The quirk terrifies me because it likely affects
every KVM-based VM out there (I can't find a single VMM that disables the quirk).
These changes are limited to VMs with noncoherent DMA, but still.

In short, I am steadfastly against any change that piles more arbitrary behavior
functional behavior on top of the quirk, especially when that behavior relies on
heuristics to try and guess what the guest is doing.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state
  2023-05-26 16:09                   ` Sean Christopherson
@ 2023-05-30  1:19                     ` Yan Zhao
  0 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-30  1:19 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Robert Hoo, kvm, linux-kernel, pbonzini

On Fri, May 26, 2023 at 09:09:08AM -0700, Sean Christopherson wrote:
> On Fri, May 26, 2023, Yan Zhao wrote:
> > On Thu, May 25, 2023 at 07:53:41AM -0700, Sean Christopherson wrote:
> > > > > > > So, if KVM zaps SPTEs when CR0.CD is cleared (even when the quirk is enabled),
> > > > > > > then KVM can skip the MTRR zaps when CR0.CD=1 because KVM is ignoring the MTRRs
> > > > > > > and will zap when CR0.CD is cleared.  And to avoid regressing the CR0.CD case,
> > > > > > > if KVM honors guest PAT for the bizarro CR0.CD quirk, then KVM only needs to
> > > > > > > zap non-WB MTRR ranges when CR0.CD is cleared.  Since WB is weak, looking for
> > > > Not only non-WB ranges are required to be zapped.
> > > > Think about a scenario:
> > > > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too.
> > > > (2) after CR0.CD=1, without zap, its EPT entry memtype is still UC.
> > > > (3) then guest performs MTRR disable, no zap too.
> > > > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > > > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.
> > > 
> > > KVM installs WB memtype when the quirk is enabled, thus no need to zap.  KVM
> > > currently zaps everything when the quirk is disabled, and I'm not proposing that
> > > be changed.
> > I didn't explain it clearly.
> > 
> > (1) one fixed MTRR range is UC initially. Its EPT entry memtype is UC, too. ==> EPT entry has been created here
> > (2) after CR0.CD=1, because of the quirk, no zap, the created EPT entry memtype is still UC.
> > (3) then guest performs MTRR disable, no zap too, according to our change.
> > (4) guest updates this fixed MTRR range to WB, and performs MTRR enable.
> > (5) CR0.CD=0. we need to zap this fixed range to update the EPT memtype to WB.==>we also need to zap WB range.
> 
> Ugh, right.  But that case can be handled by zapping non-WB ranges on CR0.CD being
> set.  Hmm, and KVM would need to zap everything if CR0.CD is toggled with MTRRs
Ok. Actually even with below "zap everything always", I didn't observe any performance
downgrade on OVMF in my side regarding to this change as we skipped EPT zap in
update_mtrr() when CR0.CD=1.
(even 3 less zaps for vCPU 0 and 1 less zap for APs as initially there are several MTRRs
disabled when CR0.CD=1 without accompanying CR0.CD toggle).

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dadf80fb85e9..ad3c4dc9d7bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -941,9 +941,9 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon

        if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
-           kvm_mmu_honors_guest_mtrr(vcpu->kvm) &&
-           !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)
+           kvm_mmu_honors_guest_mtrr(vcpu->kvm)
                kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
 }
 EXPORT_SYMBOL_GPL(kvm_post_set_cr0);



> disabled, but I don't think OVMF ever does that.  Zapping on CR0.CD toggling would
> would likely introduce a small performance hit for SeaBIOS due to SeaBIOS clearing
> CR0.CD immediately after SIPI, i.e. with MTRRs disabled, but that's arguably a
For SeaBIOS, I also observed 1 less of EPT zap for each vCPU, because
there are 3 more MTRR toggles than CR0.CD toggles for each vCPU, and only 2 MTRR
toggles happen when CR0.CD=0.

> correctness fix since the quirk means KVM incorrectly runs the vCPU with WB SPTEs
> until MTRRs are programmed.

> With precise+batched zapping, zapping non-WB ranges even when CR0.CD is set should
> still be a healthy performance boost for OVMF.
Ok. I'll try to implement in this way in the next version.


> > (2) return MTRR_TYPE_WRBACK if guest mtrr has not been enabled for once
> > u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> > @@ -628,13 +635,23 @@ u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn)
> >         struct mtrr_iter iter;
> >         u64 start, end;
> >         int type = -1;
> >         const int wt_wb_mask = (1 << MTRR_TYPE_WRBACK)
> >                                | (1 << MTRR_TYPE_WRTHROUGH);
> >  
> > +       if (!mtrr_state->enabled_once)
> > +               return MTRR_TYPE_WRBACK;
> 
> No, because this assumes too many things about the guest, and completely falls
> apart if the guest reboots.
Ah, right.

> I am not willing to lean on the historical commit messages for the quirk to
> justify any change.  I'm not at all convinced that there was any meaningful thought
> put into ensuring correctness.
> 
> > we can return MTRR_TYPE_WRBACK for CR0.CD=1 only when MTRR is not enbled for
> > once. (And I tried, it works!)
> 
> On your system, for your setup.  The quirk terrifies me because it likely affects
> every KVM-based VM out there (I can't find a single VMM that disables the quirk).
> These changes are limited to VMs with noncoherent DMA, but still.
> 
> In short, I am steadfastly against any change that piles more arbitrary behavior
> functional behavior on top of the quirk, especially when that behavior relies on
> heuristics to try and guess what the guest is doing.

Ok, yes, this is the right insistence.

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-25 15:54               ` Sean Christopherson
@ 2023-05-30  1:32                 ` Yan Zhao
  2023-05-30  9:48                 ` Yan Zhao
  1 sibling, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-30  1:32 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023 at 08:54:15AM -0700, Sean Christopherson wrote:
> On Thu, May 25, 2023, Yan Zhao wrote:
> > On Wed, May 24, 2023 at 07:50:24AM -0700, Sean Christopherson wrote:
> > > bool __kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool vm_has_noncoherent_dma);
> > > 
> > > static inline bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm)
> > > {
> > > 	
> > > 	return __kvm_mmu_honors_guest_mtrrs(kvm, kvm_arch_has_noncoherent_dma(kvm));
> > > }
> > 
> > This should work and it centralizes the comments into one place, though I dislike
> > having to pass true as vm_has_noncoherent_dma in case of 1->0 transition. :)
> 
> Yeah, I don't love it either, but the whole 1=>0 transition is awkward.  FWIW,
> KVM doesn't strictly need to zap in that case since the guest isn't relying on
> WB for functionality, i.e. we could just skip it.
I think zap when 1=>0 transition is still useful.
E.g. if this non-coherent DMA is unassigned
(1) when CR0.CD=1 (KVM_X86_QUIRK_CD_NW_CLEARED is not enabled), or
(2) when CR0.CD=0 and MTRRs are disabled,
it's better to zap the UC ranges for better performance.


^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-25 15:54               ` Sean Christopherson
  2023-05-30  1:32                 ` Yan Zhao
@ 2023-05-30  9:48                 ` Yan Zhao
  2023-05-30 23:51                   ` Sean Christopherson
  1 sibling, 1 reply; 41+ messages in thread
From: Yan Zhao @ 2023-05-30  9:48 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Thu, May 25, 2023 at 08:54:15AM -0700, Sean Christopherson wrote:
> > > static void kvm_noncoherent_dma_start_or_end(struct kvm *kvm)
> > What does start_or_end or first_or_last stand for? 
> 
> Start/End of device (un)assignment, or First/Last device (un)assigned.  Definitely
> feel free to pick a better name.
Hi Sean,
start_or_end gave me a first impression that it stands for gfn_start, gfn_end.
so currently I changed it to start/stop to avoid confusion.

+static void kvm_noncoherent_dma_assignment_start_or_stop(struct kvm *kvm)
+{
+       /*
+        * Only necessary to zap KVM TDP if guest MTRR is honored.
+        * As whether a VM has noncoherent DMA can affect whether
+        * KVM honors guest MTRR and the resulting memory type in TDP,
+        * specify its value as true here to test if guest MTRR
+        * is honored after the assignment starts or
+        * was honored before the assignment stops.
+        */
+       if (kvm_mmu_honors_guest_mtrrs(kvm, true))
+               kvm_zap_gfn_range(kvm, 0, ~0ULL);
+}

And I combined the __kvm_mmu_honors_guest_mtrrs() into
kvm_mmu_honors_guest_mtrrs(). Not sure if you like it :)

+/*
+ * Returns if KVM honors guest MTRRs
+ * @override_vm_has_noncoherent_dma: Allow caller to override non-coherent DMA
+ *                                   status returned from
+ *                                   kvm_arch_has_noncoherent_dma()
+ */
+bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm,
+                               bool override_vm_has_noncoherent_dma)
+{
+       bool noncoherent_dma = override_vm_has_noncoherent_dma ? true :
+                              kvm_arch_has_noncoherent_dma(kvm);
+
+       /*
+        * If the TDP is enabled, the host MTRRs are ignored by TDP
+        * (shadow_memtype_mask is non-zero), and the VM has non-coherent DMA
+        * (DMA doesn't snoop CPU caches), KVM's ABI is to honor the memtype
+        * from the guest's MTRRs so that guest accesses to memory that is
+        * DMA'd aren't cached against the guest's wishes.
+        *
+        * Note, KVM may still ultimately ignore guest MTRRs for certain PFNs,
+        * e.g. KVM will force UC memtype for host MMIO.
+        */
+       return noncoherent_dma && tdp_enabled && shadow_memtype_mask;
+}
+

Thanks
Yan

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-30  9:48                 ` Yan Zhao
@ 2023-05-30 23:51                   ` Sean Christopherson
  2023-05-31  0:18                     ` Yan Zhao
  0 siblings, 1 reply; 41+ messages in thread
From: Sean Christopherson @ 2023-05-30 23:51 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Tue, May 30, 2023, Yan Zhao wrote:
> On Thu, May 25, 2023 at 08:54:15AM -0700, Sean Christopherson wrote:
> And I combined the __kvm_mmu_honors_guest_mtrrs() into
> kvm_mmu_honors_guest_mtrrs(). Not sure if you like it :)

I would prefer to provide a separater inner helper, mainly so that the common
case callers don't need to pass %false.  I don't love passing bools, but it's
tolerable for a one-off use of an inner helper.

> +/*
> + * Returns if KVM honors guest MTRRs
> + * @override_vm_has_noncoherent_dma: Allow caller to override non-coherent DMA
> + *                                   status returned from
> + *                                   kvm_arch_has_noncoherent_dma()
> + */
> +bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm,
> +                               bool override_vm_has_noncoherent_dma)
> +{
> +       bool noncoherent_dma = override_vm_has_noncoherent_dma ? true :
> +                              kvm_arch_has_noncoherent_dma(kvm);

The "override" name is confusing, e.g. it won't be clear when it's safe/correct
for a new caller to override kvm_arch_has_noncoherent_dma().  If we go with a
single helper, I could live with:

bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool stopping_noncoherent_dma)
{
	bool noncoherent_dma = stopping_noncoherent_dma ||
			       kvm_arch_has_noncoherent_dma(kvm);

	...
}

but that makes it awkward to use common code for start+stop assignment, and as
above there are three "normal" callers that would have to pass magic %false
values regardless of the name.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes
  2023-05-30 23:51                   ` Sean Christopherson
@ 2023-05-31  0:18                     ` Yan Zhao
  0 siblings, 0 replies; 41+ messages in thread
From: Yan Zhao @ 2023-05-31  0:18 UTC (permalink / raw)
  To: Sean Christopherson; +Cc: Chao Gao, kvm, linux-kernel, pbonzini

On Tue, May 30, 2023 at 04:51:25PM -0700, Sean Christopherson wrote:
> On Tue, May 30, 2023, Yan Zhao wrote:
> > On Thu, May 25, 2023 at 08:54:15AM -0700, Sean Christopherson wrote:
> > And I combined the __kvm_mmu_honors_guest_mtrrs() into
> > kvm_mmu_honors_guest_mtrrs(). Not sure if you like it :)
> 
> I would prefer to provide a separater inner helper, mainly so that the common
> case callers don't need to pass %false.  I don't love passing bools, but it's
> tolerable for a one-off use of an inner helper.

Ok. Will follow this way.
It's reasonable :)

> > +/*
> > + * Returns if KVM honors guest MTRRs
> > + * @override_vm_has_noncoherent_dma: Allow caller to override non-coherent DMA
> > + *                                   status returned from
> > + *                                   kvm_arch_has_noncoherent_dma()
> > + */
> > +bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm,
> > +                               bool override_vm_has_noncoherent_dma)
> > +{
> > +       bool noncoherent_dma = override_vm_has_noncoherent_dma ? true :
> > +                              kvm_arch_has_noncoherent_dma(kvm);
> 
> The "override" name is confusing, e.g. it won't be clear when it's safe/correct
> for a new caller to override kvm_arch_has_noncoherent_dma().  If we go with a
> single helper, I could live with:
> 
> bool kvm_mmu_honors_guest_mtrrs(struct kvm *kvm, bool stopping_noncoherent_dma)
> {
> 	bool noncoherent_dma = stopping_noncoherent_dma ||
> 			       kvm_arch_has_noncoherent_dma(kvm);
> 
> 	...
> }
> 
> but that makes it awkward to use common code for start+stop assignment, and as
> above there are three "normal" callers that would have to pass magic %false
> values regardless of the name.

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2023-05-31  0:44 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-05-09 13:48 [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Yan Zhao
2023-05-09 13:50 ` [PATCH v2 1/6] KVM: x86/mmu: add a new mmu zap helper to indicate memtype changes Yan Zhao
2023-05-10  5:30   ` Chao Gao
2023-05-10  8:06     ` Yan Zhao
2023-05-23 22:51       ` Sean Christopherson
2023-05-24  2:22         ` Yan Zhao
2023-05-24 14:50           ` Sean Christopherson
2023-05-25 10:14             ` Yan Zhao
2023-05-25 15:54               ` Sean Christopherson
2023-05-30  1:32                 ` Yan Zhao
2023-05-30  9:48                 ` Yan Zhao
2023-05-30 23:51                   ` Sean Christopherson
2023-05-31  0:18                     ` Yan Zhao
2023-05-09 13:51 ` [PATCH v2 2/6] KVM: x86/mmu: only zap EPT when guest CR0_CD changes Yan Zhao
2023-05-09 13:51 ` [PATCH v2 3/6] KVM: x86/mmu: only zap EPT when guest MTRR changes Yan Zhao
2023-05-10  5:39   ` Chao Gao
2023-05-10  8:00     ` Yan Zhao
2023-05-10 10:54       ` Huang, Kai
2023-05-11  0:15         ` Yan Zhao
2023-05-11  2:42           ` Huang, Kai
2023-05-11  2:31             ` Yan Zhao
2023-05-11  3:05               ` Huang, Kai
2023-05-09 13:52 ` [PATCH v2 4/6] KVM: x86/mmu: Zap all EPT leaf entries according noncoherent DMA count Yan Zhao
2023-05-09 13:53 ` [PATCH v2 5/6] KVM: x86: Keep a per-VM MTRR state Yan Zhao
2023-05-10 17:23   ` David Matlack
2023-05-21  3:44   ` Robert Hoo
2023-05-23  6:21     ` Yan Zhao
2023-05-24  0:13       ` Sean Christopherson
2023-05-24 11:03         ` Yan Zhao
2023-05-24 18:21           ` Sean Christopherson
2023-05-25 10:09             ` Yan Zhao
2023-05-25 14:53               ` Sean Christopherson
2023-05-26  7:54                 ` Yan Zhao
2023-05-26 16:09                   ` Sean Christopherson
2023-05-30  1:19                     ` Yan Zhao
2023-05-25  7:21       ` Robert Hoo
2023-05-25 15:00         ` Sean Christopherson
2023-05-26  1:49           ` Robert Hoo
2023-05-09 13:53 ` [PATCH v2 6/6] KVM: x86/mmu: use per-VM based MTRR for EPT Yan Zhao
2023-05-24  0:15 ` [PATCH v2 0/6] KVM: x86/mmu: refine memtype related mmu zap Sean Christopherson
2023-05-24 11:04   ` Yan Zhao

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).