linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v3 00/19] RFC: nested AVIC
@ 2022-04-27 20:02 Maxim Levitsky
  2022-04-27 20:02 ` [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons Maxim Levitsky
                   ` (18 more replies)
  0 siblings, 19 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:02 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This is V3 of my nested AVIC patches.

I fixed few more bugs, and I also split the cod insto smaller patches.

Review is welcome!

Best regards,
	Maxim Levitsky

Maxim Levitsky (19):
  KVM: x86: document AVIC/APICv inhibit reasons
  KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic
    id/base from the defaults.
  KVM: x86: SVM: remove avic's broken code that updated APIC ID
  KVM: x86: mmu: allow to enable write tracking externally
  x86: KVMGT: use kvm_page_track_write_tracking_enable
  KVM: x86: mmu: add gfn_in_memslot helper
  KVM: x86: mmu: tweak fast path for emulation of access to nested NPT
    pages
  KVM: x86: SVM: move avic state to separate struct
  KVM: x86: nSVM: add nested AVIC tracepoints
  KVM: x86: nSVM: implement AVIC's physid/logid table access helpers
  KVM: x86: nSVM: implement shadowing of AVIC's physical id table
  KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the
    host scheduling
  KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit
  KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  KVM: x86: nSVM: add code to reload AVIC physid table when it is
    invalidated
  KVM: x86: nSVM: implement support for nested AVIC vmexits
  KVM: x86: nSVM: implement nested AVIC doorbell emulation
  KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode
  KVM: x86: nSVM: expose the nested AVIC to the guest

 arch/x86/include/asm/kvm-x86-ops.h    |   2 +-
 arch/x86/include/asm/kvm_host.h       |  23 +-
 arch/x86/include/asm/kvm_page_track.h |   1 +
 arch/x86/kvm/Kconfig                  |   3 -
 arch/x86/kvm/lapic.c                  |  25 +-
 arch/x86/kvm/lapic.h                  |   8 +
 arch/x86/kvm/mmu.h                    |   8 +-
 arch/x86/kvm/mmu/mmu.c                |  21 +-
 arch/x86/kvm/mmu/page_track.c         |  10 +-
 arch/x86/kvm/svm/avic.c               | 985 +++++++++++++++++++++++---
 arch/x86/kvm/svm/nested.c             | 141 +++-
 arch/x86/kvm/svm/svm.c                |  39 +-
 arch/x86/kvm/svm/svm.h                | 166 ++++-
 arch/x86/kvm/trace.h                  | 157 +++-
 arch/x86/kvm/vmx/vmx.c                |   8 +-
 arch/x86/kvm/x86.c                    |  19 +-
 drivers/gpu/drm/i915/Kconfig          |   1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c      |   5 +
 include/linux/kvm_host.h              |  10 +-
 19 files changed, 1507 insertions(+), 125 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
@ 2022-04-27 20:02 ` Maxim Levitsky
  2022-05-18 15:56   ` Sean Christopherson
  2022-04-27 20:02 ` [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults Maxim Levitsky
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:02 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

These days there are too many AVIC/APICv inhibit
reasons, and it doesn't hurt to have some documentation
for them.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f164c6c1514a4..63eae00625bda 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1046,14 +1046,29 @@ struct kvm_x86_msr_filter {
 };
 
 enum kvm_apicv_inhibit {
+	/* APICv/AVIC is disabled by module param and/or not supported in hardware */
 	APICV_INHIBIT_REASON_DISABLE,
+	/* APICv/AVIC is inhibited because AutoEOI feature is being used by a HyperV guest*/
 	APICV_INHIBIT_REASON_HYPERV,
+	/* AVIC is inhibited on a CPU because it runs a nested guest */
 	APICV_INHIBIT_REASON_NESTED,
+	/* AVIC is inhibited due to wait for an irq window (AVIC doesn't support this) */
 	APICV_INHIBIT_REASON_IRQWIN,
+	/*
+	 * AVIC is inhibited because i8254 're-inject' mode is used
+	 * which needs EOI intercept which AVIC doesn't support
+	 */
 	APICV_INHIBIT_REASON_PIT_REINJ,
+	/* AVIC is inhibited because the guest has x2apic in its CPUID*/
 	APICV_INHIBIT_REASON_X2APIC,
+	/* AVIC/APICv is inhibited because KVM_GUESTDBG_BLOCKIRQ was enabled */
 	APICV_INHIBIT_REASON_BLOCKIRQ,
+	/*
+	 * AVIC/APICv is inhibited because the guest didn't yet
+	 * enable kernel/split irqchip
+	 */
 	APICV_INHIBIT_REASON_ABSENT,
+	/* AVIC is disabled because SEV doesn't support it */
 	APICV_INHIBIT_REASON_SEV,
 };
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
  2022-04-27 20:02 ` [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons Maxim Levitsky
@ 2022-04-27 20:02 ` Maxim Levitsky
  2022-05-18  8:28   ` Chao Gao
  2022-05-19 16:06   ` Sean Christopherson
  2022-04-27 20:02 ` [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID Maxim Levitsky
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:02 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

Neither of these settings should be changed by the guest and it is
a burden to support it in the acceleration code, so just inhibit
it instead.

Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
 arch/x86/kvm/lapic.h            |  8 ++++++++
 3 files changed, 33 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 63eae00625bda..636df87542555 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
 	APICV_INHIBIT_REASON_ABSENT,
 	/* AVIC is disabled because SEV doesn't support it */
 	APICV_INHIBIT_REASON_SEV,
+	/* APIC ID and/or APIC base was changed by the guest */
+	APICV_INHIBIT_REASON_RO_SETTINGS,
 };
 
 struct kvm_arch {
@@ -1258,6 +1260,7 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+	bool apic_id_changed;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 66b0eb0bda94e..8996675b3ef4c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
 	}
 }
 
+static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
+{
+	if (kvm_apic_has_initial_apic_id(apic))
+		return;
+
+	pr_warn_once("APIC ID change is unsupported by KVM");
+
+	kvm_set_apicv_inhibit(apic->vcpu->kvm,
+			APICV_INHIBIT_REASON_RO_SETTINGS);
+
+	apic->vcpu->kvm->arch.apic_id_changed = true;
+}
+
 static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 {
 	int ret = 0;
@@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 
 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
+		if (!apic_x2apic_mode(apic)) {
+
 			kvm_apic_set_xapic_id(apic, val >> 24);
-		else
+			kvm_lapic_check_initial_apic_id(apic);
+		} else
 			ret = 1;
 		break;
 
@@ -2335,8 +2350,11 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
 			     MSR_IA32_APICBASE_BASE;
 
 	if ((value & MSR_IA32_APICBASE_ENABLE) &&
-	     apic->base_address != APIC_DEFAULT_PHYS_BASE)
+	     apic->base_address != APIC_DEFAULT_PHYS_BASE) {
+		kvm_set_apicv_inhibit(apic->vcpu->kvm,
+				APICV_INHIBIT_REASON_RO_SETTINGS);
 		pr_warn_once("APIC base relocation is unsupported by KVM");
+	}
 }
 
 void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
@@ -2649,6 +2667,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	kvm_lapic_check_initial_apic_id(vcpu->arch.apic);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 4e4f8a22754f9..b9c406d383080 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -252,4 +252,12 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
 }
 
+static inline bool kvm_apic_has_initial_apic_id(struct kvm_lapic *apic)
+{
+	if (apic_x2apic_mode(apic))
+		return true;
+
+	return kvm_xapic_id(apic) == apic->vcpu->vcpu_id;
+}
+
 #endif
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
  2022-04-27 20:02 ` [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons Maxim Levitsky
  2022-04-27 20:02 ` [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults Maxim Levitsky
@ 2022-04-27 20:02 ` Maxim Levitsky
  2022-05-19 16:10   ` Sean Christopherson
  2022-04-27 20:02 ` [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally Maxim Levitsky
                   ` (15 subsequent siblings)
  18 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:02 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

AVIC is now inhibited if the guest changes apic id, thus remove
that broken code.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 35 -----------------------------------
 1 file changed, 35 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 54fe03714f8a6..1102421668a11 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -508,35 +508,6 @@ static int avic_handle_ldr_update(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
-static int avic_handle_apic_id_update(struct kvm_vcpu *vcpu)
-{
-	u64 *old, *new;
-	struct vcpu_svm *svm = to_svm(vcpu);
-	u32 id = kvm_xapic_id(vcpu->arch.apic);
-
-	if (vcpu->vcpu_id == id)
-		return 0;
-
-	old = avic_get_physical_id_entry(vcpu, vcpu->vcpu_id);
-	new = avic_get_physical_id_entry(vcpu, id);
-	if (!new || !old)
-		return 1;
-
-	/* We need to move physical_id_entry to new offset */
-	*new = *old;
-	*old = 0ULL;
-	to_svm(vcpu)->avic_physical_id_cache = new;
-
-	/*
-	 * Also update the guest physical APIC ID in the logical
-	 * APIC ID table entry if already setup the LDR.
-	 */
-	if (svm->ldr_reg)
-		avic_handle_ldr_update(vcpu);
-
-	return 0;
-}
-
 static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -555,10 +526,6 @@ static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu)
 				AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
 	switch (offset) {
-	case APIC_ID:
-		if (avic_handle_apic_id_update(vcpu))
-			return 0;
-		break;
 	case APIC_LDR:
 		if (avic_handle_ldr_update(vcpu))
 			return 0;
@@ -650,8 +617,6 @@ int avic_init_vcpu(struct vcpu_svm *svm)
 
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
-	if (avic_handle_apic_id_update(vcpu) != 0)
-		return;
 	avic_handle_dfr_update(vcpu);
 	avic_handle_ldr_update(vcpu);
 }
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (2 preceding siblings ...)
  2022-04-27 20:02 ` [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID Maxim Levitsky
@ 2022-04-27 20:02 ` Maxim Levitsky
  2022-05-19 16:27   ` Sean Christopherson
  2022-05-19 16:37   ` Sean Christopherson
  2022-04-27 20:03 ` [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable Maxim Levitsky
                   ` (14 subsequent siblings)
  18 siblings, 2 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:02 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This will be used to enable write tracking from nested AVIC code
and can also be used to enable write tracking in GVT-g module
when it actually uses it as opposed to always enabling it,
when the module is compiled in the kernel.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm_host.h       |  2 +-
 arch/x86/include/asm/kvm_page_track.h |  1 +
 arch/x86/kvm/mmu.h                    |  8 +++++---
 arch/x86/kvm/mmu/mmu.c                | 17 ++++++++++-------
 arch/x86/kvm/mmu/page_track.c         | 10 ++++++++--
 5 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 636df87542555..fc7df778a3d71 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1254,7 +1254,7 @@ struct kvm_arch {
 	 * is used as one input when determining whether certain memslot
 	 * related allocations are necessary.
 	 */
-	bool shadow_root_allocated;
+	bool mmu_page_tracking_enabled;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	hpa_t	hv_root_tdp;
diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
index eb186bc57f6a9..955a5ae07b10e 100644
--- a/arch/x86/include/asm/kvm_page_track.h
+++ b/arch/x86/include/asm/kvm_page_track.h
@@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
 void kvm_page_track_cleanup(struct kvm *kvm);
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
+int kvm_page_track_write_tracking_enable(struct kvm *kvm);
 int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
 
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 671cfeccf04e9..44d15551f7156 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
-static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
+static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
 {
 	/*
 	 * Read shadow_root_allocated before related pointers. Hence, threads
@@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
 	 * see the pointers. Pairs with smp_store_release in
 	 * mmu_first_shadow_root_alloc.
 	 */
-	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
+	return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
 }
 
+int mmu_enable_write_tracking(struct kvm *kvm);
+
 #ifdef CONFIG_X86_64
 static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return kvm->arch.tdp_mmu_enabled; }
 #else
@@ -288,7 +290,7 @@ static inline bool is_tdp_mmu_enabled(struct kvm *kvm) { return false; }
 
 static inline bool kvm_memslots_have_rmaps(struct kvm *kvm)
 {
-	return !is_tdp_mmu_enabled(kvm) || kvm_shadow_root_allocated(kvm);
+	return !is_tdp_mmu_enabled(kvm) || mmu_page_tracking_enabled(kvm);
 }
 
 static inline gfn_t gfn_to_index(gfn_t gfn, gfn_t base_gfn, int level)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 904f0faff2186..fb744616bf7df 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3389,7 +3389,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 	return r;
 }
 
-static int mmu_first_shadow_root_alloc(struct kvm *kvm)
+int mmu_enable_write_tracking(struct kvm *kvm)
 {
 	struct kvm_memslots *slots;
 	struct kvm_memory_slot *slot;
@@ -3399,21 +3399,20 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	 * Check if this is the first shadow root being allocated before
 	 * taking the lock.
 	 */
-	if (kvm_shadow_root_allocated(kvm))
+	if (mmu_page_tracking_enabled(kvm))
 		return 0;
 
 	mutex_lock(&kvm->slots_arch_lock);
 
 	/* Recheck, under the lock, whether this is the first shadow root. */
-	if (kvm_shadow_root_allocated(kvm))
+	if (mmu_page_tracking_enabled(kvm))
 		goto out_unlock;
 
 	/*
 	 * Check if anything actually needs to be allocated, e.g. all metadata
 	 * will be allocated upfront if TDP is disabled.
 	 */
-	if (kvm_memslots_have_rmaps(kvm) &&
-	    kvm_page_track_write_tracking_enabled(kvm))
+	if (kvm_memslots_have_rmaps(kvm) && mmu_page_tracking_enabled(kvm))
 		goto out_success;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
@@ -3443,7 +3442,7 @@ static int mmu_first_shadow_root_alloc(struct kvm *kvm)
 	 * all the related pointers are set.
 	 */
 out_success:
-	smp_store_release(&kvm->arch.shadow_root_allocated, true);
+	smp_store_release(&kvm->arch.mmu_page_tracking_enabled, true);
 
 out_unlock:
 	mutex_unlock(&kvm->slots_arch_lock);
@@ -3480,7 +3479,7 @@ static int mmu_alloc_shadow_roots(struct kvm_vcpu *vcpu)
 		}
 	}
 
-	r = mmu_first_shadow_root_alloc(vcpu->kvm);
+	r = mmu_enable_write_tracking(vcpu->kvm);
 	if (r)
 		return r;
 
@@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+		mmu_enable_write_tracking(kvm);
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 2e09d1b6249f3..8857d629036d7 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -21,10 +21,16 @@
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
 {
-	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
-	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
+	return mmu_page_tracking_enabled(kvm);
 }
 
+int kvm_page_track_write_tracking_enable(struct kvm *kvm)
+{
+	return mmu_enable_write_tracking(kvm);
+}
+EXPORT_SYMBOL_GPL(kvm_page_track_write_tracking_enable);
+
+
 void kvm_page_track_free_memslot(struct kvm_memory_slot *slot)
 {
 	int i;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (3 preceding siblings ...)
  2022-04-27 20:02 ` [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-05-19 16:38   ` Sean Christopherson
  2022-04-27 20:03 ` [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper Maxim Levitsky
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This allows to enable the write tracking only when KVMGT is
actually used and doesn't carry any penalty otherwise.

Tested by booting a VM with a kvmgt mdev device.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/Kconfig             | 3 ---
 arch/x86/kvm/mmu/mmu.c           | 2 +-
 drivers/gpu/drm/i915/Kconfig     | 1 -
 drivers/gpu/drm/i915/gvt/kvmgt.c | 5 +++++
 4 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index e3cbd77061364..41341905d3734 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -126,7 +126,4 @@ config KVM_XEN
 
 	  If in doubt, say "N".
 
-config KVM_EXTERNAL_WRITE_TRACKING
-	bool
-
 endif # VIRTUALIZATION
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index fb744616bf7df..633a3138d68e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5753,7 +5753,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
 
-	if (IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) || !tdp_enabled)
+	if (!tdp_enabled)
 		mmu_enable_write_tracking(kvm);
 
 	return 0;
diff --git a/drivers/gpu/drm/i915/Kconfig b/drivers/gpu/drm/i915/Kconfig
index 98c5450b8eacc..7d8346f4bae11 100644
--- a/drivers/gpu/drm/i915/Kconfig
+++ b/drivers/gpu/drm/i915/Kconfig
@@ -130,7 +130,6 @@ config DRM_I915_GVT_KVMGT
 	depends on DRM_I915_GVT
 	depends on KVM
 	depends on VFIO_MDEV
-	select KVM_EXTERNAL_WRITE_TRACKING
 	default n
 	help
 	  Choose this option if you want to enable KVMGT support for
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 057ec44901045..4c62ab3ef245d 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -1933,6 +1933,7 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
 	struct intel_vgpu *vgpu;
 	struct kvmgt_vdev *vdev;
 	struct kvm *kvm;
+	int ret;
 
 	vgpu = mdev_get_drvdata(mdev);
 	if (handle_valid(vgpu->handle))
@@ -1948,6 +1949,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
 	if (__kvmgt_vgpu_exist(vgpu, kvm))
 		return -EEXIST;
 
+	ret = kvm_page_track_write_tracking_enable(kvm);
+	if (ret)
+		return ret;
+
 	info = vzalloc(sizeof(struct kvmgt_guest_info));
 	if (!info)
 		return -ENOMEM;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (4 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-05-19 16:43   ` Sean Christopherson
  2022-04-27 20:03 ` [RFC PATCH v3 07/19] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages Maxim Levitsky
                   ` (12 subsequent siblings)
  18 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This is a tiny refactoring, and can be useful to check
if a GPA/GFN is within a memslot a bit more cleanly.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 include/linux/kvm_host.h | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 252ee4a61b58b..12e261559070b 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
 void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
 bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
 
+
+static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
+{
+	return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
+}
+
+
 /*
  * Returns a pointer to the memslot if it contains gfn.
  * Otherwise returns NULL.
@@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
 	if (!slot)
 		return NULL;
 
-	if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
+	if (gfn_in_memslot(slot, gfn))
 		return slot;
 	else
 		return NULL;
 }
 
+
 /*
  * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
  *
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 07/19] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (5 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 08/19] KVM: x86: SVM: move avic state to separate struct Maxim Levitsky
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

If a non leaf mmu page is write tracked externally for some reason,
which can in theory happen if it was used for nested avic physid page
before, then this code will enter an endless loop of page faults because
unprotecting the mmu page will not remove write tracking, nor will the
write tracker callback be called, because there is no mmu page at
this address.

Fix this by only invoking the fast path if we succeeded in zapping the
mmu page.

Fixes: 147277540bbc5 ("kvm: svm: Add support for additional SVM NPF error codes")
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 633a3138d68e1..8f77d41e7fd80 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5341,8 +5341,8 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
 	 */
 	if (vcpu->arch.mmu->root_role.direct &&
 	    (error_code & PFERR_NESTED_GUEST_PAGE) == PFERR_NESTED_GUEST_PAGE) {
-		kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa));
-		return 1;
+		if (kvm_mmu_unprotect_page(vcpu->kvm, gpa_to_gfn(cr2_or_gpa)))
+			return 1;
 	}
 
 	/*
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 08/19] KVM: x86: SVM: move avic state to separate struct
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (6 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 07/19] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 09/19] KVM: x86: nSVM: add nested AVIC tracepoints Maxim Levitsky
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This will make the code a bit easier to read when nested AVIC support
is added.

No functional change intended.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 51 +++++++++++++++++++++++------------------
 arch/x86/kvm/svm/svm.h  | 14 ++++++-----
 2 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1102421668a11..e5cbbb97fbab6 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -69,6 +69,8 @@ int avic_ga_log_notifier(u32 ga_tag)
 	unsigned long flags;
 	struct kvm_svm *kvm_svm;
 	struct kvm_vcpu *vcpu = NULL;
+	struct kvm_svm_avic *avic;
+
 	u32 vm_id = AVIC_GATAG_TO_VMID(ga_tag);
 	u32 vcpu_id = AVIC_GATAG_TO_VCPUID(ga_tag);
 
@@ -76,9 +78,13 @@ int avic_ga_log_notifier(u32 ga_tag)
 	trace_kvm_avic_ga_log(vm_id, vcpu_id);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-	hash_for_each_possible(svm_vm_data_hash, kvm_svm, hnode, vm_id) {
-		if (kvm_svm->avic_vm_id != vm_id)
+	hash_for_each_possible(svm_vm_data_hash, avic, hnode, vm_id) {
+
+
+		if (avic->vm_id != vm_id)
 			continue;
+
+		kvm_svm = container_of(avic, struct kvm_svm, avic);
 		vcpu = kvm_get_vcpu_by_id(&kvm_svm->kvm, vcpu_id);
 		break;
 	}
@@ -98,18 +104,18 @@ int avic_ga_log_notifier(u32 ga_tag)
 void avic_vm_destroy(struct kvm *kvm)
 {
 	unsigned long flags;
-	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+	struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 
 	if (!enable_apicv)
 		return;
 
-	if (kvm_svm->avic_logical_id_table_page)
-		__free_page(kvm_svm->avic_logical_id_table_page);
-	if (kvm_svm->avic_physical_id_table_page)
-		__free_page(kvm_svm->avic_physical_id_table_page);
+	if (avic->logical_id_table_page)
+		__free_page(avic->logical_id_table_page);
+	if (avic->physical_id_table_page)
+		__free_page(avic->physical_id_table_page);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
-	hash_del(&kvm_svm->hnode);
+	hash_del(&avic->hnode);
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 }
 
@@ -117,10 +123,9 @@ int avic_vm_init(struct kvm *kvm)
 {
 	unsigned long flags;
 	int err = -ENOMEM;
-	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
-	struct kvm_svm *k2;
 	struct page *p_page;
 	struct page *l_page;
+	struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 	u32 vm_id;
 
 	if (!enable_apicv)
@@ -131,14 +136,14 @@ int avic_vm_init(struct kvm *kvm)
 	if (!p_page)
 		goto free_avic;
 
-	kvm_svm->avic_physical_id_table_page = p_page;
+	avic->physical_id_table_page = p_page;
 
 	/* Allocating logical APIC ID table (4KB) */
 	l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
 	if (!l_page)
 		goto free_avic;
 
-	kvm_svm->avic_logical_id_table_page = l_page;
+	avic->logical_id_table_page = l_page;
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
@@ -149,13 +154,15 @@ int avic_vm_init(struct kvm *kvm)
 	}
 	/* Is it still in use? Only possible if wrapped at least once */
 	if (next_vm_id_wrapped) {
-		hash_for_each_possible(svm_vm_data_hash, k2, hnode, vm_id) {
-			if (k2->avic_vm_id == vm_id)
+		struct kvm_svm_avic *avic2;
+
+		hash_for_each_possible(svm_vm_data_hash, avic2, hnode, vm_id) {
+			if (avic2->vm_id == vm_id)
 				goto again;
 		}
 	}
-	kvm_svm->avic_vm_id = vm_id;
-	hash_add(svm_vm_data_hash, &kvm_svm->hnode, kvm_svm->avic_vm_id);
+	avic->vm_id = vm_id;
+	hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
 	return 0;
@@ -169,8 +176,8 @@ void avic_init_vmcb(struct vcpu_svm *svm, struct vmcb *vmcb)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(svm->vcpu.kvm);
 	phys_addr_t bpa = __sme_set(page_to_phys(svm->avic_backing_page));
-	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic_logical_id_table_page));
-	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic_physical_id_table_page));
+	phys_addr_t lpa = __sme_set(page_to_phys(kvm_svm->avic.logical_id_table_page));
+	phys_addr_t ppa = __sme_set(page_to_phys(kvm_svm->avic.physical_id_table_page));
 
 	vmcb->control.avic_backing_page = bpa & AVIC_HPA_MASK;
 	vmcb->control.avic_logical_id = lpa & AVIC_HPA_MASK;
@@ -193,7 +200,7 @@ static u64 *avic_get_physical_id_entry(struct kvm_vcpu *vcpu,
 	if (index >= AVIC_MAX_PHYSICAL_ID_COUNT)
 		return NULL;
 
-	avic_physical_id_table = page_address(kvm_svm->avic_physical_id_table_page);
+	avic_physical_id_table = page_address(kvm_svm->avic.physical_id_table_page);
 
 	return &avic_physical_id_table[index];
 }
@@ -296,7 +303,7 @@ static int avic_kick_target_vcpus_fast(struct kvm *kvm, struct kvm_lapic *source
 	int dest_mode = icrl & APIC_DEST_MASK;
 	int shorthand = icrl & APIC_SHORT_MASK;
 	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
-	u32 *avic_logical_id_table = page_address(kvm_svm->avic_logical_id_table_page);
+	u32 *avic_logical_id_table = page_address(kvm_svm->avic.logical_id_table_page);
 
 	if (shorthand != APIC_DEST_NOSHORT)
 		return -EINVAL;
@@ -453,7 +460,7 @@ static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 		index = (cluster << 2) + apic;
 	}
 
-	logical_apic_id_table = (u32 *) page_address(kvm_svm->avic_logical_id_table_page);
+	logical_apic_id_table = (u32 *) page_address(kvm_svm->avic.logical_id_table_page);
 
 	return &logical_apic_id_table[index];
 }
@@ -803,7 +810,7 @@ int avic_pi_update_irte(struct kvm *kvm, unsigned int host_irq,
 			/* Try to enable guest_mode in IRTE */
 			pi.base = __sme_set(page_to_phys(svm->avic_backing_page) &
 					    AVIC_HPA_MASK);
-			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic_vm_id,
+			pi.ga_tag = AVIC_GATAG(to_kvm_svm(kvm)->avic.vm_id,
 						     svm->vcpu.vcpu_id);
 			pi.is_guest_mode = true;
 			pi.vcpu_data = &vcpu_info;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 32220a1b0ea20..6fcb164a6ee4a 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -88,15 +88,17 @@ struct kvm_sev_info {
 	atomic_t migration_in_progress;
 };
 
-struct kvm_svm {
-	struct kvm kvm;
 
-	/* Struct members for AVIC */
-	u32 avic_vm_id;
-	struct page *avic_logical_id_table_page;
-	struct page *avic_physical_id_table_page;
+struct kvm_svm_avic {
+	u32 vm_id;
+	struct page *logical_id_table_page;
+	struct page *physical_id_table_page;
 	struct hlist_node hnode;
+};
 
+struct kvm_svm {
+	struct kvm kvm;
+	struct kvm_svm_avic avic;
 	struct kvm_sev_info sev_info;
 };
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 09/19] KVM: x86: nSVM: add nested AVIC tracepoints
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (7 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 08/19] KVM: x86: SVM: move avic state to separate struct Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 10/19] KVM: x86: nSVM: implement AVIC's physid/logid table access helpers Maxim Levitsky
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This patch adds few tracepoints that will be used
to debug/profile the nested AVIC.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/trace.h | 157 ++++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.c   |  13 ++++
 2 files changed, 169 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index de47625175692..f7ddba5ae06a5 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -1385,7 +1385,7 @@ TRACE_EVENT(kvm_apicv_accept_irq,
 );
 
 /*
- * Tracepoint for AMD AVIC
+ * Tracepoints for AMD AVIC
  */
 TRACE_EVENT(kvm_avic_incomplete_ipi,
 	    TP_PROTO(u32 vcpu, u32 icrh, u32 icrl, u32 id, u32 index),
@@ -1479,6 +1479,161 @@ TRACE_EVENT(kvm_avic_kick_vcpu_slowpath,
 		  __entry->icrh, __entry->icrl, __entry->index)
 );
 
+TRACE_EVENT(kvm_avic_physid_table_alloc,
+	    TP_PROTO(u64 gpa),
+	    TP_ARGS(gpa),
+
+	TP_STRUCT__entry(
+		__field(u64, gpa)
+	),
+
+	TP_fast_assign(
+		__entry->gpa = gpa;
+	),
+
+	TP_printk("table at gpa 0x%llx",
+		  __entry->gpa)
+);
+
+
+TRACE_EVENT(kvm_avic_physid_table_free,
+	    TP_PROTO(u64 gpa),
+	    TP_ARGS(gpa),
+
+	TP_STRUCT__entry(
+		__field(u64, gpa)
+	),
+
+	TP_fast_assign(
+		__entry->gpa = gpa;
+	),
+
+	TP_printk("table at gpa 0x%llx",
+		  __entry->gpa)
+);
+
+TRACE_EVENT(kvm_avic_physid_table_reload,
+	    TP_PROTO(u64 gpa, int nentries, int new_nentires),
+	    TP_ARGS(gpa, nentries, new_nentires),
+
+	TP_STRUCT__entry(
+		__field(u64, gpa)
+		__field(int, nentries)
+		__field(int, new_nentires)
+	),
+
+	TP_fast_assign(
+		__entry->gpa = gpa;
+		__entry->nentries = nentries;
+		__entry->new_nentires = new_nentires;
+	),
+
+	TP_printk("table at gpa 0x%llx, nentires %d -> %d",
+		  __entry->gpa, __entry->nentries, __entry->new_nentires)
+);
+
+TRACE_EVENT(kvm_avic_physid_table_write,
+	    TP_PROTO(u64 gpa, int bytes),
+	    TP_ARGS(gpa, bytes),
+
+	TP_STRUCT__entry(
+		__field(u64, gpa)
+		__field(int, bytes)
+	),
+
+	TP_fast_assign(
+		__entry->gpa = gpa;
+		__entry->bytes = bytes;
+	),
+
+	TP_printk("gpa 0x%llx, write of %d bytes",
+		  __entry->gpa, __entry->bytes)
+);
+
+TRACE_EVENT(kvm_avic_physid_update_vcpu_host,
+	    TP_PROTO(int vcpu_id, int cpu_id, int n),
+	    TP_ARGS(vcpu_id, cpu_id, n),
+
+	TP_STRUCT__entry(
+		__field(int, vcpu_id)
+		__field(int, cpu_id)
+		__field(int, n)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu_id;
+		__entry->cpu_id = cpu_id;
+		__entry->n = n;
+	),
+
+	TP_printk("l1 vcpu %d -> l0 cpu %d (%d entries)",
+		  __entry->vcpu_id, __entry->cpu_id, __entry->n)
+);
+
+TRACE_EVENT(kvm_avic_physid_update_vcpu_guest,
+	    TP_PROTO(int vcpu_id, int cpu_id),
+	    TP_ARGS(vcpu_id, cpu_id),
+
+	TP_STRUCT__entry(
+		__field(int, vcpu_id)
+		__field(int, cpu_id)
+	),
+
+	TP_fast_assign(
+		__entry->vcpu_id = vcpu_id;
+		__entry->cpu_id = cpu_id;
+	),
+
+	TP_printk("l1 vcpu %d -> l0 cpu %d",
+		  __entry->vcpu_id, __entry->cpu_id)
+);
+
+TRACE_EVENT(kvm_avic_nested_doorbell,
+	    TP_PROTO(int source_l1_apicid, int target_l1_apicid, bool target_nested,
+			    bool target_running),
+	    TP_ARGS(source_l1_apicid, target_l1_apicid, target_nested,
+			    target_running),
+
+	TP_STRUCT__entry(
+		__field(int, source_l1_apicid)
+		__field(int, target_l1_apicid)
+		__field(bool, target_nested)
+		__field(bool, target_running)
+	),
+
+	TP_fast_assign(
+		__entry->source_l1_apicid = source_l1_apicid;
+		__entry->target_l1_apicid = target_l1_apicid;
+		__entry->target_nested = target_nested;
+		__entry->target_running = target_running;
+	),
+
+	TP_printk("source %d target %d (nested: %d, running %d)",
+		  __entry->source_l1_apicid, __entry->target_l1_apicid,
+		  __entry->target_nested, __entry->target_running)
+);
+
+TRACE_EVENT(kvm_avic_nested_kick_vcpu,
+	    TP_PROTO(int source_l1_apic_id, int target_l2_apic_id, int target_l1_apic_id),
+	    TP_ARGS(source_l1_apic_id, target_l2_apic_id, target_l1_apic_id),
+
+	TP_STRUCT__entry(
+		__field(int, source_l1_apic_id)
+		__field(int, target_l2_apic_id)
+		__field(int, target_l1_apic_id)
+	),
+
+	TP_fast_assign(
+		__entry->source_l1_apic_id = source_l1_apic_id;
+		__entry->target_l2_apic_id = target_l2_apic_id;
+		__entry->target_l1_apic_id = target_l1_apic_id;
+	),
+
+	TP_printk("source l1 apic id: %d target l2 apic id: %d target l1 apic_id: %d",
+		  __entry->source_l1_apic_id, __entry->target_l2_apic_id,
+		  __entry->target_l1_apic_id)
+);
+
 TRACE_EVENT(kvm_hv_timer_state,
 		TP_PROTO(unsigned int vcpu_id, unsigned int hv_timer_in_use),
 		TP_ARGS(vcpu_id, hv_timer_in_use),
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 951d0a78ccdae..d2f73ce87a1e3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13063,10 +13063,23 @@ EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_write_tsc_offset);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_ple_window_update);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pml_full);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_pi_irte_update);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_unaccelerated_access);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_incomplete_ipi);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_ga_log);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_kick_vcpu_slowpath);
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_table_alloc);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_table_free);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_table_reload);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_table_write);
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_update_vcpu_host);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_physid_update_vcpu_guest);
+
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_nested_doorbell);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_avic_nested_kick_vcpu);
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_apicv_accept_irq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_enter);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_vmgexit_exit);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 10/19] KVM: x86: nSVM: implement AVIC's physid/logid table access helpers
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (8 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 09/19] KVM: x86: nSVM: add nested AVIC tracepoints Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 11/19] KVM: x86: nSVM: implement shadowing of AVIC's physical id table Maxim Levitsky
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This implements a few helpers that help manipulate the AVIC's
physical and logical id table entries.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/svm.h | 45 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 45 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 6fcb164a6ee4a..dfca4c06e2071 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -628,6 +628,51 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
 
+#define INVALID_BACKING_PAGE	(~(u64)0)
+
+static inline u64 physid_entry_get_backing_table(u64 entry)
+{
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+		return INVALID_BACKING_PAGE;
+	return entry & AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK;
+}
+
+static inline int physid_entry_get_apicid(u64 entry)
+{
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+		return -1;
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+		return -1;
+
+	return entry & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+}
+
+static inline int logid_get_physid(u64 entry)
+{
+	if (!(entry & AVIC_LOGICAL_ID_ENTRY_VALID_BIT))
+		return -1;
+	return entry & AVIC_LOGICAL_ID_ENTRY_GUEST_PHYSICAL_ID_MASK;
+}
+
+static inline void physid_entry_set_backing_table(u64 *entry, u64 value)
+{
+	*entry &= ~AVIC_PHYSICAL_ID_ENTRY_BACKING_PAGE_MASK;
+	*entry |= (AVIC_PHYSICAL_ID_ENTRY_VALID_MASK | value);
+}
+
+static inline void physid_entry_set_apicid(u64 *entry, int value)
+{
+	WARN_ON(!(*entry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK));
+
+	*entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+
+	if (value == -1)
+		*entry &= ~(AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+	else
+		*entry |= (AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK | value);
+}
+
+
 /* sev.c */
 
 #define GHCB_VERSION_MAX	1ULL
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 11/19] KVM: x86: nSVM: implement shadowing of AVIC's physical id table
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (9 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 10/19] KVM: x86: nSVM: implement AVIC's physid/logid table access helpers Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 12/19] KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the host scheduling Maxim Levitsky
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

Implement the shadow physical id table and its
write tracking code which will be soon used for the nested AVIC.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 461 +++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.h  |  71 +++++++
 2 files changed, 524 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e5cbbb97fbab6..f462b7e48e3ca 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -51,6 +51,433 @@ static u32 next_vm_id = 0;
 static bool next_vm_id_wrapped = 0;
 static DEFINE_SPINLOCK(svm_vm_data_hash_lock);
 
+
+static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct kvm *kvm,
+						      int l1_apicid)
+{
+	WARN_ON(l1_apicid == -1);
+	return kvm_get_vcpu_by_id(kvm, l1_apicid);
+}
+
+static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
+					      struct avic_physid_table *t,
+					      int n,
+					      int new_l1_apicid)
+{
+	struct avic_physid_entry_descr *e = &t->entries[n];
+	u64 sentry = READ_ONCE(*e->sentry);
+	u64 old_sentry = sentry;
+	struct kvm_vcpu *new_vcpu = NULL;
+	int l0_apicid = -1;
+
+	WARN_ON(!test_bit(n, t->valid_entires));
+
+	if (!list_empty(&e->link))
+		list_del_init(&e->link);
+
+	if (new_l1_apicid != -1)
+		new_vcpu = avic_vcpu_by_l1_apicid(kvm, new_l1_apicid);
+
+	if (new_vcpu)
+		l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
+
+	physid_entry_set_apicid(&sentry, l0_apicid);
+
+	trace_kvm_avic_physid_update_vcpu_guest(new_l1_apicid, l0_apicid);
+
+	if (sentry != old_sentry)
+		WRITE_ONCE(*e->sentry, sentry);
+}
+
+static void avic_physid_shadow_entry_create(struct kvm *kvm,
+					    struct avic_physid_table *t,
+					    int n,
+					    u64 gentry)
+{
+	struct avic_physid_entry_descr *e = &t->entries[n];
+	struct page *backing_page;
+	u64 backing_page_gpa = physid_entry_get_backing_table(gentry);
+	int l1_apic_id = physid_entry_get_apicid(gentry);
+	hpa_t backing_page_hpa;
+	u64 sentry = 0;
+
+
+	if (backing_page_gpa == INVALID_BACKING_PAGE)
+		return;
+
+	/* Pin the APIC backing page */
+	backing_page = gfn_to_page(kvm, gpa_to_gfn(backing_page_gpa));
+
+	if (is_error_page(backing_page))
+		/* Invalid GPA in the guest entry - point to a dummy entry */
+		backing_page_hpa = t->dummy_page_hpa;
+	else
+		backing_page_hpa = page_to_phys(backing_page);
+
+	physid_entry_set_backing_table(&sentry, backing_page_hpa);
+
+	e->gentry = gentry;
+	*e->sentry = sentry;
+
+	if (test_and_set_bit(n, t->valid_entires))
+		WARN_ON(1);
+
+	if (backing_page_hpa != t->dummy_page_hpa)
+		avic_physid_shadow_entry_set_vcpu(kvm, t, n, l1_apic_id);
+}
+
+static void avic_physid_shadow_entry_remove(struct kvm *kvm,
+					   struct avic_physid_table *t,
+					   int n)
+{
+	struct avic_physid_entry_descr *e = &t->entries[n];
+	hpa_t backing_page_hpa;
+
+	if (!test_and_clear_bit(n, t->valid_entires))
+		WARN_ON(1);
+
+	/* Release the APIC backing page */
+	backing_page_hpa = physid_entry_get_backing_table(*e->sentry);
+
+	if (backing_page_hpa != t->dummy_page_hpa)
+		kvm_release_pfn_dirty(backing_page_hpa >> PAGE_SHIFT);
+
+	if (!list_empty(&e->link))
+		list_del_init(&e->link);
+
+	e->gentry = 0;
+	*e->sentry = 0;
+}
+
+
+static bool
+avic_physid_shadow_table_setup_write_tracking(struct kvm *kvm,
+					      struct avic_physid_table *t,
+					      bool enable)
+{
+	struct kvm_memory_slot *slot;
+
+	write_lock(&kvm->mmu_lock);
+	slot = gfn_to_memslot(kvm, t->gfn);
+	if (!slot) {
+		write_unlock(&kvm->mmu_lock);
+		return false;
+	}
+
+	if (enable)
+		kvm_slot_page_track_add_page(kvm, slot, t->gfn, KVM_PAGE_TRACK_WRITE);
+	else
+		kvm_slot_page_track_remove_page(kvm, slot, t->gfn, KVM_PAGE_TRACK_WRITE);
+	write_unlock(&kvm->mmu_lock);
+	return true;
+}
+
+static void
+avic_physid_shadow_table_erase(struct kvm *kvm, struct avic_physid_table *t)
+{
+	int i;
+
+	if (!t->nentries)
+		return;
+
+	avic_physid_shadow_table_setup_write_tracking(kvm, t, false);
+
+	for_each_set_bit(i, t->valid_entires, AVIC_MAX_PHYSICAL_ID_COUNT)
+		avic_physid_shadow_entry_remove(kvm, t, i);
+
+	t->nentries = 0;
+	t->flood_count = 0;
+}
+
+static struct avic_physid_table *
+avic_physid_shadow_table_alloc(struct kvm *kvm, gfn_t gfn)
+{
+	struct avic_physid_entry_descr *e;
+	struct avic_physid_table *t;
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+	u64 *shadow_table_address;
+	int i;
+
+	if (kvm_page_track_write_tracking_enable(kvm))
+		return NULL;
+
+	lockdep_assert_held(&kvm_svm->avic.tables_lock);
+
+	t = kzalloc(sizeof(*t), GFP_KERNEL_ACCOUNT);
+	if (!t)
+		return NULL;
+
+	t->shadow_table = alloc_page(GFP_KERNEL_ACCOUNT|__GFP_ZERO);
+	if (!t->shadow_table)
+		goto err_free_table;
+
+	shadow_table_address = page_address(t->shadow_table);
+	t->shadow_table_hpa = __sme_set(page_to_phys(t->shadow_table));
+
+	for (i = 0; i < ARRAY_SIZE(t->entries); i++) {
+		e = &t->entries[i];
+		e->sentry = &shadow_table_address[i];
+		e->gentry = 0;
+		INIT_LIST_HEAD(&e->link);
+	}
+
+	t->gfn = gfn;
+	t->refcount = 1;
+
+	list_add_tail(&t->link, &kvm_svm->avic.physid_tables);
+
+	t->dummy_page_hpa = page_to_phys(kvm_svm->avic.invalid_physid_page);
+
+	trace_kvm_avic_physid_table_alloc(gfn_to_gpa(gfn));
+	return t;
+
+err_free_table:
+	kfree(t);
+	return NULL;
+}
+
+static void
+avic_physid_shadow_table_free(struct kvm *kvm, struct avic_physid_table *t)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+
+	lockdep_assert_held(&kvm_svm->avic.tables_lock);
+
+	WARN_ON(t->refcount);
+
+	avic_physid_shadow_table_erase(kvm, t);
+
+	trace_kvm_avic_physid_table_free(gfn_to_gpa(t->gfn));
+
+	hlist_del(&t->hash_link);
+	list_del(&t->link);
+	__free_page(t->shadow_table);
+	kfree(t);
+}
+
+static struct avic_physid_table *
+__avic_physid_shadow_table_get(struct hlist_head *head, gfn_t gfn)
+{
+	struct avic_physid_table *t;
+
+	hlist_for_each_entry(t, head, hash_link)
+		if (t->gfn == gfn) {
+			t->refcount++;
+			return t;
+		}
+	return NULL;
+}
+
+struct avic_physid_table *
+avic_physid_shadow_table_get(struct kvm_vcpu *vcpu, gfn_t gfn)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
+	struct hlist_head *hlist;
+	struct avic_physid_table *t;
+
+	mutex_lock(&kvm_svm->avic.tables_lock);
+
+	hlist = &kvm_svm->avic.physid_gpa_hash[avic_physid_hash(gfn)];
+	t = __avic_physid_shadow_table_get(hlist, gfn);
+	if (!t) {
+		t = avic_physid_shadow_table_alloc(vcpu->kvm, gfn);
+		if (!t)
+			goto out_unlock;
+		hlist_add_head(&t->hash_link, hlist);
+	}
+out_unlock:
+	mutex_unlock(&kvm_svm->avic.tables_lock);
+	return t;
+}
+
+static void
+__avic_physid_shadow_table_put(struct kvm *kvm, struct avic_physid_table *t)
+{
+	WARN_ON(t->refcount <= 0);
+	if (--t->refcount == 0)
+		avic_physid_shadow_table_free(kvm, t);
+}
+
+void avic_physid_shadow_table_put(struct kvm *kvm, struct avic_physid_table *t)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+
+	mutex_lock(&kvm_svm->avic.tables_lock);
+	__avic_physid_shadow_table_put(kvm, t);
+	mutex_unlock(&kvm_svm->avic.tables_lock);
+}
+
+static void avic_physid_shadow_table_invalidate(struct kvm *kvm,
+		struct avic_physid_table *t)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+
+	lockdep_assert_held(&kvm_svm->avic.tables_lock);
+	avic_physid_shadow_table_erase(kvm, t);
+}
+
+int avic_physid_shadow_table_sync(struct kvm_vcpu *vcpu,
+				  struct avic_physid_table *t, int nentries)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
+	struct kvm_host_map map;
+	u64 *gentries;
+	int i;
+	int ret = 0;
+
+	mutex_lock(&kvm_svm->avic.tables_lock);
+
+	if (t->nentries >= nentries)
+		goto out_unlock;
+
+
+	trace_kvm_avic_physid_table_reload(gfn_to_gpa(t->gfn), t->nentries, nentries);
+
+	if (t->nentries == 0) {
+		if (!avic_physid_shadow_table_setup_write_tracking(vcpu->kvm, t, true)) {
+			ret = -EFAULT;
+			goto out_unlock;
+		}
+	}
+
+	if (kvm_vcpu_map(vcpu, t->gfn, &map)) {
+		ret = -EFAULT;
+		goto out_unlock;
+	}
+
+	gentries = (u64 *)map.hva;
+
+	for (i = t->nentries ; i < nentries ; i++)
+		avic_physid_shadow_entry_create(vcpu->kvm, t, i, gentries[i]);
+
+	/* publish the table before setting nentries */
+	wmb();
+	WRITE_ONCE(t->nentries, nentries);
+
+	kvm_vcpu_unmap(vcpu, &map, false);
+out_unlock:
+	mutex_unlock(&kvm_svm->avic.tables_lock);
+	return ret;
+}
+
+static void avic_physid_shadow_table_track_write(struct kvm_vcpu *vcpu,
+						 gpa_t gpa,
+						 const u8 *new,
+						 int bytes,
+						 struct kvm_page_track_notifier_node *node)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
+	struct hlist_head *hlist;
+	struct avic_physid_table *t;
+	gfn_t gfn = gpa_to_gfn(gpa);
+	unsigned int page_offset = offset_in_page(gpa);
+	unsigned int entry_offset = page_offset & 0x7;
+	int first = page_offset / sizeof(u64);
+	int last = (page_offset + bytes - 1) / sizeof(u64);
+	u64 new_entry, old_entry;
+	int l1_apic_id;
+
+	if (WARN_ON_ONCE(bytes == 0))
+		return;
+
+	mutex_lock(&kvm_svm->avic.tables_lock);
+
+	hlist = &kvm_svm->avic.physid_gpa_hash[avic_physid_hash(gfn)];
+	t = __avic_physid_shadow_table_get(hlist, gfn);
+
+	if (!t)
+		goto out_unlock;
+
+	trace_kvm_avic_physid_table_write(gpa, bytes);
+
+	/*
+	 * Update policy:
+	 *
+	 * Only a write to a single entry, entry that had a valid backing page
+	 * on the last VM entry with this page, and only if the
+	 * write touches only the is_running and/or apic_id part of this entry
+	 * is allowed.
+	 *
+	 * Writes outside of known number of entries are ignored to support
+	 * case when the guest is adding entries to end of the page
+	 * in the process of a cpu hotplug.
+	 *
+	 * All other writes, which are not supposed to happen during
+	 * use of the page, cause the page to be invalidated,
+	 * and read as a whole, next time it is used by a vCPU for VM entry.
+	 */
+
+	if (first >= t->nentries)
+		goto out_table_put;
+
+	if (first != last || !test_bit(first, t->valid_entires))
+		goto invalidate;
+
+	/* update the entry with written bytes */
+	old_entry = t->entries[first].gentry;
+	new_entry = old_entry;
+	memcpy(((u8 *)&new_entry) + entry_offset, new, bytes);
+
+	/* if backing page changed, invalidate the whole page*/
+	if (physid_entry_get_backing_table(old_entry) !=
+				physid_entry_get_backing_table(new_entry))
+		goto invalidate;
+
+	/*
+	 * Detect write flooding to physid pages that might not be used
+	 * for the purpose anymore
+	 */
+	if (!atomic_read(&t->usecount)) {
+		if (++t->flood_count > t->nentries * AVIC_PHYSID_FLOOD_COUNT)
+			goto invalidate;
+	} else {
+		t->flood_count = 0;
+	}
+
+	/* Update the backing cpu */
+	l1_apic_id = physid_entry_get_apicid(new_entry);
+	avic_physid_shadow_entry_set_vcpu(vcpu->kvm, t, first, l1_apic_id);
+	t->entries[first].gentry = new_entry;
+	goto out_table_put;
+invalidate:
+	avic_physid_shadow_table_invalidate(vcpu->kvm, t);
+out_table_put:
+	__avic_physid_shadow_table_put(vcpu->kvm, t);
+out_unlock:
+	mutex_unlock(&kvm_svm->avic.tables_lock);
+}
+
+static void avic_physid_shadow_table_flush_memslot(struct kvm *kvm,
+						   struct kvm_memory_slot *slot,
+						   struct kvm_page_track_notifier_node *node)
+{
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
+	struct avic_physid_table *t, *n;
+	int i;
+
+	mutex_lock(&kvm_svm->avic.tables_lock);
+
+	list_for_each_entry_safe(t, n, &kvm_svm->avic.physid_tables, link) {
+
+		if (gfn_in_memslot(slot, t->gfn)) {
+			avic_physid_shadow_table_invalidate(kvm, t);
+			continue;
+		}
+
+		for_each_set_bit(i, t->valid_entires, AVIC_MAX_PHYSICAL_ID_COUNT) {
+			u64 gentry = t->entries[i].gentry;
+			gpa_t gpa = physid_entry_get_backing_table(gentry);
+
+			if (gfn_in_memslot(slot, gpa_to_gfn(gpa))) {
+				avic_physid_shadow_table_invalidate(kvm, t);
+				break;
+			}
+		}
+	}
+	mutex_unlock(&kvm_svm->avic.tables_lock);
+}
+
+
 /*
  * This is a wrapper of struct amd_iommu_ir_data.
  */
@@ -113,18 +540,22 @@ void avic_vm_destroy(struct kvm *kvm)
 		__free_page(avic->logical_id_table_page);
 	if (avic->physical_id_table_page)
 		__free_page(avic->physical_id_table_page);
+	if (avic->invalid_physid_page)
+		__free_page(avic->invalid_physid_page);
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
 	hash_del(&avic->hnode);
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
+
+
+	kvm_page_track_unregister_notifier(kvm, &avic->write_tracker);
 }
 
 int avic_vm_init(struct kvm *kvm)
 {
 	unsigned long flags;
 	int err = -ENOMEM;
-	struct page *p_page;
-	struct page *l_page;
+	struct page *page;
 	struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
 	u32 vm_id;
 
@@ -132,18 +563,25 @@ int avic_vm_init(struct kvm *kvm)
 		return 0;
 
 	/* Allocating physical APIC ID table (4KB) */
-	p_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!p_page)
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!page)
 		goto free_avic;
 
-	avic->physical_id_table_page = p_page;
+	avic->physical_id_table_page = page;
 
 	/* Allocating logical APIC ID table (4KB) */
-	l_page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
-	if (!l_page)
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!page)
 		goto free_avic;
 
-	avic->logical_id_table_page = l_page;
+	avic->logical_id_table_page = page;
+
+	/* Allocating a dummy page for invalid nested avic physid entries */
+	page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!page)
+		goto free_avic;
+
+	avic->invalid_physid_page = page;
 
 	spin_lock_irqsave(&svm_vm_data_hash_lock, flags);
  again:
@@ -165,6 +603,13 @@ int avic_vm_init(struct kvm *kvm)
 	hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
+	mutex_init(&avic->tables_lock);
+	INIT_LIST_HEAD(&avic->physid_tables);
+
+	avic->write_tracker.track_write = avic_physid_shadow_table_track_write;
+	avic->write_tracker.track_flush_slot = avic_physid_shadow_table_flush_memslot;
+
+	kvm_page_track_register_notifier(kvm, &avic->write_tracker);
 	return 0;
 
 free_avic:
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index dfca4c06e2071..fc15e1f938793 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -18,6 +18,7 @@
 #include <linux/kvm_types.h>
 #include <linux/kvm_host.h>
 #include <linux/bits.h>
+#include <linux/hash.h>
 
 #include <asm/svm.h>
 #include <asm/sev-common.h>
@@ -89,13 +90,33 @@ struct kvm_sev_info {
 };
 
 
+#define AVIC_PHYSID_HASH_SHIFT 8
+#define AVIC_PHYSID_HASH_SIZE (1 << AVIC_PHYSID_HASH_SHIFT)
+
 struct kvm_svm_avic {
 	u32 vm_id;
 	struct page *logical_id_table_page;
 	struct page *physical_id_table_page;
 	struct hlist_node hnode;
+
+	struct mutex tables_lock;
+
+	/* List of all shadow tables */
+	struct list_head physid_tables;
+
+	/* GPA hash table to find a shadow table via its GPA */
+	struct hlist_head physid_gpa_hash[AVIC_PHYSID_HASH_SIZE];
+
+	struct kvm_page_track_notifier_node write_tracker;
+
+	struct page *invalid_physid_page;
 };
 
+static __always_inline unsigned int avic_physid_hash(gfn_t gfn)
+{
+	return hash_64(gfn, AVIC_PHYSID_HASH_SHIFT);
+}
+
 struct kvm_svm {
 	struct kvm kvm;
 	struct kvm_svm_avic avic;
@@ -147,6 +168,49 @@ struct vmcb_ctrl_area_cached {
 	u8 reserved_sw[32];
 };
 
+struct avic_physid_entry_descr {
+	struct list_head link;
+
+	/* cached value of guest entry */
+	u64  gentry;
+
+	/* shadow table entry pointer*/
+	u64 *sentry;
+};
+
+#define AVIC_PHYSID_FLOOD_COUNT 1000
+
+struct avic_physid_table {
+	/* List of all tables member */
+	struct list_head link;
+
+	/* GPA hash of all tables member */
+	struct hlist_node hash_link;
+
+	/* GPA of the table in guest memory*/
+	gfn_t gfn;
+
+	/* Number of entries that we shadow and which are valid*/
+	int nentries;
+	DECLARE_BITMAP(valid_entires, AVIC_MAX_PHYSICAL_ID_COUNT);
+
+	struct avic_physid_entry_descr entries[AVIC_MAX_PHYSICAL_ID_COUNT];
+
+	/* Guest visible shadow table */
+	struct page *shadow_table;
+	hpa_t shadow_table_hpa;
+	hpa_t dummy_page_hpa;
+
+	/* Number of vCPUs which have reference to this table  */
+	int refcount;
+
+	/* number of vCPUs that are in guest mode and use this table */
+	atomic_t usecount;
+
+	/* Number of writes to this page between uses of it*/
+	int flood_count;
+};
+
 struct svm_nested_state {
 	struct kvm_vmcb_info vmcb02;
 	u64 hsave_msr;
@@ -628,6 +692,13 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
 
+struct avic_physid_table *
+avic_physid_shadow_table_get(struct kvm_vcpu *vcpu, gfn_t gfn);
+void avic_physid_shadow_table_put(struct kvm *kvm, struct avic_physid_table *t);
+int avic_physid_shadow_table_sync(struct kvm_vcpu *vcpu,
+				  struct avic_physid_table *t, int nentries);
+
+
 #define INVALID_BACKING_PAGE	(~(u64)0)
 
 static inline u64 physid_entry_get_backing_table(u64 entry)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 12/19] KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the host scheduling
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (10 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 11/19] KVM: x86: nSVM: implement shadowing of AVIC's physical id table Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 13/19] KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit Maxim Levitsky
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

For each vCPU
  - store a linked list of all shadow physical id entries
    which address it.

  - Update those entries when this vCPU is scheduled
    in/out

  - update this list, when physid tables are modified by
    other means (guest write and/or table sync)

To avoid races vs vcpu schedule, use a spinlock.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 113 +++++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/svm/svm.c  |   7 +++
 arch/x86/kvm/svm/svm.h  |  10 ++++
 3 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f462b7e48e3ca..34da9fabd5194 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -67,8 +67,12 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
 	struct avic_physid_entry_descr *e = &t->entries[n];
 	u64 sentry = READ_ONCE(*e->sentry);
 	u64 old_sentry = sentry;
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	struct kvm_vcpu *new_vcpu = NULL;
 	int l0_apicid = -1;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
 
 	WARN_ON(!test_bit(n, t->valid_entires));
 
@@ -79,6 +83,9 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
 		new_vcpu = avic_vcpu_by_l1_apicid(kvm, new_l1_apicid);
 
 	if (new_vcpu)
+		list_add_tail(&e->link, &to_svm(new_vcpu)->nested.physid_ref_entries);
+
+	if (new_vcpu && to_svm(new_vcpu)->nested_avic_active)
 		l0_apicid = kvm_cpu_get_apicid(new_vcpu->cpu);
 
 	physid_entry_set_apicid(&sentry, l0_apicid);
@@ -87,6 +94,8 @@ static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
 
 	if (sentry != old_sentry)
 		WRITE_ONCE(*e->sentry, sentry);
+
+	raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
 }
 
 static void avic_physid_shadow_entry_create(struct kvm *kvm,
@@ -131,7 +140,11 @@ static void avic_physid_shadow_entry_remove(struct kvm *kvm,
 					   int n)
 {
 	struct avic_physid_entry_descr *e = &t->entries[n];
+	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 	hpa_t backing_page_hpa;
+	unsigned long flags;
+
+	raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
 
 	if (!test_and_clear_bit(n, t->valid_entires))
 		WARN_ON(1);
@@ -147,8 +160,49 @@ static void avic_physid_shadow_entry_remove(struct kvm *kvm,
 
 	e->gentry = 0;
 	*e->sentry = 0;
+
+	raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
 }
 
+static void avic_update_peer_physid_entries(struct kvm_vcpu *vcpu, int cpu)
+{
+	/*
+	 * Update all shadow physid tables which contain entries
+	 * which reference this vCPU with its new physical location
+	 */
+	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
+	struct vcpu_svm *vcpu_svm = to_svm(vcpu);
+	struct avic_physid_entry_descr *e;
+	int updated_nentries = 0;
+	int l0_apicid = -1;
+	unsigned long flags;
+	bool new_active = cpu != -1;
+
+	if (cpu != -1)
+		l0_apicid = kvm_cpu_get_apicid(cpu);
+
+	raw_spin_lock_irqsave(&kvm_svm->avic.table_entries_lock, flags);
+
+	list_for_each_entry(e, &vcpu_svm->nested.physid_ref_entries, link) {
+		u64 sentry = READ_ONCE(*e->sentry);
+		u64 old_sentry = sentry;
+
+		physid_entry_set_apicid(&sentry, l0_apicid);
+
+		if (sentry != old_sentry) {
+			updated_nentries++;
+			WRITE_ONCE(*e->sentry, sentry);
+		}
+	}
+
+	if (updated_nentries)
+		trace_kvm_avic_physid_update_vcpu_host(vcpu->vcpu_id,
+						       l0_apicid, updated_nentries);
+
+	vcpu_svm->nested_avic_active = new_active;
+
+	raw_spin_unlock_irqrestore(&kvm_svm->avic.table_entries_lock, flags);
+}
 
 static bool
 avic_physid_shadow_table_setup_write_tracking(struct kvm *kvm,
@@ -603,6 +657,7 @@ int avic_vm_init(struct kvm *kvm)
 	hash_add(svm_vm_data_hash, &avic->hnode, avic->vm_id);
 	spin_unlock_irqrestore(&svm_vm_data_hash_lock, flags);
 
+	raw_spin_lock_init(&avic->table_entries_lock);
 	mutex_init(&avic->tables_lock);
 	INIT_LIST_HEAD(&avic->physid_tables);
 
@@ -1428,9 +1483,51 @@ static void avic_vcpu_load(struct kvm_vcpu *vcpu)
 static void avic_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	preempt_disable();
-
 	__avic_vcpu_put(vcpu);
+	preempt_enable();
+}
+
 
+void __nested_avic_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	lockdep_assert_preemption_disabled();
+
+	/*
+	 * For the same reason as in __avic_vcpu_load there is no
+	 * need to load nested AVIC when this vCPU is blocking
+	 */
+	if (kvm_vcpu_is_blocking(vcpu))
+		return;
+
+	if (svm->nested.initialized)
+		avic_update_peer_physid_entries(vcpu, cpu);
+}
+
+void __nested_avic_put(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	lockdep_assert_preemption_disabled();
+
+	if (svm->nested.initialized)
+		avic_update_peer_physid_entries(vcpu, -1);
+}
+
+void nested_avic_load(struct kvm_vcpu *vcpu)
+{
+	int cpu = get_cpu();
+
+	WARN_ON(cpu != vcpu->cpu);
+	__nested_avic_load(vcpu, cpu);
+	put_cpu();
+}
+
+void nested_avic_put(struct kvm_vcpu *vcpu)
+{
+	preempt_disable();
+	__nested_avic_put(vcpu);
 	preempt_enable();
 }
 
@@ -1468,9 +1565,6 @@ void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 
 void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
-	if (!kvm_vcpu_apicv_active(vcpu))
-		return;
-
        /*
         * Unload the AVIC when the vCPU is about to block, _before_
         * the vCPU actually blocks.
@@ -1484,13 +1578,16 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
         * IRR and reading IsRunning; the lack of this barrier might be
         * the cause of errata #1235).
         */
-	avic_vcpu_put(vcpu);
+	if (kvm_vcpu_apicv_active(vcpu))
+		avic_vcpu_put(vcpu);
+
+	nested_avic_put(vcpu);
 }
 
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
 {
-	if (!kvm_vcpu_apicv_active(vcpu))
-		return;
+	if (kvm_vcpu_apicv_active(vcpu))
+		avic_vcpu_load(vcpu);
 
-	avic_vcpu_load(vcpu);
+	nested_avic_load(vcpu);
 }
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 75b4f3ac8b1a0..76fbee2c8c5d7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1302,6 +1302,8 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu)
 
 	svm->guest_state_loaded = false;
 
+	INIT_LIST_HEAD(&svm->nested.physid_ref_entries);
+
 	return 0;
 
 error_free_vmsa_page:
@@ -1391,8 +1393,11 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		sd->current_vmcb = svm->vmcb;
 		indirect_branch_prediction_barrier();
 	}
+
 	if (kvm_vcpu_apicv_active(vcpu))
 		__avic_vcpu_load(vcpu, cpu);
+
+	__nested_avic_load(vcpu, cpu);
 }
 
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1400,6 +1405,8 @@ static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 	if (kvm_vcpu_apicv_active(vcpu))
 		__avic_vcpu_put(vcpu);
 
+	__nested_avic_put(vcpu);
+
 	svm_prepare_host_switch(vcpu);
 
 	++vcpu->stat.host_state_reload;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index fc15e1f938793..401449dbce65d 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -99,6 +99,7 @@ struct kvm_svm_avic {
 	struct page *physical_id_table_page;
 	struct hlist_node hnode;
 
+	raw_spinlock_t table_entries_lock;
 	struct mutex tables_lock;
 
 	/* List of all shadow tables */
@@ -244,6 +245,9 @@ struct svm_nested_state {
 	 * on its side.
 	 */
 	bool force_msr_bitmap_recalc;
+
+	/* All AVIC shadow PID table entry descriptors that reference this vCPU */
+	struct list_head physid_ref_entries;
 };
 
 struct vcpu_sev_es_state {
@@ -311,6 +315,7 @@ struct vcpu_svm {
 	u32 dfr_reg;
 	struct page *avic_backing_page;
 	u64 *avic_physical_id_cache;
+	bool nested_avic_active;
 
 	/*
 	 * Per-vcpu list of struct amd_svm_iommu_ir:
@@ -678,6 +683,11 @@ int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu);
 int avic_init_vcpu(struct vcpu_svm *svm);
 void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void __avic_vcpu_put(struct kvm_vcpu *vcpu);
+void __nested_avic_load(struct kvm_vcpu *vcpu, int cpu);
+void __nested_avic_put(struct kvm_vcpu *vcpu);
+void nested_avic_load(struct kvm_vcpu *vcpu);
+void nested_avic_put(struct kvm_vcpu *vcpu);
+
 void avic_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void avic_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
 void avic_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 13/19] KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (11 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 12/19] KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the host scheduling Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page Maxim Levitsky
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

  * Passthrough guest's avic pages that can be passed through
     - logical id table
     - avic backing page

  * Passthrough AVIC's mmio range
     - nested guest is responsible for marking it RW
       in its NPT tables.

  * Write track physical id page
     - all peer's avic backing pages are pinned
       as long as the shadow table is not invalidated/
       freed.

  * Cache guest AVIC settings.

  * Add SDM mandated changes to emulated VM enter/exit.

Note that nested AVIC still can't be enabled, thus this
code has no effect yet.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c   |  51 ++++++++++++++-
 arch/x86/kvm/svm/nested.c | 127 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/svm.c    |   2 +
 arch/x86/kvm/svm/svm.h    |  24 +++++++
 4 files changed, 199 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 34da9fabd5194..e6ec525a88625 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -59,6 +59,18 @@ static inline struct kvm_vcpu *avic_vcpu_by_l1_apicid(struct kvm *kvm,
 	return kvm_get_vcpu_by_id(kvm, l1_apicid);
 }
 
+static u32 nested_avic_get_reg(struct kvm_vcpu *vcpu, int reg_off)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	void *nested_apic_regs = svm->nested.l2_apic_access_page.hva;
+
+	if (WARN_ON_ONCE(!nested_apic_regs))
+		return 0;
+
+	return *((u32 *) (nested_apic_regs + reg_off));
+}
+
 static void avic_physid_shadow_entry_set_vcpu(struct kvm *kvm,
 					      struct avic_physid_table *t,
 					      int n,
@@ -531,6 +543,20 @@ static void avic_physid_shadow_table_flush_memslot(struct kvm *kvm,
 	mutex_unlock(&kvm_svm->avic.tables_lock);
 }
 
+void avic_free_nested(struct kvm_vcpu *vcpu)
+{
+	struct avic_physid_table *t;
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	t = svm->nested.l2_physical_id_table;
+	if (t) {
+		avic_physid_shadow_table_put(vcpu->kvm, t);
+		svm->nested.l2_physical_id_table = NULL;
+	}
+
+	kvm_vcpu_unmap(vcpu, &svm->nested.l2_apic_access_page, true);
+	kvm_vcpu_unmap(vcpu, &svm->nested.l2_logical_id_table, true);
+}
 
 /*
  * This is a wrapper of struct amd_iommu_ir_data.
@@ -586,10 +612,18 @@ void avic_vm_destroy(struct kvm *kvm)
 {
 	unsigned long flags;
 	struct kvm_svm_avic *avic = &to_kvm_svm(kvm)->avic;
+	unsigned long i;
+	struct kvm_vcpu *vcpu;
 
 	if (!enable_apicv)
 		return;
 
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		vcpu_load(vcpu);
+		avic_free_nested(vcpu);
+		vcpu_put(vcpu);
+	}
+
 	if (avic->logical_id_table_page)
 		__free_page(avic->logical_id_table_page);
 	if (avic->physical_id_table_page)
@@ -1501,7 +1535,7 @@ void __nested_avic_load(struct kvm_vcpu *vcpu, int cpu)
 	if (kvm_vcpu_is_blocking(vcpu))
 		return;
 
-	if (svm->nested.initialized)
+	if (svm->nested.initialized && svm->avic_enabled)
 		avic_update_peer_physid_entries(vcpu, cpu);
 }
 
@@ -1511,7 +1545,7 @@ void __nested_avic_put(struct kvm_vcpu *vcpu)
 
 	lockdep_assert_preemption_disabled();
 
-	if (svm->nested.initialized)
+	if (svm->nested.initialized && svm->avic_enabled)
 		avic_update_peer_physid_entries(vcpu, -1);
 }
 
@@ -1591,3 +1625,16 @@ void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
 
 	nested_avic_load(vcpu);
 }
+
+bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	int off;
+
+	if (!nested_avic_in_use(vcpu))
+		return false;
+
+	for (off = 0x10; off < 0x80; off += 0x10)
+		if (nested_avic_get_reg(vcpu, APIC_IRR + off))
+			return true;
+	return false;
+}
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef0..eb5e9b600e052 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -387,6 +387,14 @@ void __nested_copy_vmcb_control_to_cache(struct kvm_vcpu *vcpu,
 		memcpy(to->reserved_sw, from->reserved_sw,
 		       sizeof(struct hv_enlightenments));
 	}
+
+	/* copy avic related settings only when it is enabled */
+	if (from->int_ctl & AVIC_ENABLE_MASK) {
+		to->avic_vapic_bar      = from->avic_vapic_bar;
+		to->avic_backing_page   = from->avic_backing_page;
+		to->avic_logical_id     = from->avic_logical_id;
+		to->avic_physical_id    = from->avic_physical_id;
+	}
 }
 
 void nested_copy_vmcb_control_to_cache(struct vcpu_svm *svm,
@@ -539,6 +547,79 @@ void nested_vmcb02_compute_g_pat(struct vcpu_svm *svm)
 	svm->nested.vmcb02.ptr->save.g_pat = svm->vmcb01.ptr->save.g_pat;
 }
 
+
+static bool nested_vmcb02_prepare_avic(struct vcpu_svm *svm)
+{
+	struct vmcb *vmcb02 = svm->nested.vmcb02.ptr;
+	struct avic_physid_table *t = svm->nested.l2_physical_id_table;
+	gfn_t physid_gfn;
+	int physid_nentries;
+
+	if (!nested_avic_in_use(&svm->vcpu))
+		return true;
+
+	if (svm->vcpu.kvm->arch.apic_id_changed) {
+		/* if the guest played with apic id, it will keep both pieces */
+		kvm_vm_bugged(svm->vcpu.kvm);
+		return false;
+	}
+
+	if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.ctl.avic_backing_page & AVIC_HPA_MASK),
+			  &svm->nested.l2_apic_access_page))
+		goto error;
+
+	if (kvm_vcpu_map(&svm->vcpu, gpa_to_gfn(svm->nested.ctl.avic_logical_id & AVIC_HPA_MASK),
+			  &svm->nested.l2_logical_id_table))
+		goto error_unmap_backing_page;
+
+	physid_gfn = gpa_to_gfn(svm->nested.ctl.avic_physical_id &
+		     AVIC_HPA_MASK);
+	physid_nentries = svm->nested.ctl.avic_physical_id &
+			AVIC_PHYSICAL_ID_TABLE_SIZE_MASK;
+
+	if (t && t->gfn != physid_gfn) {
+		avic_physid_shadow_table_put(svm->vcpu.kvm, t);
+		svm->nested.l2_physical_id_table = NULL;
+	}
+
+	if (!svm->nested.l2_physical_id_table) {
+		t = avic_physid_shadow_table_get(&svm->vcpu, physid_gfn);
+		if (!t)
+			goto error_unmap_logical_id_table;
+		svm->nested.l2_physical_id_table = t;
+	}
+
+	atomic_inc(&t->usecount);
+
+	if (t->nentries < physid_nentries)
+		if (avic_physid_shadow_table_sync(&svm->vcpu, t, physid_nentries) < 0)
+			goto error_put_table;
+
+	/* Everything is setup, we can enable AVIC */
+	vmcb02->control.avic_vapic_bar =
+		svm->nested.ctl.avic_vapic_bar & VMCB_AVIC_APIC_BAR_MASK;
+	vmcb02->control.avic_backing_page =
+		pfn_to_hpa(svm->nested.l2_apic_access_page.pfn);
+	vmcb02->control.avic_logical_id =
+		pfn_to_hpa(svm->nested.l2_logical_id_table.pfn);
+	vmcb02->control.avic_physical_id =
+		(svm->nested.l2_physical_id_table->shadow_table_hpa) | physid_nentries;
+
+	vmcb02->control.int_ctl |= AVIC_ENABLE_MASK;
+	vmcb_mark_dirty(vmcb02, VMCB_AVIC);
+	return true;
+
+error_put_table:
+	avic_physid_shadow_table_put(svm->vcpu.kvm, t);
+	svm->nested.l2_physical_id_table = NULL;
+error_unmap_logical_id_table:
+	kvm_vcpu_unmap(&svm->vcpu, &svm->nested.l2_logical_id_table, false);
+error_unmap_backing_page:
+	kvm_vcpu_unmap(&svm->vcpu, &svm->nested.l2_apic_access_page, false);
+error:
+	return false;
+}
+
 static void nested_vmcb02_prepare_save(struct vcpu_svm *svm, struct vmcb *vmcb12)
 {
 	bool new_vmcb12 = false;
@@ -627,6 +708,17 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 	else
 		int_ctl_vmcb01_bits |= (V_GIF_MASK | V_GIF_ENABLE_MASK);
 
+	if (nested_avic_in_use(vcpu)) {
+
+		/*
+		 * Enabling AVIC implicitly disables the
+		 * V_IRQ, V_INTR_PRIO, V_IGN_TPR, and V_INTR_VECTOR
+		 * fields in the VMCB Control Word
+		 */
+		int_ctl_vmcb12_bits &= ~V_IRQ_INJECTION_BITS_MASK;
+	}
+
+
 	/* Copied from vmcb01.  msrpm_base can be overwritten later.  */
 	vmcb02->control.nested_ctl = vmcb01->control.nested_ctl;
 	vmcb02->control.iopm_base_pa = vmcb01->control.iopm_base_pa;
@@ -829,7 +921,10 @@ int nested_svm_vmrun(struct kvm_vcpu *vcpu)
 	if (enter_svm_guest_mode(vcpu, vmcb12_gpa, vmcb12, true))
 		goto out_exit_err;
 
-	if (nested_svm_vmrun_msrpm(svm))
+	if (!nested_svm_vmrun_msrpm(svm))
+		goto out_exit_err;
+
+	if (nested_vmcb02_prepare_avic(svm))
 		goto out;
 
 out_exit_err:
@@ -956,6 +1051,15 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 
 	nested_svm_copy_common_state(svm->nested.vmcb02.ptr, svm->vmcb01.ptr);
 
+	if (nested_avic_in_use(vcpu)) {
+		struct avic_physid_table *t = svm->nested.l2_physical_id_table;
+
+		kvm_vcpu_unmap(vcpu, &svm->nested.l2_apic_access_page, true);
+		kvm_vcpu_unmap(vcpu, &svm->nested.l2_logical_id_table, true);
+
+		atomic_dec(&t->usecount);
+	}
+
 	svm_switch_vmcb(svm, &svm->vmcb01);
 
 	if (unlikely(svm->lbrv_enabled && (svm->nested.ctl.virt_ext & LBR_CTL_ENABLE_MASK))) {
@@ -1069,6 +1173,7 @@ int svm_allocate_nested(struct vcpu_svm *svm)
 	svm_vcpu_init_msrpm(&svm->vcpu, svm->nested.msrpm);
 
 	svm->nested.initialized = true;
+	nested_avic_load(&svm->vcpu);
 	return 0;
 
 err_free_vmcb02:
@@ -1078,6 +1183,8 @@ int svm_allocate_nested(struct vcpu_svm *svm)
 
 void svm_free_nested(struct vcpu_svm *svm)
 {
+	struct kvm_vcpu *vcpu = &svm->vcpu;
+
 	if (!svm->nested.initialized)
 		return;
 
@@ -1096,6 +1203,11 @@ void svm_free_nested(struct vcpu_svm *svm)
 	 */
 	svm->nested.last_vmcb12_gpa = INVALID_GPA;
 
+	if (svm->avic_enabled) {
+		nested_avic_put(vcpu);
+		avic_free_nested(vcpu);
+	}
+
 	svm->nested.initialized = false;
 }
 
@@ -1116,8 +1228,10 @@ void svm_leave_nested(struct kvm_vcpu *vcpu)
 
 		nested_svm_uninit_mmu_context(vcpu);
 		vmcb_mark_all_dirty(svm->vmcb);
-	}
 
+		kvm_vcpu_unmap(vcpu, &svm->nested.l2_apic_access_page, true);
+		kvm_vcpu_unmap(vcpu, &svm->nested.l2_logical_id_table, true);
+	}
 	kvm_clear_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
 }
 
@@ -1423,6 +1537,13 @@ static void nested_copy_vmcb_cache_to_control(struct vmcb_control_area *dst,
 	dst->pause_filter_count   = from->pause_filter_count;
 	dst->pause_filter_thresh  = from->pause_filter_thresh;
 	/* 'clean' and 'reserved_sw' are not changed by KVM */
+
+	if (from->int_ctl & AVIC_ENABLE_MASK) {
+		dst->avic_vapic_bar      = from->avic_vapic_bar;
+		dst->avic_backing_page   = from->avic_backing_page;
+		dst->avic_logical_id     = from->avic_logical_id;
+		dst->avic_physical_id    = from->avic_physical_id;
+	}
 }
 
 static int svm_get_nested_state(struct kvm_vcpu *vcpu,
@@ -1644,7 +1765,7 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
 		if (CC(!load_pdptrs(vcpu, vcpu->arch.cr3)))
 			return false;
 
-	if (!nested_svm_vmrun_msrpm(svm)) {
+	if (!nested_svm_vmrun_msrpm(svm) || !nested_vmcb02_prepare_avic(svm)) {
 		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
 		vcpu->run->internal.suberror =
 			KVM_INTERNAL_ERROR_EMULATION;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 76fbee2c8c5d7..a39bb0b27a51d 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4680,6 +4680,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.refresh_apicv_exec_ctrl = avic_refresh_apicv_exec_ctrl,
 	.check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
 	.apicv_post_state_restore = avic_apicv_post_state_restore,
+	.guest_apic_has_interrupt = avic_nested_has_interrupt,
 
 	.get_mt_mask = svm_get_mt_mask,
 	.get_exit_info = svm_get_exit_info,
@@ -4931,6 +4932,7 @@ static __init int svm_hardware_setup(void)
 		svm_x86_ops.vcpu_blocking = NULL;
 		svm_x86_ops.vcpu_unblocking = NULL;
 		svm_x86_ops.vcpu_get_apicv_inhibit_reasons = NULL;
+		svm_x86_ops.guest_apic_has_interrupt = NULL;
 	}
 
 	if (vls) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 401449dbce65d..17fcc09cf4be1 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -167,6 +167,11 @@ struct vmcb_ctrl_area_cached {
 	u64 virt_ext;
 	u32 clean;
 	u8 reserved_sw[32];
+
+	u64 avic_vapic_bar;
+	u64 avic_backing_page;
+	u64 avic_logical_id;
+	u64 avic_physical_id;
 };
 
 struct avic_physid_entry_descr {
@@ -248,6 +253,10 @@ struct svm_nested_state {
 
 	/* All AVIC shadow PID table entry descriptors that reference this vCPU */
 	struct list_head physid_ref_entries;
+
+	struct kvm_host_map l2_apic_access_page;
+	struct kvm_host_map l2_logical_id_table;
+	struct avic_physid_table *l2_physical_id_table;
 };
 
 struct vcpu_sev_es_state {
@@ -310,6 +319,7 @@ struct vcpu_svm {
 	bool pause_filter_enabled         : 1;
 	bool pause_threshold_enabled      : 1;
 	bool vgif_enabled                 : 1;
+	bool avic_enabled                 : 1;
 
 	u32 ldr_reg;
 	u32 dfr_reg;
@@ -701,6 +711,8 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
+void avic_free_nested(struct kvm_vcpu *vcpu);
+bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu);
 
 struct avic_physid_table *
 avic_physid_shadow_table_get(struct kvm_vcpu *vcpu, gfn_t gfn);
@@ -708,6 +720,18 @@ void avic_physid_shadow_table_put(struct kvm *kvm, struct avic_physid_table *t);
 int avic_physid_shadow_table_sync(struct kvm_vcpu *vcpu,
 				  struct avic_physid_table *t, int nentries);
 
+static inline bool nested_avic_in_use(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *vcpu_svm = to_svm(vcpu);
+
+	if (!vcpu_svm->avic_enabled)
+		return false;
+
+	if (!nested_npt_enabled(vcpu_svm))
+		return false;
+
+	return vcpu_svm->nested.ctl.int_ctl & AVIC_ENABLE_MASK;
+}
 
 #define INVALID_BACKING_PAGE	(~(u64)0)
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (12 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 13/19] KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-05-19 16:55   ` Sean Christopherson
  2022-04-27 20:03 ` [RFC PATCH v3 15/19] KVM: x86: nSVM: add code to reload AVIC physid table when it is invalidated Maxim Levitsky
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This will be used on SVM to reload shadow page of the AVIC physid table

No functional change intended

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h    | 3 +--
 arch/x86/kvm/vmx/vmx.c             | 8 ++++----
 arch/x86/kvm/x86.c                 | 6 +++---
 4 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 96e4e9842dfc6..997edb7453ac2 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -82,7 +82,7 @@ KVM_X86_OP_OPTIONAL(hwapic_isr_update)
 KVM_X86_OP_OPTIONAL_RET0(guest_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(load_eoi_exitmap)
 KVM_X86_OP_OPTIONAL(set_virtual_apic_mode)
-KVM_X86_OP_OPTIONAL(set_apic_access_page_addr)
+KVM_X86_OP_OPTIONAL(reload_apic_pages)
 KVM_X86_OP(deliver_interrupt)
 KVM_X86_OP_OPTIONAL(sync_pir_to_irr)
 KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fc7df778a3d71..52fa04c3108b1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1436,7 +1436,7 @@ struct kvm_x86_ops {
 	bool (*guest_apic_has_interrupt)(struct kvm_vcpu *vcpu);
 	void (*load_eoi_exitmap)(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
 	void (*set_virtual_apic_mode)(struct kvm_vcpu *vcpu);
-	void (*set_apic_access_page_addr)(struct kvm_vcpu *vcpu);
+	void (*reload_apic_pages)(struct kvm_vcpu *vcpu);
 	void (*deliver_interrupt)(struct kvm_lapic *apic, int delivery_mode,
 				  int trig_mode, int vector);
 	int (*sync_pir_to_irr)(struct kvm_vcpu *vcpu);
@@ -1909,7 +1909,6 @@ int kvm_cpu_has_extint(struct kvm_vcpu *v);
 int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu);
 int kvm_cpu_get_interrupt(struct kvm_vcpu *v);
 void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
-
 int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 		    unsigned long ipi_bitmap_high, u32 min,
 		    unsigned long icr, int op_64_bit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce3..7defd31703c61 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6339,7 +6339,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 	vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+static void vmx_reload_apic_access_page(struct kvm_vcpu *vcpu)
 {
 	struct page *page;
 
@@ -7777,7 +7777,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.enable_irq_window = vmx_enable_irq_window,
 	.update_cr8_intercept = vmx_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+	.reload_apic_pages = vmx_reload_apic_access_page,
 	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
 	.load_eoi_exitmap = vmx_load_eoi_exitmap,
 	.apicv_post_state_restore = vmx_apicv_post_state_restore,
@@ -7940,12 +7940,12 @@ static __init int hardware_setup(void)
 		enable_vnmi = 0;
 
 	/*
-	 * set_apic_access_page_addr() is used to reload apic access
+	 * kvm_vcpu_reload_apic_pages() is used to reload apic access
 	 * page upon invalidation.  No need to do anything if not
 	 * using the APIC_ACCESS_ADDR VMCS field.
 	 */
 	if (!flexpriority_enabled)
-		vmx_x86_ops.set_apic_access_page_addr = NULL;
+		vmx_x86_ops.reload_apic_pages = NULL;
 
 	if (!cpu_has_vmx_tpr_shadow())
 		vmx_x86_ops.update_cr8_intercept = NULL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d2f73ce87a1e3..ad744ab99734c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
 		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 }
 
-static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
+static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
 {
 	if (!lapic_in_kernel(vcpu))
 		return;
 
-	static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
+	static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
 }
 
 void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
@@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
 			vcpu_load_eoi_exitmap(vcpu);
 		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
-			kvm_vcpu_reload_apic_access_page(vcpu);
+			kvm_vcpu_reload_apic_pages(vcpu);
 		if (kvm_check_request(KVM_REQ_HV_CRASH, vcpu)) {
 			vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
 			vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 15/19] KVM: x86: nSVM: add code to reload AVIC physid table when it is invalidated
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (13 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 16/19] KVM: x86: nSVM: implement support for nested AVIC vmexits Maxim Levitsky
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

An AVIC table invalidation is not supposed to happen often, and can
only happen when the guest does something suspicious such as:

  - It places physid page in a memslot that is enabled/disabled and memslot
    flushing happens.

  - It tries to update apic backing page addresses - guest has no
    reason to touch this, and doing so on real hardware will likely
    result in unpredictable results.

  - It writes to reserved bits of a tracked page.


  - It write floods a physid table while no vCPU is using it
    (the page is likely reused at that point to contain something else)


All of the above causes a KVM_REQ_APIC_PAGE_RELOAD request to be raised
on all vCPUS, which kicks them out of the guest mode,
and then first vCPU to reach the handler will re-create the entries of
the physid page, and others will notice this and do nothing.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 13 +++++++++++++
 arch/x86/kvm/svm/svm.c  |  1 +
 arch/x86/kvm/svm/svm.h  |  1 +
 3 files changed, 15 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e6ec525a88625..f13ca1e7b2845 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -379,6 +379,7 @@ static void avic_physid_shadow_table_invalidate(struct kvm *kvm,
 	struct kvm_svm *kvm_svm = to_kvm_svm(kvm);
 
 	lockdep_assert_held(&kvm_svm->avic.tables_lock);
+	kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
 	avic_physid_shadow_table_erase(kvm, t);
 }
 
@@ -1638,3 +1639,15 @@ bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu)
 			return true;
 	return false;
 }
+
+void avic_reload_apic_pages(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_svm *vcpu_svm = to_svm(vcpu);
+	struct avic_physid_table *t = vcpu_svm->nested.l2_physical_id_table;
+
+	int nentries = vcpu_svm->nested.ctl.avic_physical_id &
+			AVIC_PHYSICAL_ID_TABLE_SIZE_MASK;
+
+	if (t && is_guest_mode(vcpu) && nested_avic_in_use(vcpu))
+		avic_physid_shadow_table_sync(vcpu, t, nentries);
+}
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index a39bb0b27a51d..d96a73931d1e5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4677,6 +4677,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.enable_nmi_window = svm_enable_nmi_window,
 	.enable_irq_window = svm_enable_irq_window,
 	.update_cr8_intercept = svm_update_cr8_intercept,
+	.reload_apic_pages = avic_reload_apic_pages,
 	.refresh_apicv_exec_ctrl = avic_refresh_apicv_exec_ctrl,
 	.check_apicv_inhibit_reasons = avic_check_apicv_inhibit_reasons,
 	.apicv_post_state_restore = avic_apicv_post_state_restore,
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 17fcc09cf4be1..93fd9d6f5fd85 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -711,6 +711,7 @@ void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
 void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 void avic_ring_doorbell(struct kvm_vcpu *vcpu);
 unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
+void avic_reload_apic_pages(struct kvm_vcpu *vcpu);
 void avic_free_nested(struct kvm_vcpu *vcpu);
 bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 16/19] KVM: x86: nSVM: implement support for nested AVIC vmexits
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (14 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 15/19] KVM: x86: nSVM: add code to reload AVIC physid table when it is invalidated Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 17/19] KVM: x86: nSVM: implement nested AVIC doorbell emulation Maxim Levitsky
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

* SVM_EXIT_AVIC_UNACCELERATED_ACCESS is always forwarded to the L1

* SVM_EXIT_AVIC_INCOMPLETE_IPI is hidden from the guest if:

   - is_running was false in shadow physid page because L1's vCPU
     was scheduled out - in this case, the vCPU is waken up,
     and it will process nested AVIC on next VM entry

  - invalid physical address of avic backing page was present
    in the guest's physid page, which KVM translates to
    valid physical address of a dummy page and is_running=false.

    If this condition happens,
    the AVIC_IPI_FAILURE_INVALID_BACKING_PAGE VM exit is injected to
    the nested hypervisor.

* Note that it is possible to have SVM_EXIT_AVIC_INCOMPLETE_IPI
  VM exit happen both due to host and guest related reason
  at the same time:

  For example if a broadcast IPI was attempted and some shadow
  physid entries had 'is_running=false' set by the guest,
  and some had it set to false due to scheduled out L1 vCPUs.

  To support this case, all relevant entries of guest's physical
  and logical id tables are checked, and both host related actions
  (e.g wakeup) and guest vm exit reflection are done.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c   | 204 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/svm/nested.c |  14 +++
 2 files changed, 216 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index f13ca1e7b2845..e8c53fd77f0b1 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -917,6 +917,164 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
 	}
 }
 
+static void
+avic_kick_target_vcpu_nested_physical(struct vcpu_svm *svm,
+				      int target_l2_apic_id,
+				      int *index,
+				      bool *invalid_page)
+{
+	u64 gentry, sentry;
+	int target_l1_apicid;
+	struct avic_physid_table *t = svm->nested.l2_physical_id_table;
+
+	if (WARN_ON_ONCE(!t))
+		return;
+
+	/*
+	 * This shouldn't normally happen because this condition
+	 * should cause AVIC_IPI_FAILURE_INVALID_TARGET vmexit,
+	 * however the guest can change the page and trigger this.
+	 */
+	if (target_l2_apic_id >= t->nentries)
+		return;
+
+	gentry = t->entries[target_l2_apic_id].gentry;
+	sentry = *t->entries[target_l2_apic_id].sentry;
+
+	/* Same reasoning as above  */
+	if (!(gentry & AVIC_PHYSICAL_ID_ENTRY_VALID_MASK))
+		return;
+
+	/*
+	 * This races against the guest updating is_running bit.
+	 *
+	 * Race itself happens on real hardware as well, and the guest
+	 * must use the correct means to avoid it.
+	 *
+	 * AVIC hardware already set IRR and should have done memory
+	 * barrier, and then found out that is_running is false
+	 * in shadow physid table.
+	 *
+	 * We are doing another is_running check (in the guest physid table),
+	 * completing it, thus don't need additional memory barrier.
+	 */
+
+	target_l1_apicid = physid_entry_get_apicid(gentry);
+
+	if (target_l1_apicid == -1) {
+
+		/* is_running is false, need to vmexit to the guest */
+		if (*index == -1) {
+			u64 backing_page_phys = physid_entry_get_backing_table(sentry);
+
+			*index = target_l2_apic_id;
+			if (backing_page_phys == t->dummy_page_hpa)
+				*invalid_page = true;
+		}
+	} else {
+		/* Wake up the target vCPU and hide the VM exit from the guest */
+		struct kvm_vcpu *target = avic_vcpu_by_l1_apicid(svm->vcpu.kvm, target_l1_apicid);
+
+		if (target && target != &svm->vcpu)
+			kvm_vcpu_wake_up(target);
+	}
+
+	trace_kvm_avic_nested_kick_vcpu(svm->vcpu.vcpu_id,
+					target_l2_apic_id,
+					target_l1_apicid);
+}
+
+static void
+avic_kick_target_vcpus_nested_logical(struct vcpu_svm *svm, unsigned long dest,
+				      int *index, bool *invalid_page)
+{
+	int logical_id;
+	u8 cluster = 0;
+	u64 *logical_id_table = (u64 *)svm->nested.l2_logical_id_table.hva;
+	int physical_index = -1;
+
+	if (WARN_ON_ONCE(!logical_id_table))
+		return;
+
+	if (nested_avic_get_reg(&svm->vcpu, APIC_DFR) == APIC_DFR_CLUSTER) {
+		if (dest >= 0x40)
+			return;
+		cluster = dest & 0x3C;
+		dest &= 0x3;
+	}
+
+	for_each_set_bit(logical_id, &dest, 8) {
+		int logical_index = cluster | logical_id;
+		u64 log_gentry = logical_id_table[logical_index];
+		int l2_apicid = logid_get_physid(log_gentry);
+
+		/* Should not happen as in this case AVIC should VM exit
+		 * with 'invalid target'
+
+		 * However the guest can change the entry under KVM's back,
+		 * thus ignore this case.
+		 */
+		if (l2_apicid == -1)
+			continue;
+
+		avic_kick_target_vcpu_nested_physical(svm, l2_apicid,
+						      &physical_index,
+						      invalid_page);
+
+		/* Reported index is the index of the logical entry in this case */
+		if (physical_index != -1)
+			*index = logical_index;
+	}
+}
+
+static void
+avic_kick_target_vcpus_nested_broadcast(struct vcpu_svm *svm,
+					int *index, bool *invalid_page)
+{
+	struct avic_physid_table *t = svm->nested.l2_physical_id_table;
+	int l2_apicid;
+
+	/*
+	 * This races against the guest changing the valid bit in the physid
+	 * table and/or increasing number of entries of the table.
+	 *
+	 * In both cases the race would happen on real hardware as well,
+	 * thus this code can avoid synchronization vs write tracking.
+	 */
+	for_each_set_bit(l2_apicid, t->valid_entires, AVIC_MAX_PHYSICAL_ID_COUNT)
+		avic_kick_target_vcpu_nested_physical(svm, l2_apicid,
+						      index, invalid_page);
+}
+
+static void avic_kick_target_vcpus_nested(struct kvm_vcpu *vcpu,
+					struct kvm_lapic *source,
+					u32 icrl, u32 icrh,
+					int *index, bool *invalid_page)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+	int dest = GET_APIC_DEST_FIELD(icrh);
+
+	switch (icrl & APIC_SHORT_MASK) {
+	case APIC_DEST_NOSHORT:
+		if (dest == 0xFF)
+			avic_kick_target_vcpus_nested_broadcast(svm,
+					index, invalid_page);
+		else if (icrl & APIC_DEST_MASK)
+			avic_kick_target_vcpus_nested_logical(svm, dest,
+					index, invalid_page);
+		else
+			avic_kick_target_vcpu_nested_physical(svm, dest,
+					index, invalid_page);
+		break;
+	case APIC_DEST_ALLINC:
+	case APIC_DEST_ALLBUT:
+		avic_kick_target_vcpus_nested_broadcast(svm, index, invalid_page);
+		break;
+	case APIC_DEST_SELF:
+		break;
+	}
+}
+
 int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -924,10 +1082,20 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
 	u32 icrl = svm->vmcb->control.exit_info_1;
 	u32 id = svm->vmcb->control.exit_info_2 >> 32;
 	u32 index = svm->vmcb->control.exit_info_2 & 0x1FF;
+	int nindex = -1;
+	bool invalid_page = false;
+
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
 	trace_kvm_avic_incomplete_ipi(vcpu->vcpu_id, icrh, icrl, id, index);
 
+	if (is_guest_mode(&svm->vcpu)) {
+		if (WARN_ON_ONCE(!nested_avic_in_use(vcpu)))
+			return 1;
+		if (WARN_ON_ONCE(!svm->nested.l2_physical_id_table))
+			return 1;
+	}
+
 	switch (id) {
 	case AVIC_IPI_FAILURE_INVALID_INT_TYPE:
 		/*
@@ -939,23 +1107,49 @@ int avic_incomplete_ipi_interception(struct kvm_vcpu *vcpu)
 		 * which case KVM needs to emulate the ICR write as well in
 		 * order to clear the BUSY flag.
 		 */
+		if (is_guest_mode(&svm->vcpu)) {
+			nested_svm_vmexit(svm);
+			break;
+		}
+
 		if (icrl & APIC_ICR_BUSY)
 			kvm_apic_write_nodecode(vcpu, APIC_ICR);
 		else
 			kvm_apic_send_ipi(apic, icrl, icrh);
+
 		break;
 	case AVIC_IPI_FAILURE_TARGET_NOT_RUNNING:
 		/*
 		 * At this point, we expect that the AVIC HW has already
 		 * set the appropriate IRR bits on the valid target
 		 * vcpus. So, we just need to kick the appropriate vcpu.
+		 *
+		 * If nested KVM might also need to reflect the VM exit to
+		 * the guest.
 		 */
-		avic_kick_target_vcpus(vcpu->kvm, apic, icrl, icrh, index);
+		if (!is_guest_mode(&svm->vcpu)) {
+			avic_kick_target_vcpus(vcpu->kvm, apic, icrl, icrh, index);
+			break;
+		}
+
+		avic_kick_target_vcpus_nested(vcpu, apic, icrl, icrh,
+					      &nindex, &invalid_page);
+		if (nindex != -1) {
+			if (invalid_page)
+				id = AVIC_IPI_FAILURE_INVALID_BACKING_PAGE;
+
+			svm->vmcb->control.exit_info_2 =  ((u64)id << 32) | nindex;
+			nested_svm_vmexit(svm);
+		}
 		break;
 	case AVIC_IPI_FAILURE_INVALID_TARGET:
+		if (is_guest_mode(&svm->vcpu))
+			nested_svm_vmexit(svm);
+		else
+			WARN_ON_ONCE(1);
 		break;
 	case AVIC_IPI_FAILURE_INVALID_BACKING_PAGE:
-		WARN_ONCE(1, "Invalid backing page\n");
+		WARN_ON_ONCE(1);
 		break;
 	default:
 		pr_err("Unknown IPI interception\n");
@@ -1064,9 +1258,13 @@ static void avic_handle_dfr_update(struct kvm_vcpu *vcpu)
 
 static int avic_unaccel_trap_write(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_svm *svm = to_svm(vcpu);
 	u32 offset = to_svm(vcpu)->vmcb->control.exit_info_1 &
 				AVIC_UNACCEL_ACCESS_OFFSET_MASK;
 
+	if (WARN_ON_ONCE(is_guest_mode(&svm->vcpu)))
+		return 0;
+
 	switch (offset) {
 	case APIC_LDR:
 		if (avic_handle_ldr_update(vcpu))
@@ -1124,6 +1322,8 @@ int avic_unaccelerated_access_interception(struct kvm_vcpu *vcpu)
 		     AVIC_UNACCEL_ACCESS_WRITE_MASK;
 	bool trap = is_avic_unaccelerated_access_trap(offset);
 
+	WARN_ON_ONCE(is_guest_mode(&svm->vcpu));
+
 	trace_kvm_avic_unaccelerated_access(vcpu->vcpu_id, offset,
 					    trap, write, vector);
 	if (trap) {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index eb5e9b600e052..decc665d7cc69 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1320,6 +1320,20 @@ static int nested_svm_intercept(struct vcpu_svm *svm)
 		vmexit = NESTED_EXIT_DONE;
 		break;
 	}
+	case SVM_EXIT_AVIC_UNACCELERATED_ACCESS: {
+		/*
+		 * Unaccelerated AVIC access is always reflected.
+		 * Also there is no intercept bit for it.
+		 */
+		vmexit = NESTED_EXIT_DONE;
+		break;
+	}
+	case SVM_EXIT_AVIC_INCOMPLETE_IPI:
+		/*
+		 * Doesn't have an intercept bit, host needs to check
+		 * if to reflect it to the guest or handle it by itself.
+		 */
+		break;
 	default: {
 		if (vmcb12_is_intercept(&svm->nested.ctl, exit_code))
 			vmexit = NESTED_EXIT_DONE;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 17/19] KVM: x86: nSVM: implement nested AVIC doorbell emulation
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (15 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 16/19] KVM: x86: nSVM: implement support for nested AVIC vmexits Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 18/19] KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 19/19] KVM: x86: nSVM: expose the nested AVIC to the guest Maxim Levitsky
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This patch implements the doorbell msr emulation
for nested AVIC.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 49 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/svm/svm.c  |  2 ++
 arch/x86/kvm/svm/svm.h  |  1 +
 3 files changed, 52 insertions(+)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e8c53fd77f0b1..149df26e17462 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1165,6 +1165,55 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+int avic_emulate_doorbell_write(struct kvm_vcpu *vcpu, u64 data)
+{
+	int source_l1_apicid = vcpu->vcpu_id;
+	int target_l1_apicid = data & AVIC_DOORBELL_PHYSICAL_ID_MASK;
+	bool target_running, target_nested;
+	struct kvm_vcpu *target;
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	if (!svm->avic_enabled || (data & ~AVIC_DOORBELL_PHYSICAL_ID_MASK))
+		return 1;
+
+	target = avic_vcpu_by_l1_apicid(vcpu->kvm, target_l1_apicid);
+	if (!target)
+		/* Guest bug: targeting invalid APIC ID. */
+		return 0;
+
+	target_running = READ_ONCE(target->mode) == IN_GUEST_MODE;
+	target_nested = is_guest_mode(target);
+
+	trace_kvm_avic_nested_doorbell(source_l1_apicid, target_l1_apicid,
+				       target_nested, target_running);
+
+	/*
+	 * Target is not in the nested mode, thus the doorbell doesn't affect it.
+	 * If it just became nested after is_guest_mode was checked,
+	 * it means that it just processed AVIC state and KVM doesn't need
+	 * to send it another doorbell.
+	 */
+	if (!target_nested)
+		return 0;
+
+	/*
+	 * If the target vCPU is in guest mode, kick the real doorbell.
+	 * Otherwise KVM needs to try to wake it up if it was sleeping.
+	 *
+	 * If the target is not longer in guest mode (just exited it),
+	 * it will either halt and before that it will notice pending IRR
+	 * bits, and cancel halting, or it will enter the guest mode again,
+	 * and notice the IRR bits as well.
+	 */
+	if (target_running)
+		wrmsr(MSR_AMD64_SVM_AVIC_DOORBELL,
+		      kvm_cpu_get_apicid(READ_ONCE(target->cpu)), 0);
+	else
+		kvm_vcpu_wake_up(target);
+
+	return 0;
+}
+
 static u32 *avic_get_logical_id_entry(struct kvm_vcpu *vcpu, u32 ldr, bool flat)
 {
 	struct kvm_svm *kvm_svm = to_kvm_svm(vcpu->kvm);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d96a73931d1e5..b31bab832360e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -2772,6 +2772,8 @@ static int svm_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	u32 ecx = msr->index;
 	u64 data = msr->data;
 	switch (ecx) {
+	case MSR_AMD64_SVM_AVIC_DOORBELL:
+		return avic_emulate_doorbell_write(vcpu, data);
 	case MSR_AMD64_TSC_RATIO:
 
 		if (!svm->tsc_scaling_enabled) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 93fd9d6f5fd85..14e2c5c451cad 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -714,6 +714,7 @@ unsigned long avic_vcpu_get_apicv_inhibit_reasons(struct kvm_vcpu *vcpu);
 void avic_reload_apic_pages(struct kvm_vcpu *vcpu);
 void avic_free_nested(struct kvm_vcpu *vcpu);
 bool avic_nested_has_interrupt(struct kvm_vcpu *vcpu);
+int avic_emulate_doorbell_write(struct kvm_vcpu *vcpu, u64 data);
 
 struct avic_physid_table *
 avic_physid_shadow_table_get(struct kvm_vcpu *vcpu, gfn_t gfn);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 18/19] KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (16 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 17/19] KVM: x86: nSVM: implement nested AVIC doorbell emulation Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  2022-04-27 20:03 ` [RFC PATCH v3 19/19] KVM: x86: nSVM: expose the nested AVIC to the guest Maxim Levitsky
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

By default, peers of a vCPU, can send it doorbell messages,
only when that vCPU is assigned (loaded) a physical CPU.

However when doorbell messages are not allowed, this causes all of
the vCPU's peers to get VM exits, which is suboptimal when this
vCPU is not halted, and therefore just temporary not running
in the guest mode due to being scheduled out and/or
having a userspace VM exit.

In this case peers can't make this vCPU enter guest mode faster,
and thus the VM exits they get don't do anything good.

Therefore this patch introduces (disabled by default)
new non strict mode (enabled by setting avic_doorbell_strict
kvm_amd module param to 0), such as when it is enabled,
and a vCPU is scheduled out but not halted, its peers can continue
sending  doorbell messages to the last physical CPU where the vCPU was
last running.

Security wise, a malicious guest with a compromised guest kernel,
can in this mode in some cases slow down whatever is
running on the last physical CPU where a vCPU was running
by spamming it with doorbell messages (hammering on ICR),
from its another vCPU.

Thus this mode is disabled by default.

However if admin policy is to have 1:1 vCPU/pCPU mapping,
this mode can be useful to avoid VM exits when a vCPU has
a userspace VM exit and such.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 16 +++++++++-------
 arch/x86/kvm/svm/svm.c  | 25 +++++++++++++++++++++----
 2 files changed, 30 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 149df26e17462..4bf0f00f13c12 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1704,7 +1704,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 
 void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-	u64 entry;
+	u64 old_entry, new_entry;
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -1723,14 +1723,16 @@ void __avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (kvm_vcpu_is_blocking(vcpu))
 		return;
 
-	entry = READ_ONCE(*(svm->avic_physical_id_cache));
-	WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
+	old_entry = READ_ONCE(*(svm->avic_physical_id_cache));
+	new_entry = old_entry;
 
-	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
-	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+	new_entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
+	new_entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
+	new_entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+
+	if (old_entry != new_entry)
+		WRITE_ONCE(*(svm->avic_physical_id_cache), new_entry);
 
-	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
 	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index b31bab832360e..099329711ad13 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -191,6 +191,10 @@ module_param(avic, bool, 0444);
 static bool force_avic;
 module_param_unsafe(force_avic, bool, 0444);
 
+static bool avic_doorbell_strict = true;
+module_param(avic_doorbell_strict, bool, 0444);
+
+
 bool __read_mostly dump_invalid_vmcb;
 module_param(dump_invalid_vmcb, bool, 0644);
 
@@ -1402,10 +1406,23 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	if (kvm_vcpu_apicv_active(vcpu))
-		__avic_vcpu_put(vcpu);
-
-	__nested_avic_put(vcpu);
+	/*
+	 * Forbid this vCPU's peers to send doorbell messages.
+	 * Unless non strict doorbell mode is used.
+	 *
+	 * In this mode, doorbell messages are forbidden only when a vCPU
+	 * blocks, since for correctness only in this case it is needed
+	 * to intercept an IPI to wake up a vCPU.
+	 *
+	 * However this reduces the isolation of the guest since flood of
+	 * spurious doorbell messages can slow a CPU running another task
+	 * while this vCPU is scheduled out.
+	 */
+	if (avic_doorbell_strict) {
+		if (kvm_vcpu_apicv_active(vcpu))
+			__avic_vcpu_put(vcpu);
+		__nested_avic_put(vcpu);
+	}
 
 	svm_prepare_host_switch(vcpu);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* [RFC PATCH v3 19/19] KVM: x86: nSVM: expose the nested AVIC to the guest
  2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
                   ` (17 preceding siblings ...)
  2022-04-27 20:03 ` [RFC PATCH v3 18/19] KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode Maxim Levitsky
@ 2022-04-27 20:03 ` Maxim Levitsky
  18 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-04-27 20:03 UTC (permalink / raw)
  To: kvm
  Cc: Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel, Maxim Levitsky

This patch enables and exposes to the nested guest
the support for the nested AVIC.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/svm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 099329711ad13..431281ccc40ef 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4087,6 +4087,9 @@ static void svm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		if (guest_cpuid_has(vcpu, X86_FEATURE_X2APIC))
 			kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_X2APIC);
 	}
+
+	svm->avic_enabled = enable_apicv && guest_cpuid_has(vcpu, X86_FEATURE_AVIC);
+
 	init_vmcb_after_set_cpuid(vcpu);
 }
 
@@ -4827,6 +4830,9 @@ static __init void svm_set_cpu_caps(void)
 		if (vgif)
 			kvm_cpu_cap_set(X86_FEATURE_VGIF);
 
+		if (enable_apicv)
+			kvm_cpu_cap_set(X86_FEATURE_AVIC);
+
 		/* Nested VM can receive #VMEXIT instead of triggering #GP */
 		kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
 	}
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-04-27 20:02 ` [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults Maxim Levitsky
@ 2022-05-18  8:28   ` Chao Gao
  2022-05-18  9:50     ` Maxim Levitsky
  2022-05-19 16:06   ` Sean Christopherson
  1 sibling, 1 reply; 57+ messages in thread
From: Chao Gao @ 2022-05-18  8:28 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, Apr 27, 2022 at 11:02:57PM +0300, Maxim Levitsky wrote:
>Neither of these settings should be changed by the guest and it is
>a burden to support it in the acceleration code, so just inhibit
>it instead.
>
>Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
>
>Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>---
> arch/x86/include/asm/kvm_host.h |  3 +++
> arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
> arch/x86/kvm/lapic.h            |  8 ++++++++
> 3 files changed, 33 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 63eae00625bda..636df87542555 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> 	APICV_INHIBIT_REASON_ABSENT,
> 	/* AVIC is disabled because SEV doesn't support it */
> 	APICV_INHIBIT_REASON_SEV,
>+	/* APIC ID and/or APIC base was changed by the guest */
>+	APICV_INHIBIT_REASON_RO_SETTINGS,

You need to add it to check_apicv_inhibit_reasons as well.

> };
> 
> struct kvm_arch {
>@@ -1258,6 +1260,7 @@ struct kvm_arch {
> 	hpa_t	hv_root_tdp;
> 	spinlock_t hv_root_tdp_lock;
> #endif
>+	bool apic_id_changed;

What's the value of this boolean? No one reads it.

> };
> 
> struct kvm_vm_stat {
>diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>index 66b0eb0bda94e..8996675b3ef4c 100644
>--- a/arch/x86/kvm/lapic.c
>+++ b/arch/x86/kvm/lapic.c
>@@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> 	}
> }
> 
>+static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
>+{
>+	if (kvm_apic_has_initial_apic_id(apic))
>+		return;
>+
>+	pr_warn_once("APIC ID change is unsupported by KVM");

It is misleading because changing xAPIC ID is supported by KVM; it just
isn't compatible with APICv. Probably this pr_warn_once() should be
removed.

>+
>+	kvm_set_apicv_inhibit(apic->vcpu->kvm,
>+			APICV_INHIBIT_REASON_RO_SETTINGS);

The indentation here looks incorrect to me.
	kvm_set_apicv_inhibit(apic->vcpu->kvm,
			      APICV_INHIBIT_REASON_RO_SETTINGS);

>+
>+	apic->vcpu->kvm->arch.apic_id_changed = true;
>+}
>+
> static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> {
> 	int ret = 0;
>@@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> 
> 	switch (reg) {
> 	case APIC_ID:		/* Local APIC ID */
>-		if (!apic_x2apic_mode(apic))
>+		if (!apic_x2apic_mode(apic)) {
>+
> 			kvm_apic_set_xapic_id(apic, val >> 24);
>-		else
>+			kvm_lapic_check_initial_apic_id(apic);
>+		} else
> 			ret = 1;
> 		break;
> 
>@@ -2335,8 +2350,11 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> 			     MSR_IA32_APICBASE_BASE;
> 
> 	if ((value & MSR_IA32_APICBASE_ENABLE) &&
>-	     apic->base_address != APIC_DEFAULT_PHYS_BASE)
>+	     apic->base_address != APIC_DEFAULT_PHYS_BASE) {
>+		kvm_set_apicv_inhibit(apic->vcpu->kvm,
>+				APICV_INHIBIT_REASON_RO_SETTINGS);
> 		pr_warn_once("APIC base relocation is unsupported by KVM");
>+	}
> }
> 
> void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
>@@ -2649,6 +2667,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> 		}
> 	}
> 
>+	kvm_lapic_check_initial_apic_id(vcpu->arch.apic);
> 	return 0;
> }
> 
>diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
>index 4e4f8a22754f9..b9c406d383080 100644
>--- a/arch/x86/kvm/lapic.h
>+++ b/arch/x86/kvm/lapic.h
>@@ -252,4 +252,12 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
> 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
> }
> 
>+static inline bool kvm_apic_has_initial_apic_id(struct kvm_lapic *apic)
>+{
>+	if (apic_x2apic_mode(apic))
>+		return true;

I suggest warning of x2apic mode:
	if (WARN_ON_ONCE(apic_x2apic_mode(apic)))

Because it is weird that callers care about initial apic id when apic is
in x2apic mode.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-18  8:28   ` Chao Gao
@ 2022-05-18  9:50     ` Maxim Levitsky
  2022-05-18 11:51       ` Chao Gao
  2022-05-18 15:39       ` Sean Christopherson
  0 siblings, 2 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-18  9:50 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, 2022-05-18 at 16:28 +0800, Chao Gao wrote:
> On Wed, Apr 27, 2022 at 11:02:57PM +0300, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> > arch/x86/include/asm/kvm_host.h |  3 +++
> > arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
> > arch/x86/kvm/lapic.h            |  8 ++++++++
> > 3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> > 	APICV_INHIBIT_REASON_ABSENT,
> > 	/* AVIC is disabled because SEV doesn't support it */
> > 	APICV_INHIBIT_REASON_SEV,
> > +	/* APIC ID and/or APIC base was changed by the guest */
> > +	APICV_INHIBIT_REASON_RO_SETTINGS,
> 
> You need to add it to check_apicv_inhibit_reasons as well.
True, forgot about it.

> 
> > };
> > 
> > struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > 	hpa_t	hv_root_tdp;
> > 	spinlock_t hv_root_tdp_lock;
> > #endif
> > +	bool apic_id_changed;
> 
> What's the value of this boolean? No one reads it.

I use it in later patches to kill the guest during nested VM entry 
if it attempts to use nested AVIC after any vCPU changed APIC ID.

I mentioned this boolean in the commit description.

This boolean avoids the need to go over all vCPUs and checking
if they still have the initial apic id.

In the future maybe we can introduce a more generic 'taint'
bitmap with various flags like that, indicating that the guest
did something unexpected.

BTW, the other option in regard to the nested AVIC is just to ignore this issue completely.
The code itself always uses vcpu_id's, thus regardless of when/how often the guest changes
its apic ids, my code would just use the initial APIC ID values consistently.

In this case I won't need this boolean.

> 
> > };
> > 
> > struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> > 	}
> > }
> > 
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> > +{
> > +	if (kvm_apic_has_initial_apic_id(apic))
> > +		return;
> > +
> > +	pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It is misleading because changing xAPIC ID is supported by KVM; it just
> isn't compatible with APICv. Probably this pr_warn_once() should be
> removed.

Honestly since nobody uses this feature, I am not sure if to call this supported,
I am sure that KVM has more bugs in regard of using non standard APIC ID.
This warning might hopefuly make someone complain about it if this
feature is actually used somewhere.

> 
> > +
> > +	kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +			APICV_INHIBIT_REASON_RO_SETTINGS);
> 
> The indentation here looks incorrect to me.
> 	kvm_set_apicv_inhibit(apic->vcpu->kvm,
> 			      APICV_INHIBIT_REASON_RO_SETTINGS);

True, will fix.

> 
> > +
> > +	apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> > static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > {
> > 	int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> > 
> > 	switch (reg) {
> > 	case APIC_ID:		/* Local APIC ID */
> > -		if (!apic_x2apic_mode(apic))
> > +		if (!apic_x2apic_mode(apic)) {
> > +
> > 			kvm_apic_set_xapic_id(apic, val >> 24);
> > -		else
> > +			kvm_lapic_check_initial_apic_id(apic);
> > +		} else
> > 			ret = 1;
> > 		break;
> > 
> > @@ -2335,8 +2350,11 @@ void kvm_lapic_set_base(struct kvm_vcpu *vcpu, u64 value)
> > 			     MSR_IA32_APICBASE_BASE;
> > 
> > 	if ((value & MSR_IA32_APICBASE_ENABLE) &&
> > -	     apic->base_address != APIC_DEFAULT_PHYS_BASE)
> > +	     apic->base_address != APIC_DEFAULT_PHYS_BASE) {
> > +		kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +				APICV_INHIBIT_REASON_RO_SETTINGS);
> > 		pr_warn_once("APIC base relocation is unsupported by KVM");
> > +	}
> > }
> > 
> > void kvm_apic_update_apicv(struct kvm_vcpu *vcpu)
> > @@ -2649,6 +2667,7 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
> > 		}
> > 	}
> > 
> > +	kvm_lapic_check_initial_apic_id(vcpu->arch.apic);
> > 	return 0;
> > }
> > 
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index 4e4f8a22754f9..b9c406d383080 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -252,4 +252,12 @@ static inline u8 kvm_xapic_id(struct kvm_lapic *apic)
> > 	return kvm_lapic_get_reg(apic, APIC_ID) >> 24;
> > }
> > 
> > +static inline bool kvm_apic_has_initial_apic_id(struct kvm_lapic *apic)
> > +{
> > +	if (apic_x2apic_mode(apic))
> > +		return true;
> 
> I suggest warning of x2apic mode:
> 	if (WARN_ON_ONCE(apic_x2apic_mode(apic)))
> 
> Because it is weird that callers care about initial apic id when apic is
> in x2apic mode.

Yes but due to something I don't agree with, but also something that I gave up
on arguing upon, KVM userspace API kind of supports setting APIC ID != initial apic id,
even in x2apic mode, and disallowing it, is considered API breakage,
therefore this case is possible.

This case should still trigger a warning in kvm_lapic_check_initial_apic_id.

Best regards,
	Maxim Levitsky


> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-18  9:50     ` Maxim Levitsky
@ 2022-05-18 11:51       ` Chao Gao
  2022-05-18 12:36         ` Maxim Levitsky
  2022-05-18 15:39       ` Sean Christopherson
  1 sibling, 1 reply; 57+ messages in thread
From: Chao Gao @ 2022-05-18 11:51 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, May 18, 2022 at 12:50:27PM +0300, Maxim Levitsky wrote:
>> > struct kvm_arch {
>> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
>> > 	hpa_t	hv_root_tdp;
>> > 	spinlock_t hv_root_tdp_lock;
>> > #endif
>> > +	bool apic_id_changed;
>> 
>> What's the value of this boolean? No one reads it.
>
>I use it in later patches to kill the guest during nested VM entry 
>if it attempts to use nested AVIC after any vCPU changed APIC ID.
>
>I mentioned this boolean in the commit description.
>
>This boolean avoids the need to go over all vCPUs and checking
>if they still have the initial apic id.

Do you want to kill the guest if APIC base got changed? If yes,
you can check if APICV_INHIBIT_REASON_RO_SETTINGS is set and save
the boolean.

>
>In the future maybe we can introduce a more generic 'taint'
>bitmap with various flags like that, indicating that the guest
>did something unexpected.
>
>BTW, the other option in regard to the nested AVIC is just to ignore this issue completely.
>The code itself always uses vcpu_id's, thus regardless of when/how often the guest changes
>its apic ids, my code would just use the initial APIC ID values consistently.
>
>In this case I won't need this boolean.
>
>> 
>> > };
>> > 
>> > struct kvm_vm_stat {
>> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
>> > index 66b0eb0bda94e..8996675b3ef4c 100644
>> > --- a/arch/x86/kvm/lapic.c
>> > +++ b/arch/x86/kvm/lapic.c
>> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
>> > 	}
>> > }
>> > 
>> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
>> > +{
>> > +	if (kvm_apic_has_initial_apic_id(apic))
>> > +		return;
>> > +
>> > +	pr_warn_once("APIC ID change is unsupported by KVM");
>> 
>> It is misleading because changing xAPIC ID is supported by KVM; it just
>> isn't compatible with APICv. Probably this pr_warn_once() should be
>> removed.
>
>Honestly since nobody uses this feature, I am not sure if to call this supported,
>I am sure that KVM has more bugs in regard of using non standard APIC ID.
>This warning might hopefuly make someone complain about it if this
>feature is actually used somewhere.

Now I got you. It is fine to me.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-18 11:51       ` Chao Gao
@ 2022-05-18 12:36         ` Maxim Levitsky
  0 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-18 12:36 UTC (permalink / raw)
  To: Chao Gao
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Sean Christopherson, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, 2022-05-18 at 19:51 +0800, Chao Gao wrote:
> On Wed, May 18, 2022 at 12:50:27PM +0300, Maxim Levitsky wrote:
> > > > struct kvm_arch {
> > > > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > > > 	hpa_t	hv_root_tdp;
> > > > 	spinlock_t hv_root_tdp_lock;
> > > > #endif
> > > > +	bool apic_id_changed;
> > > 
> > > What's the value of this boolean? No one reads it.
> > 
> > I use it in later patches to kill the guest during nested VM entry 
> > if it attempts to use nested AVIC after any vCPU changed APIC ID.
> > 
> > I mentioned this boolean in the commit description.
> > 
> > This boolean avoids the need to go over all vCPUs and checking
> > if they still have the initial apic id.
> 
> Do you want to kill the guest if APIC base got changed? If yes,
> you can check if APICV_INHIBIT_REASON_RO_SETTINGS is set and save
> the boolean.

Yep, I thrown in the apic base just because I can. It doesn't matter to 
my nested AVIC logic at all, but since it is also something that guests
don't change, I also don't care if this will lead to inhibit and
killing the guest if it attempts to use nested AVIC.

That boolean should have the same value as the APICV_INHIBIT_REASON_RO_SETTINGS
inhibit, so yes I can instead check if the inhibit is active.

I don't know if that is cleaner that this boolean though, individual
inhibit value is currently not something that anybody uses in logic.

Best regards,
	Maxim Levitsky


> 
> > In the future maybe we can introduce a more generic 'taint'
> > bitmap with various flags like that, indicating that the guest
> > did something unexpected.
> > 
> > BTW, the other option in regard to the nested AVIC is just to ignore this issue completely.
> > The code itself always uses vcpu_id's, thus regardless of when/how often the guest changes
> > its apic ids, my code would just use the initial APIC ID values consistently.
> > 
> > In this case I won't need this boolean.
> > 
> > > > };
> > > > 
> > > > struct kvm_vm_stat {
> > > > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > > > index 66b0eb0bda94e..8996675b3ef4c 100644
> > > > --- a/arch/x86/kvm/lapic.c
> > > > +++ b/arch/x86/kvm/lapic.c
> > > > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> > > > 	}
> > > > }
> > > > 
> > > > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> > > > +{
> > > > +	if (kvm_apic_has_initial_apic_id(apic))
> > > > +		return;
> > > > +
> > > > +	pr_warn_once("APIC ID change is unsupported by KVM");
> > > 
> > > It is misleading because changing xAPIC ID is supported by KVM; it just
> > > isn't compatible with APICv. Probably this pr_warn_once() should be
> > > removed.
> > 
> > Honestly since nobody uses this feature, I am not sure if to call this supported,
> > I am sure that KVM has more bugs in regard of using non standard APIC ID.
> > This warning might hopefuly make someone complain about it if this
> > feature is actually used somewhere.
> 
> Now I got you. It is fine to me.
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-18  9:50     ` Maxim Levitsky
  2022-05-18 11:51       ` Chao Gao
@ 2022-05-18 15:39       ` Sean Christopherson
  2022-05-18 17:15         ` Maxim Levitsky
  1 sibling, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-18 15:39 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Chao Gao, kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula,
	Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang,
	Joonas Lahtinen, Tom Lendacky, Ingo Molnar, David Airlie,
	Thomas Gleixner, Dave Hansen, x86, intel-gfx, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, May 18, 2022, Maxim Levitsky wrote:
> On Wed, 2022-05-18 at 16:28 +0800, Chao Gao wrote:
> > > struct kvm_arch {
> > > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > > 	hpa_t	hv_root_tdp;
> > > 	spinlock_t hv_root_tdp_lock;
> > > #endif
> > > +	bool apic_id_changed;
> > 
> > What's the value of this boolean? No one reads it.
> 
> I use it in later patches to kill the guest during nested VM entry 
> if it attempts to use nested AVIC after any vCPU changed APIC ID.

Then the flag should be introduced in the later patch, because (a) it's dead code
if that patch is never merged and (b) it's impossible to review this patch for
correctness without seeing the usage, e.g. setting apic_id_changed isn't guarded
with a lock and so the usage may or may not be susceptible to races.

> > > +	apic->vcpu->kvm->arch.apic_id_changed = true;
> > > +}
> > > +

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons
  2022-04-27 20:02 ` [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons Maxim Levitsky
@ 2022-05-18 15:56   ` Sean Christopherson
  2022-05-18 17:13     ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-18 15:56 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> These days there are too many AVIC/APICv inhibit
> reasons, and it doesn't hurt to have some documentation
> for them.

Please wrap at ~75 chars.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index f164c6c1514a4..63eae00625bda 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1046,14 +1046,29 @@ struct kvm_x86_msr_filter {
>  };
>  
>  enum kvm_apicv_inhibit {
> +	/* APICv/AVIC is disabled by module param and/or not supported in hardware */

Rather than tag every one as APICv vs. AVIC, what about reorganizing the enums so
that the common vs. AVIC flags are bundled together?  And then the redundant info
in the comments about "XYZ is inhibited" can go away too, i.e. the individual
comments can focus on explaining what triggers the inhibit (and for some, why that
action is incompatible with APIC virtualization).

E.g.
	/***************************************************************/
	/* INHIBITs are relevant to both Intel's APICv and AMD's AVIC. */
	/***************************************************************/

	/* APIC/AVIC is unsupported and/or disabled via module param. */
	APICV_INHIBIT_REASON_DISABLE,

	/* The local APIC is not in-kernel.  See KVM_CREATE_IRQCHIP. */
	APICV_INHIBIT_REASON_ABSENT,

	/*
	 * At least one IRQ vector is configured for HyperV's AutoEOI, which
	 * requires manually injecting the IRQ to do EOI on behalf of the guest.
	 */
	APICV_INHIBIT_REASON_HYPERV,
	

	/**********************************************/
	/* INHIBITs relevant only to AMD's AVIC. */
	/**********************************************/

>  	APICV_INHIBIT_REASON_DISABLE,
> +	/* APICv/AVIC is inhibited because AutoEOI feature is being used by a HyperV guest*/
>  	APICV_INHIBIT_REASON_HYPERV,
> +	/* AVIC is inhibited on a CPU because it runs a nested guest */
>  	APICV_INHIBIT_REASON_NESTED,
> +	/* AVIC is inhibited due to wait for an irq window (AVIC doesn't support this) */
>  	APICV_INHIBIT_REASON_IRQWIN,
> +	/*
> +	 * AVIC is inhibited because i8254 're-inject' mode is used
> +	 * which needs EOI intercept which AVIC doesn't support
> +	 */
>  	APICV_INHIBIT_REASON_PIT_REINJ,
> +	/* AVIC is inhibited because the guest has x2apic in its CPUID*/
>  	APICV_INHIBIT_REASON_X2APIC,
> +	/* AVIC/APICv is inhibited because KVM_GUESTDBG_BLOCKIRQ was enabled */
>  	APICV_INHIBIT_REASON_BLOCKIRQ,
> +	/*
> +	 * AVIC/APICv is inhibited because the guest didn't yet

s/guest/userspace

> +	 * enable kernel/split irqchip
> +	 */
>  	APICV_INHIBIT_REASON_ABSENT,
> +	/* AVIC is disabled because SEV doesn't support it */
>  	APICV_INHIBIT_REASON_SEV,
>  };
>  
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons
  2022-05-18 15:56   ` Sean Christopherson
@ 2022-05-18 17:13     ` Maxim Levitsky
  0 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-18 17:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, 2022-05-18 at 15:56 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > These days there are too many AVIC/APICv inhibit
> > reasons, and it doesn't hurt to have some documentation
> > for them.
> 
> Please wrap at ~75 chars.
> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
> >  1 file changed, 15 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index f164c6c1514a4..63eae00625bda 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1046,14 +1046,29 @@ struct kvm_x86_msr_filter {
> >  };
> >  
> >  enum kvm_apicv_inhibit {
> > +	/* APICv/AVIC is disabled by module param and/or not supported in hardware */
> 
> Rather than tag every one as APICv vs. AVIC, what about reorganizing the enums so
> that the common vs. AVIC flags are bundled together?  And then the redundant info
> in the comments about "XYZ is inhibited" can go away too, i.e. the individual
> comments can focus on explaining what triggers the inhibit (and for some, why that
> action is incompatible with APIC virtualization).

Very good idea, will do!

Best regards,
	Maxim Levitsky

> 
> E.g.
> 	/***************************************************************/
> 	/* INHIBITs are relevant to both Intel's APICv and AMD's AVIC. */
> 	/***************************************************************/
> 
> 	/* APIC/AVIC is unsupported and/or disabled via module param. */
> 	APICV_INHIBIT_REASON_DISABLE,
> 
> 	/* The local APIC is not in-kernel.  See KVM_CREATE_IRQCHIP. */
> 	APICV_INHIBIT_REASON_ABSENT,
> 
> 	/*
> 	 * At least one IRQ vector is configured for HyperV's AutoEOI, which
> 	 * requires manually injecting the IRQ to do EOI on behalf of the guest.
> 	 */
> 	APICV_INHIBIT_REASON_HYPERV,
> 	
> 
> 	/**********************************************/
> 	/* INHIBITs relevant only to AMD's AVIC. */
> 	/**********************************************/
> 
> >  	APICV_INHIBIT_REASON_DISABLE,
> > +	/* APICv/AVIC is inhibited because AutoEOI feature is being used by a HyperV guest*/
> >  	APICV_INHIBIT_REASON_HYPERV,
> > +	/* AVIC is inhibited on a CPU because it runs a nested guest */
> >  	APICV_INHIBIT_REASON_NESTED,
> > +	/* AVIC is inhibited due to wait for an irq window (AVIC doesn't support this) */
> >  	APICV_INHIBIT_REASON_IRQWIN,
> > +	/*
> > +	 * AVIC is inhibited because i8254 're-inject' mode is used
> > +	 * which needs EOI intercept which AVIC doesn't support
> > +	 */
> >  	APICV_INHIBIT_REASON_PIT_REINJ,
> > +	/* AVIC is inhibited because the guest has x2apic in its CPUID*/
> >  	APICV_INHIBIT_REASON_X2APIC,
> > +	/* AVIC/APICv is inhibited because KVM_GUESTDBG_BLOCKIRQ was enabled */
> >  	APICV_INHIBIT_REASON_BLOCKIRQ,
> > +	/*
> > +	 * AVIC/APICv is inhibited because the guest didn't yet
> 
> s/guest/userspace
> 
> > +	 * enable kernel/split irqchip
> > +	 */
> >  	APICV_INHIBIT_REASON_ABSENT,
> > +	/* AVIC is disabled because SEV doesn't support it */
> >  	APICV_INHIBIT_REASON_SEV,
> >  };
> >  
> > -- 
> > 2.26.3
> > 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-18 15:39       ` Sean Christopherson
@ 2022-05-18 17:15         ` Maxim Levitsky
  0 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-18 17:15 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Chao Gao, kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula,
	Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang,
	Joonas Lahtinen, Tom Lendacky, Ingo Molnar, David Airlie,
	Thomas Gleixner, Dave Hansen, x86, intel-gfx, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Jim Mattson,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Wed, 2022-05-18 at 15:39 +0000, Sean Christopherson wrote:
> On Wed, May 18, 2022, Maxim Levitsky wrote:
> > On Wed, 2022-05-18 at 16:28 +0800, Chao Gao wrote:
> > > > struct kvm_arch {
> > > > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> > > > 	hpa_t	hv_root_tdp;
> > > > 	spinlock_t hv_root_tdp_lock;
> > > > #endif
> > > > +	bool apic_id_changed;
> > > 
> > > What's the value of this boolean? No one reads it.
> > 
> > I use it in later patches to kill the guest during nested VM entry 
> > if it attempts to use nested AVIC after any vCPU changed APIC ID.
> 
> Then the flag should be introduced in the later patch, because (a) it's dead code
> if that patch is never merged and (b) it's impossible to review this patch for
> correctness without seeing the usage, e.g. setting apic_id_changed isn't guarded
> with a lock and so the usage may or may not be susceptible to races.

I can't disagree with you on this, this was just somewhat a hack I wasn't sure
(and not yet 100% sure I will move forward with) so I cut this corner.

Thanks for the review!

Best regards,
	Maxim Levitsky

> 
> > > > +	apic->vcpu->kvm->arch.apic_id_changed = true;
> > > > +}
> > > > +



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-04-27 20:02 ` [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults Maxim Levitsky
  2022-05-18  8:28   ` Chao Gao
@ 2022-05-19 16:06   ` Sean Christopherson
  2022-05-22  9:03     ` Maxim Levitsky
  2022-06-23  9:44     ` Maxim Levitsky
  1 sibling, 2 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:06 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> Neither of these settings should be changed by the guest and it is
> a burden to support it in the acceleration code, so just inhibit
> it instead.
> 
> Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
>  arch/x86/kvm/lapic.h            |  8 ++++++++
>  3 files changed, 33 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 63eae00625bda..636df87542555 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
>  	APICV_INHIBIT_REASON_ABSENT,
>  	/* AVIC is disabled because SEV doesn't support it */
>  	APICV_INHIBIT_REASON_SEV,
> +	/* APIC ID and/or APIC base was changed by the guest */

I don't see any reason to inhibit APICv if the APIC base is changed.  KVM has
never supported that, and disabling APICv won't "fix" anything.

Ignoring that is a minor simplification, but also allows for a more intuitive
name, e.g.

	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,

The inhibit also needs to be added avic_check_apicv_inhibit_reasons() and
vmx_check_apicv_inhibit_reasons().

> +	APICV_INHIBIT_REASON_RO_SETTINGS,
>  };
>  
>  struct kvm_arch {
> @@ -1258,6 +1260,7 @@ struct kvm_arch {
>  	hpa_t	hv_root_tdp;
>  	spinlock_t hv_root_tdp_lock;
>  #endif
> +	bool apic_id_changed;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 66b0eb0bda94e..8996675b3ef4c 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
>  	}
>  }
>  
> +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)

The "check" part is misleading/confusing.  "check" helpers usually query and return
state.  I assume you avoided "changed" because the ID may or may not actually be
changing.  Maybe kvm_apic_id_updated()?  Ah, better idea.  What about
kvm_lapic_xapic_id_updated()?  See below for reasoning.

> +{
> +	if (kvm_apic_has_initial_apic_id(apic))

Rather than add a single-use helper, invoke the helper from kvm_apic_state_fixup()
in the !x2APIC path, then this can KVM_BUG_ON() x2APIC to help document that KVM
should never allow the ID to change for x2APIC.

> +		return;
> +
> +	pr_warn_once("APIC ID change is unsupported by KVM");

It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
APICv.

> +	kvm_set_apicv_inhibit(apic->vcpu->kvm,
> +			APICV_INHIBIT_REASON_RO_SETTINGS);
> +
> +	apic->vcpu->kvm->arch.apic_id_changed = true;
> +}
> +
>  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  {
>  	int ret = 0;
> @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  
>  	switch (reg) {
>  	case APIC_ID:		/* Local APIC ID */
> -		if (!apic_x2apic_mode(apic))
> +		if (!apic_x2apic_mode(apic)) {
> +

Spurious newline.

>  			kvm_apic_set_xapic_id(apic, val >> 24);
> -		else
> +			kvm_lapic_check_initial_apic_id(apic);
> +		} else

Needs curly braces for both paths.

>  			ret = 1;
>  		break;
>  

E.g.

---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/lapic.c            | 21 +++++++++++++++++++--
 arch/x86/kvm/svm/avic.c         |  3 ++-
 arch/x86/kvm/vmx/vmx.c          |  3 ++-
 4 files changed, 24 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d895d25c5b2f..d888fa1bae77 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1071,6 +1071,7 @@ enum kvm_apicv_inhibit {
 	APICV_INHIBIT_REASON_BLOCKIRQ,
 	APICV_INHIBIT_REASON_ABSENT,
 	APICV_INHIBIT_REASON_SEV,
+	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
 };

 struct kvm_arch {
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 5fd678c90288..6fe8f20f03d8 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2039,6 +2039,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
 	}
 }

+static void kvm_lapic_xapic_id_updated(struct kvm_lapic *apic)
+{
+	struct kvm *kvm = apic->vcpu->kvm;
+
+	if (KVM_BUG_ON(apic_x2apic_mode(apic), kvm))
+		return;
+
+	if (kvm_xapic_id(apic) == apic->vcpu->vcpu_id)
+		return;
+
+	kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
+}
+
 static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
 {
 	int ret = 0;
@@ -2047,10 +2060,12 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)

 	switch (reg) {
 	case APIC_ID:		/* Local APIC ID */
-		if (!apic_x2apic_mode(apic))
+		if (!apic_x2apic_mode(apic)) {
 			kvm_apic_set_xapic_id(apic, val >> 24);
-		else
+			kvm_lapic_xapic_id_updated(apic);
+		} else {
 			ret = 1;
+		}
 		break;

 	case APIC_TASKPRI:
@@ -2665,6 +2680,8 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
 			icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR);
 			__kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32);
 		}
+	} else {
+		kvm_lapic_xapic_id_updated(vcpu->arch.apic);
 	}

 	return 0;
diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 54fe03714f8a..239c3e8b1f3f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -910,7 +910,8 @@ bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 			  BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
 			  BIT(APICV_INHIBIT_REASON_X2APIC) |
 			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
-			  BIT(APICV_INHIBIT_REASON_SEV);
+			  BIT(APICV_INHIBIT_REASON_SEV) |
+			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);

 	return supported & BIT(reason);
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b06eafa5884d..941adade21ea 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7818,7 +7818,8 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
 			  BIT(APICV_INHIBIT_REASON_ABSENT) |
 			  BIT(APICV_INHIBIT_REASON_HYPERV) |
-			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
+			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
+			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);

 	return supported & BIT(reason);
 }

base-commit: 6ab6e3842d18e4529fa524fb6c668ae8a8bf54f4
--


^ permalink raw reply related	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID
  2022-04-27 20:02 ` [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID Maxim Levitsky
@ 2022-05-19 16:10   ` Sean Christopherson
  2022-05-22  9:01     ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:10 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> AVIC is now inhibited if the guest changes apic id, thus remove
> that broken code.

Can you explicitly call out what's broken?  Just something short on the code not
handling the scenario where APIC ID is changed back to vcpu_id to help future
archaeologists.  I forget if there are other bugs...

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-04-27 20:02 ` [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally Maxim Levitsky
@ 2022-05-19 16:27   ` Sean Christopherson
  2022-05-22 10:21     ` Maxim Levitsky
  2022-05-19 16:37   ` Sean Christopherson
  1 sibling, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:27 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> This will be used to enable write tracking from nested AVIC code
> and can also be used to enable write tracking in GVT-g module
> when it actually uses it as opposed to always enabling it,
> when the module is compiled in the kernel.

Wrap at ~75.

> No functional change intended.
> 
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h       |  2 +-
>  arch/x86/include/asm/kvm_page_track.h |  1 +
>  arch/x86/kvm/mmu.h                    |  8 +++++---
>  arch/x86/kvm/mmu/mmu.c                | 17 ++++++++++-------
>  arch/x86/kvm/mmu/page_track.c         | 10 ++++++++--
>  5 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 636df87542555..fc7df778a3d71 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1254,7 +1254,7 @@ struct kvm_arch {
>  	 * is used as one input when determining whether certain memslot
>  	 * related allocations are necessary.
>  	 */

The above comment needs to be rewritten.

> -	bool shadow_root_allocated;
> +	bool mmu_page_tracking_enabled;
>  #if IS_ENABLED(CONFIG_HYPERV)
>  	hpa_t	hv_root_tdp;
> diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> index eb186bc57f6a9..955a5ae07b10e 100644
> --- a/arch/x86/include/asm/kvm_page_track.h
> +++ b/arch/x86/include/asm/kvm_page_track.h
> @@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
>  void kvm_page_track_cleanup(struct kvm *kvm);
>  
>  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
> +int kvm_page_track_write_tracking_enable(struct kvm *kvm);
>  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
>  
>  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 671cfeccf04e9..44d15551f7156 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
>  int kvm_mmu_post_init_vm(struct kvm *kvm);
>  void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
>  
> -static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
> +static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
>  {
>  	/*
>  	 * Read shadow_root_allocated before related pointers. Hence, threads
> @@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
>  	 * see the pointers. Pairs with smp_store_release in
>  	 * mmu_first_shadow_root_alloc.
>  	 */

This comment also needs to be rewritten.

> -	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
> +	return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
>  }

...

> diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> index 2e09d1b6249f3..8857d629036d7 100644
> --- a/arch/x86/kvm/mmu/page_track.c
> +++ b/arch/x86/kvm/mmu/page_track.c
> @@ -21,10 +21,16 @@
>  
>  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)

This can be static, it's now used only by page_track.c.

>  {
> -	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
> -	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
> +	return mmu_page_tracking_enabled(kvm);
>  }
>  
> +int kvm_page_track_write_tracking_enable(struct kvm *kvm)

This is too similar to the "enabled" version; "kvm_page_track_enable_write_tracking()"
would maintain namespacing and be less confusing.

Hmm, I'd probably vote to make this a "static inline" in kvm_page_track.h, and
rename mmu_enable_write_tracking() to kvm_mmu_enable_write_tracking and export.
Not a strong preference, just feels silly to export a one-liner.

> +{
> +	return mmu_enable_write_tracking(kvm);
> +}
> +EXPORT_SYMBOL_GPL(kvm_page_track_write_tracking_enable);
> +
> +
>  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot)
>  {
>  	int i;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-04-27 20:02 ` [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally Maxim Levitsky
  2022-05-19 16:27   ` Sean Christopherson
@ 2022-05-19 16:37   ` Sean Christopherson
  2022-05-22 10:22     ` Maxim Levitsky
  1 sibling, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:37 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);

Can you add a patch to move this call to kvm_page_track_register_notifier() into
mmu_enable_write_tracking(), and simultaneously add a WARN in the register path
that page tracking is enabled?

Oh, actually, a better idea. Add an inner __kvm_page_track_register_notifier()
that is not exported and thus used only by KVM, invoke mmu_enable_write_tracking()
from the exported kvm_page_track_register_notifier(), and then do the above.
That will require modifying KVMGT and KVM in a single patch, but that's ok.

That will avoid any possibility of an external user failing to enabling tracking
before registering its notifier, and also avoids bikeshedding over what to do with
the one-line wrapper to enable tracking.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable
  2022-04-27 20:03 ` [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable Maxim Levitsky
@ 2022-05-19 16:38   ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:38 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> @@ -1948,6 +1949,10 @@ static int kvmgt_guest_init(struct mdev_device *mdev)
>  	if (__kvmgt_vgpu_exist(vgpu, kvm))
>  		return -EEXIST;
>  
> +	ret = kvm_page_track_write_tracking_enable(kvm);
> +	if (ret)
> +		return ret;

If for some reason my idea to enable tracking during kvm_page_track_register_notifier()
doesn't pan out, it's probably worth adding a comment saying that enabling write
tracking can't be undone.

> +
>  	info = vzalloc(sizeof(struct kvmgt_guest_info));
>  	if (!info)
>  		return -ENOMEM;
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper
  2022-04-27 20:03 ` [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper Maxim Levitsky
@ 2022-05-19 16:43   ` Sean Christopherson
  2022-05-22 10:22     ` Maxim Levitsky
  2022-05-22 12:12     ` Maxim Levitsky
  0 siblings, 2 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:43 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> This is a tiny refactoring, and can be useful to check
> if a GPA/GFN is within a memslot a bit more cleanly.

This doesn't explain the actual motivation, which is to use the new helper from
arch code.

> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  include/linux/kvm_host.h | 10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 252ee4a61b58b..12e261559070b 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
>  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
>  bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
>  
> +
> +static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> +{
> +	return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
> +}
> +

Spurious newline.

> +
>  /*
>   * Returns a pointer to the memslot if it contains gfn.
>   * Otherwise returns NULL.
> @@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
>  	if (!slot)
>  		return NULL;
>  
> -	if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> +	if (gfn_in_memslot(slot, gfn))
>  		return slot;
>  	else
>  		return NULL;

At this point, maybe:

	if (!slot || !gfn_in_memslot(slot, gfn))
		return NULL;

	return slot;

>  }
>  
> +
>  /*
>   * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
>   *
> -- 
> 2.26.3
> 

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  2022-04-27 20:03 ` [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page Maxim Levitsky
@ 2022-05-19 16:55   ` Sean Christopherson
  2022-05-22 10:22     ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-05-19 16:55 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> This will be used on SVM to reload shadow page of the AVIC physid table
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d2f73ce87a1e3..ad744ab99734c 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
>  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
>  }
>  
> -static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> +static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
>  {
>  	if (!lapic_in_kernel(vcpu))
>  		return;
>  
> -	static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
> +	static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
>  }
>  
>  void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
> @@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>  		if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
>  			vcpu_load_eoi_exitmap(vcpu);
>  		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
> -			kvm_vcpu_reload_apic_access_page(vcpu);
> +			kvm_vcpu_reload_apic_pages(vcpu);

My vote is to add a new request and new kvm_x86_ops hook instead of piggybacking
KVM_REQ_APIC_PAGE_RELOAD.  The usage in kvm_arch_mmu_notifier_invalidate_range()
very subtlies relies on the memslot and vma being allocated/controlled by KVM.

The use in avic_physid_shadow_table_flush_memslot() is too similar in that it
also deals with memslot changes, but at the same time is _very_ different in that
it's dealing with user controlled memslots.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID
  2022-05-19 16:10   ` Sean Christopherson
@ 2022-05-22  9:01     ` Maxim Levitsky
  2022-05-23 17:19       ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22  9:01 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:10 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > AVIC is now inhibited if the guest changes apic id, thus remove
> > that broken code.
> 
> Can you explicitly call out what's broken?  Just something short on the code not
> handling the scenario where APIC ID is changed back to vcpu_id to help future
> archaeologists.  I forget if there are other bugs...
> 


Well, the avic_handle_apic_id_update is called each time the AVIC is uninhibited,
because while it is inhibited, the AVIC code doesn't track changes to APIC ID and such.

Also there are many ways it is broken for example

1. a CPU can't move its APIC ID to a free slot due to (!new) check

2. If APIC ID is moved to a used slot, then the CPU that used that overwritten
slot can't correctly move it, since its now not its slot, not to mention races.


BTW, if you see a value in it, I can fix this code instead - a lock + going over all the apic ids,
should be quite easy to implement. In case of two vCPUs using the same APIC ID,
I can write non present entry to the table, so none will be able to be addressed,
hoping that the situation is only temporary.

Same can be done for IPIv.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-19 16:06   ` Sean Christopherson
@ 2022-05-22  9:03     ` Maxim Levitsky
  2022-05-22 14:47       ` Jim Mattson
  2022-06-23  9:44     ` Maxim Levitsky
  1 sibling, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22  9:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
> >  arch/x86/kvm/lapic.h            |  8 ++++++++
> >  3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> >  	APICV_INHIBIT_REASON_ABSENT,
> >  	/* AVIC is disabled because SEV doesn't support it */
> >  	APICV_INHIBIT_REASON_SEV,
> > +	/* APIC ID and/or APIC base was changed by the guest */
> 
> I don't see any reason to inhibit APICv if the APIC base is changed.  KVM has
> never supported that, and disabling APICv won't "fix" anything.

I kind of tacked the APIC base on the thing just to be a good citezen.

In theory currently if the guest changes the APIC base, neither APICv
nor AVIC will even notice, so the guest will still be able to access the
default APIC base and the new APIC base, which is kind of wrong.

Inhibiting APICv/AVIC in this case makes it better and it is very cheap to do.

If you still think that it shouln't be done, I'll remove it.


> 
> Ignoring that is a minor simplification, but also allows for a more intuitive
> name, e.g.
> 
> 	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
> 
> The inhibit also needs to be added avic_check_apicv_inhibit_reasons() and
> vmx_check_apicv_inhibit_reasons().
> 
> > +	APICV_INHIBIT_REASON_RO_SETTINGS,

> >  };
> >  
> >  struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> >  	hpa_t	hv_root_tdp;
> >  	spinlock_t hv_root_tdp_lock;
> >  #endif
> > +	bool apic_id_changed;
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> >  	}
> >  }
> >  
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> 
> The "check" part is misleading/confusing.  "check" helpers usually query and return
> state.  I assume you avoided "changed" because the ID may or may not actually be
> changing.  Maybe kvm_apic_id_updated()?  Ah, better idea.  What about
> kvm_lapic_xapic_id_updated()?  See below for reasoning.

This is a very good idea!

> 
> > +{
> > +	if (kvm_apic_has_initial_apic_id(apic))
> 
> Rather than add a single-use helper, invoke the helper from kvm_apic_state_fixup()
> in the !x2APIC path, then this can KVM_BUG_ON() x2APIC to help document that KVM
> should never allow the ID to change for x2APIC.

yes, but we do allow non default x2apic id via userspace api - I wasn't able to convience
you to remove this :)

> 
> > +		return;
> > +
> > +	pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> APICv.

Here, as I said, it would be nice to see that warning if someone complains.
Fact is that AVIC code was totally broken in this regard, and there are probably more,
so it would be nice to see if anybody complains.

If you insist, I'll remove this warning.

> 
> > +	kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +			APICV_INHIBIT_REASON_RO_SETTINGS);
> > +
> > +	apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> >  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  {
> >  	int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  
> >  	switch (reg) {
> >  	case APIC_ID:		/* Local APIC ID */
> > -		if (!apic_x2apic_mode(apic))
> > +		if (!apic_x2apic_mode(apic)) {
> > +
> 
> Spurious newline.
Will fix.
> 
> >  			kvm_apic_set_xapic_id(apic, val >> 24);
> > -		else
> > +			kvm_lapic_check_initial_apic_id(apic);
> > +		} else
> 
> Needs curly braces for both paths.
Will fix.
> 
> >  			ret = 1;
> >  		break;
> >  
> 
> E.g.
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c            | 21 +++++++++++++++++++--
>  arch/x86/kvm/svm/avic.c         |  3 ++-
>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>  4 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d895d25c5b2f..d888fa1bae77 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1071,6 +1071,7 @@ enum kvm_apicv_inhibit {
>  	APICV_INHIBIT_REASON_BLOCKIRQ,
>  	APICV_INHIBIT_REASON_ABSENT,
>  	APICV_INHIBIT_REASON_SEV,
> +	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
>  };
> 
>  struct kvm_arch {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5fd678c90288..6fe8f20f03d8 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2039,6 +2039,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
>  	}
>  }
> 
> +static void kvm_lapic_xapic_id_updated(struct kvm_lapic *apic)
> +{
> +	struct kvm *kvm = apic->vcpu->kvm;
> +
> +	if (KVM_BUG_ON(apic_x2apic_mode(apic), kvm))
> +		return;
> +
> +	if (kvm_xapic_id(apic) == apic->vcpu->vcpu_id)
> +		return;
> +
> +	kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> +}
> +
>  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  {
>  	int ret = 0;
> @@ -2047,10 +2060,12 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> 
>  	switch (reg) {
>  	case APIC_ID:		/* Local APIC ID */
> -		if (!apic_x2apic_mode(apic))
> +		if (!apic_x2apic_mode(apic)) {
>  			kvm_apic_set_xapic_id(apic, val >> 24);
> -		else
> +			kvm_lapic_xapic_id_updated(apic);
> +		} else {
>  			ret = 1;
> +		}
>  		break;
> 
>  	case APIC_TASKPRI:
> @@ -2665,6 +2680,8 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>  			icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR);
>  			__kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32);
>  		}
> +	} else {
> +		kvm_lapic_xapic_id_updated(vcpu->arch.apic);
>  	}
> 
>  	return 0;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 54fe03714f8a..239c3e8b1f3f 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -910,7 +910,8 @@ bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
>  			  BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
>  			  BIT(APICV_INHIBIT_REASON_X2APIC) |
>  			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> -			  BIT(APICV_INHIBIT_REASON_SEV);
> +			  BIT(APICV_INHIBIT_REASON_SEV) |
> +			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> 
>  	return supported & BIT(reason);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b06eafa5884d..941adade21ea 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7818,7 +7818,8 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
>  	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>  			  BIT(APICV_INHIBIT_REASON_ABSENT) |
>  			  BIT(APICV_INHIBIT_REASON_HYPERV) |
> -			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> +			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> +			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> 
>  	return supported & BIT(reason);
>  }
> 
> base-commit: 6ab6e3842d18e4529fa524fb6c668ae8a8bf54f4



Best regards,
	Thanks for the review,
		Maxim Levitsky
> --
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-05-19 16:27   ` Sean Christopherson
@ 2022-05-22 10:21     ` Maxim Levitsky
  0 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22 10:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:27 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This will be used to enable write tracking from nested AVIC code
> > and can also be used to enable write tracking in GVT-g module
> > when it actually uses it as opposed to always enabling it,
> > when the module is compiled in the kernel.
> 
> Wrap at ~75.
Well, the checkpatch.pl didn't complain, so I didn't notice.

> 
> > No functional change intended.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h       |  2 +-
> >  arch/x86/include/asm/kvm_page_track.h |  1 +
> >  arch/x86/kvm/mmu.h                    |  8 +++++---
> >  arch/x86/kvm/mmu/mmu.c                | 17 ++++++++++-------
> >  arch/x86/kvm/mmu/page_track.c         | 10 ++++++++--
> >  5 files changed, 25 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 636df87542555..fc7df778a3d71 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1254,7 +1254,7 @@ struct kvm_arch {
> >  	 * is used as one input when determining whether certain memslot
> >  	 * related allocations are necessary.
> >  	 */
> 
> The above comment needs to be rewritten.
Good catch, thank a lot!!

> 
> > -	bool shadow_root_allocated;
> > +	bool mmu_page_tracking_enabled;
> >  #if IS_ENABLED(CONFIG_HYPERV)
> >  	hpa_t	hv_root_tdp;
> > diff --git a/arch/x86/include/asm/kvm_page_track.h b/arch/x86/include/asm/kvm_page_track.h
> > index eb186bc57f6a9..955a5ae07b10e 100644
> > --- a/arch/x86/include/asm/kvm_page_track.h
> > +++ b/arch/x86/include/asm/kvm_page_track.h
> > @@ -50,6 +50,7 @@ int kvm_page_track_init(struct kvm *kvm);
> >  void kvm_page_track_cleanup(struct kvm *kvm);
> >  
> >  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm);
> > +int kvm_page_track_write_tracking_enable(struct kvm *kvm);
> >  int kvm_page_track_write_tracking_alloc(struct kvm_memory_slot *slot);
> >  
> >  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot);
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index 671cfeccf04e9..44d15551f7156 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -269,7 +269,7 @@ int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
> >  int kvm_mmu_post_init_vm(struct kvm *kvm);
> >  void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
> >  
> > -static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
> > +static inline bool mmu_page_tracking_enabled(struct kvm *kvm)
> >  {
> >  	/*
> >  	 * Read shadow_root_allocated before related pointers. Hence, threads
> > @@ -277,9 +277,11 @@ static inline bool kvm_shadow_root_allocated(struct kvm *kvm)
> >  	 * see the pointers. Pairs with smp_store_release in
> >  	 * mmu_first_shadow_root_alloc.
> >  	 */
> 
> This comment also needs to be rewritten.
Also thanks a lot, next time I'll check comments better.

> 
> > -	return smp_load_acquire(&kvm->arch.shadow_root_allocated);
> > +	return smp_load_acquire(&kvm->arch.mmu_page_tracking_enabled);
> >  }
> 
> ...
> 
> > diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
> > index 2e09d1b6249f3..8857d629036d7 100644
> > --- a/arch/x86/kvm/mmu/page_track.c
> > +++ b/arch/x86/kvm/mmu/page_track.c
> > @@ -21,10 +21,16 @@
> >  
> >  bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
> 
> This can be static, it's now used only by page_track.c.
I'll fix this.
> 
> >  {
> > -	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
> > -	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
> > +	return mmu_page_tracking_enabled(kvm);
> >  }
> >  
> > +int kvm_page_track_write_tracking_enable(struct kvm *kvm)
> 
> This is too similar to the "enabled" version; "kvm_page_track_enable_write_tracking()"
> would maintain namespacing and be less confusing.
Makes sense, thanks, will do!

> 
> Hmm, I'd probably vote to make this a "static inline" in kvm_page_track.h, and
> rename mmu_enable_write_tracking() to kvm_mmu_enable_write_tracking and export.
> Not a strong preference, just feels silly to export a one-liner.

The sole reason I did it this way, because 'page_track.c' this way contains all the interfaces
that an external user of write tracking needs to use.

> 
> > +{
> > +	return mmu_enable_write_tracking(kvm);
> > +}
> > +EXPORT_SYMBOL_GPL(kvm_page_track_write_tracking_enable);
> > +
> > +
> >  void kvm_page_track_free_memslot(struct kvm_memory_slot *slot)
> >  {
> >  	int i;
> > -- 
> > 2.26.3
> > 

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-05-19 16:37   ` Sean Christopherson
@ 2022-05-22 10:22     ` Maxim Levitsky
  2022-07-20 14:42       ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:37 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> >  	node->track_write = kvm_mmu_pte_write;
> >  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> >  	kvm_page_track_register_notifier(kvm, node);
> 
> Can you add a patch to move this call to kvm_page_track_register_notifier() into
> mmu_enable_write_tracking(), and simultaneously add a WARN in the register path
> that page tracking is enabled?
> 
> Oh, actually, a better idea. Add an inner __kvm_page_track_register_notifier()
> that is not exported and thus used only by KVM, invoke mmu_enable_write_tracking()
> from the exported kvm_page_track_register_notifier(), and then do the above.
> That will require modifying KVMGT and KVM in a single patch, but that's ok.
> 
> That will avoid any possibility of an external user failing to enabling tracking
> before registering its notifier, and also avoids bikeshedding over what to do with
> the one-line wrapper to enable tracking.
> 

This is a good idea as well, especially looking at kvmgt and seeing that
it registers the page track notifier, when the vGPU is opened.

I'll do this in the next series.

Thanks for the review!

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper
  2022-05-19 16:43   ` Sean Christopherson
@ 2022-05-22 10:22     ` Maxim Levitsky
  2022-05-22 12:12     ` Maxim Levitsky
  1 sibling, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:43 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This is a tiny refactoring, and can be useful to check
> > if a GPA/GFN is within a memslot a bit more cleanly.
> 
> This doesn't explain the actual motivation, which is to use the new helper from
> arch code.
I'll add this in the next version
> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  include/linux/kvm_host.h | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 252ee4a61b58b..12e261559070b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
> >  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> >  bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
> >  
> > +
> > +static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +	return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
> > +}
> > +
> 
> Spurious newline.
> 
> > +
> >  /*
> >   * Returns a pointer to the memslot if it contains gfn.
> >   * Otherwise returns NULL.
> > @@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> >  	if (!slot)
> >  		return NULL;
> >  
> > -	if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> > +	if (gfn_in_memslot(slot, gfn))
> >  		return slot;
> >  	else
> >  		return NULL;
> 
> At this point, maybe:

No objections.

Thanks for the review.

Best regards,
	Maxim Levitsky

> 
> 	if (!slot || !gfn_in_memslot(slot, gfn))
> 		return NULL;
> 
> 	return slot;
> 
> >  }
> >  
> > +
> >  /*
> >   * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
> >   *
> > -- 
> > 2.26.3
> > 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page
  2022-05-19 16:55   ` Sean Christopherson
@ 2022-05-22 10:22     ` Maxim Levitsky
  0 siblings, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22 10:22 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:55 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This will be used on SVM to reload shadow page of the AVIC physid table
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d2f73ce87a1e3..ad744ab99734c 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9949,12 +9949,12 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> >  		kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> >  }
> >  
> > -static void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> > +static void kvm_vcpu_reload_apic_pages(struct kvm_vcpu *vcpu)
> >  {
> >  	if (!lapic_in_kernel(vcpu))
> >  		return;
> >  
> > -	static_call_cond(kvm_x86_set_apic_access_page_addr)(vcpu);
> > +	static_call_cond(kvm_x86_reload_apic_pages)(vcpu);
> >  }
> >  
> >  void __kvm_request_immediate_exit(struct kvm_vcpu *vcpu)
> > @@ -10071,7 +10071,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >  		if (kvm_check_request(KVM_REQ_LOAD_EOI_EXITMAP, vcpu))
> >  			vcpu_load_eoi_exitmap(vcpu);
> >  		if (kvm_check_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu))
> > -			kvm_vcpu_reload_apic_access_page(vcpu);
> > +			kvm_vcpu_reload_apic_pages(vcpu);
> 
> My vote is to add a new request and new kvm_x86_ops hook instead of piggybacking
> KVM_REQ_APIC_PAGE_RELOAD.  The usage in kvm_arch_mmu_notifier_invalidate_range()
> very subtlies relies on the memslot and vma being allocated/controlled by KVM.
> 
> The use in avic_physid_shadow_table_flush_memslot() is too similar in that it
> also deals with memslot changes, but at the same time is _very_ different in that
> it's dealing with user controlled memslots.
> 

No objections, will do.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper
  2022-05-19 16:43   ` Sean Christopherson
  2022-05-22 10:22     ` Maxim Levitsky
@ 2022-05-22 12:12     ` Maxim Levitsky
  1 sibling, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-22 12:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:43 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > This is a tiny refactoring, and can be useful to check
> > if a GPA/GFN is within a memslot a bit more cleanly.
> 
> This doesn't explain the actual motivation, which is to use the new helper from
> arch code.
I'll add this in the next version
> 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  include/linux/kvm_host.h | 10 +++++++++-
> >  1 file changed, 9 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 252ee4a61b58b..12e261559070b 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -1580,6 +1580,13 @@ int kvm_request_irq_source_id(struct kvm *kvm);
> >  void kvm_free_irq_source_id(struct kvm *kvm, int irq_source_id);
> >  bool kvm_arch_irqfd_allowed(struct kvm *kvm, struct kvm_irqfd *args);
> >  
> > +
> > +static inline bool gfn_in_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> > +{
> > +	return (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages);
> > +}
> > +
> 
> Spurious newline.
> 
> > +
> >  /*
> >   * Returns a pointer to the memslot if it contains gfn.
> >   * Otherwise returns NULL.
> > @@ -1590,12 +1597,13 @@ try_get_memslot(struct kvm_memory_slot *slot, gfn_t gfn)
> >  	if (!slot)
> >  		return NULL;
> >  
> > -	if (gfn >= slot->base_gfn && gfn < slot->base_gfn + slot->npages)
> > +	if (gfn_in_memslot(slot, gfn))
> >  		return slot;
> >  	else
> >  		return NULL;
> 
> At this point, maybe:

No objections.

Thanks for the review.

Best regards,
	Maxim Levitsky

> 
> 	if (!slot || !gfn_in_memslot(slot, gfn))
> 		return NULL;
> 
> 	return slot;
> 
> >  }
> >  
> > +
> >  /*
> >   * Returns a pointer to the memslot that contains gfn. Otherwise returns NULL.
> >   *
> > -- 
> > 2.26.3
> > 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-22  9:03     ` Maxim Levitsky
@ 2022-05-22 14:47       ` Jim Mattson
  2022-05-23  6:50         ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Jim Mattson @ 2022-05-22 14:47 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jani Nikula, Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi,
	Zhenyu Wang, Joonas Lahtinen, Tom Lendacky, Ingo Molnar,
	David Airlie, Thomas Gleixner, Dave Hansen, x86, intel-gfx,
	Daniel Vetter, Borislav Petkov, Joerg Roedel, linux-kernel,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Sun, May 22, 2022 at 2:03 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > Neither of these settings should be changed by the guest and it is
> > > a burden to support it in the acceleration code, so just inhibit
> > > it instead.
> > >
> > > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > >
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > ---
> > > +           return;
> > > +
> > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> >
> > It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> > APICv.
>
> Here, as I said, it would be nice to see that warning if someone complains.
> Fact is that AVIC code was totally broken in this regard, and there are probably more,
> so it would be nice to see if anybody complains.
>
> If you insist, I'll remove this warning.

This may be fine for a hobbyist, but it's a terrible API in an
enterprise environment. To be honest, I have no way of propagating
this warning from /var/log/messages on a particular host to a
potentially impacted customer. Worse, if they're not the first
impacted customer since the last host reboot, there's no warning to
propagate. I suppose I could just tell every later customer, "Your VM
was scheduled to run on a host that previously reported, 'APIC ID
change is unsupported by KVM.' If you notice any unusual behavior,
that might be the reason for it," but that isn't going to inspire
confidence. I could schedule a drain and reboot of the host, but that
defeats the whole point of the "_once" suffix.

I know that there's a long history of doing this in KVM, but I'd like
to ask that we:
a) stop piling on
b) start fixing the existing uses

If KVM cannot emulate a perfectly valid operation, an exit to
userspace with KVM_EXIT_INTERNAL_ERROR is warranted. Perhaps for
operations that we suspect KVM might get wrong, we should have a new
userspace exit: KVM_EXIT_WARNING?

I'm not saying that you should remove the warning. I'm just asking
that it be augmented with a direct signal to userspace that KVM may no
longer be reliable.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-22 14:47       ` Jim Mattson
@ 2022-05-23  6:50         ` Maxim Levitsky
  2022-05-23 17:22           ` Jim Mattson
  2022-05-23 17:31           ` Sean Christopherson
  0 siblings, 2 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-05-23  6:50 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jani Nikula, Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi,
	Zhenyu Wang, Joonas Lahtinen, Tom Lendacky, Ingo Molnar,
	David Airlie, Thomas Gleixner, Dave Hansen, x86, intel-gfx,
	Daniel Vetter, Borislav Petkov, Joerg Roedel, linux-kernel,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Sun, 2022-05-22 at 07:47 -0700, Jim Mattson wrote:
> On Sun, May 22, 2022 at 2:03 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > Neither of these settings should be changed by the guest and it is
> > > > a burden to support it in the acceleration code, so just inhibit
> > > > it instead.
> > > > 
> > > > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > > > 
> > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > ---
> > > > +           return;
> > > > +
> > > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> > > 
> > > It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> > > APICv.
> > 
> > Here, as I said, it would be nice to see that warning if someone complains.
> > Fact is that AVIC code was totally broken in this regard, and there are probably more,
> > so it would be nice to see if anybody complains.
> > 
> > If you insist, I'll remove this warning.
> 
> This may be fine for a hobbyist, but it's a terrible API in an
> enterprise environment. To be honest, I have no way of propagating
> this warning from /var/log/messages on a particular host to a
> potentially impacted customer. Worse, if they're not the first
> impacted customer since the last host reboot, there's no warning to
> propagate. I suppose I could just tell every later customer, "Your VM
> was scheduled to run on a host that previously reported, 'APIC ID
> change is unsupported by KVM.' If you notice any unusual behavior,
> that might be the reason for it," but that isn't going to inspire
> confidence. I could schedule a drain and reboot of the host, but that
> defeats the whole point of the "_once" suffix.

Mostly agree, and I read alrady few discussions about exactly this,
those warnings are mostly useless, but they are used in the
cases where we don't have the courage to just exit with KVM_EXIT_INTERNAL_ERROR.

I do not thing though that the warning is completely useless, 
as we often have the kernel log of the target machine when things go wrong, 
so *we* can notice it.
In other words a kernel warning is mostly useless but better that nothing.

About KVM_EXIT_WARNING, this is IMHO a very good idea, probably combined
with some form of taint flag, which could be read by qemu and then shown
over hmp/qmp interfaces.

Best regards,
	Maxim levitsky


> 
> I know that there's a long history of doing this in KVM, but I'd like
> to ask that we:
> a) stop piling on
> b) start fixing the existing uses
> 
> If KVM cannot emulate a perfectly valid operation, an exit to
> userspace with KVM_EXIT_INTERNAL_ERROR is warranted. Perhaps for
> operations that we suspect KVM might get wrong, we should have a new
> userspace exit: KVM_EXIT_WARNING?
> 
> I'm not saying that you should remove the warning. I'm just asking
> that it be augmented with a direct signal to userspace that KVM may no
> longer be reliable.
> 



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID
  2022-05-22  9:01     ` Maxim Levitsky
@ 2022-05-23 17:19       ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-05-23 17:19 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Sun, May 22, 2022, Maxim Levitsky wrote:
> On Thu, 2022-05-19 at 16:10 +0000, Sean Christopherson wrote:
> > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > AVIC is now inhibited if the guest changes apic id, thus remove
> > > that broken code.
> > 
> > Can you explicitly call out what's broken?  Just something short on the code not
> > handling the scenario where APIC ID is changed back to vcpu_id to help future
> > archaeologists.  I forget if there are other bugs...
> > 
> 
> 
> Well, the avic_handle_apic_id_update is called each time the AVIC is uninhibited,
> because while it is inhibited, the AVIC code doesn't track changes to APIC ID and such.
> 
> Also there are many ways it is broken for example
> 
> 1. a CPU can't move its APIC ID to a free slot due to (!new) check
> 
> 2. If APIC ID is moved to a used slot, then the CPU that used that overwritten
> slot can't correctly move it, since its now not its slot, not to mention races.

The more the merrier :-)  Any/all of those examples are great, just so long as it's
obvious to future readers that the code truly is busted.

> BTW, if you see a value in it, I can fix this code instead - a lock + going over all the apic ids,
> should be quite easy to implement. In case of two vCPUs using the same APIC ID,
> I can write non present entry to the table, so none will be able to be addressed,
> hoping that the situation is only temporary.

Very strong "no", let's keep this as simple as possible without outright killing
the guest or breaking ABI.  Disabling APICv/AVIC is perfect.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-23  6:50         ` Maxim Levitsky
@ 2022-05-23 17:22           ` Jim Mattson
  2022-05-23 17:31           ` Sean Christopherson
  1 sibling, 0 replies; 57+ messages in thread
From: Jim Mattson @ 2022-05-23 17:22 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, kvm, Wanpeng Li, Vitaly Kuznetsov,
	Jani Nikula, Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi,
	Zhenyu Wang, Joonas Lahtinen, Tom Lendacky, Ingo Molnar,
	David Airlie, Thomas Gleixner, Dave Hansen, x86, intel-gfx,
	Daniel Vetter, Borislav Petkov, Joerg Roedel, linux-kernel,
	Zhi Wang, Brijesh Singh, H. Peter Anvin, intel-gvt-dev,
	dri-devel

On Sun, May 22, 2022 at 11:50 PM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Sun, 2022-05-22 at 07:47 -0700, Jim Mattson wrote:
> > On Sun, May 22, 2022 at 2:03 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > Neither of these settings should be changed by the guest and it is
> > > > > a burden to support it in the acceleration code, so just inhibit
> > > > > it instead.
> > > > >
> > > > > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > > > >
> > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > > ---
> > > > > +           return;
> > > > > +
> > > > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> > > >
> > > > It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> > > > APICv.
> > >
> > > Here, as I said, it would be nice to see that warning if someone complains.
> > > Fact is that AVIC code was totally broken in this regard, and there are probably more,
> > > so it would be nice to see if anybody complains.
> > >
> > > If you insist, I'll remove this warning.
> >
> > This may be fine for a hobbyist, but it's a terrible API in an
> > enterprise environment. To be honest, I have no way of propagating
> > this warning from /var/log/messages on a particular host to a
> > potentially impacted customer. Worse, if they're not the first
> > impacted customer since the last host reboot, there's no warning to
> > propagate. I suppose I could just tell every later customer, "Your VM
> > was scheduled to run on a host that previously reported, 'APIC ID
> > change is unsupported by KVM.' If you notice any unusual behavior,
> > that might be the reason for it," but that isn't going to inspire
> > confidence. I could schedule a drain and reboot of the host, but that
> > defeats the whole point of the "_once" suffix.
>
> Mostly agree, and I read alrady few discussions about exactly this,
> those warnings are mostly useless, but they are used in the
> cases where we don't have the courage to just exit with KVM_EXIT_INTERNAL_ERROR.
>
> I do not thing though that the warning is completely useless,
> as we often have the kernel log of the target machine when things go wrong,
> so *we* can notice it.
> In other words a kernel warning is mostly useless but better that nothing.

I don't know how this works for you, but *we* are rarely involved when
things go wrong. :-(

> About KVM_EXIT_WARNING, this is IMHO a very good idea, probably combined
> with some form of taint flag, which could be read by qemu and then shown
> over hmp/qmp interfaces.
>
> Best regards,
>         Maxim levitsky
>
>
> >
> > I know that there's a long history of doing this in KVM, but I'd like
> > to ask that we:
> > a) stop piling on
> > b) start fixing the existing uses
> >
> > If KVM cannot emulate a perfectly valid operation, an exit to
> > userspace with KVM_EXIT_INTERNAL_ERROR is warranted. Perhaps for
> > operations that we suspect KVM might get wrong, we should have a new
> > userspace exit: KVM_EXIT_WARNING?
> >
> > I'm not saying that you should remove the warning. I'm just asking
> > that it be augmented with a direct signal to userspace that KVM may no
> > longer be reliable.
> >
>
>

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-23  6:50         ` Maxim Levitsky
  2022-05-23 17:22           ` Jim Mattson
@ 2022-05-23 17:31           ` Sean Christopherson
  1 sibling, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-05-23 17:31 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Jim Mattson, kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula,
	Paolo Bonzini, Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang,
	Joonas Lahtinen, Tom Lendacky, Ingo Molnar, David Airlie,
	Thomas Gleixner, Dave Hansen, x86, intel-gfx, Daniel Vetter,
	Borislav Petkov, Joerg Roedel, linux-kernel, Zhi Wang,
	Brijesh Singh, H. Peter Anvin, intel-gvt-dev, dri-devel

On Mon, May 23, 2022, Maxim Levitsky wrote:
> On Sun, 2022-05-22 at 07:47 -0700, Jim Mattson wrote:
> > On Sun, May 22, 2022 at 2:03 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > Neither of these settings should be changed by the guest and it is
> > > > > a burden to support it in the acceleration code, so just inhibit
> > > > > it instead.
> > > > > 
> > > > > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > > > > 
> > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > > ---
> > > > > +           return;
> > > > > +
> > > > > +   pr_warn_once("APIC ID change is unsupported by KVM");
> > > > 
> > > > It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> > > > APICv.
> > > 
> > > Here, as I said, it would be nice to see that warning if someone complains.
> > > Fact is that AVIC code was totally broken in this regard, and there are probably more,
> > > so it would be nice to see if anybody complains.
> > > 
> > > If you insist, I'll remove this warning.
> > 
> > This may be fine for a hobbyist, but it's a terrible API in an
> > enterprise environment. To be honest, I have no way of propagating
> > this warning from /var/log/messages on a particular host to a
> > potentially impacted customer. Worse, if they're not the first
> > impacted customer since the last host reboot, there's no warning to
> > propagate. I suppose I could just tell every later customer, "Your VM
> > was scheduled to run on a host that previously reported, 'APIC ID
> > change is unsupported by KVM.' If you notice any unusual behavior,
> > that might be the reason for it," but that isn't going to inspire
> > confidence. I could schedule a drain and reboot of the host, but that
> > defeats the whole point of the "_once" suffix.
> 
> Mostly agree, and I read alrady few discussions about exactly this,
> those warnings are mostly useless, but they are used in the
> cases where we don't have the courage to just exit with KVM_EXIT_INTERNAL_ERROR.
> 
> I do not thing though that the warning is completely useless, 
> as we often have the kernel log of the target machine when things go wrong, 
> so *we* can notice it.
> In other words a kernel warning is mostly useless but better that nothing.

IMO, it's worse than doing nothing.  Us developers become desensitized to the
kernel message due to running tests, the existence of these message propagates
the notion that they are a good thing (and we keep rehashing these discussions...),
users may not realize it's a _once() printk and so think they _aren't_ affected
when re-running a workload, etc...

And in this case, "APIC ID change is unsupported by KVM" is partly wrong.  KVM
fully models Intel's behavior where the ID change isn't carried across x2APIC
enabling, the only unsupported behavior is that the guest will lose APICv
acceleration.

> About KVM_EXIT_WARNING, this is IMHO a very good idea, probably combined
> with some form of taint flag, which could be read by qemu and then shown
> over hmp/qmp interfaces.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults.
  2022-05-19 16:06   ` Sean Christopherson
  2022-05-22  9:03     ` Maxim Levitsky
@ 2022-06-23  9:44     ` Maxim Levitsky
  1 sibling, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-06-23  9:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-05-19 at 16:06 +0000, Sean Christopherson wrote:
> On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > Neither of these settings should be changed by the guest and it is
> > a burden to support it in the acceleration code, so just inhibit
> > it instead.
> > 
> > Also add a boolean 'apic_id_changed' to indicate if apic id ever changed.
> > 
> > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  3 +++
> >  arch/x86/kvm/lapic.c            | 25 ++++++++++++++++++++++---
> >  arch/x86/kvm/lapic.h            |  8 ++++++++
> >  3 files changed, 33 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 63eae00625bda..636df87542555 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1070,6 +1070,8 @@ enum kvm_apicv_inhibit {
> >  	APICV_INHIBIT_REASON_ABSENT,
> >  	/* AVIC is disabled because SEV doesn't support it */
> >  	APICV_INHIBIT_REASON_SEV,
> > +	/* APIC ID and/or APIC base was changed by the guest */
> 
> I don't see any reason to inhibit APICv if the APIC base is changed.  KVM has
> never supported that, and disabling APICv won't "fix" anything.
> 
> Ignoring that is a minor simplification, but also allows for a more intuitive
> name, e.g.
> 
> 	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
> 
> The inhibit also needs to be added avic_check_apicv_inhibit_reasons() and
> vmx_check_apicv_inhibit_reasons().
> 
> > +	APICV_INHIBIT_REASON_RO_SETTINGS,
> >  };
> >  
> >  struct kvm_arch {
> > @@ -1258,6 +1260,7 @@ struct kvm_arch {
> >  	hpa_t	hv_root_tdp;
> >  	spinlock_t hv_root_tdp_lock;
> >  #endif
> > +	bool apic_id_changed;
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 66b0eb0bda94e..8996675b3ef4c 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -2038,6 +2038,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
> >  	}
> >  }
> >  
> > +static void kvm_lapic_check_initial_apic_id(struct kvm_lapic *apic)
> 
> The "check" part is misleading/confusing.  "check" helpers usually query and return
> state.  I assume you avoided "changed" because the ID may or may not actually be
> changing.  Maybe kvm_apic_id_updated()?  Ah, better idea.  What about
> kvm_lapic_xapic_id_updated()?  See below for reasoning.
> 
> > +{
> > +	if (kvm_apic_has_initial_apic_id(apic))
> 
> Rather than add a single-use helper, invoke the helper from kvm_apic_state_fixup()
> in the !x2APIC path, then this can KVM_BUG_ON() x2APIC to help document that KVM
> should never allow the ID to change for x2APIC.
> 
> > +		return;
> > +
> > +	pr_warn_once("APIC ID change is unsupported by KVM");
> 
> It's supported (modulo x2APIC shenanigans), otherwise KVM wouldn't need to disable
> APICv.
> 
> > +	kvm_set_apicv_inhibit(apic->vcpu->kvm,
> > +			APICV_INHIBIT_REASON_RO_SETTINGS);
> > +
> > +	apic->vcpu->kvm->arch.apic_id_changed = true;
> > +}
> > +
> >  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  {
> >  	int ret = 0;
> > @@ -2046,9 +2059,11 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> >  
> >  	switch (reg) {
> >  	case APIC_ID:		/* Local APIC ID */
> > -		if (!apic_x2apic_mode(apic))
> > +		if (!apic_x2apic_mode(apic)) {
> > +
> 
> Spurious newline.
> 
> >  			kvm_apic_set_xapic_id(apic, val >> 24);
> > -		else
> > +			kvm_lapic_check_initial_apic_id(apic);
> > +		} else
> 
> Needs curly braces for both paths.
> 
> >  			ret = 1;
> >  		break;
> >  
> 
> E.g.
> 
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/lapic.c            | 21 +++++++++++++++++++--
>  arch/x86/kvm/svm/avic.c         |  3 ++-
>  arch/x86/kvm/vmx/vmx.c          |  3 ++-
>  4 files changed, 24 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d895d25c5b2f..d888fa1bae77 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1071,6 +1071,7 @@ enum kvm_apicv_inhibit {
>  	APICV_INHIBIT_REASON_BLOCKIRQ,
>  	APICV_INHIBIT_REASON_ABSENT,
>  	APICV_INHIBIT_REASON_SEV,
> +	APICV_INHIBIT_REASON_APIC_ID_MODIFIED,
>  };
> 
>  struct kvm_arch {
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 5fd678c90288..6fe8f20f03d8 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -2039,6 +2039,19 @@ static void apic_manage_nmi_watchdog(struct kvm_lapic *apic, u32 lvt0_val)
>  	}
>  }
> 
> +static void kvm_lapic_xapic_id_updated(struct kvm_lapic *apic)
> +{
> +	struct kvm *kvm = apic->vcpu->kvm;
> +
> +	if (KVM_BUG_ON(apic_x2apic_mode(apic), kvm))
> +		return;
> +
> +	if (kvm_xapic_id(apic) == apic->vcpu->vcpu_id)
> +		return;
> +
> +	kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> +}
> +
>  static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
>  {
>  	int ret = 0;
> @@ -2047,10 +2060,12 @@ static int kvm_lapic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val)
> 
>  	switch (reg) {
>  	case APIC_ID:		/* Local APIC ID */
> -		if (!apic_x2apic_mode(apic))
> +		if (!apic_x2apic_mode(apic)) {
>  			kvm_apic_set_xapic_id(apic, val >> 24);
> -		else
> +			kvm_lapic_xapic_id_updated(apic);
> +		} else {
>  			ret = 1;
> +		}
>  		break;
> 
>  	case APIC_TASKPRI:
> @@ -2665,6 +2680,8 @@ static int kvm_apic_state_fixup(struct kvm_vcpu *vcpu,
>  			icr = __kvm_lapic_get_reg64(s->regs, APIC_ICR);
>  			__kvm_lapic_set_reg(s->regs, APIC_ICR2, icr >> 32);
>  		}
> +	} else {
> +		kvm_lapic_xapic_id_updated(vcpu->arch.apic);
>  	}
> 
>  	return 0;
> diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
> index 54fe03714f8a..239c3e8b1f3f 100644
> --- a/arch/x86/kvm/svm/avic.c
> +++ b/arch/x86/kvm/svm/avic.c
> @@ -910,7 +910,8 @@ bool avic_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
>  			  BIT(APICV_INHIBIT_REASON_PIT_REINJ) |
>  			  BIT(APICV_INHIBIT_REASON_X2APIC) |
>  			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> -			  BIT(APICV_INHIBIT_REASON_SEV);
> +			  BIT(APICV_INHIBIT_REASON_SEV) |
> +			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> 
>  	return supported & BIT(reason);
>  }
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index b06eafa5884d..941adade21ea 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7818,7 +7818,8 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
>  	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>  			  BIT(APICV_INHIBIT_REASON_ABSENT) |
>  			  BIT(APICV_INHIBIT_REASON_HYPERV) |
> -			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ);
> +			  BIT(APICV_INHIBIT_REASON_BLOCKIRQ) |
> +			  BIT(APICV_INHIBIT_REASON_APIC_ID_MODIFIED);
> 
>  	return supported & BIT(reason);
>  }
> 
> base-commit: 6ab6e3842d18e4529fa524fb6c668ae8a8bf54f4
> --
> 


Hi Sean!

So, I decided to stop beeing lazy and to understand how KVM actually treats the whole thing:


- kvm_apic_set_xapic_id - called when apic id changes either by guest write,
  cpu reset or x2apic beeing disabled due to write to apic base msr.
  apic register is updated, and apic map is recalculated


- kvm_apic_set_x2apic_id - called only when apic base write (guest or userspace),
  enables x2apic. caller uses vcpu->vcpu_id explicity


- kvm_apic_state_fixup - when apic state is uploaded by userspace, has check
  that check for x2apic api. Also triggers apic map update


- kvm_recalculate_apic_map
  this updates the apic map that we use in IPI emulation.
  - xapic id (aka APIC_ID >> 24) is only used for APICs which are not in xapic mode.
  - x2apic ids (aka vcpu->vcpu_id) are used for all APICs which are in x2apic mode,
    and also (as a hack, when an apic has vcpu_id > 255, even if not in x2apic mode,
    its x2apic id is still put in the map)


Conclusions:

- Practically speaking, when an apic is in x2apic mode, even if userspace uploaded
non standard APIC_ID, it is ignored, and just read back (garbage in - garbage out)

- Non standard APIC ID is lost when switching to x2apic mode.




Best regards,
	Maxim Levitsky



PS: sending this so this info is not lost.

Thankfully my APICv inhibit patch got accepted upstream,
so one issue less to deal with.



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-05-22 10:22     ` Maxim Levitsky
@ 2022-07-20 14:42       ` Maxim Levitsky
  2022-07-25 16:08         ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-07-20 14:42 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> On Thu, 2022-05-19 at 16:37 +0000, Sean Christopherson wrote:
> > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > >         node->track_write = kvm_mmu_pte_write;
> > >         node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
> > >         kvm_page_track_register_notifier(kvm, node);
> > 
> > Can you add a patch to move this call to kvm_page_track_register_notifier() into
> > mmu_enable_write_tracking(), and simultaneously add a WARN in the register path
> > that page tracking is enabled?
> > 
> > Oh, actually, a better idea. Add an inner __kvm_page_track_register_notifier()
> > that is not exported and thus used only by KVM, invoke mmu_enable_write_tracking()
> > from the exported kvm_page_track_register_notifier(), and then do the above.
> > That will require modifying KVMGT and KVM in a single patch, but that's ok.
> > 
> > That will avoid any possibility of an external user failing to enabling tracking
> > before registering its notifier, and also avoids bikeshedding over what to do with
> > the one-line wrapper to enable tracking.
> > 
> 
> This is a good idea as well, especially looking at kvmgt and seeing that
> it registers the page track notifier, when the vGPU is opened.
> 
> I'll do this in the next series.
> 
> Thanks for the review!

After putting some thought into this, I am not 100% sure anymore I want to do it this way.
 
Let me explain the current state of things:

For mmu: 
- write tracking notifier is registered on VM initialization (that is pretty much always),
and if it is called because write tracking was enabled due to some other reason
(currently only KVMGT), it checks the number of shadow mmu pages and if zero, bails out.
 
- write tracking enabled when shadow root is allocated.
 
This can be kept as is by using the __kvm_page_track_register_notifier as you suggested.
 
For KVMGT:
- both write tracking and notifier are enabled when an vgpu mdev device is first opened.
That 'works' only because KVMGT doesn't allow to assign more that one mdev to same VM,
thus a per VM notifier and the write tracking for that VM are enabled at the same time
 
 
Now for nested AVIC, this is what I would like to do:
 
- just like mmu, I prefer to register the write tracking notifier, when the VM is created.
- just like mmu, write tracking should only be enabled when nested AVIC is actually used
  first time, so that write tracking is not always enabled when you just boot a VM with nested avic supported,
  since the VM might not use nested at all.
 
Thus I either need to use the __kvm_page_track_register_notifier too for AVIC (and thus need to export it)
or I need to have a boolean (nested_avic_was_used_once) and register the write tracking
notifier only when false and do it not on VM creation but on first attempt to use nested AVIC.
 
Do you think this is worth it? I mean there is some value of registering the notifier only when needed
(this way it is not called for nothing) but it does complicate things a bit.
 
I can also stash this boolean (like 'bool registered;') into the 'struct kvm_page_track_notifier_node', 
and thus allow the kvm_page_track_register_notifier to be called more that once - 
then I can also get rid of __kvm_page_track_register_notifier. 

What do you think about this?
 
Best regards,
	Maxim Levitsky


> 
> Best regards,
>         Maxim Levitsky



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-07-20 14:42       ` Maxim Levitsky
@ 2022-07-25 16:08         ` Sean Christopherson
  2022-07-28  7:46           ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-07-25 16:08 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> > On Thu, 2022-05-19 at 16:37 +0000, Sean Christopherson wrote:
> > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> Now for nested AVIC, this is what I would like to do:
>  
> - just like mmu, I prefer to register the write tracking notifier, when the
>   VM is created.
>
> - just like mmu, write tracking should only be enabled when nested AVIC is
>   actually used first time, so that write tracking is not always enabled when
>   you just boot a VM with nested avic supported, since the VM might not use
>   nested at all.
>  
> Thus I either need to use the __kvm_page_track_register_notifier too for AVIC
> (and thus need to export it) or I need to have a boolean
> (nested_avic_was_used_once) and register the write tracking notifier only
> when false and do it not on VM creation but on first attempt to use nested
> AVIC.
>  
> Do you think this is worth it? I mean there is some value of registering the
> notifier only when needed (this way it is not called for nothing) but it does
> complicate things a bit.

Compared to everything else that you're doing in the nested AVIC code, refcounting
the shared kvm_page_track_notifier_node object is a trivial amount of complexity.

And on that topic, do you have performance numbers to justify using a single
shared node?  E.g. if every table instance has its own notifier, then no additional
refcounting is needed.  It's not obvious that a shared node will provide better
performance, e.g. if there are only a handful of AVIC tables being shadowed, then
a linear walk of all nodes is likely fast enough, and doesn't bring the risk of
a write potentially being stalled due to having to acquire a VM-scoped mutex.

> I can also stash this boolean (like 'bool registered;') into the 'struct
> kvm_page_track_notifier_node',  and thus allow the
> kvm_page_track_register_notifier to be called more that once -  then I can
> also get rid of __kvm_page_track_register_notifier. 

No, allowing redundant registration without proper refcounting leads to pain,
e.g. X registers, Y registers, X unregisters, kaboom.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-07-25 16:08         ` Sean Christopherson
@ 2022-07-28  7:46           ` Maxim Levitsky
  2022-08-01 15:53             ` Maxim Levitsky
  2022-08-01 17:20             ` Sean Christopherson
  0 siblings, 2 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-07-28  7:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Mon, 2022-07-25 at 16:08 +0000, Sean Christopherson wrote:
> On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> > > On Thu, 2022-05-19 at 16:37 +0000, Sean Christopherson wrote:
> > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > Now for nested AVIC, this is what I would like to do:
> >  
> > - just like mmu, I prefer to register the write tracking notifier, when the
> >   VM is created.
> > 
> > - just like mmu, write tracking should only be enabled when nested AVIC is
> >   actually used first time, so that write tracking is not always enabled when
> >   you just boot a VM with nested avic supported, since the VM might not use
> >   nested at all.
> >  
> > Thus I either need to use the __kvm_page_track_register_notifier too for AVIC
> > (and thus need to export it) or I need to have a boolean
> > (nested_avic_was_used_once) and register the write tracking notifier only
> > when false and do it not on VM creation but on first attempt to use nested
> > AVIC.
> >  
> > Do you think this is worth it? I mean there is some value of registering the
> > notifier only when needed (this way it is not called for nothing) but it does
> > complicate things a bit.
> 
> Compared to everything else that you're doing in the nested AVIC code, refcounting
> the shared kvm_page_track_notifier_node object is a trivial amount of complexity.
Makes sense.

> 
> And on that topic, do you have performance numbers to justify using a single
> shared node?  E.g. if every table instance has its own notifier, then no additional
> refcounting is needed. 

The thing is that KVM goes over the list of notifiers and calls them for every write from the emulator
in fact even just for mmio write, and when you enable write tracking on a page,
you just write protect the page and add a mark in the page track array, which is roughly 

'don't install spte, don't install mmio spte, but just emulate the page fault if it hits this page'

So adding more than a bare minimum to this list, seems just a bit wrong.


>  It's not obvious that a shared node will provide better
> performance, e.g. if there are only a handful of AVIC tables being shadowed, then
> a linear walk of all nodes is likely fast enough, and doesn't bring the risk of
> a write potentially being stalled due to having to acquire a VM-scoped mutex.

The thing is that if I register multiple notifiers, they all will be called anyway,
but yes I can use container_of, and discover which table the notifier belongs to,
instead of having a hash table where I lookup the GFN of the fault.

The above means practically that all the shadow physid tables will be in a linear
list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
however for simplicity I probably will still need it because I do modify the page,
and having per physid table mutex complicates things.

Currently in my code the locking is very simple and somewhat dumb, but the performance
is very good because the code isn't executed often, most of the time the AVIC hardware
works alone without any VM exits.

Once the code is accepted upstream, it's one of the things that can be improved.


Note though that I still need a hash table and a mutex because on each VM entry,
the guest can use a different physid table, so I need to lookup it, and create it,
if not found, which would require read/write of the hash table and thus a mutex.



> 
> > I can also stash this boolean (like 'bool registered;') into the 'struct
> > kvm_page_track_notifier_node',  and thus allow the
> > kvm_page_track_register_notifier to be called more that once -  then I can
> > also get rid of __kvm_page_track_register_notifier. 
> 
> No, allowing redundant registration without proper refcounting leads to pain,
> e.g. X registers, Y registers, X unregisters, kaboom.
> 

True, but then what about adding a refcount to 'struct kvm_page_track_notifier_node'
instead of a boolean, and allowing redundant registration? 
Probably not worth it, in which case I am OK to add a refcount to my avic code.

Or maybe just scrap the whole thing and just leave registration and activation of the
write tracking as two separate things? Honestly now that looks like the most clean
solution.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-07-28  7:46           ` Maxim Levitsky
@ 2022-08-01 15:53             ` Maxim Levitsky
  2022-08-01 17:20             ` Sean Christopherson
  1 sibling, 0 replies; 57+ messages in thread
From: Maxim Levitsky @ 2022-08-01 15:53 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-07-28 at 10:46 +0300, Maxim Levitsky wrote:
> On Mon, 2022-07-25 at 16:08 +0000, Sean Christopherson wrote:
> > On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > > On Sun, 2022-05-22 at 13:22 +0300, Maxim Levitsky wrote:
> > > > On Thu, 2022-05-19 at 16:37 +0000, Sean Christopherson wrote:
> > > > > On Wed, Apr 27, 2022, Maxim Levitsky wrote:
> > > > > > @@ -5753,6 +5752,10 @@ int kvm_mmu_init_vm(struct kvm *kvm)
> > > Now for nested AVIC, this is what I would like to do:
> > >  
> > > - just like mmu, I prefer to register the write tracking notifier, when the
> > >   VM is created.
> > > 
> > > - just like mmu, write tracking should only be enabled when nested AVIC is
> > >   actually used first time, so that write tracking is not always enabled when
> > >   you just boot a VM with nested avic supported, since the VM might not use
> > >   nested at all.
> > >  
> > > Thus I either need to use the __kvm_page_track_register_notifier too for AVIC
> > > (and thus need to export it) or I need to have a boolean
> > > (nested_avic_was_used_once) and register the write tracking notifier only
> > > when false and do it not on VM creation but on first attempt to use nested
> > > AVIC.
> > >  
> > > Do you think this is worth it? I mean there is some value of registering the
> > > notifier only when needed (this way it is not called for nothing) but it does
> > > complicate things a bit.
> > 
> > Compared to everything else that you're doing in the nested AVIC code, refcounting
> > the shared kvm_page_track_notifier_node object is a trivial amount of complexity.
> Makes sense.
> 
> > And on that topic, do you have performance numbers to justify using a single
> > shared node?  E.g. if every table instance has its own notifier, then no additional
> > refcounting is needed. 
> 
> The thing is that KVM goes over the list of notifiers and calls them for every write from the emulator
> in fact even just for mmio write, and when you enable write tracking on a page,
> you just write protect the page and add a mark in the page track array, which is roughly 
> 
> 'don't install spte, don't install mmio spte, but just emulate the page fault if it hits this page'
> 
> So adding more than a bare minimum to this list, seems just a bit wrong.
> 
> 
> >  It's not obvious that a shared node will provide better
> > performance, e.g. if there are only a handful of AVIC tables being shadowed, then
> > a linear walk of all nodes is likely fast enough, and doesn't bring the risk of
> > a write potentially being stalled due to having to acquire a VM-scoped mutex.
> 
> The thing is that if I register multiple notifiers, they all will be called anyway,
> but yes I can use container_of, and discover which table the notifier belongs to,
> instead of having a hash table where I lookup the GFN of the fault.
> 
> The above means practically that all the shadow physid tables will be in a linear
> list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
> however for simplicity I probably will still need it because I do modify the page,
> and having per physid table mutex complicates things.
> 
> Currently in my code the locking is very simple and somewhat dumb, but the performance
> is very good because the code isn't executed often, most of the time the AVIC hardware
> works alone without any VM exits.
> 
> Once the code is accepted upstream, it's one of the things that can be improved.
> 
> 
> Note though that I still need a hash table and a mutex because on each VM entry,
> the guest can use a different physid table, so I need to lookup it, and create it,
> if not found, which would require read/write of the hash table and thus a mutex.
> 
> 
> 
> > > I can also stash this boolean (like 'bool registered;') into the 'struct
> > > kvm_page_track_notifier_node',  and thus allow the
> > > kvm_page_track_register_notifier to be called more that once -  then I can
> > > also get rid of __kvm_page_track_register_notifier. 
> > 
> > No, allowing redundant registration without proper refcounting leads to pain,
> > e.g. X registers, Y registers, X unregisters, kaboom.
> > 
> 
> True, but then what about adding a refcount to 'struct kvm_page_track_notifier_node'
> instead of a boolean, and allowing redundant registration? 
> Probably not worth it, in which case I am OK to add a refcount to my avic code.
> 
> Or maybe just scrap the whole thing and just leave registration and activation of the
> write tracking as two separate things? Honestly now that looks like the most clean
> solution.


Kind ping on this. Do you still want me to enable write tracking on the notifier registeration,
or scrap the idea?


Best regards,
	Maxim Levitsky
> 
> Best regards,
> 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally
  2022-07-28  7:46           ` Maxim Levitsky
  2022-08-01 15:53             ` Maxim Levitsky
@ 2022-08-01 17:20             ` Sean Christopherson
  2022-08-08 13:13               ` Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally) Maxim Levitsky
  1 sibling, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-08-01 17:20 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, Jul 28, 2022, Maxim Levitsky wrote:
> On Mon, 2022-07-25 at 16:08 +0000, Sean Christopherson wrote:
> > On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > And on that topic, do you have performance numbers to justify using a single
> > shared node?  E.g. if every table instance has its own notifier, then no additional
> > refcounting is needed. 
> 
> The thing is that KVM goes over the list of notifiers and calls them for
> every write from the emulator in fact even just for mmio write, and when you
> enable write tracking on a page, you just write protect the page and add a
> mark in the page track array, which is roughly 
> 
> 'don't install spte, don't install mmio spte, but just emulate the page fault if it hits this page'
> 
> So adding more than a bare minimum to this list, seems just a bit wrong.

Hmm, I see what you're saying.  To some extent, having a minimal page tracker
implementation is just that, an implementation detail.  But for better or worse,
the existing API effectively pushes range checking to the callers.  I agree that
breaking from that pattern would be odd.

> >  It's not obvious that a shared node will provide better performance, e.g.
> >  if there are only a handful of AVIC tables being shadowed, then a linear
> >  walk of all nodes is likely fast enough, and doesn't bring the risk of a
> >  write potentially being stalled due to having to acquire a VM-scoped
> >  mutex.
> 
> The thing is that if I register multiple notifiers, they all will be called anyway,
> but yes I can use container_of, and discover which table the notifier belongs to,
> instead of having a hash table where I lookup the GFN of the fault.
> 
> The above means practically that all the shadow physid tables will be in a linear
> list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
> however for simplicity I probably will still need it because I do modify the page,
> and having per physid table mutex complicates things.
> 
> Currently in my code the locking is very simple and somewhat dumb, but the performance
> is very good because the code isn't executed often, most of the time the AVIC hardware
> works alone without any VM exits.

Yes, but because the code isn't executed often, pretty much any solution will
provide good performance.

> Once the code is accepted upstream, it's one of the things that can be improved.
> 
> Note though that I still need a hash table and a mutex because on each VM entry,
> the guest can use a different physid table, so I need to lookup it, and create it,
> if not found, which would require read/write of the hash table and thus a mutex.

One of the points I'm trying to make is that a hash table isn't strictly required.
E.g. if I understand the update rules correctly, I believe tables can be tracked
via an RCU-protected list, with vCPUs taking a spinlock and doing synchronize_rcu()
when adding/removing a table.  That would avoid having to take any "real" locks in
the page track notifier.

The VM-scoped mutex worries me as it will be a bottleneck if L1 is running multiple
L2 VMs.  E.g. if L1 is frequently switching vmcs12 and thus avic_physical_id, then
nested VMRUN will effectively get serialized.  That is mitigated to some extent by
an RCU-protected list, as a sane L1 will use a single table for each L2, and so a
vCPU will need to add/remove a table if and only if it's the first/last vCPU to
start/stop running an L2 VM.

> > > I can also stash this boolean (like 'bool registered;') into the 'struct
> > > kvm_page_track_notifier_node',  and thus allow the
> > > kvm_page_track_register_notifier to be called more that once -  then I can
> > > also get rid of __kvm_page_track_register_notifier. 
> > 
> > No, allowing redundant registration without proper refcounting leads to pain,
> > e.g. X registers, Y registers, X unregisters, kaboom.
> > 
> 
> True, but then what about adding a refcount to 'struct kvm_page_track_notifier_node'
> instead of a boolean, and allowing redundant registration?
> Probably not worth it, in which case I am OK to add a refcount to my avic code.

Ya, I would rather force AVIC to do the refcounting.  Existing users don't need a
refcount, and doing the refcounting in AVIC code means kvm_page_track_notifier_node
can WARN on redundant registration, i.e. can sanity check the AVIC code to some
extent.

> Or maybe just scrap the whole thing and just leave registration and
> activation of the write tracking as two separate things? Honestly now that
> looks like the most clean solution.

It's the easiest, but IMO it's not the cleanest.  Allowing notifiers to be
registered without tracking being enabled is undesirable, especially since we know
we can prevent it.

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)
  2022-08-01 17:20             ` Sean Christopherson
@ 2022-08-08 13:13               ` Maxim Levitsky
  2022-09-29 22:38                 ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-08-08 13:13 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Mon, 2022-08-01 at 17:20 +0000, Sean Christopherson wrote:
> On Thu, Jul 28, 2022, Maxim Levitsky wrote:
> > On Mon, 2022-07-25 at 16:08 +0000, Sean Christopherson wrote:
> > > On Wed, Jul 20, 2022, Maxim Levitsky wrote:
> > > And on that topic, do you have performance numbers to justify using a single
> > > shared node?  E.g. if every table instance has its own notifier, then no additional
> > > refcounting is needed. 
> > 
> > The thing is that KVM goes over the list of notifiers and calls them for
> > every write from the emulator in fact even just for mmio write, and when you
> > enable write tracking on a page, you just write protect the page and add a
> > mark in the page track array, which is roughly 
> > 
> > 'don't install spte, don't install mmio spte, but just emulate the page fault if it hits this page'
> > 
> > So adding more than a bare minimum to this list, seems just a bit wrong.
> 
> Hmm, I see what you're saying.  To some extent, having a minimal page tracker
> implementation is just that, an implementation detail.  But for better or worse,
> the existing API effectively pushes range checking to the callers.  I agree that
> breaking from that pattern would be odd.
> 
> > >  It's not obvious that a shared node will provide better performance, e.g.
> > >  if there are only a handful of AVIC tables being shadowed, then a linear
> > >  walk of all nodes is likely fast enough, and doesn't bring the risk of a
> > >  write potentially being stalled due to having to acquire a VM-scoped
> > >  mutex.
> > 
> > The thing is that if I register multiple notifiers, they all will be called anyway,
> > but yes I can use container_of, and discover which table the notifier belongs to,
> > instead of having a hash table where I lookup the GFN of the fault.
> > 
> > The above means practically that all the shadow physid tables will be in a linear
> > list of notifiers, so I could indeed avoid per vm mutex on the write tracking,
> > however for simplicity I probably will still need it because I do modify the page,
> > and having per physid table mutex complicates things.
> > 
> > Currently in my code the locking is very simple and somewhat dumb, but the performance
> > is very good because the code isn't executed often, most of the time the AVIC hardware
> > works alone without any VM exits.
> 
> Yes, but because the code isn't executed often, pretty much any solution will
> provide good performance.
> 
> > Once the code is accepted upstream, it's one of the things that can be improved.
> > 
> > Note though that I still need a hash table and a mutex because on each VM entry,
> > the guest can use a different physid table, so I need to lookup it, and create it,
> > if not found, which would require read/write of the hash table and thus a mutex.
> 
> One of the points I'm trying to make is that a hash table isn't strictly required.
> E.g. if I understand the update rules correctly, I believe tables can be tracked
> via an RCU-protected list, with vCPUs taking a spinlock and doing synchronize_rcu()
> when adding/removing a table.  That would avoid having to take any "real" locks in
> the page track notifier.
> 
> The VM-scoped mutex worries me as it will be a bottleneck if L1 is running multiple
> L2 VMs.  E.g. if L1 is frequently switching vmcs12 and thus avic_physical_id, then
> nested VMRUN will effectively get serialized.  That is mitigated to some extent by
> an RCU-protected list, as a sane L1 will use a single table for each L2, and so a
> vCPU will need to add/remove a table if and only if it's the first/last vCPU to
> start/stop running an L2 VM.

Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.
 
I would like to explain the design choices for locking, and life cycle of the shadow physid tables, and I hope
that this will make it easier for you to review my code and/or make some suggestions on how to improve it.
 
=====================================================================================================================
Explanation of the AVIC physid page (AVIC physical ID table)
=====================================================================================================================
 
This table gives a vCPU enough knowledge of its peers to send them IPIs without VM exit.
 
A vCPU doesn’t use this table to send IPIs to itself and or to process its own interrupts from its own
IRR/ISR. It accesses its APIC backing page directly.
 
This table contains an entry for each vCPU, and each entry contains 2 things:
 
1. A physical address of a peer’s vCPU APIC backing page, so that when sending IPIs to this vCPU
   It can set them in the IRR location in this page (this differs from APICv, which uses PIR bitmaps).
 
   NOTE1: There is also a ‘V’(valid) bit attached to the address - when clear, then whole entry is invalid
   and trying to use it will trigger a VM exit for unicast IPIs, but for broadcast interrupts the
   entry will be ignored.
 
   NOTE2: This part of the entry is not supposed to change during the lifetime of a VM.
 
2. An apic id of a physical vCPU where this vCPU is running (in case of nesting, this would be L1 APIC id).
 
   This field allows AVIC to ring the doorbell on the target physical CPU to make its AVIC process the 
   incoming interrupt request.
 
   It also has a ‘IR’ bit (is running) which when clear indicates that the target vCPU is not running anywhere
   thus the field content is not valid.
 
   - This field is supposed to be changed by L1 once in a while when it either migrates
     the L2's vCPUs around and/or schedules them in/out
 
   - Write tracking of the guest physid table ensures that the shadow physid table is kept up to date.
 
   - In addition to that, the L1's vCPUs can be migrated and/or scheduled in/out, which would 
     lead to an update of the shadow table as well.
     (similar how mmu notifiers need to update the shadow tables, not because of a guest 
     lead change but due to host triggered change)
 
 
- All vCPUs of a nested VM are supposed to share the same physid page, and the page is supposed to contain
  entries such as each entry points to unique apic backing page and contains the L1’s physical apic id,
  On which this nested vCPU runs now (or has is_running=1 meaning that this vCPU is scheduled out)
 
- The number of entries in the physid table (aka max guest apic id) is not specified in it, bur rather it is given
  In the vmcb that references it (also all vmcbs of a guest should have the same value).
 
NOTE: while I say ‘supposed’ I understand that a malicious guest will try to bend each of these
  assumptions and AFAIK I do handle (but often in a slow way) all these unusual cases while
  still following the AVIC spec.
 
=====================================================================================================================
Lifecycle of the shadow physid pages
=====================================================================================================================
 
- An empty shadow physid page is created when a nested entry with AVIC is attempted with a new physid table.
  New shadow physid table is created, and has 0 entries, thus it needs to be synced.
 
- On each VM entry, if the vCPU’s shadow physid table is not NULL but is not synced, then all the entries in the
  table are created (synced):
 
  - the apic backing page pointed by the entry is pinned in ram and its real physical address is written 
    in the shadow entry
 
  - the L1 vCPU in the entry, when valid (is_running=1) is translated to L0 apic id based on which CPU, the L1 vCPU 
    runs, and the value is written in the shadow entry.
 
- On nested VM exit, pretty much nothing is done in regard to shadow physid tables:
  the vCPU keeps its shadow physid table, its shadow entries are still valid and point to pinned apic backing pages.
 
- Once L1 is running, if it is about to schedule the L2’s vCPU off, it can toggle is_running bit, which will trigger
   write tracking and update the shadow physid table.
 
 
- On another nested VM entry with *same* physid table, nothing happens
  (Unless for some reason the guest increased the number of entries, then new entries are synced, which
  is very rare to happen - can only practically happen when nested CPU hotplug happens)
 
- On another nested VM entry with a different physid table:
 
  - The current table refcount is decreased, and the table is freed if it reaches 0. Freeing triggers unpinning of
    all guest apic backing pages referenced by the page.
 
    This relatively simple approach means that if L1 switches a lot between nested guests, and these  guests don't
    have many vCPUs, it would be possible that all nested vCPUs would switch to one  physid page and then to another
    thus triggering freeing of the first and creating of the second page  and then vice versa.
 
    In my testing that doesn't happen that often, unless there is quite some oversubscription  and/or double nesting
    (which leads to L1 running two guests (01 and 02) and switching between them like crazy.
 
    The design choice was made to avoid keeping a cache of physid tables (like mmu does) and  shrinking it once in
    a while.
 
    The problem with such cache is that each inactive physid table in it (which might very well be already reused 
    for something else), will keep all its entries pinned in the guest memory.
 
    With this design choice, the maximum number of shadow physid tables is the number of vCPUs.
 
  - new physid table is looked up in the hash table and created if not found there.
 
 
- When a vCPU disables nesting (clears EFER.SVME) and/or the VM is shut down the physid table that belongs to it,
  has its refcount decreased as well, which can also lead to its freeing.
  
  So when L1 fully disables nesting (in KVM case, means that it destroys all VMs), then all shadow physid
  pages will be freed.
 
 
- When L1 vCPU is migrated across physical cpus and/or scheduled in/out, all shadow physid table's entries which
  reference this vCPU, are updated.
 
  NOTE: usually there will be just one or zero such entries, but if this L1 vCPU is oversubscribed, it is possible 
  that two physid tables would contain entries that reference this vCPU, if two guests are running almost at the 
  same time on this vCPU. 
 
  It can't happen if the nested guest is KVM, because KVM always unloads the previous vCPU before it loads the
  next one, which will lead to setting is_running to 0 on the previous vCPU.
 
  In case of double nesting, KVM also clears is_running bit of L1 guest before running L2.
 
  A linked list of only the entries themselves is kept in each L1's vCPU, and it is protected from races vs write
  tracking by a spinlock.
 
=====================================================================================================================
Locking in the nested AVIC
=====================================================================================================================
 
First of all I use two locks.
 
1. a per VM mutex that roughly shares the same purpose as 'kvm->mmu_lock' and protects the hash table, and also just  
   serializes some operations.
 
2. a per VM spinlock which protects access to the physical CPU portion of physid tables. It is either taken with the
   mutex held or taken alone.
 
The choice of two locks is a bit odd, and I might be able to only have a single spinlock.
 
Let me now explain how the locking is used and how it compares with kvm’s mmu lock:
 
======================================
-> Nested VM entry
======================================
 
  mutex -> spinlock
 
  Mutex ensures that KVM doesn’t race against another nested VM entry which is also trying to create the 
  shadow physid page
 
  Spinlock ensures that we don't race with one of the vCPU schedule in/out, updating the is_running bit,
 
  kvm's mmu:
        - kvm_mmu_load is called when current mmu root is invalid
        - mmu lock is taken, and a new mmu root page is created or existing one looked up in the hash table
 
======================================
-> VM entry
======================================
 
  mutex -> spinlock
 
  (done only when KVM_REQ_APIC_PAGE_RELOAD is pending)
 
  Very similar to the nested VM entry, and in practice will happen *very rarely* because this can happen only if a 
  memslot that *contains* the page got flushed, or if write tracking detected unusual write to the page
  (like update of the avic backing page)
 
  kvm’s mmu:
	- kvm_mmu_load is called when current mmu root is invalid
	- mmu lock is taken, and a new mmu root page is created or existing one looked up in the hash table
 
======================================
-> Write tracking <-
======================================
 
   mutex -> spinlock
 
   Also like the above. 
 
   - Updates only the is_running bits in the shadow physid table.
 
   - Otherwise all entries in the table are erased and the KVM_REQ_APIC_PAGE_RELOAD request raised, which ensures 
     that if that table is used on another CPU, it will sync it before using it again.
     
     That is also very rare to happen, unless the guest stopped using the page as a physid page, in which case
     the page will be just dropped by vCPUs which still reference it but don’t use it.
 
   kvm’s mmu:
 
   - kvm_mmu_pte_write is called
 
   - mmu lock is taken, and a new mmu root page is created or existing one looked up in the hash table
 
   - if unaligned write / write flooding is detected, the page is zapped
 
   - for zapped root pages, since they are still could be in use by other cpus, this removes the table from 
     the linked list + raises KVM_REQ_MMU_FREE_OBSOLETE_ROOTS)
 
   - KVM_REQ_MMU_FREE_OBSOLETE_ROOTS makes each vcpu get rid of its mmu root if zapped, and later will lead
     to 'kvm_mmu_load' creating a new root shadow page
 
     (this is similar to raising KVM_REQ_APIC_PAGE_RELOAD)
 
======================================
-> Memslot flush <-
======================================
 
    mutex -> spinlock
 
   - Memslot flush happens very rarely, and leads to erase of all shadow physid tables in the memslot.
     and raising of KVM_REQ_APIC_PAGE_RELOAD which if some vCPUs use the page, will make them re-sync it.
 
   kvm’s mmu:
       kvm_mmu_invalidate_zap_pages_in_memslot is called which
	   - takes mmu lock
	   - zaps *all* the shadow pages (kvm_mmu_zap_all_fast)
	   - raises KVM_REQ_MMU_FREE_OBSOLETE_ROOTS to get re-create all the current mmu roots
 
======================================
-> L1 vCPU schedule in/out <-
======================================
 
   *only spinlock is taken*
 
    Here the KVM only updates the is_running bit in shadow physid tables that reference it using a linked list of 
    these entries.
 
    Can be optimized to avoid taking spinlock if the linked list is empty, using the correct memory barriers.
 
    kvm mmu: No equivalent.
 
======================================
-> Unaccelerated IPI emulation <-
======================================
 
   * no lock are taken *
 
   Guest physid table is read to determine guest value of is_running bit. This is done without locking vs
   write tracking because the guest must itself insure that it either has locking or barriers to avoid a race here.
 
======================================
-> Nested doorbell emulation <-
======================================
 
   * no lock are taken *
 
   Thankfully the code doesn't need physid table at all, it just needs to translate the L1's apic ID to the L0's 
   apic id and ring the real doorbell.
 
=====================================================================================================================
Ideas for improvement:
=====================================================================================================================
 
1. Stopping pinning the avic backing pages. 
 
   While these pages have to be pinned when a vCPU uses it directly, they don't have to be pinned when a 
   the physid table references it if they could be pinned on demand.
 
   Many of these tables might not be used anymore and until KVM finds out, these backing pages will be pinned 
   for nothing.
 
   The problem with this is that keeping 'V' (valid/present) bit off in the shadow table isn't suitable for 
   on demand access to these entries like one would do in paging - the reason - when sending a broadcast interrupt
   through AVIC, it ignores the non valid entries and doesn't VMexit - which makes sense but ruins the plan.
 
   However there is a way to overcome it. An valid shadow physid table entry is created which points to a 'dummy' 
   page, and doesn't have the 'is_running' bit set. 
 
   For such entry, AVIC will set IRR bits in that dummy page, and then signal unaccelerated IPI vm exit,
   and then KVM can detect the condition, locate and swap in the AVIC backing page and write the bit there manually,
   by looking at what was written in the ICR (that is thankfully in the vm exit info field).
 
   This, together with hooking into mmu notifier to erase shadow physid entries, when an apic backing page is swapped out, 
   should make it work.

   The downside of this is that I will have to emulate more of the AVIC, I will have to set the IRR bits manually
   in the apic backing pages I just pinned.
   And I need a hash to track all avic backing pages, so that when I get mmu notifier notification, I can know
   that a page is a apic backing page, and I also need to know which physid table references it (I need a sort of
   'rmap' for this).
 
2. Use just a spinlock.
 
  - I have to use a spinlock because this is the only locking primitive that can be used from L1's vCPU load/put
    functions which are called from schedule().
 
  - I can avoid using the mutex, which is currently used because allocation of physid table can sleep and also
    because pinning of avic backing pages can sleep and accessing the guest physid table can sleep as well, so by having
    a spinlock, I can take it only in short critical sections where I update the is_running bit in the shadow table
    and nowhere else.
 
    KVM's mmu avoids first issue by having a pre-allocated cache of mmu pages and for the second issue, 
    it either uses atomic guest access functions and retires if they fail (need sleep), or pre-caches the values
    (like in mmu page walk struct) then takes the mmu spinlock, and then uses the read values.

 
Your feedback, ideas, and of course review of the patches is very welcome!
 
Best regards,
	Maxim Levitsky


> 
> > > > I can also stash this boolean (like 'bool registered;') into the 'struct
> > > > kvm_page_track_notifier_node',  and thus allow the
> > > > kvm_page_track_register_notifier to be called more that once -  then I can
> > > > also get rid of __kvm_page_track_register_notifier. 
> > > 
> > > No, allowing redundant registration without proper refcounting leads to pain,
> > > e.g. X registers, Y registers, X unregisters, kaboom.
> > > 
> > 
> > True, but then what about adding a refcount to 'struct kvm_page_track_notifier_node'
> > instead of a boolean, and allowing redundant registration?
> > Probably not worth it, in which case I am OK to add a refcount to my avic code.
> 
> Ya, I would rather force AVIC to do the refcounting.  Existing users don't need a
> refcount, and doing the refcounting in AVIC code means kvm_page_track_notifier_node
> can WARN on redundant registration, i.e. can sanity check the AVIC code to some
> extent.
> 
> > Or maybe just scrap the whole thing and just leave registration and
> > activation of the write tracking as two separate things? Honestly now that
> > looks like the most clean solution.
> 
> It's the easiest, but IMO it's not the cleanest.  Allowing notifiers to be
> registered without tracking being enabled is undesirable, especially since we know
> we can prevent it.
> 






^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)
  2022-08-08 13:13               ` Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally) Maxim Levitsky
@ 2022-09-29 22:38                 ` Sean Christopherson
  2022-10-03  7:27                   ` Maxim Levitsky
  0 siblings, 1 reply; 57+ messages in thread
From: Sean Christopherson @ 2022-09-29 22:38 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Mon, Aug 08, 2022, Maxim Levitsky wrote:
> Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.

Before we dive deep into design details, I think we should first decide whether
or not nested AVIC is worth pursing/supporting.

  - Rome has a ucode/silicon bug with no known workaround and no anticipated fix[*];
    AMD's recommended "workaround" is to disable AVIC.
  - AVIC is not available in Milan, which may or may not be related to the
    aforementioned bug.
  - AVIC is making a comeback on Zen4, but Zen4 comes with x2AVIC.
  - x2APIC is likely going to become ubiquitous, e.g. Intel is effectively
    requiring x2APIC to fudge around xAPIC bugs.
  - It's actually quite realistic to effectively force the guest to use x2APIC,
    at least if it's a Linux guest.  E.g. turn x2APIC on in BIOS, which is often
    (always?) controlled by the host, and Linux will use x2APIC.

In other words, given that AVIC is well on its way to becoming a "legacy" feature,
IMO there needs to be a fairly strong use case to justify taking on this much code
and complexity.  ~1500 lines of code to support a feature that has historically
been buggy _without_ nested support is going to require a non-trivial amount of
effort to review, stabilize, and maintain.

[*] 1235 "Guest With AVIC (Advanced Virtual Interrupt Controller) Enabled May Fail
    to Process IPI (Inter-Processor Interrupt) Until Guest Is Re-Scheduled" in
    https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf

^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)
  2022-09-29 22:38                 ` Sean Christopherson
@ 2022-10-03  7:27                   ` Maxim Levitsky
  2022-11-10  0:47                     ` Sean Christopherson
  0 siblings, 1 reply; 57+ messages in thread
From: Maxim Levitsky @ 2022-10-03  7:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

On Thu, 2022-09-29 at 22:38 +0000, Sean Christopherson wrote:
> On Mon, Aug 08, 2022, Maxim Levitsky wrote:
> > Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.
> 
> Before we dive deep into design details, I think we should first decide whether
> or not nested AVIC is worth pursing/supporting.
> 
>   - Rome has a ucode/silicon bug with no known workaround and no anticipated fix[*];
>     AMD's recommended "workaround" is to disable AVIC.
>   - AVIC is not available in Milan, which may or may not be related to the
>     aforementioned bug.
>   - AVIC is making a comeback on Zen4, but Zen4 comes with x2AVIC.
>   - x2APIC is likely going to become ubiquitous, e.g. Intel is effectively
>     requiring x2APIC to fudge around xAPIC bugs.
>   - It's actually quite realistic to effectively force the guest to use x2APIC,
>     at least if it's a Linux guest.  E.g. turn x2APIC on in BIOS, which is often
>     (always?) controlled by the host, and Linux will use x2APIC.
> 
> In other words, given that AVIC is well on its way to becoming a "legacy" feature,
> IMO there needs to be a fairly strong use case to justify taking on this much code
> and complexity.  ~1500 lines of code to support a feature that has historically
> been buggy _without_ nested support is going to require a non-trivial amount of
> effort to review, stabilize, and maintain.
> 
> [*] 1235 "Guest With AVIC (Advanced Virtual Interrupt Controller) Enabled May Fail
>     to Process IPI (Inter-Processor Interrupt) Until Guest Is Re-Scheduled" in
>     https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf
> 

I am afraid that you mixed things up:

You mistake is that x2avic is just a minor addition to AVIC. It is still for
all practical purposes the same feature.

 
1. The AVIC is indeed kind of broken on Zen2 (but AFAIK for all practical purposes,
   including nested it works fine, the errata only shows up in a unit test and/or
   under very specific workloads (most of the time a delayed wakeup doesn't cause a hang).
   Yet, I agree that for production Zen2 should not have AVIC enabled.
 

2. Zen3 does indeed have AVIC soft disabled in CPUID. AFAIK it works just fine,
   but I understand that customers won't use it against AMD's guidance.
 
 
3. On Zen4, AVIC is fully enabled and also extended to support x2apic mode.
   The fact that AVIC was extended to support X2apic mode also shows that AMD
   is committed to supporting it.
 
 
My nested AVIC code technically doesn't expose x2avic to the guest, but it
is pretty much trivial to add (I am only waiting to get my hands on Zen4 machine
to do it), and also even in its current form it would work just fine if the host
uses normal AVIC .
 
(or even doesn't use AVIC at all - the nested AVIC code works just fine
even if the host has its AVIC inhibited for some reason).
 
Adding nested x2avic support is literally about not passing through that MMIO address,
Enabling the x2avic bit in int_ctl, and opening up the access to x2apic msrs.
Plus I need to do some minor changes in unaccelerated IPI handler, dealing
With read-only logical ID and such.
 
Physid tables, apic backing pages, doorbell emulation, 
everything is pretty much unchanged.
 
So AVIC is nothing but a legacy feature, and my nested AVIC code will support
both nested AVIC and nested X2AVIC.
 
Best regards,
	Maxim Levitsky
 
 


^ permalink raw reply	[flat|nested] 57+ messages in thread

* Re: Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally)
  2022-10-03  7:27                   ` Maxim Levitsky
@ 2022-11-10  0:47                     ` Sean Christopherson
  0 siblings, 0 replies; 57+ messages in thread
From: Sean Christopherson @ 2022-11-10  0:47 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, Wanpeng Li, Vitaly Kuznetsov, Jani Nikula, Paolo Bonzini,
	Tvrtko Ursulin, Rodrigo Vivi, Zhenyu Wang, Joonas Lahtinen,
	Tom Lendacky, Ingo Molnar, David Airlie, Thomas Gleixner,
	Dave Hansen, x86, intel-gfx, Daniel Vetter, Borislav Petkov,
	Joerg Roedel, linux-kernel, Jim Mattson, Zhi Wang, Brijesh Singh,
	H. Peter Anvin, intel-gvt-dev, dri-devel

Sorry for the super slow reply, I don't have a good excuse other than I needed to
take break from AVIC code...

On Mon, Oct 03, 2022, Maxim Levitsky wrote:
> On Thu, 2022-09-29 at 22:38 +0000, Sean Christopherson wrote:
> > On Mon, Aug 08, 2022, Maxim Levitsky wrote:
> > > Hi Sean, Paolo, and everyone else who wants to review my nested AVIC work.
> > 
> > Before we dive deep into design details, I think we should first decide whether
> > or not nested AVIC is worth pursing/supporting.
> > 
> >   - Rome has a ucode/silicon bug with no known workaround and no anticipated fix[*];
> >     AMD's recommended "workaround" is to disable AVIC.
> >   - AVIC is not available in Milan, which may or may not be related to the
> >     aforementioned bug.
> >   - AVIC is making a comeback on Zen4, but Zen4 comes with x2AVIC.
> >   - x2APIC is likely going to become ubiquitous, e.g. Intel is effectively
> >     requiring x2APIC to fudge around xAPIC bugs.
> >   - It's actually quite realistic to effectively force the guest to use x2APIC,
> >     at least if it's a Linux guest.  E.g. turn x2APIC on in BIOS, which is often
> >     (always?) controlled by the host, and Linux will use x2APIC.
> > 
> > In other words, given that AVIC is well on its way to becoming a "legacy" feature,
> > IMO there needs to be a fairly strong use case to justify taking on this much code
> > and complexity.  ~1500 lines of code to support a feature that has historically
> > been buggy _without_ nested support is going to require a non-trivial amount of
> > effort to review, stabilize, and maintain.
> > 
> > [*] 1235 "Guest With AVIC (Advanced Virtual Interrupt Controller) Enabled May Fail
> >     to Process IPI (Inter-Processor Interrupt) Until Guest Is Re-Scheduled" in
> >     https://www.amd.com/system/files/TechDocs/56323-PUB_1.00.pdf
> > 
> 
> I am afraid that you mixed things up:
> 
> You mistake is that x2avic is just a minor addition to AVIC. It is still for
> all practical purposes the same feature.

...

> Physid tables, apic backing pages, doorbell emulation, 
> everything is pretty much unchanged.

Ya, it finally clicked for me that KVM would needs to shadow the physical ID
tables irrespective of x2APIC.

I'm still very hesitant to support full virtualization of nested (x2)AVIC.  The
complexity and amount of code is daunting, and nSVM has lower hanging fruit that
we should pick before going after full nested (x2)AVIC, e.g. SVM's TLB flushing
needs a serious overhaul.  And if we go through the pain for SVM, I think we'd
probably want to come up with a solution that can be at least shared shared with
VMX's IPI virtualization.

As an intermediate step, can we expose (x2)AVIC to L2 without any shadowing?
E.g. run all L2s with a single dummy physical ID table and emulate IPIs in KVM?

If that works, that seems like a logical first step even if we want to eventually
support nested IPI virtualization.

^ permalink raw reply	[flat|nested] 57+ messages in thread

end of thread, other threads:[~2022-11-10  0:47 UTC | newest]

Thread overview: 57+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-27 20:02 [RFC PATCH v3 00/19] RFC: nested AVIC Maxim Levitsky
2022-04-27 20:02 ` [RFC PATCH v3 01/19] KVM: x86: document AVIC/APICv inhibit reasons Maxim Levitsky
2022-05-18 15:56   ` Sean Christopherson
2022-05-18 17:13     ` Maxim Levitsky
2022-04-27 20:02 ` [RFC PATCH v3 02/19] KVM: x86: inhibit APICv/AVIC when the guest and/or host changes apic id/base from the defaults Maxim Levitsky
2022-05-18  8:28   ` Chao Gao
2022-05-18  9:50     ` Maxim Levitsky
2022-05-18 11:51       ` Chao Gao
2022-05-18 12:36         ` Maxim Levitsky
2022-05-18 15:39       ` Sean Christopherson
2022-05-18 17:15         ` Maxim Levitsky
2022-05-19 16:06   ` Sean Christopherson
2022-05-22  9:03     ` Maxim Levitsky
2022-05-22 14:47       ` Jim Mattson
2022-05-23  6:50         ` Maxim Levitsky
2022-05-23 17:22           ` Jim Mattson
2022-05-23 17:31           ` Sean Christopherson
2022-06-23  9:44     ` Maxim Levitsky
2022-04-27 20:02 ` [RFC PATCH v3 03/19] KVM: x86: SVM: remove avic's broken code that updated APIC ID Maxim Levitsky
2022-05-19 16:10   ` Sean Christopherson
2022-05-22  9:01     ` Maxim Levitsky
2022-05-23 17:19       ` Sean Christopherson
2022-04-27 20:02 ` [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally Maxim Levitsky
2022-05-19 16:27   ` Sean Christopherson
2022-05-22 10:21     ` Maxim Levitsky
2022-05-19 16:37   ` Sean Christopherson
2022-05-22 10:22     ` Maxim Levitsky
2022-07-20 14:42       ` Maxim Levitsky
2022-07-25 16:08         ` Sean Christopherson
2022-07-28  7:46           ` Maxim Levitsky
2022-08-01 15:53             ` Maxim Levitsky
2022-08-01 17:20             ` Sean Christopherson
2022-08-08 13:13               ` Nested AVIC design (was:Re: [RFC PATCH v3 04/19] KVM: x86: mmu: allow to enable write tracking externally) Maxim Levitsky
2022-09-29 22:38                 ` Sean Christopherson
2022-10-03  7:27                   ` Maxim Levitsky
2022-11-10  0:47                     ` Sean Christopherson
2022-04-27 20:03 ` [RFC PATCH v3 05/19] x86: KVMGT: use kvm_page_track_write_tracking_enable Maxim Levitsky
2022-05-19 16:38   ` Sean Christopherson
2022-04-27 20:03 ` [RFC PATCH v3 06/19] KVM: x86: mmu: add gfn_in_memslot helper Maxim Levitsky
2022-05-19 16:43   ` Sean Christopherson
2022-05-22 10:22     ` Maxim Levitsky
2022-05-22 12:12     ` Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 07/19] KVM: x86: mmu: tweak fast path for emulation of access to nested NPT pages Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 08/19] KVM: x86: SVM: move avic state to separate struct Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 09/19] KVM: x86: nSVM: add nested AVIC tracepoints Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 10/19] KVM: x86: nSVM: implement AVIC's physid/logid table access helpers Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 11/19] KVM: x86: nSVM: implement shadowing of AVIC's physical id table Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 12/19] KVM: x86: nSVM: make nested AVIC physid write tracking be aware of the host scheduling Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 13/19] KVM: x86: nSVM: wire nested AVIC to nested guest entry/exit Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 14/19] KVM: x86: rename .set_apic_access_page_addr to reload_apic_access_page Maxim Levitsky
2022-05-19 16:55   ` Sean Christopherson
2022-05-22 10:22     ` Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 15/19] KVM: x86: nSVM: add code to reload AVIC physid table when it is invalidated Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 16/19] KVM: x86: nSVM: implement support for nested AVIC vmexits Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 17/19] KVM: x86: nSVM: implement nested AVIC doorbell emulation Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 18/19] KVM: x86: SVM/nSVM: add optional non strict AVIC doorbell mode Maxim Levitsky
2022-04-27 20:03 ` [RFC PATCH v3 19/19] KVM: x86: nSVM: expose the nested AVIC to the guest Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).