linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/8] My AVIC patch queue
@ 2021-07-13 14:20 Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 1/8] KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated Maxim Levitsky
                   ` (8 more replies)
  0 siblings, 9 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

Hi!

This is a series of bugfixes to the AVIC dynamic inhibition, which was
made while trying to fix bugs as much as possible, in this area and trying
to make the AVIC+SYNIC conditional enablement work.

* Patches 1-4 address an issue of possible
  mismatch between the AVIC inhibit state and AVIC enable state on all vCPUs.

  Since AVICs state is changed via a request there is a window during which
  the states differ which can lead to various warnings and errors.

  There was an earlier attempt to fix this by changing the AVIC enable state
  on the current vCPU immediately when the AVIC inhibit request is created,
  however while this fixes the common case, it actually hides the issue deeper,
  because on all other vCPUs but current one, the two states can still
  mismatch till the KVM_REQ_APICV_UPDATE is processed on each of them.

  My take on this is to fix the places where the mismatch causes the
  issues instead and then drop the special case of toggling the AVIC right
  away in kvm_request_apicv_update.

  V2: I rewrote the commit description for the patch that touches
    avic inhibition in nested case.

* Patches 5-6 in this series fix a race condition which can cause
  a lost write from a guest to APIC when the APIC write races
  the AVIC un-inhibition, and add a warning to catch this problem
  if it re-emerges again.

  V2: I re-implemented this with a mutex in V2.

* Patch 7 is an  fix yet another issue I found in AVIC inhibit code:
  Currently avic_vcpu_load/avic_vcpu_put are called on userspace entry/exit
  from KVM (aka kvm_vcpu_get/kvm_vcpu_put), and these functions update the
  "is running" bit in the AVIC physical ID remap table and update the
  target vCPU in iommu code.

  However both of these functions don't do anything when AVIC is inhibited
  thus the "is running" bit will be kept enabled during exit to userspace.
  This shouldn't be a big issue as the caller
  doesn't use the AVIC when inhibited but still inconsistent and can trigger
  a warning about this in avic_vcpu_load.

  To be on the safe side I think it makes sense to call
  avic_vcpu_put/avic_vcpu_load when inhibiting/uninhibiting the AVIC.
  This will ensure that the work these functions do is matched.

* Patch 8 is the patch from Vitaly about allowing AVIC with SYNC
  as long as the guest doesn’t use the AutoEOI feature. I only slightly
  changed it to drop the SRCU lock around call to kvm_request_apicv_update
  and also expose the AutoEOI cpuid bit regardless of AVIC enablement.

  Despite the fact that this is the last patch in this series, this patch
  doesn't depend on the other fixes.

Best regards,
	Maxim Levitsky

Maxim Levitsky (7):
  KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to
    be deactivated
  KVM: SVM: tweak warning about enabled AVIC on nested entry
  KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl
  KVM: x86: APICv: drop immediate APICv disablement on current vCPU
  KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
  KVM: SVM: add warning for mistmatch between AVIC state and AVIC access
    page state
  KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling
    AVIC

Vitaly Kuznetsov (1):
  KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in
    use

 arch/x86/include/asm/kvm_host.h |  3 ++
 arch/x86/kvm/hyperv.c           | 34 ++++++++++++++++----
 arch/x86/kvm/svm/avic.c         | 45 ++++++++++++++------------
 arch/x86/kvm/svm/nested.c       |  2 +-
 arch/x86/kvm/svm/svm.c          | 18 ++++++++---
 arch/x86/kvm/x86.c              | 57 ++++++++++++++++++---------------
 include/linux/kvm_host.h        |  1 +
 virt/kvm/kvm_main.c             |  1 +
 8 files changed, 103 insertions(+), 58 deletions(-)

-- 
2.26.3



^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH v2 1/8] KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 2/8] KVM: SVM: tweak warning about enabled AVIC on nested entry Maxim Levitsky
                   ` (7 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

It is possible for AVIC inhibit and AVIC active state to be mismatched.
Currently we disable AVIC right away on vCPU which started the AVIC inhibit
request thus this warning doesn't trigger but at least in theory,
if svm_set_vintr is called at the same time on multiple vCPUs,
the warning can happen.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/svm.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 12c06ea28f5c..d4d14e318ab5 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1560,8 +1560,11 @@ static void svm_set_vintr(struct vcpu_svm *svm)
 {
 	struct vmcb_control_area *control;
 
-	/* The following fields are ignored when AVIC is enabled */
-	WARN_ON(kvm_vcpu_apicv_active(&svm->vcpu));
+	/*
+	 * The following fields are ignored when AVIC is enabled
+	 */
+	WARN_ON(kvm_apicv_activated(svm->vcpu.kvm));
+
 	svm_set_intercept(svm, INTERCEPT_VINTR);
 
 	/*
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 2/8] KVM: SVM: tweak warning about enabled AVIC on nested entry
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 1/8] KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 3/8] KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl Maxim Levitsky
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

It is possible that AVIC was requested to be disabled but
not yet disabled, e.g if the nested entry is done right
after svm_vcpu_after_set_cpuid.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/nested.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index dca20f949b63..253847f7d9aa 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -505,7 +505,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 	 * Also covers avic_vapic_bar, avic_backing_page, avic_logical_id,
 	 * avic_physical_id.
 	 */
-	WARN_ON(svm->vmcb01.ptr->control.int_ctl & AVIC_ENABLE_MASK);
+	WARN_ON(kvm_apicv_activated(svm->vcpu.kvm));
 
 	/* Copied from vmcb01.  msrpm_base can be overwritten later.  */
 	svm->vmcb->control.nested_ctl = svm->vmcb01.ptr->control.nested_ctl;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 3/8] KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 1/8] KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 2/8] KVM: SVM: tweak warning about enabled AVIC on nested entry Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 4/8] KVM: x86: APICv: drop immediate APICv disablement on current vCPU Maxim Levitsky
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

Currently when SVM is enabled in guest CPUID, AVIC is inhibited as soon
as the guest CPUID is set.

AVIC happens to be fully disabled on all vCPUs by the time any guest
entry starts (if after migration the entry can be nested).

The reason is that currently we disable avic right away on vCPU from which
the kvm_request_apicv_update was called and for this case, it happens to be
called on all vCPUs (by svm_vcpu_after_set_cpuid).

After we stop doing this, AVIC will end up being disabled only when
KVM_REQ_APICV_UPDATE is processed which is after we done switching to the
nested guest.

Fix this by just using vmcb01 in svm_refresh_apicv_exec_ctrl for avic
(which is a right thing to do anyway).

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index a9abed054cd5..64d1e1b6a305 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -646,7 +646,7 @@ static int svm_set_pi_irte_mode(struct kvm_vcpu *vcpu, bool activate)
 void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb *vmcb = svm->vmcb;
+	struct vmcb *vmcb = svm->vmcb01.ptr;
 	bool activated = kvm_vcpu_apicv_active(vcpu);
 
 	if (!enable_apicv)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 4/8] KVM: x86: APICv: drop immediate APICv disablement on current vCPU
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (2 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 3/8] KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM Maxim Levitsky
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

Special case of disabling the APICv on the current vCPU right away in
kvm_request_apicv_update doesn't bring much benefit vs raising
KVM_REQ_APICV_UPDATE on it instead, since this request will be processed
on the next entry to the guest.
(the comment about having another #VMEXIT is wrong).

It also hides various assumptions that APIVc enable state matches
the APICv inhibit state, as this special case only makes those states
match on the current vCPU.

Previous patches fixed few such assumptions so now it should be safe
to drop this special case.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 76dae88cf524..29b92f6cbad4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9204,7 +9204,6 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
  */
 void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 {
-	struct kvm_vcpu *except;
 	unsigned long old, new, expected;
 
 	if (!kvm_x86_ops.check_apicv_inhibit_reasons ||
@@ -9230,16 +9229,7 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 	if (kvm_x86_ops.pre_update_apicv_exec_ctrl)
 		static_call(kvm_x86_pre_update_apicv_exec_ctrl)(kvm, activate);
 
-	/*
-	 * Sending request to update APICV for all other vcpus,
-	 * while update the calling vcpu immediately instead of
-	 * waiting for another #VMEXIT to handle the request.
-	 */
-	except = kvm_get_running_vcpu();
-	kvm_make_all_cpus_request_except(kvm, KVM_REQ_APICV_UPDATE,
-					 except);
-	if (except)
-		kvm_vcpu_update_apicv(except);
+	kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
 }
 EXPORT_SYMBOL_GPL(kvm_request_apicv_update);
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (3 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 4/8] KVM: x86: APICv: drop immediate APICv disablement on current vCPU Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-26 22:34   ` Paolo Bonzini
  2021-07-13 14:20 ` [PATCH v2 6/8] KVM: SVM: add warning for mistmatch between AVIC state and AVIC access page state Maxim Levitsky
                   ` (3 subsequent siblings)
  8 siblings, 1 reply; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

Currently on SVM, the kvm_request_apicv_update calls the
'pre_update_apicv_exec_ctrl' without doing any synchronization
and that function toggles the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

If there is a mismatch between that memslot state and the AVIC state,
on one of vCPUs, an APIC mmio write can be lost:

For example:

VCPU0: enable the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT
VCPU1: write to an APIC mmio register.

Since AVIC is still disabled on VCPU1, the access will not be  intercepted
by it, and neither will it cause MMIO fault, but rather it will just update
the dummy page mapped into the APIC_ACCESS_PAGE_PRIVATE_MEMSLOT.

Fix that by adding a lock guarding the AVIC state changes, and carefully
order the operations of kvm_request_apicv_update to avoid this race:

1. Take the lock
2. Send KVM_REQ_APICV_UPDATE
3. Update the apic inhibit reason
4. Release the lock

This ensures that at (2) all vCPUs are kicked out of the guest mode,
but don't yet see the new avic state.
Then only after (4) all other vCPUs can update their AVIC state and resume.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/x86.c       | 39 ++++++++++++++++++++++++---------------
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      |  1 +
 3 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 29b92f6cbad4..a91e35b92447 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9180,6 +9180,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 	if (!lapic_in_kernel(vcpu))
 		return;
 
+	mutex_lock(&vcpu->kvm->apicv_update_lock);
+
 	vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
 	kvm_apic_update_apicv(vcpu);
 	static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
@@ -9192,6 +9194,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 	 */
 	if (!vcpu->arch.apicv_active)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+	mutex_unlock(&vcpu->kvm->apicv_update_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
 
@@ -9204,32 +9208,34 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
  */
 void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 {
-	unsigned long old, new, expected;
+	unsigned long old, new;
 
 	if (!kvm_x86_ops.check_apicv_inhibit_reasons ||
 	    !static_call(kvm_x86_check_apicv_inhibit_reasons)(bit))
 		return;
 
-	old = READ_ONCE(kvm->arch.apicv_inhibit_reasons);
-	do {
-		expected = new = old;
-		if (activate)
-			__clear_bit(bit, &new);
-		else
-			__set_bit(bit, &new);
-		if (new == old)
-			break;
-		old = cmpxchg(&kvm->arch.apicv_inhibit_reasons, expected, new);
-	} while (old != expected);
+	mutex_lock(&kvm->apicv_update_lock);
+
+	old = new = kvm->arch.apicv_inhibit_reasons;
+	if (activate)
+		__clear_bit(bit, &new);
+	else
+		__set_bit(bit, &new);
+
+	kvm->arch.apicv_inhibit_reasons = new;
 
 	if (!!old == !!new)
-		return;
+		goto out;
 
 	trace_kvm_apicv_update_request(activate, bit);
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
+
 	if (kvm_x86_ops.pre_update_apicv_exec_ctrl)
 		static_call(kvm_x86_pre_update_apicv_exec_ctrl)(kvm, activate);
 
-	kvm_make_all_cpus_request(kvm, KVM_REQ_APICV_UPDATE);
+out:
+	mutex_unlock(&kvm->apicv_update_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_request_apicv_update);
 
@@ -9436,8 +9442,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 		 */
 		if (kvm_check_request(KVM_REQ_HV_STIMER, vcpu))
 			kvm_hv_process_stimers(vcpu);
-		if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
+		if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu)) {
+			srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
 			kvm_vcpu_update_apicv(vcpu);
+			vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+		}
 		if (kvm_check_request(KVM_REQ_APF_READY, vcpu))
 			kvm_check_async_pf_completion(vcpu);
 		if (kvm_check_request(KVM_REQ_MSR_FILTER_CHANGED, vcpu))
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 37cbb56ccd09..0364d35d43dc 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -524,6 +524,7 @@ struct kvm {
 #endif /* KVM_HAVE_MMU_RWLOCK */
 
 	struct mutex slots_lock;
+	struct mutex apicv_update_lock;
 
 	/*
 	 * Protects the arch-specific fields of struct kvm_memory_slots in
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ed4d1581d502..ba5d5d9ebc64 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	mutex_init(&kvm->irq_lock);
 	mutex_init(&kvm->slots_lock);
 	mutex_init(&kvm->slots_arch_lock);
+	mutex_init(&kvm->apicv_update_lock);
 	INIT_LIST_HEAD(&kvm->devices);
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 6/8] KVM: SVM: add warning for mistmatch between AVIC state and AVIC access page state
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (4 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 7/8] KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC Maxim Levitsky
                   ` (2 subsequent siblings)
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

It is never a good idea to enter a guest when AVIC state doesn't match
the state of the AVIC MMIO page state.

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/svm.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d4d14e318ab5..d14be8aba2d7 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3783,6 +3783,9 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 
 	pre_svm_run(vcpu);
 
+	WARN_ON_ONCE(vcpu->kvm->arch.apic_access_memslot_enabled !=
+		     kvm_vcpu_apicv_active(vcpu));
+
 	sync_lapic_to_cr8(vcpu);
 
 	if (unlikely(svm->asid != svm->vmcb->control.asid)) {
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 7/8] KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (5 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 6/8] KVM: SVM: add warning for mistmatch between AVIC state and AVIC access page state Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-13 14:20 ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
  2021-07-26 17:24 ` [PATCH v2 0/8] My AVIC patch queue Paolo Bonzini
  8 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

Currently it is possible to have the following scenario:

1. AVIC is disabled by svm_refresh_apicv_exec_ctrl
2. svm_vcpu_blocking calls avic_vcpu_put which does nothing
3. svm_vcpu_unblocking enables the AVIC (due to KVM_REQ_APICV_UPDATE)
   and then calls avic_vcpu_load
4. warning is triggered in avic_vcpu_load since
   AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK was never cleared

While it is possible to just remove the warning, it seems to be more robust
to fully disable/enable AVIC in svm_refresh_apicv_exec_ctrl by calling the
avic_vcpu_load/avic_vcpu_put

Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 43 ++++++++++++++++++++++-------------------
 arch/x86/kvm/svm/svm.c  |  8 ++++++--
 arch/x86/kvm/x86.c      | 10 ++++++++--
 3 files changed, 37 insertions(+), 24 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 64d1e1b6a305..0e4370fe0868 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -80,6 +80,28 @@ enum avic_ipi_failure_cause {
 	AVIC_IPI_FAILURE_INVALID_BACKING_PAGE,
 };
 
+
+static void __avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
+{
+	if (is_run)
+		avic_vcpu_load(vcpu, vcpu->cpu);
+	else
+		avic_vcpu_put(vcpu);
+}
+
+/*
+ * This function is called during VCPU halt/unhalt.
+ */
+static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
+{
+	struct vcpu_svm *svm = to_svm(vcpu);
+
+	svm->avic_is_running = is_run;
+
+	if (kvm_vcpu_apicv_active(vcpu))
+		__avic_set_running(vcpu, is_run);
+}
+
 /* Note:
  * This function is called from IOMMU driver to notify
  * SVM to schedule in a particular vCPU of a particular VM.
@@ -667,6 +689,7 @@ void svm_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 	}
 	vmcb_mark_dirty(vmcb, VMCB_AVIC);
 
+	__avic_set_running(vcpu, activated);
 	svm_set_pi_irte_mode(vcpu, activated);
 }
 
@@ -960,9 +983,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	if (!kvm_vcpu_apicv_active(vcpu))
-		return;
-
 	/*
 	 * Since the host physical APIC id is 8 bits,
 	 * we can support host APIC ID upto 255.
@@ -990,9 +1010,6 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	u64 entry;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
-	if (!kvm_vcpu_apicv_active(vcpu))
-		return;
-
 	entry = READ_ONCE(*(svm->avic_physical_id_cache));
 	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
 		avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
@@ -1001,20 +1018,6 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
 }
 
-/*
- * This function is called during VCPU halt/unhalt.
- */
-static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-
-	svm->avic_is_running = is_run;
-	if (is_run)
-		avic_vcpu_load(vcpu, vcpu->cpu);
-	else
-		avic_vcpu_put(vcpu);
-}
-
 void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
 	avic_set_running(vcpu, false);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index d14be8aba2d7..2c291dfa0e04 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1505,12 +1505,16 @@ static void svm_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		sd->current_vmcb = svm->vmcb;
 		indirect_branch_prediction_barrier();
 	}
-	avic_vcpu_load(vcpu, cpu);
+
+	if (kvm_vcpu_apicv_active(vcpu))
+		avic_vcpu_load(vcpu, cpu);
 }
 
 static void svm_vcpu_put(struct kvm_vcpu *vcpu)
 {
-	avic_vcpu_put(vcpu);
+	if (kvm_vcpu_apicv_active(vcpu))
+		avic_vcpu_put(vcpu);
+
 	svm_prepare_host_switch(vcpu);
 
 	++vcpu->stat.host_state_reload;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a91e35b92447..213d332664e2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9177,12 +9177,18 @@ void kvm_make_scan_ioapic_request(struct kvm *kvm)
 
 void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 {
+	bool activate;
+
 	if (!lapic_in_kernel(vcpu))
 		return;
 
 	mutex_lock(&vcpu->kvm->apicv_update_lock);
 
-	vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
+	activate = kvm_apicv_activated(vcpu->kvm);
+	if (vcpu->arch.apicv_active == activate)
+		goto out;
+
+	vcpu->arch.apicv_active = activate;
 	kvm_apic_update_apicv(vcpu);
 	static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
 
@@ -9194,7 +9200,7 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
 	 */
 	if (!vcpu->arch.apicv_active)
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
-
+out:
 	mutex_unlock(&vcpu->kvm->apicv_update_lock);
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_update_apicv);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (6 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 7/8] KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC Maxim Levitsky
@ 2021-07-13 14:20 ` Maxim Levitsky
  2021-07-18 12:13   ` Maxim Levitsky
  2021-07-26 17:24 ` [PATCH v2 0/8] My AVIC patch queue Paolo Bonzini
  8 siblings, 1 reply; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-13 14:20 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson, Maxim Levitsky

From: Vitaly Kuznetsov <vkuznets@redhat.com>

APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
however, possible to track whether the feature was actually used by the
guest and only inhibit APICv/AVIC when needed.

TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
Windows know that AutoEOI feature should be avoided. While it's up to
KVM userspace to set the flag, KVM can help a bit by exposing global
APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
is still preferred.

Maxim:
   - added SRCU lock drop around call to kvm_request_apicv_update
   - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
     since this feature can be used regardless of AVIC

Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
 2 files changed, 31 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e11d64aa0bcd..f900dca58af8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -956,6 +956,9 @@ struct kvm_hv {
 	/* How many vCPUs have VP index != vCPU index */
 	atomic_t num_mismatched_vp_indexes;
 
+	/* How many SynICs use 'AutoEOI' feature */
+	atomic_t synic_auto_eoi_used;
+
 	struct hv_partition_assist_pg *hv_pa_pg;
 	struct kvm_hv_syndbg hv_syndbg;
 };
diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index b07592ca92f0..6bf47a583d0e 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
 	return false;
 }
 
+
+static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
+{
+	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+	kvm_request_apicv_update(vcpu->kvm, activate,
+			APICV_INHIBIT_REASON_HYPERV);
+	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+}
+
 static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
 				int vector)
 {
+	struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
+	struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
+	int auto_eoi_old, auto_eoi_new;
+
 	if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
 		return;
 
@@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
 	else
 		__clear_bit(vector, synic->vec_bitmap);
 
+	auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
+
 	if (synic_has_vector_auto_eoi(synic, vector))
 		__set_bit(vector, synic->auto_eoi_bitmap);
 	else
 		__clear_bit(vector, synic->auto_eoi_bitmap);
+
+	auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
+
+	/* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
+	if (!auto_eoi_old && auto_eoi_new) {
+		if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
+			synic_toggle_avic(vcpu, false);
+	} else if (!auto_eoi_new && auto_eoi_old) {
+		if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
+			synic_toggle_avic(vcpu, true);
+	}
 }
 
 static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
@@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
 
 	synic = to_hv_synic(vcpu);
 
-	/*
-	 * Hyper-V SynIC auto EOI SINT's are
-	 * not compatible with APICV, so request
-	 * to deactivate APICV permanently.
-	 */
-	kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
 	synic->active = true;
 	synic->dont_zero_synic_pages = dont_zero_synic_pages;
 	synic->control = HV_SYNIC_CONTROL_ENABLE;
@@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
 				ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
 			if (!cpu_smt_possible())
 				ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
+
+			ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
 			/*
 			 * Default number of spinlock retry attempts, matches
 			 * HyperV 2016.
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-13 14:20 ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
@ 2021-07-18 12:13   ` Maxim Levitsky
  2021-07-19  7:47     ` Vitaly Kuznetsov
  2021-07-19 18:49     ` Sean Christopherson
  0 siblings, 2 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-18 12:13 UTC (permalink / raw)
  To: kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> From: Vitaly Kuznetsov <vkuznets@redhat.com>
> 
> APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> however, possible to track whether the feature was actually used by the
> guest and only inhibit APICv/AVIC when needed.
> 
> TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> Windows know that AutoEOI feature should be avoided. While it's up to
> KVM userspace to set the flag, KVM can help a bit by exposing global
> APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> is still preferred.

> 
> Maxim:
>    - added SRCU lock drop around call to kvm_request_apicv_update
>    - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
>      since this feature can be used regardless of AVIC
> 
> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
>  2 files changed, 31 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index e11d64aa0bcd..f900dca58af8 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -956,6 +956,9 @@ struct kvm_hv {
>  	/* How many vCPUs have VP index != vCPU index */
>  	atomic_t num_mismatched_vp_indexes;
>  
> +	/* How many SynICs use 'AutoEOI' feature */
> +	atomic_t synic_auto_eoi_used;
> +
>  	struct hv_partition_assist_pg *hv_pa_pg;
>  	struct kvm_hv_syndbg hv_syndbg;
>  };
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index b07592ca92f0..6bf47a583d0e 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
>  	return false;
>  }
>  
> +
> +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> +{
> +	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> +	kvm_request_apicv_update(vcpu->kvm, activate,
> +			APICV_INHIBIT_REASON_HYPERV);
> +	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> +}

Well turns out that this patch still doesn't work (on this
weekend I found out that all my AVIC enabled VMs hang on reboot).

I finally found out what prompted me back then to make srcu lock drop
in synic_update_vector conditional on whether the write was done
by the host.
 
Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
it stores the returned srcu index in a local variable and not
in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
doesn't work.
 
So it is likely that I have seen it not work, and blamed 
KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
 
I am more inclined to fix this by just tracking if we hold the srcu
lock on each VCPU manually, just as we track the srcu index anyway,
and then kvm_request_apicv_update can use this to drop the srcu
lock when needed.


> +
>  static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>  				int vector)
>  {
> +	struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
> +	struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
> +	int auto_eoi_old, auto_eoi_new;
> +
>  	if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
>  		return;
>  
> @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>  	else
>  		__clear_bit(vector, synic->vec_bitmap);
>  
> +	auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
> +
>  	if (synic_has_vector_auto_eoi(synic, vector))
>  		__set_bit(vector, synic->auto_eoi_bitmap);
>  	else
>  		__clear_bit(vector, synic->auto_eoi_bitmap);
> +
> +	auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
> +
> +	/* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
> +	if (!auto_eoi_old && auto_eoi_new) {
> +		if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
> +			synic_toggle_avic(vcpu, false);
> +	} else if (!auto_eoi_new && auto_eoi_old) {
> +		if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
> +			synic_toggle_avic(vcpu, true);
> +	}
>  }
>  
>  static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
>  
>  	synic = to_hv_synic(vcpu);
>  
> -	/*
> -	 * Hyper-V SynIC auto EOI SINT's are
> -	 * not compatible with APICV, so request
> -	 * to deactivate APICV permanently.
> -	 */
> -	kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
>  	synic->active = true;
>  	synic->dont_zero_synic_pages = dont_zero_synic_pages;
>  	synic->control = HV_SYNIC_CONTROL_ENABLE;
> @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>  				ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>  			if (!cpu_smt_possible())
>  				ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
> +
> +			ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;

Vitally, I would like to hear your opinion on this change I also made to your code.
I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
in the kernel, even if this is not optimal.


Best regards,
	Maxim Levitsky

>  			/*
>  			 * Default number of spinlock retry attempts, matches
>  			 * HyperV 2016.



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-18 12:13   ` Maxim Levitsky
@ 2021-07-19  7:47     ` Vitaly Kuznetsov
  2021-07-19  9:00       ` Maxim Levitsky
  2021-07-19 18:49     ` Sean Christopherson
  1 sibling, 1 reply; 32+ messages in thread
From: Vitaly Kuznetsov @ 2021-07-19  7:47 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
>> From: Vitaly Kuznetsov <vkuznets@redhat.com>
>> 
>> APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
>> SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
>> however, possible to track whether the feature was actually used by the
>> guest and only inhibit APICv/AVIC when needed.
>> 
>> TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
>> Windows know that AutoEOI feature should be avoided. While it's up to
>> KVM userspace to set the flag, KVM can help a bit by exposing global
>> APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
>> is still preferred.
>
>> 
>> Maxim:
>>    - added SRCU lock drop around call to kvm_request_apicv_update
>>    - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
>>      since this feature can be used regardless of AVIC
>> 
>> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>> ---
>>  arch/x86/include/asm/kvm_host.h |  3 +++
>>  arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
>>  2 files changed, 31 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index e11d64aa0bcd..f900dca58af8 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -956,6 +956,9 @@ struct kvm_hv {
>>  	/* How many vCPUs have VP index != vCPU index */
>>  	atomic_t num_mismatched_vp_indexes;
>>  
>> +	/* How many SynICs use 'AutoEOI' feature */
>> +	atomic_t synic_auto_eoi_used;
>> +
>>  	struct hv_partition_assist_pg *hv_pa_pg;
>>  	struct kvm_hv_syndbg hv_syndbg;
>>  };
>> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> index b07592ca92f0..6bf47a583d0e 100644
>> --- a/arch/x86/kvm/hyperv.c
>> +++ b/arch/x86/kvm/hyperv.c
>> @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
>>  	return false;
>>  }
>>  
>> +
>> +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
>> +{
>> +	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>> +	kvm_request_apicv_update(vcpu->kvm, activate,
>> +			APICV_INHIBIT_REASON_HYPERV);
>> +	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> +}
>
> Well turns out that this patch still doesn't work (on this
> weekend I found out that all my AVIC enabled VMs hang on reboot).
>
> I finally found out what prompted me back then to make srcu lock drop
> in synic_update_vector conditional on whether the write was done
> by the host.
>  
> Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> it stores the returned srcu index in a local variable and not
> in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> doesn't work.
>  
> So it is likely that I have seen it not work, and blamed 
> KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
>  
> I am more inclined to fix this by just tracking if we hold the srcu
> lock on each VCPU manually, just as we track the srcu index anyway,
> and then kvm_request_apicv_update can use this to drop the srcu
> lock when needed.
>

Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
introduce a new 'srcu_ls_locked' flag?

>
>> +
>>  static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>>  				int vector)
>>  {
>> +	struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
>> +	struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
>> +	int auto_eoi_old, auto_eoi_new;
>> +
>>  	if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
>>  		return;
>>  
>> @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
>>  	else
>>  		__clear_bit(vector, synic->vec_bitmap);
>>  
>> +	auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
>> +
>>  	if (synic_has_vector_auto_eoi(synic, vector))
>>  		__set_bit(vector, synic->auto_eoi_bitmap);
>>  	else
>>  		__clear_bit(vector, synic->auto_eoi_bitmap);
>> +
>> +	auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
>> +
>> +	/* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
>> +	if (!auto_eoi_old && auto_eoi_new) {
>> +		if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
>> +			synic_toggle_avic(vcpu, false);
>> +	} else if (!auto_eoi_new && auto_eoi_old) {
>> +		if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
>> +			synic_toggle_avic(vcpu, true);
>> +	}
>>  }
>>  
>>  static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
>> @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
>>  
>>  	synic = to_hv_synic(vcpu);
>>  
>> -	/*
>> -	 * Hyper-V SynIC auto EOI SINT's are
>> -	 * not compatible with APICV, so request
>> -	 * to deactivate APICV permanently.
>> -	 */
>> -	kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
>>  	synic->active = true;
>>  	synic->dont_zero_synic_pages = dont_zero_synic_pages;
>>  	synic->control = HV_SYNIC_CONTROL_ENABLE;
>> @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
>>  				ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
>>  			if (!cpu_smt_possible())
>>  				ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
>> +
>> +			ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
>
> Vitally, I would like to hear your opinion on this change I also made to your code.
> I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
> HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
> in the kernel, even if this is not optimal.

Generally, I'm OK with the change. The meaning of the bit changes
slightly depending on whether we expose it conditionally or not.

If HV_DEPRECATING_AEOI_RECOMMENDED is always exposed it just means that
the patch in question is in, nothing else. VMM has to figure out if
APICv/AVIC is supported by some other means.

If HV_DEPRECATING_AEOI_RECOMMENDED is exposed conditionally, then VMM
can be stupid and solely rely on this information by just copying the
bit to user visible CPUIDs (e.g. QEMU's 'hv-passthrough' mode).

-- 
Vitaly


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-19  7:47     ` Vitaly Kuznetsov
@ 2021-07-19  9:00       ` Maxim Levitsky
  2021-07-19  9:23         ` Vitaly Kuznetsov
  0 siblings, 1 reply; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-19  9:00 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> > > From: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > 
> > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> > > however, possible to track whether the feature was actually used by the
> > > guest and only inhibit APICv/AVIC when needed.
> > > 
> > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> > > Windows know that AutoEOI feature should be avoided. While it's up to
> > > KVM userspace to set the flag, KVM can help a bit by exposing global
> > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> > > is still preferred.
> > > Maxim:
> > >    - added SRCU lock drop around call to kvm_request_apicv_update
> > >    - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
> > >      since this feature can be used regardless of AVIC
> > > 
> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > ---
> > >  arch/x86/include/asm/kvm_host.h |  3 +++
> > >  arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
> > >  2 files changed, 31 insertions(+), 6 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index e11d64aa0bcd..f900dca58af8 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -956,6 +956,9 @@ struct kvm_hv {
> > >  	/* How many vCPUs have VP index != vCPU index */
> > >  	atomic_t num_mismatched_vp_indexes;
> > >  
> > > +	/* How many SynICs use 'AutoEOI' feature */
> > > +	atomic_t synic_auto_eoi_used;
> > > +
> > >  	struct hv_partition_assist_pg *hv_pa_pg;
> > >  	struct kvm_hv_syndbg hv_syndbg;
> > >  };
> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > index b07592ca92f0..6bf47a583d0e 100644
> > > --- a/arch/x86/kvm/hyperv.c
> > > +++ b/arch/x86/kvm/hyperv.c
> > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
> > >  	return false;
> > >  }
> > >  
> > > +
> > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> > > +{
> > > +	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> > > +	kvm_request_apicv_update(vcpu->kvm, activate,
> > > +			APICV_INHIBIT_REASON_HYPERV);
> > > +	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > +}
> > 
> > Well turns out that this patch still doesn't work (on this
> > weekend I found out that all my AVIC enabled VMs hang on reboot).
> > 
> > I finally found out what prompted me back then to make srcu lock drop
> > in synic_update_vector conditional on whether the write was done
> > by the host.
> >  
> > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> > it stores the returned srcu index in a local variable and not
> > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> > doesn't work.
> >  
> > So it is likely that I have seen it not work, and blamed 
> > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
> >  
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
> > 
> 
> Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
> introduce a new 'srcu_ls_locked' flag?

Well, currently the returned index value from srcu_read_lock is opaque 
(and we have two SRCU implementations and both I think return small positive numbers, 
but I haven't studied them in depth).
 
We can ask the people that maintain SRCU to reserve a number (like -1)
or so.
I probably first add the 'srcu_is_locked' thought and then as a follow up patch
remove it if they agree.

> 
> > > +
> > >  static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> > >  				int vector)
> > >  {
> > > +	struct kvm_vcpu *vcpu = hv_synic_to_vcpu(synic);
> > > +	struct kvm_hv *hv = to_kvm_hv(vcpu->kvm);
> > > +	int auto_eoi_old, auto_eoi_new;
> > > +
> > >  	if (vector < HV_SYNIC_FIRST_VALID_VECTOR)
> > >  		return;
> > >  
> > > @@ -96,10 +109,23 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
> > >  	else
> > >  		__clear_bit(vector, synic->vec_bitmap);
> > >  
> > > +	auto_eoi_old = bitmap_weight(synic->auto_eoi_bitmap, 256);
> > > +
> > >  	if (synic_has_vector_auto_eoi(synic, vector))
> > >  		__set_bit(vector, synic->auto_eoi_bitmap);
> > >  	else
> > >  		__clear_bit(vector, synic->auto_eoi_bitmap);
> > > +
> > > +	auto_eoi_new = bitmap_weight(synic->auto_eoi_bitmap, 256);
> > > +
> > > +	/* Hyper-V SynIC auto EOI SINTs are not compatible with APICV */
> > > +	if (!auto_eoi_old && auto_eoi_new) {
> > > +		if (atomic_inc_return(&hv->synic_auto_eoi_used) == 1)
> > > +			synic_toggle_avic(vcpu, false);
> > > +	} else if (!auto_eoi_new && auto_eoi_old) {
> > > +		if (atomic_dec_return(&hv->synic_auto_eoi_used) == 0)
> > > +			synic_toggle_avic(vcpu, true);
> > > +	}
> > >  }
> > >  
> > >  static int synic_set_sint(struct kvm_vcpu_hv_synic *synic, int sint,
> > > @@ -933,12 +959,6 @@ int kvm_hv_activate_synic(struct kvm_vcpu *vcpu, bool dont_zero_synic_pages)
> > >  
> > >  	synic = to_hv_synic(vcpu);
> > >  
> > > -	/*
> > > -	 * Hyper-V SynIC auto EOI SINT's are
> > > -	 * not compatible with APICV, so request
> > > -	 * to deactivate APICV permanently.
> > > -	 */
> > > -	kvm_request_apicv_update(vcpu->kvm, false, APICV_INHIBIT_REASON_HYPERV);
> > >  	synic->active = true;
> > >  	synic->dont_zero_synic_pages = dont_zero_synic_pages;
> > >  	synic->control = HV_SYNIC_CONTROL_ENABLE;
> > > @@ -2466,6 +2486,8 @@ int kvm_get_hv_cpuid(struct kvm_vcpu *vcpu, struct kvm_cpuid2 *cpuid,
> > >  				ent->eax |= HV_X64_ENLIGHTENED_VMCS_RECOMMENDED;
> > >  			if (!cpu_smt_possible())
> > >  				ent->eax |= HV_X64_NO_NONARCH_CORESHARING;
> > > +
> > > +			ent->eax |= HV_DEPRECATING_AEOI_RECOMMENDED;
> > 
> > Vitally, I would like to hear your opinion on this change I also made to your code.
> > I think that we should always expose HV_DEPRECATING_AEOI_RECOMMENDED as a supported
> > HV cpuid bit regardless of AVIC, so that qemu could set it regardless of AVIC
> > in the kernel, even if this is not optimal.
> 
> Generally, I'm OK with the change. The meaning of the bit changes
> slightly depending on whether we expose it conditionally or not.
> 
> If HV_DEPRECATING_AEOI_RECOMMENDED is always exposed it just means that
> the patch in question is in, nothing else. VMM has to figure out if
> APICv/AVIC is supported by some other means.
> 
> If HV_DEPRECATING_AEOI_RECOMMENDED is exposed conditionally, then VMM
> can be stupid and solely rely on this information by just copying the
> bit to user visible CPUIDs (e.g. QEMU's 'hv-passthrough' mode).

I understand what you mean.
In regard to avic, though, the 'hv-passthrough' doesn't work at all 
anyway since it also makes the qemu enable hv-vapic bit which
makes the guest not use APIC....

So it seems that not even qemu but a management layer above it should
figure out that it wants AVIC, and then enable all the bits that 'should'
make it work, but AVIC might still not work for some reason.

Currently those are the flags that are needed to make AVIC work on windows:

 -cpu -svm,-x2apic,-hv-vapic,hv-aeoi-deprecation (I added this locally)
 -global kvm-pit.lost_tick_policy=discard

Thanks,
Best regards,
	Maxim Levitsky
 

> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-19  9:00       ` Maxim Levitsky
@ 2021-07-19  9:23         ` Vitaly Kuznetsov
  2021-07-19  9:58           ` Maxim Levitsky
  0 siblings, 1 reply; 32+ messages in thread
From: Vitaly Kuznetsov @ 2021-07-19  9:23 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

Maxim Levitsky <mlevitsk@redhat.com> writes:

> On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <mlevitsk@redhat.com> writes:
>> 
>> > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
>> > > From: Vitaly Kuznetsov <vkuznets@redhat.com>
>> > > 
>> > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
>> > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
>> > > however, possible to track whether the feature was actually used by the
>> > > guest and only inhibit APICv/AVIC when needed.
>> > > 
>> > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
>> > > Windows know that AutoEOI feature should be avoided. While it's up to
>> > > KVM userspace to set the flag, KVM can help a bit by exposing global
>> > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
>> > > is still preferred.
>> > > Maxim:
>> > >    - added SRCU lock drop around call to kvm_request_apicv_update
>> > >    - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
>> > >      since this feature can be used regardless of AVIC
>> > > 
>> > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
>> > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
>> > > ---
>> > >  arch/x86/include/asm/kvm_host.h |  3 +++
>> > >  arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
>> > >  2 files changed, 31 insertions(+), 6 deletions(-)
>> > > 
>> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> > > index e11d64aa0bcd..f900dca58af8 100644
>> > > --- a/arch/x86/include/asm/kvm_host.h
>> > > +++ b/arch/x86/include/asm/kvm_host.h
>> > > @@ -956,6 +956,9 @@ struct kvm_hv {
>> > >  	/* How many vCPUs have VP index != vCPU index */
>> > >  	atomic_t num_mismatched_vp_indexes;
>> > >  
>> > > +	/* How many SynICs use 'AutoEOI' feature */
>> > > +	atomic_t synic_auto_eoi_used;
>> > > +
>> > >  	struct hv_partition_assist_pg *hv_pa_pg;
>> > >  	struct kvm_hv_syndbg hv_syndbg;
>> > >  };
>> > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
>> > > index b07592ca92f0..6bf47a583d0e 100644
>> > > --- a/arch/x86/kvm/hyperv.c
>> > > +++ b/arch/x86/kvm/hyperv.c
>> > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
>> > >  	return false;
>> > >  }
>> > >  
>> > > +
>> > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
>> > > +{
>> > > +	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
>> > > +	kvm_request_apicv_update(vcpu->kvm, activate,
>> > > +			APICV_INHIBIT_REASON_HYPERV);
>> > > +	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
>> > > +}
>> > 
>> > Well turns out that this patch still doesn't work (on this
>> > weekend I found out that all my AVIC enabled VMs hang on reboot).
>> > 
>> > I finally found out what prompted me back then to make srcu lock drop
>> > in synic_update_vector conditional on whether the write was done
>> > by the host.
>> >  
>> > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
>> > it stores the returned srcu index in a local variable and not
>> > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
>> > doesn't work.
>> >  
>> > So it is likely that I have seen it not work, and blamed 
>> > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
>> >  
>> > I am more inclined to fix this by just tracking if we hold the srcu
>> > lock on each VCPU manually, just as we track the srcu index anyway,
>> > and then kvm_request_apicv_update can use this to drop the srcu
>> > lock when needed.
>> > 
>> 
>> Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
>> introduce a new 'srcu_ls_locked' flag?
>
> Well, currently the returned index value from srcu_read_lock is opaque 
> (and we have two SRCU implementations and both I think return small positive numbers, 
> but I haven't studied them in depth).
>  
> We can ask the people that maintain SRCU to reserve a number (like -1)
> or so.
> I probably first add the 'srcu_is_locked' thought and then as a follow up patch
> remove it if they agree.
>

Ah, OK. BTW, I've just discovered srcu_read_lock_held() which sounds
like the function we need but unfortunately it is not.

-- 
Vitaly


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-19  9:23         ` Vitaly Kuznetsov
@ 2021-07-19  9:58           ` Maxim Levitsky
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-19  9:58 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On Mon, 2021-07-19 at 11:23 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <mlevitsk@redhat.com> writes:
> 
> > On Mon, 2021-07-19 at 09:47 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <mlevitsk@redhat.com> writes:
> > > 
> > > > On Tue, 2021-07-13 at 17:20 +0300, Maxim Levitsky wrote:
> > > > > From: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > > > 
> > > > > APICV_INHIBIT_REASON_HYPERV is currently unconditionally forced upon
> > > > > SynIC activation as SynIC's AutoEOI is incompatible with APICv/AVIC. It is,
> > > > > however, possible to track whether the feature was actually used by the
> > > > > guest and only inhibit APICv/AVIC when needed.
> > > > > 
> > > > > TLFS suggests a dedicated 'HV_DEPRECATING_AEOI_RECOMMENDED' flag to let
> > > > > Windows know that AutoEOI feature should be avoided. While it's up to
> > > > > KVM userspace to set the flag, KVM can help a bit by exposing global
> > > > > APICv/AVIC enablement: in case APICv/AVIC usage is impossible, AutoEOI
> > > > > is still preferred.
> > > > > Maxim:
> > > > >    - added SRCU lock drop around call to kvm_request_apicv_update
> > > > >    - always set HV_DEPRECATING_AEOI_RECOMMENDED in kvm_get_hv_cpuid,
> > > > >      since this feature can be used regardless of AVIC
> > > > > 
> > > > > Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com>
> > > > > Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com>
> > > > > ---
> > > > >  arch/x86/include/asm/kvm_host.h |  3 +++
> > > > >  arch/x86/kvm/hyperv.c           | 34 +++++++++++++++++++++++++++------
> > > > >  2 files changed, 31 insertions(+), 6 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > > > index e11d64aa0bcd..f900dca58af8 100644
> > > > > --- a/arch/x86/include/asm/kvm_host.h
> > > > > +++ b/arch/x86/include/asm/kvm_host.h
> > > > > @@ -956,6 +956,9 @@ struct kvm_hv {
> > > > >  	/* How many vCPUs have VP index != vCPU index */
> > > > >  	atomic_t num_mismatched_vp_indexes;
> > > > >  
> > > > > +	/* How many SynICs use 'AutoEOI' feature */
> > > > > +	atomic_t synic_auto_eoi_used;
> > > > > +
> > > > >  	struct hv_partition_assist_pg *hv_pa_pg;
> > > > >  	struct kvm_hv_syndbg hv_syndbg;
> > > > >  };
> > > > > diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> > > > > index b07592ca92f0..6bf47a583d0e 100644
> > > > > --- a/arch/x86/kvm/hyperv.c
> > > > > +++ b/arch/x86/kvm/hyperv.c
> > > > > @@ -85,9 +85,22 @@ static bool synic_has_vector_auto_eoi(struct kvm_vcpu_hv_synic *synic,
> > > > >  	return false;
> > > > >  }
> > > > >  
> > > > > +
> > > > > +static void synic_toggle_avic(struct kvm_vcpu *vcpu, bool activate)
> > > > > +{
> > > > > +	srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
> > > > > +	kvm_request_apicv_update(vcpu->kvm, activate,
> > > > > +			APICV_INHIBIT_REASON_HYPERV);
> > > > > +	vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
> > > > > +}
> > > > 
> > > > Well turns out that this patch still doesn't work (on this
> > > > weekend I found out that all my AVIC enabled VMs hang on reboot).
> > > > 
> > > > I finally found out what prompted me back then to make srcu lock drop
> > > > in synic_update_vector conditional on whether the write was done
> > > > by the host.
> > > >  
> > > > Turns out that while KVM_SET_MSRS does take the kvm->srcu lock,
> > > > it stores the returned srcu index in a local variable and not
> > > > in vcpu->srcu_idx, thus the lock drop in synic_toggle_avic
> > > > doesn't work.
> > > >  
> > > > So it is likely that I have seen it not work, and blamed 
> > > > KVM_SET_MSRS for not taking the srcu lock which was a wrong assumption.
> > > >  
> > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > lock when needed.
> > > > 
> > > 
> > > Would it be possible to use some magic value in 'vcpu->srcu_idx' and not
> > > introduce a new 'srcu_ls_locked' flag?
> > 
> > Well, currently the returned index value from srcu_read_lock is opaque 
> > (and we have two SRCU implementations and both I think return small positive numbers, 
> > but I haven't studied them in depth).
> >  
> > We can ask the people that maintain SRCU to reserve a number (like -1)
> > or so.
> > I probably first add the 'srcu_is_locked' thought and then as a follow up patch
> > remove it if they agree.
> > 
> 
> Ah, OK. BTW, I've just discovered srcu_read_lock_held() which sounds
> like the function we need but unfortunately it is not.

Yea, exactly this. :-(

Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-18 12:13   ` Maxim Levitsky
  2021-07-19  7:47     ` Vitaly Kuznetsov
@ 2021-07-19 18:49     ` Sean Christopherson
  2021-07-20  9:40       ` Maxim Levitsky
                         ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Sean Christopherson @ 2021-07-19 18:49 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> I am more inclined to fix this by just tracking if we hold the srcu
> lock on each VCPU manually, just as we track the srcu index anyway,
> and then kvm_request_apicv_update can use this to drop the srcu
> lock when needed.

The entire approach of dynamically adding/removing the memslot seems doomed to
failure, and is likely responsible for the performance issues with AVIC, e.g. a
single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
and again on re-enable.

Rather than pile on more gunk, what about special casing the APIC access page
memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
also give us a good excuse to rename try_async_pf() :-)

If lack of MMIO caching is a performance problem, an alternative solution would
be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
clear their cache.

It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
less awful than the current memslot+SRCU mess.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f4d35289f59e..ea434d76cfb0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
                                  kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
 }

-static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
-                        gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
-                        bool write, bool *writable)
+static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
+                           gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
+                           bool write, bool *writable, int *r)
 {
        struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
        bool async;
@@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
         * be zapped before KVM inserts a new MMIO SPTE for the gfn.
         */
        if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
-               return true;
+               goto out_retry;

-       /* Don't expose private memslots to L2. */
-       if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
-               *pfn = KVM_PFN_NOSLOT;
-               *writable = false;
-               return false;
+       if (!kvm_is_visible_memslot(slot)) {
+               /* Don't expose private memslots to L2. */
+               if (is_guest_mode(vcpu)) {
+                       *pfn = KVM_PFN_NOSLOT;
+                       *writable = false;
+                       return false;
+               }
+               /*
+                * If the APIC access page exists but is disabled, go directly
+                * to emulation without caching the MMIO access or creating a
+                * MMIO SPTE.  That way the cache doesn't need to be purged
+                * when the AVIC is re-enabled.
+                */
+               if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
+                   !vcpu->kvm->arch.apic_access_memslot_enabled) {
+                       *r = RET_PF_EMULATE;
+                       return true;
+               }
        }

        async = false;
@@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
                if (kvm_find_async_pf_gfn(vcpu, gfn)) {
                        trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
                        kvm_make_request(KVM_REQ_APF_HALT, vcpu);
-                       return true;
-               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
-                       return true;
+                       goto out_retry;
+               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
+                       goto out_retry;
+               }
        }

        *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
                                    write, writable, hva);
        return false;
+
+out_retry:
+       *r = RET_PF_RETRY;
+       return true;
 }

 static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
@@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
        mmu_seq = vcpu->kvm->mmu_notifier_seq;
        smp_rmb();

-       if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
-                        write, &map_writable))
-               return RET_PF_RETRY;
+       if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
+                           &map_writable, &r))
+               return r;

        if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
                return r;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 490a028ddabe..9747124b877d 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
        mmu_seq = vcpu->kvm->mmu_notifier_seq;
        smp_rmb();
 
-       if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
-                        write_fault, &map_writable))
-               return RET_PF_RETRY;
+       if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
+                           write_fault, &map_writable, &r))
+               return r;
 
        if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
                return r;

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-19 18:49     ` Sean Christopherson
@ 2021-07-20  9:40       ` Maxim Levitsky
  2021-07-22  9:12       ` KVM's support for non default APIC base Maxim Levitsky
  2021-07-22 17:35       ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
  2 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-20  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
> 
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
100% agree, especially about the fact that this approach is doomed to fail.
 
There are two reasons why I didn't explore the direction you propose:
 
(when I talked about this on IRC with Paolo, I did suggest trying to explore 
this direction once)
 
One is that practically speaking AVIC inhibition is not something that happens often,
and in all of my tests it didn't happen after the VM had finished booting.
 
There are the following cases when AVIC is inhibited:
 
1.Not possible to support configuration
 
  (PIT reinject mode, x2apic mode, autoeoi mode of hyperv SYNIC)
 
  With the theoretical exception of the SYNIC autoeoi mode, it is something that happens
  only once when qemu configures the VM.
 
  In all of my tests the SYNIC autoeoi mode was also enabled once on each vCPU and 
  stayed that way.
 
 
2.Not yet supported configuration (nesting)
  Something that will go away eventually and also something that fires only once
  at least currently.
 
3.Dynamically when the guest dares to use local APIC's virtual wire mode to
  send PIC's connected interrupts via LAPIC, and since this can't be done using AVIC
  because in this mode, the interrupt must not affect local APIC registers (like IRR, ISR, etc),
  a normal EVENTINJ injection has to be done.
  This sometimes requires detection of the start of the interrupt window, which is something not 
  possible to do when AVIC is enabled because AVIC makes the CPU ignore the so called
  virtual interrupts, which we inject and intercept, to detect it.

  This case only happens on windows and in OVMF (both are silly enough to use this mode),
  and thankfully windows only uses this mode during boot. 

Thus even a gross performance issue isn't an issue yet, but it would be
indeed nice to eliminate it if we ever need to deal with a guest which
does somehow end up enabling and disabling AVIC many times per second.

> 
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> also give us a good excuse to rename try_async_pf() :-)

I understand what you mean about renaming try_async_pf :-)
 
As for the other reason, the reason is simple: Fear ;-)
While I do know the KVM's MMU an order of magnitude better than I did a year ago,
I still don't have the confidence needed to add hacks to it.
 
I do agree that a special AVIC hack in the mmu as you propose is miles better than
dynamic disable hack of the AVIC memslot.

> 
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.

In theory this can be an issue. LAPIC MMIO has the ICR register which is split in two,
Thus, lack of caching should hurt performance.

With that said, a management layer (e.g libvirt) these days always enables X2APIC
and KVM exposes it as supported in CPUID regardless of host support fo it,
as it still has better performance than unaccelerated xAPIC.
 
This means that it can be expected that a management layer will only disable X2APIC
when it enables AVIC and sets up everything such as it is actually used, and therefore
no performance impact will be felt.
 
(the time during which the AVIC could be dynamically inhibited is neglectable as I explained above).

> 
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess.
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f4d35289f59e..ea434d76cfb0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>                                   kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
>  }
> 
> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -                        gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> -                        bool write, bool *writable)
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> +                           gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> +                           bool write, bool *writable, int *r)
>  {
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         bool async;
> @@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>          * be zapped before KVM inserts a new MMIO SPTE for the gfn.
>          */
>         if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> -               return true;
> +               goto out_retry;
> 
> -       /* Don't expose private memslots to L2. */
> -       if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> -               *pfn = KVM_PFN_NOSLOT;
> -               *writable = false;
> -               return false;
> +       if (!kvm_is_visible_memslot(slot)) {
> +               /* Don't expose private memslots to L2. */
> +               if (is_guest_mode(vcpu)) {
> +                       *pfn = KVM_PFN_NOSLOT;
> +                       *writable = false;
> +                       return false;
> +               }
> +               /*
> +                * If the APIC access page exists but is disabled, go directly
> +                * to emulation without caching the MMIO access or creating a
> +                * MMIO SPTE.  That way the cache doesn't need to be purged
> +                * when the AVIC is re-enabled.
> +                */
> +               if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> +                   !vcpu->kvm->arch.apic_access_memslot_enabled) {
> +                       *r = RET_PF_EMULATE;
> +                       return true;
> +               }
>         }
> 
>         async = false;
> @@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>                 if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>                         trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
>                         kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> -                       return true;
> -               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
> -                       return true;
> +                       goto out_retry;
> +               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
> +                       goto out_retry;
> +               }
>         }
> 
>         *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
>                                     write, writable, hva);
>         return false;
> +
> +out_retry:
> +       *r = RET_PF_RETRY;
> +       return true;
>  }
> 
>  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> @@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         mmu_seq = vcpu->kvm->mmu_notifier_seq;
>         smp_rmb();
> 
> -       if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
> -                        write, &map_writable))
> -               return RET_PF_RETRY;
> +       if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
> +                           &map_writable, &r))
> +               return r;
> 
>         if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
>                 return r;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 490a028ddabe..9747124b877d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>         mmu_seq = vcpu->kvm->mmu_notifier_seq;
>         smp_rmb();
>  
> -       if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> -                        write_fault, &map_writable))
> -               return RET_PF_RETRY;
> +       if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> +                           write_fault, &map_writable, &r))
> +               return r;
>  
>         if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
>                 return r;
> 


The above approach looks very good.
Paolo what do you think?

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 32+ messages in thread

* KVM's support for non default APIC base
  2021-07-19 18:49     ` Sean Christopherson
  2021-07-20  9:40       ` Maxim Levitsky
@ 2021-07-22  9:12       ` Maxim Levitsky
  2021-08-02  9:20         ` Maxim Levitsky
  2021-08-06 21:55         ` Sean Christopherson
  2021-07-22 17:35       ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
  2 siblings, 2 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-22  9:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
> 
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
> 
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> also give us a good excuse to rename try_async_pf() :-)
> 
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.
> 
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess

Hi Sean, Paolo and everyone else:

I am exploring the approach that you proposed and I noticed that we have very inconsistient
code that handles the APIC base relocation for in-kernel local apic.
I do know that APIC base relocation is not supported, and I don't have anything against
this as long as VMs don't use that feature, but I do want this to be consistent.

I did a study of the code that is involved in this mess and I would like to hear your opinion:

There are basically 3 modes of operation of in kernel local apic:

Regular unaccelerated local apic:

-> APIC MMIO base address is stored at 'apic->base_address', and updated in 
   kvm_lapic_set_base which is called from  msr write of 'MSR_IA32_APICBASE'
   (both by SVM and VMX).
   The value is even migrated.

-> APIC mmio read/write is done from MMU, when we access MMIO page:
	vcpu_mmio_write always calls apic_mmio_write which checks if the write is in 
	apic->base_address page and if so forwards the write local apic with offset
	in this page.

-> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
   thus must contain no guest memslots.
   If the guest relocates the APIC base somewhere where we have a memslot, 
   memslot will take priority, while on real hardware, LAPIC is likely to
   take priority.

APICv:

-> The default apic MMIO base (0xfee00000) is covered by a dummy page which is
   allocated from qemu's process  using __x86_set_memory_region.
   
   This is done once in alloc_apic_access_page which is called on vcpu creation,
   (but only once when the memslot is not yet enabled)

-> to avoid pinning this page into qemu's memory, reference to it
   is dropped in alloc_apic_access_page.
   (see commit c24ae0dcd3e8695efa43e71704d1fc4bc7e29e9b)

-> kvm_arch_mmu_notifier_invalidate_range -> checks if we invalidate GPA 0xfee00000 
   and if so, raises KVM_REQ_APIC_PAGE_RELOAD request.

-> kvm_vcpu_reload_apic_access_page handles the KVM_REQ_APIC_PAGE_RELOAD request by calling
   kvm_x86_ops.set_apic_access_page_addr which is only implemented on VMX
   (vmx_set_apic_access_page_addr)

-> vmx_set_apic_access_page_addr does gfn_to_page on 0xfee00000 GPA,
   and if the result is valid, writes the physical address of this page to APIC_ACCESS_ADDR vmcs field.

   (This is a major difference from the AVIC - AVIC's avic_vapic_bar is *GPA*, while APIC_ACCESS_ADDR
   is host physical address which the hypervisor is supposed to map at APIC MMIO GPA using EPT)

   Note that if we have an error here, we might end with invalid APIC_ACCESS_ADDR field.

-> writes to the  HPA of that special page (which has GPA of 0xfee00000, and mapped via EPT) go to
   APICv or cause special VM exits: (EXIT_REASON_APIC_ACCESS, EXIT_REASON_APIC_WRITE)

   * EXIT_REASON_APIC_ACCESS (which is used for older limited 'flexpriotiy' mode which only emulates TPR practically) 
     actually emulates the instruction to know the written value,
     but we have a special case in vcpu_is_mmio_gpa which makes the emulation treat the access to the default
     apic base as MMIO.
   
   * EXIT_REASON_APIC_WRITE is a trap VMexit which comes with full APICv, and since it also has offset
     qualification and the value is already in the apic page, this info is just passed to kvm_lapic_reg_write


-> If APIC base is relocated, the APICv isn't aware of it, and the writes to new APIC base,
   (assuming that we have no memslots covering it) will go through standard APIC MMIO emulation,
   and *might* create a mess.

AVIC:

-> The default apic MMIO base (0xfee00000) 
   is also covered by a dummy page which is allocated from qemu's process using __x86_set_memory_region 
   in avic_update_access_page which is called also on vcpu creation (also only once),
   and from SVM's dynamic AVIC inhibition.

-> The reference to this page is not dropped thus there is no KVM_REQ_APIC_PAGE_RELOAD handler.
   I think we should do the same we do for APICv here?

-> avic_vapic_bar is GPA and thus contains 0xfee00000 but writes to MSR_IA32_APICBASE do update it
   (avic_update_vapic_bar which is called from MSR_IA32_APICBASE write in SVM code)

   thus if the guest relocates the APIC base to a writable memory page, actually AVIC would happen to work.
   (opposite from the stock xAPIC handlilng, which only works when apic base is in MMIO area.)

-> writes to the GPA in avic_vapic_bar are first checked in NPT (but HPA written there ignored),
   and then either go to AVIC or cause SVM_EXIT_AVIC_UNACCELERATED_ACCESS which has offset of the write
   in the exit_info_1
   (there is also SVM_EXIT_AVIC_INCOMPLETE_IPI which is called on some ICR writes)


As far as I know the only good reason to relocate APIC base is to access it from the real mode
which is not something that is done these days by modern BIOSes.

I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default base is set and apic enabled) 
and remove all remains of the support for variable APIC base.
(we already have a warning when APIC base is set to non default value)


Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-19 18:49     ` Sean Christopherson
  2021-07-20  9:40       ` Maxim Levitsky
  2021-07-22  9:12       ` KVM's support for non default APIC base Maxim Levitsky
@ 2021-07-22 17:35       ` Maxim Levitsky
  2021-07-22 19:06         ` Sean Christopherson
  2 siblings, 1 reply; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-22 17:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > I am more inclined to fix this by just tracking if we hold the srcu
> > lock on each VCPU manually, just as we track the srcu index anyway,
> > and then kvm_request_apicv_update can use this to drop the srcu
> > lock when needed.
> 
> The entire approach of dynamically adding/removing the memslot seems doomed to
> failure, and is likely responsible for the performance issues with AVIC, e.g. a
> single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> and again on re-enable.
> 
> Rather than pile on more gunk, what about special casing the APIC access page
> memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> also give us a good excuse to rename try_async_pf() :-)
> 
> If lack of MMIO caching is a performance problem, an alternative solution would
> be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> clear their cache.
> 
> It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> less awful than the current memslot+SRCU mess.

Hi!

I am testing your approach and it actually works very well! I can't seem to break it.

Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

I just use your patch as is, plus the changes that are needed in kvm_request_apicv_update.

On AVIC unlike APICv, the page in this memslot is pinned in the physical memory still
(AVIC misses the code that APICv has in this regard).
It should be done in the future though.

Best regards,
	Maxim Levitsky

> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index f4d35289f59e..ea434d76cfb0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3767,9 +3767,9 @@ static bool kvm_arch_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>                                   kvm_vcpu_gfn_to_hva(vcpu, gfn), &arch);
>  }
> 
> -static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> -                        gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> -                        bool write, bool *writable)
> +static bool kvm_faultin_pfn(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
> +                           gpa_t cr2_or_gpa, kvm_pfn_t *pfn, hva_t *hva,
> +                           bool write, bool *writable, int *r)
>  {
>         struct kvm_memory_slot *slot = kvm_vcpu_gfn_to_memslot(vcpu, gfn);
>         bool async;
> @@ -3780,13 +3780,26 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>          * be zapped before KVM inserts a new MMIO SPTE for the gfn.
>          */
>         if (slot && (slot->flags & KVM_MEMSLOT_INVALID))
> -               return true;
> +               goto out_retry;
> 
> -       /* Don't expose private memslots to L2. */
> -       if (is_guest_mode(vcpu) && !kvm_is_visible_memslot(slot)) {
> -               *pfn = KVM_PFN_NOSLOT;
> -               *writable = false;
> -               return false;
> +       if (!kvm_is_visible_memslot(slot)) {
> +               /* Don't expose private memslots to L2. */
> +               if (is_guest_mode(vcpu)) {
> +                       *pfn = KVM_PFN_NOSLOT;
> +                       *writable = false;
> +                       return false;
> +               }
> +               /*
> +                * If the APIC access page exists but is disabled, go directly
> +                * to emulation without caching the MMIO access or creating a
> +                * MMIO SPTE.  That way the cache doesn't need to be purged
> +                * when the AVIC is re-enabled.
> +                */
> +               if (slot->id == APIC_ACCESS_PAGE_PRIVATE_MEMSLOT &&
> +                   !vcpu->kvm->arch.apic_access_memslot_enabled) {
> +                       *r = RET_PF_EMULATE;
> +                       return true;
> +               }
>         }
> 
>         async = false;
> @@ -3800,14 +3813,19 @@ static bool try_async_pf(struct kvm_vcpu *vcpu, bool prefault, gfn_t gfn,
>                 if (kvm_find_async_pf_gfn(vcpu, gfn)) {
>                         trace_kvm_async_pf_doublefault(cr2_or_gpa, gfn);
>                         kvm_make_request(KVM_REQ_APF_HALT, vcpu);
> -                       return true;
> -               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn))
> -                       return true;
> +                       goto out_retry;
> +               } else if (kvm_arch_setup_async_pf(vcpu, cr2_or_gpa, gfn)) {
> +                       goto out_retry;
> +               }
>         }
> 
>         *pfn = __gfn_to_pfn_memslot(slot, gfn, false, NULL,
>                                     write, writable, hva);
>         return false;
> +
> +out_retry:
> +       *r = RET_PF_RETRY;
> +       return true;
>  }
> 
>  static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
> @@ -3839,9 +3857,9 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa, u32 error_code,
>         mmu_seq = vcpu->kvm->mmu_notifier_seq;
>         smp_rmb();
> 
> -       if (try_async_pf(vcpu, prefault, gfn, gpa, &pfn, &hva,
> -                        write, &map_writable))
> -               return RET_PF_RETRY;
> +       if (kvm_faultin_pfn(vcpu, prefault, gfn, gpa, &pfn, &hva, write,
> +                           &map_writable, &r))
> +               return r;
> 
>         if (handle_abnormal_pfn(vcpu, is_tdp ? 0 : gpa, gfn, pfn, ACC_ALL, &r))
>                 return r;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 490a028ddabe..9747124b877d 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -881,9 +881,9 @@ static int FNAME(page_fault)(struct kvm_vcpu *vcpu, gpa_t addr, u32 error_code,
>         mmu_seq = vcpu->kvm->mmu_notifier_seq;
>         smp_rmb();
>  
> -       if (try_async_pf(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> -                        write_fault, &map_writable))
> -               return RET_PF_RETRY;
> +       if (kvm_faultin_pfn(vcpu, prefault, walker.gfn, addr, &pfn, &hva,
> +                           write_fault, &map_writable, &r))
> +               return r;
>  
>         if (handle_abnormal_pfn(vcpu, addr, walker.gfn, pfn, walker.pte_access, &r))
>                 return r;
> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-22 17:35       ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
@ 2021-07-22 19:06         ` Sean Christopherson
  2021-07-27 13:05           ` Maxim Levitsky
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-07-22 19:06 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Ben Gardon

[-- Attachment #1: Type: text/plain, Size: 5113 bytes --]

+Ben

On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> > 
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> > 
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> > also give us a good excuse to rename try_async_pf() :-)
> > 
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> > 
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess.
> 
> Hi!
> 
> I am testing your approach and it actually works very well! I can't seem to break it.
> 
> Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?

Glad you asked, there's one more change needed.  kvm_zap_gfn_range() currently
takes mmu_lock for read, but it needs to take mmu_lock for write for this case
(more way below).

The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy.  The
whole thing is a grey area because KVM is trying to ensure it honors the guest's
UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
same transition.  So in practice there's likely no observable bug, but it also
means that taking mmu_lock for read is likely pointless, because for things to
work the guest has to serialize all running vCPUs.

Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()?  It would
effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
operate under the mmu read lock"); see attached patch.  And we could even bump
the notifier count in that helper, e.g. on top of the attached:

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b607e8763aa2..7174058e982b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)

        write_lock(&kvm->mmu_lock);

+       kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
+
        if (kvm_memslots_have_rmaps(kvm)) {
                for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
                        slots = __kvm_memslots(kvm, i);
@@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
        if (flush)
                kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);

+       kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
+
        write_unlock(&kvm->mmu_lock);
 }




Back to Maxim's original question...

Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
where APICv is being disabled while a different vCPU is concurrently faulting in a
new mapping for the APIC page.  E.g. it handles this race:

 vCPU0                                 vCPU1
                                       apic_access_memslot_enabled = true;
 			               #NPF on APIC
			               apic_access_memslot_enabled==true, proceed with #NPF
 apic_access_memslot_enabled = false 
 kvm_zap_gfn_range(APIC);
                                       __direct_map(APIC)

 mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate



The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
to be called, or just finised) and/or a modified mmu_notifier_seq (after the
count was decremented).

This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
to run in parallel with the page fault handler, there's no guarantee that the
correct apic_access_memslot_enabled will be observed.

	if (is_tdp_mmu_fault)
		read_lock(&vcpu->kvm->mmu_lock);
	else
		write_lock(&vcpu->kvm->mmu_lock);

	if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
		goto out_unlock;

	if (is_tdp_mmu_fault)
		r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
				    pfn, prefault);
	else
		r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
				 prefault, is_tdp);

[-- Attachment #2: 0001-Revert-KVM-x86-mmu-Allow-zap-gfn-range-to-operate-un.patch --]
[-- Type: text/x-diff, Size: 4672 bytes --]

From 18c4f6445231dfeb355772d1de09d933fe8ef8cf Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 22 Jul 2021 11:59:15 -0700
Subject: [PATCH] Revert "KVM: x86/mmu: Allow zap gfn range to operate under
 the mmu read lock"

Sean needs to write a changelog.

This effectively reverts commit 6103bc074048876794fa6d21fd8989331690ccbd.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 19 ++++++++-----------
 arch/x86/kvm/mmu/tdp_mmu.c | 15 ++++-----------
 arch/x86/kvm/mmu/tdp_mmu.h | 11 ++++-------
 3 files changed, 16 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b888385d1933..b607e8763aa2 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5566,8 +5566,9 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	int i;
 	bool flush = false;
 
+	write_lock(&kvm->mmu_lock);
+
 	if (kvm_memslots_have_rmaps(kvm)) {
-		write_lock(&kvm->mmu_lock);
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 			slots = __kvm_memslots(kvm, i);
 			kvm_for_each_memslot(memslot, slots) {
@@ -5586,22 +5587,18 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 		}
 		if (flush)
 			kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
-		write_unlock(&kvm->mmu_lock);
 	}
 
 	if (is_tdp_mmu_enabled(kvm)) {
-		flush = false;
-
-		read_lock(&kvm->mmu_lock);
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
-							  gfn_end, flush, true);
-		if (flush)
-			kvm_flush_remote_tlbs_with_address(kvm, gfn_start,
-							   gfn_end);
-
-		read_unlock(&kvm->mmu_lock);
+							  gfn_end, flush);
 	}
+
+	if (flush)
+		kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
+
+	write_unlock(&kvm->mmu_lock);
 }
 
 static bool slot_rmap_write_protect(struct kvm *kvm,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index caac4ddb46df..16f4d91deb56 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -773,21 +773,15 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * non-root pages mapping GFNs strictly within that range. Returns true if
  * SPTEs have been cleared and a TLB flush is needed before releasing the
  * MMU lock.
- *
- * If shared is true, this thread holds the MMU lock in read mode and must
- * account for the possibility that other threads are modifying the paging
- * structures concurrently. If shared is false, this thread should hold the
- * MMU in write mode.
  */
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush,
-				 bool shared)
+				 gfn_t end, bool can_yield, bool flush)
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, shared)
+	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
-				      shared);
+				      false);
 
 	return flush;
 }
@@ -799,8 +793,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	int i;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn,
-						  flush, false);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, max_gfn, flush);
 
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 1cae4485b3bc..ec8b012488fc 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -20,14 +20,11 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush,
-				 bool shared);
+				 gfn_t end, bool can_yield, bool flush);
 static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
-					     gfn_t start, gfn_t end, bool flush,
-					     bool shared)
+					     gfn_t start, gfn_t end, bool flush)
 {
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
-					   shared);
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
 }
 static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -44,7 +41,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	 */
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
-					   sp->gfn, end, false, false, false);
+					   sp->gfn, end, false, false);
 }
 
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
-- 
2.32.0.432.gabb21c7263-goog


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 0/8] My AVIC patch queue
  2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
                   ` (7 preceding siblings ...)
  2021-07-13 14:20 ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
@ 2021-07-26 17:24 ` Paolo Bonzini
  8 siblings, 0 replies; 32+ messages in thread
From: Paolo Bonzini @ 2021-07-26 17:24 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On 13/07/21 16:20, Maxim Levitsky wrote:
> Hi!
> 
> This is a series of bugfixes to the AVIC dynamic inhibition, which was
> made while trying to fix bugs as much as possible, in this area and trying
> to make the AVIC+SYNIC conditional enablement work.
> 
> * Patches 1-4 address an issue of possible
>    mismatch between the AVIC inhibit state and AVIC enable state on all vCPUs.
> 
>    Since AVICs state is changed via a request there is a window during which
>    the states differ which can lead to various warnings and errors.
> 
>    There was an earlier attempt to fix this by changing the AVIC enable state
>    on the current vCPU immediately when the AVIC inhibit request is created,
>    however while this fixes the common case, it actually hides the issue deeper,
>    because on all other vCPUs but current one, the two states can still
>    mismatch till the KVM_REQ_APICV_UPDATE is processed on each of them.
> 
>    My take on this is to fix the places where the mismatch causes the
>    issues instead and then drop the special case of toggling the AVIC right
>    away in kvm_request_apicv_update.
> 
>    V2: I rewrote the commit description for the patch that touches
>      avic inhibition in nested case.
> 
> * Patches 5-6 in this series fix a race condition which can cause
>    a lost write from a guest to APIC when the APIC write races
>    the AVIC un-inhibition, and add a warning to catch this problem
>    if it re-emerges again.
> 
>    V2: I re-implemented this with a mutex in V2.
> 
> * Patch 7 is an  fix yet another issue I found in AVIC inhibit code:
>    Currently avic_vcpu_load/avic_vcpu_put are called on userspace entry/exit
>    from KVM (aka kvm_vcpu_get/kvm_vcpu_put), and these functions update the
>    "is running" bit in the AVIC physical ID remap table and update the
>    target vCPU in iommu code.
> 
>    However both of these functions don't do anything when AVIC is inhibited
>    thus the "is running" bit will be kept enabled during exit to userspace.
>    This shouldn't be a big issue as the caller
>    doesn't use the AVIC when inhibited but still inconsistent and can trigger
>    a warning about this in avic_vcpu_load.
> 
>    To be on the safe side I think it makes sense to call
>    avic_vcpu_put/avic_vcpu_load when inhibiting/uninhibiting the AVIC.
>    This will ensure that the work these functions do is matched.
> 
> * Patch 8 is the patch from Vitaly about allowing AVIC with SYNC
>    as long as the guest doesn’t use the AutoEOI feature. I only slightly
>    changed it to drop the SRCU lock around call to kvm_request_apicv_update
>    and also expose the AutoEOI cpuid bit regardless of AVIC enablement.
> 
>    Despite the fact that this is the last patch in this series, this patch
>    doesn't depend on the other fixes.
> 
> Best regards,
> 	Maxim Levitsky
> 
> Maxim Levitsky (7):
>    KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to
>      be deactivated
>    KVM: SVM: tweak warning about enabled AVIC on nested entry
>    KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl
>    KVM: x86: APICv: drop immediate APICv disablement on current vCPU
>    KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
>    KVM: SVM: add warning for mistmatch between AVIC state and AVIC access
>      page state
>    KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling
>      AVIC
> 
> Vitaly Kuznetsov (1):
>    KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in
>      use
> 
>   arch/x86/include/asm/kvm_host.h |  3 ++
>   arch/x86/kvm/hyperv.c           | 34 ++++++++++++++++----
>   arch/x86/kvm/svm/avic.c         | 45 ++++++++++++++------------
>   arch/x86/kvm/svm/nested.c       |  2 +-
>   arch/x86/kvm/svm/svm.c          | 18 ++++++++---
>   arch/x86/kvm/x86.c              | 57 ++++++++++++++++++---------------
>   include/linux/kvm_host.h        |  1 +
>   virt/kvm/kvm_main.c             |  1 +
>   8 files changed, 103 insertions(+), 58 deletions(-)
> 

Queued patches 1-4, thanks.

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
  2021-07-13 14:20 ` [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM Maxim Levitsky
@ 2021-07-26 22:34   ` Paolo Bonzini
  2021-07-27 13:22     ` Maxim Levitsky
  0 siblings, 1 reply; 32+ messages in thread
From: Paolo Bonzini @ 2021-07-26 22:34 UTC (permalink / raw)
  To: Maxim Levitsky, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On 13/07/21 16:20, Maxim Levitsky wrote:
> +	mutex_lock(&vcpu->kvm->apicv_update_lock);
> +
>  	vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
>  	kvm_apic_update_apicv(vcpu);
>  	static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
> @@ -9246,6 +9248,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
>  	 */
>  	if (!vcpu->arch.apicv_active)
>  		kvm_make_request(KVM_REQ_EVENT, vcpu);
> +
> +	mutex_unlock(&vcpu->kvm->apicv_update_lock);

Does this whole piece of code need the lock/unlock?  Does it work and/or 
make sense to do the unlock immediately after mutex_lock()?  This makes 
it clearer that the mutex is being to synchronize against the requestor.

> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ed4d1581d502..ba5d5d9ebc64 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   	mutex_init(&kvm->irq_lock);
>   	mutex_init(&kvm->slots_lock);
>   	mutex_init(&kvm->slots_arch_lock);
> +	mutex_init(&kvm->apicv_update_lock);
>   	INIT_LIST_HEAD(&kvm->devices);
>   
>   	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> 

Please add comments above fields that are protected by this lock 
(anything but apicv_inhibit_reasons?), and especially move it to kvm->arch.

Paolo


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-22 19:06         ` Sean Christopherson
@ 2021-07-27 13:05           ` Maxim Levitsky
  2021-07-27 17:48             ` Ben Gardon
  0 siblings, 1 reply; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-27 13:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Ben Gardon

On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> +Ben
> 
> On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > lock when needed.
> > > 
> > > The entire approach of dynamically adding/removing the memslot seems doomed to
> > > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > > and again on re-enable.
> > > 
> > > Rather than pile on more gunk, what about special casing the APIC access page
> > > memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > > fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> > > also give us a good excuse to rename try_async_pf() :-)
> > > 
> > > If lack of MMIO caching is a performance problem, an alternative solution would
> > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > > clear their cache.
> > > 
> > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > > less awful than the current memslot+SRCU mess.
> > 
> > Hi!
> > 
> > I am testing your approach and it actually works very well! I can't seem to break it.
> > 
> > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?
> 
> Glad you asked, there's one more change needed.  kvm_zap_gfn_range() currently
> takes mmu_lock for read, but it needs to take mmu_lock for write for this case
> (more way below).
> 
> The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy.  The
> whole thing is a grey area because KVM is trying to ensure it honors the guest's
> UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
> i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
> same transition.  So in practice there's likely no observable bug, but it also
> means that taking mmu_lock for read is likely pointless, because for things to
> work the guest has to serialize all running vCPUs.
> 
> Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()?  It would
> effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
> operate under the mmu read lock"); see attached patch.  And we could even bump
> the notifier count in that helper, e.g. on top of the attached:
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index b607e8763aa2..7174058e982b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> 
>         write_lock(&kvm->mmu_lock);
> 
> +       kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
> +
>         if (kvm_memslots_have_rmaps(kvm)) {
>                 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
>                         slots = __kvm_memslots(kvm, i);
> @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>         if (flush)
>                 kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
> 
> +       kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
> +
>         write_unlock(&kvm->mmu_lock);
>  }
> 

I understand what you mean now. I thought that I need to change to code of the
kvm_inc_notifier_count/kvm_dec_notifier_count.




> 
> 
> 
> Back to Maxim's original question...
> 
> Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
> where APICv is being disabled while a different vCPU is concurrently faulting in a
> new mapping for the APIC page.  E.g. it handles this race:
> 
>  vCPU0                                 vCPU1
>                                        apic_access_memslot_enabled = true;
>  			               #NPF on APIC
> 			               apic_access_memslot_enabled==true, proceed with #NPF
>  apic_access_memslot_enabled = false 
>  kvm_zap_gfn_range(APIC);
>                                        __direct_map(APIC)
> 
>  mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate

I understand this now. I guess this can't happen with original memslot disable
which I guess has the needed locking and flushing to avoid this.
(I didnt' study the code in depth thought)

> 
> 
> 
> The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
> vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> count was decremented).
> 
> This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
> to run in parallel with the page fault handler, there's no guarantee that the
> correct apic_access_memslot_enabled will be observed.

I understand now.

So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
Do you agree to revert the usage of the read lock?

I will post a new series using this approach very soon, since I already have
msot of the code done.

Best regards,
	Maxim Levitsky

> 
> 	if (is_tdp_mmu_fault)
> 		read_lock(&vcpu->kvm->mmu_lock);
> 	else
> 		write_lock(&vcpu->kvm->mmu_lock);
> 
> 	if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
> 		goto out_unlock;
> 
> 	if (is_tdp_mmu_fault)
> 		r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
> 				    pfn, prefault);
> 	else
> 		r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
> 				 prefault, is_tdp);



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM
  2021-07-26 22:34   ` Paolo Bonzini
@ 2021-07-27 13:22     ` Maxim Levitsky
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-27 13:22 UTC (permalink / raw)
  To: Paolo Bonzini, kvm
  Cc: open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Sean Christopherson

On Tue, 2021-07-27 at 00:34 +0200, Paolo Bonzini wrote:
> On 13/07/21 16:20, Maxim Levitsky wrote:
> > +	mutex_lock(&vcpu->kvm->apicv_update_lock);
> > +
> >  	vcpu->arch.apicv_active = kvm_apicv_activated(vcpu->kvm);
> >  	kvm_apic_update_apicv(vcpu);
> >  	static_call(kvm_x86_refresh_apicv_exec_ctrl)(vcpu);
> > @@ -9246,6 +9248,8 @@ void kvm_vcpu_update_apicv(struct kvm_vcpu *vcpu)
> >  	 */
> >  	if (!vcpu->arch.apicv_active)
> >  		kvm_make_request(KVM_REQ_EVENT, vcpu);
> > +
> > +	mutex_unlock(&vcpu->kvm->apicv_update_lock);
> 
> Does this whole piece of code need the lock/unlock?  Does it work and/or 
> make sense to do the unlock immediately after mutex_lock()?  This makes 
> it clearer that the mutex is being to synchronize against the requestor.

Yes, I do need to hold the mutex for the whole duration.

The requester does the following:

1. Take the mutex
 
2. Kick all the vCPUs out of the guest mode with KVM_REQ_EVENT
   At that point all these vCPUs will be (or soon will be) stuck on the mutex
   and guaranteed to be outside of the guest mode.
   which is exactly what I need to avoid them entering the guest
   mode as long as the AVIC's memslot state is not up to date.

3. Update kvm->arch.apicv_inhibit_reasons. I removed the cmpchg loop
   since it is now protected under the lock anyway.
   This step doesn't have to be done at this point, but should be done while mutex is held
   so that there is no need to cmpchg and such.

   This itself isn't the justification for the mutex.
 
4. Update the memslot
 
5. Release the mutex.
   Only now all other vCPUs are permitted to enter the guest mode again
   (since only now the memslot is up to date)
   and they will also update their per-vcpu AVIC enablement prior to entering it.
 
 
I think it might be possible to avoid the mutex, but I am not sure if this is worth it:
 
First of all, the above sync sequence is only really needed when we enable AVIC.

(Because from the moment we enable the memslot and to the moment the vCPU enables the AVIC,
it must not be in guest mode as otherwise it will access the dummy page in the memslot
without VMexit, instead of going through AVIC vmexit/acceleration.
 
The other way around is OK. IF we disable the memslot, and a vCPU still has a enabled AVIC, 
it will just get a page fault which will be correctly emulated as APIC read/write by
the MMIO page fault.
 
If I had a guarantee that when I enable the memslot (either as it done today or
using the kvm_zap_gfn_range (which I strongly prefer), would always raise a TLB
flush request on all vCPUs, then I could (ab)use that request to update local
AVIC state.
 
Or I can just always check if local AVIC state matches the memslot and update
if it doesn't prior to guest mode entry.
 
I still think I would prefer a mutex to be 100% sure that there are no races,
since the whole AVIC disablement isn't something that is done often anyway.
 
Best regards,
	Maxim Levitsky

> 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ed4d1581d502..ba5d5d9ebc64 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -943,6 +943,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
> >   	mutex_init(&kvm->irq_lock);
> >   	mutex_init(&kvm->slots_lock);
> >   	mutex_init(&kvm->slots_arch_lock);
> > +	mutex_init(&kvm->apicv_update_lock);
> >   	INIT_LIST_HEAD(&kvm->devices);
> >   
> >   	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
> > 
> 
> Please add comments above fields that are protected by this lock 
> (anything but apicv_inhibit_reasons?), and especially move it to kvm->arch.
I agree, I will do this.


> 
> Paolo



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-27 13:05           ` Maxim Levitsky
@ 2021-07-27 17:48             ` Ben Gardon
  2021-07-27 18:17               ` Sean Christopherson
  0 siblings, 1 reply; 32+ messages in thread
From: Ben Gardon @ 2021-07-27 17:48 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, kvm,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > +Ben
> >
> > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > > I am more inclined to fix this by just tracking if we hold the srcu
> > > > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > > > and then kvm_request_apicv_update can use this to drop the srcu
> > > > > lock when needed.
> > > >
> > > > The entire approach of dynamically adding/removing the memslot seems doomed to
> > > > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > > > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > > > and again on re-enable.
> > > >
> > > > Rather than pile on more gunk, what about special casing the APIC access page
> > > > memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> > > > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > > > fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> > > > also give us a good excuse to rename try_async_pf() :-)
> > > >
> > > > If lack of MMIO caching is a performance problem, an alternative solution would
> > > > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > > > clear their cache.
> > > >
> > > > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > > > less awful than the current memslot+SRCU mess.
> > >
> > > Hi!
> > >
> > > I am testing your approach and it actually works very well! I can't seem to break it.
> > >
> > > Could you explain why do I need to do something with kvm_{inc,dec}_notifier_count()) ?
> >
> > Glad you asked, there's one more change needed.  kvm_zap_gfn_range() currently
> > takes mmu_lock for read, but it needs to take mmu_lock for write for this case
> > (more way below).
> >
> > The existing users, update_mtrr() and kvm_post_set_cr0(), are a bit sketchy.  The
> > whole thing is a grey area because KVM is trying to ensure it honors the guest's
> > UC memtype for non-coherent DMA, but the inputs (CR0 and MTRRs) are per-vCPU,
> > i.e. for it to work correctly, the guest has to ensure all running vCPUs do the
> > same transition.  So in practice there's likely no observable bug, but it also
> > means that taking mmu_lock for read is likely pointless, because for things to
> > work the guest has to serialize all running vCPUs.
> >
> > Ben, any objection to taking mmu_lock for write in kvm_zap_gfn_range()?  It would
> > effectively revert commit 6103bc074048 ("KVM: x86/mmu: Allow zap gfn range to
> > operate under the mmu read lock"); see attached patch.  And we could even bump
> > the notifier count in that helper, e.g. on top of the attached:
> >
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index b607e8763aa2..7174058e982b 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -5568,6 +5568,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >
> >         write_lock(&kvm->mmu_lock);
> >
> > +       kvm_inc_notifier_count(kvm, gfn_start, gfn_end);
> > +
> >         if (kvm_memslots_have_rmaps(kvm)) {
> >                 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> >                         slots = __kvm_memslots(kvm, i);
> > @@ -5598,6 +5600,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >         if (flush)
> >                 kvm_flush_remote_tlbs_with_address(kvm, gfn_start, gfn_end);
> >
> > +       kvm_dec_notifier_count(kvm, gfn_start, gfn_end);
> > +
> >         write_unlock(&kvm->mmu_lock);
> >  }
> >
>
> I understand what you mean now. I thought that I need to change to code of the
> kvm_inc_notifier_count/kvm_dec_notifier_count.
>
>
>
>
> >
> >
> >
> > Back to Maxim's original question...
> >
> > Elevating mmu_notifier_count and bumping mmu_notifier_seq will will handle the case
> > where APICv is being disabled while a different vCPU is concurrently faulting in a
> > new mapping for the APIC page.  E.g. it handles this race:
> >
> >  vCPU0                                 vCPU1
> >                                        apic_access_memslot_enabled = true;
> >                                      #NPF on APIC
> >                                      apic_access_memslot_enabled==true, proceed with #NPF
> >  apic_access_memslot_enabled = false
> >  kvm_zap_gfn_range(APIC);
> >                                        __direct_map(APIC)
> >
> >  mov [APIC], 0 <-- succeeds, but KVM wants to intercept to emulate
>
> I understand this now. I guess this can't happen with original memslot disable
> which I guess has the needed locking and flushing to avoid this.
> (I didnt' study the code in depth thought)
>
> >
> >
> >
> > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
> > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > count was decremented).
> >
> > This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
> > to run in parallel with the page fault handler, there's no guarantee that the
> > correct apic_access_memslot_enabled will be observed.
>
> I understand now.
>
> So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> Do you agree to revert the usage of the read lock?
>
> I will post a new series using this approach very soon, since I already have
> msot of the code done.
>
> Best regards,
>         Maxim Levitsky

From reading through this thread, it seems like switching from read
lock to write lock is only necessary for a small range of GFNs, (i.e.
the APIC access page) is that correct?
My initial reaction was that switching kvm_zap_gfn_range back to the
write lock would be terrible for performance, but given its only two
callers, I think it would actually be fine.
If you do that though, you should pass shared=false to
kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
operating with exclusive access to the MMU lock.

>
> >
> >       if (is_tdp_mmu_fault)
> >               read_lock(&vcpu->kvm->mmu_lock);
> >       else
> >               write_lock(&vcpu->kvm->mmu_lock);
> >
> >       if (!is_noslot_pfn(pfn) && mmu_notifier_retry_hva(vcpu->kvm, mmu_seq, hva)) <--- look here!
> >               goto out_unlock;
> >
> >       if (is_tdp_mmu_fault)
> >               r = kvm_tdp_mmu_map(vcpu, gpa, error_code, map_writable, max_level,
> >                                   pfn, prefault);
> >       else
> >               r = __direct_map(vcpu, gpa, error_code, map_writable, max_level, pfn,
> >                                prefault, is_tdp);
>
>

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-27 17:48             ` Ben Gardon
@ 2021-07-27 18:17               ` Sean Christopherson
  2021-07-29 14:10                 ` Maxim Levitsky
  0 siblings, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-07-27 18:17 UTC (permalink / raw)
  To: Ben Gardon
  Cc: Maxim Levitsky, kvm,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Tue, Jul 27, 2021, Ben Gardon wrote:
> On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> >
> > On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > > to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
> > > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > > count was decremented).
> > >
> > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
> > > to run in parallel with the page fault handler, there's no guarantee that the
> > > correct apic_access_memslot_enabled will be observed.
> >
> > I understand now.
> >
> > So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> > Do you agree to revert the usage of the read lock?
> >
> > I will post a new series using this approach very soon, since I already have
> > msot of the code done.
> >
> > Best regards,
> >         Maxim Levitsky
> 
> From reading through this thread, it seems like switching from read
> lock to write lock is only necessary for a small range of GFNs, (i.e.
> the APIC access page) is that correct?

For the APICv case, yes, literally a single GFN (the default APIC base).

> My initial reaction was that switching kvm_zap_gfn_range back to the
> write lock would be terrible for performance, but given its only two
> callers, I think it would actually be fine.

And more importantly, the two callers are gated by kvm_arch_has_noncoherent_dma()
and are very rare flows for the guest (updating MTRRs, toggling CR0.CD).

> If you do that though, you should pass shared=false to
> kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
> operating with exclusive access to the MMU lock.

Ya, my suggested revert was to drop @shared entirely since kvm_zap_gfn_range() is
the only caller that passes @shared=true.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use
  2021-07-27 18:17               ` Sean Christopherson
@ 2021-07-29 14:10                 ` Maxim Levitsky
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-07-29 14:10 UTC (permalink / raw)
  To: Sean Christopherson, Ben Gardon
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Tue, 2021-07-27 at 18:17 +0000, Sean Christopherson wrote:
> On Tue, Jul 27, 2021, Ben Gardon wrote:
> > On Tue, Jul 27, 2021 at 6:06 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > > On Thu, 2021-07-22 at 19:06 +0000, Sean Christopherson wrote:
> > > > The elevated mmu_notifier_count and/or changed mmu_notifier_seq will cause vCPU1
> > > > to bail and resume the guest without fixing the #NPF.  After acquiring mmu_lock,
> > > > vCPU1 will see the elevated mmu_notifier_count (if kvm_zap_gfn_range() is about
> > > > to be called, or just finised) and/or a modified mmu_notifier_seq (after the
> > > > count was decremented).
> > > > 
> > > > This is why kvm_zap_gfn_range() needs to take mmu_lock for write.  If it's allowed
> > > > to run in parallel with the page fault handler, there's no guarantee that the
> > > > correct apic_access_memslot_enabled will be observed.
> > > 
> > > I understand now.
> > > 
> > > So, Paolo, Ben Gardon, what do you think. Do you think this approach is feasable?
> > > Do you agree to revert the usage of the read lock?
> > > 
> > > I will post a new series using this approach very soon, since I already have
> > > msot of the code done.
> > > 
> > > Best regards,
> > >         Maxim Levitsky
> > 
> > From reading through this thread, it seems like switching from read
> > lock to write lock is only necessary for a small range of GFNs, (i.e.
> > the APIC access page) is that correct?
> 
> For the APICv case, yes, literally a single GFN (the default APIC base).
> 
> > My initial reaction was that switching kvm_zap_gfn_range back to the
> > write lock would be terrible for performance, but given its only two
> > callers, I think it would actually be fine.
> 
> And more importantly, the two callers are gated by kvm_arch_has_noncoherent_dma()
> and are very rare flows for the guest (updating MTRRs, toggling CR0.CD).
> 
> > If you do that though, you should pass shared=false to
> > kvm_tdp_mmu_zap_gfn_range in that function, so that it knows it's
> > operating with exclusive access to the MMU lock.
> 
> Ya, my suggested revert was to drop @shared entirely since kvm_zap_gfn_range() is
> the only caller that passes @shared=true.
> 


Just one question:
Should I submit the patches for MMU changes that you described,
and on top of them my AVIC patches?

Should I worry about the new TDP mmu?

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-07-22  9:12       ` KVM's support for non default APIC base Maxim Levitsky
@ 2021-08-02  9:20         ` Maxim Levitsky
  2021-08-06 21:55         ` Sean Christopherson
  1 sibling, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-08-02  9:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Thu, 2021-07-22 at 12:12 +0300, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > I am more inclined to fix this by just tracking if we hold the srcu
> > > lock on each VCPU manually, just as we track the srcu index anyway,
> > > and then kvm_request_apicv_update can use this to drop the srcu
> > > lock when needed.
> > 
> > The entire approach of dynamically adding/removing the memslot seems doomed to
> > failure, and is likely responsible for the performance issues with AVIC, e.g. a
> > single vCPU temporarily inhibiting AVIC will zap all SPTEs _twice_; on disable
> > and again on re-enable.
> > 
> > Rather than pile on more gunk, what about special casing the APIC access page
> > memslot in try_async_pf()?  E.g. zap the GFN in avic_update_access_page() when
> > disabling (and bounce through kvm_{inc,dec}_notifier_count()), and have the page
> > fault path skip directly to MMIO emulation without caching the MMIO info.  It'd
> > also give us a good excuse to rename try_async_pf() :-)
> > 
> > If lack of MMIO caching is a performance problem, an alternative solution would
> > be to allow caching but add a helper to zap the MMIO SPTE and request all vCPUs to
> > clear their cache.
> > 
> > It's all a bit gross, especially hijacking the mmu_notifier path, but IMO it'd be
> > less awful than the current memslot+SRCU mess
> 
> Hi Sean, Paolo and everyone else:
> 
> I am exploring the approach that you proposed and I noticed that we have very inconsistient
> code that handles the APIC base relocation for in-kernel local apic.
> I do know that APIC base relocation is not supported, and I don't have anything against
> this as long as VMs don't use that feature, but I do want this to be consistent.
> 
> I did a study of the code that is involved in this mess and I would like to hear your opinion:
> 
> There are basically 3 modes of operation of in kernel local apic:
> 
> Regular unaccelerated local apic:
> 
> -> APIC MMIO base address is stored at 'apic->base_address', and updated in 
>    kvm_lapic_set_base which is called from  msr write of 'MSR_IA32_APICBASE'
>    (both by SVM and VMX).
>    The value is even migrated.
> 
> -> APIC mmio read/write is done from MMU, when we access MMIO page:
> 	vcpu_mmio_write always calls apic_mmio_write which checks if the write is in 
> 	apic->base_address page and if so forwards the write local apic with offset
> 	in this page.
> 
> -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
>    thus must contain no guest memslots.
>    If the guest relocates the APIC base somewhere where we have a memslot, 
>    memslot will take priority, while on real hardware, LAPIC is likely to
>    take priority.
> 
> APICv:
> 
> -> The default apic MMIO base (0xfee00000) is covered by a dummy page which is
>    allocated from qemu's process  using __x86_set_memory_region.
>    
>    This is done once in alloc_apic_access_page which is called on vcpu creation,
>    (but only once when the memslot is not yet enabled)
> 
> -> to avoid pinning this page into qemu's memory, reference to it
>    is dropped in alloc_apic_access_page.
>    (see commit c24ae0dcd3e8695efa43e71704d1fc4bc7e29e9b)
> 
> -> kvm_arch_mmu_notifier_invalidate_range -> checks if we invalidate GPA 0xfee00000 
>    and if so, raises KVM_REQ_APIC_PAGE_RELOAD request.
> 
> -> kvm_vcpu_reload_apic_access_page handles the KVM_REQ_APIC_PAGE_RELOAD request by calling
>    kvm_x86_ops.set_apic_access_page_addr which is only implemented on VMX
>    (vmx_set_apic_access_page_addr)
> 
> -> vmx_set_apic_access_page_addr does gfn_to_page on 0xfee00000 GPA,
>    and if the result is valid, writes the physical address of this page to APIC_ACCESS_ADDR vmcs field.
> 
>    (This is a major difference from the AVIC - AVIC's avic_vapic_bar is *GPA*, while APIC_ACCESS_ADDR
>    is host physical address which the hypervisor is supposed to map at APIC MMIO GPA using EPT)
> 
>    Note that if we have an error here, we might end with invalid APIC_ACCESS_ADDR field.
> 
> -> writes to the  HPA of that special page (which has GPA of 0xfee00000, and mapped via EPT) go to
>    APICv or cause special VM exits: (EXIT_REASON_APIC_ACCESS, EXIT_REASON_APIC_WRITE)
> 
>    * EXIT_REASON_APIC_ACCESS (which is used for older limited 'flexpriotiy' mode which only emulates TPR practically) 
>      actually emulates the instruction to know the written value,
>      but we have a special case in vcpu_is_mmio_gpa which makes the emulation treat the access to the default
>      apic base as MMIO.
>    
>    * EXIT_REASON_APIC_WRITE is a trap VMexit which comes with full APICv, and since it also has offset
>      qualification and the value is already in the apic page, this info is just passed to kvm_lapic_reg_write
> 
> 
> -> If APIC base is relocated, the APICv isn't aware of it, and the writes to new APIC base,
>    (assuming that we have no memslots covering it) will go through standard APIC MMIO emulation,
>    and *might* create a mess.
> 
> AVIC:
> 
> -> The default apic MMIO base (0xfee00000) 
>    is also covered by a dummy page which is allocated from qemu's process using __x86_set_memory_region 
>    in avic_update_access_page which is called also on vcpu creation (also only once),
>    and from SVM's dynamic AVIC inhibition.
> 
> -> The reference to this page is not dropped thus there is no KVM_REQ_APIC_PAGE_RELOAD handler.
>    I think we should do the same we do for APICv here?
> 
> -> avic_vapic_bar is GPA and thus contains 0xfee00000 but writes to MSR_IA32_APICBASE do update it
>    (avic_update_vapic_bar which is called from MSR_IA32_APICBASE write in SVM code)
> 
>    thus if the guest relocates the APIC base to a writable memory page, actually AVIC would happen to work.
>    (opposite from the stock xAPIC handlilng, which only works when apic base is in MMIO area.)
> 
> -> writes to the GPA in avic_vapic_bar are first checked in NPT (but HPA written there ignored),
>    and then either go to AVIC or cause SVM_EXIT_AVIC_UNACCELERATED_ACCESS which has offset of the write
>    in the exit_info_1
>    (there is also SVM_EXIT_AVIC_INCOMPLETE_IPI which is called on some ICR writes)
> 
> 
> As far as I know the only good reason to relocate APIC base is to access it from the real mode
> which is not something that is done these days by modern BIOSes.
> 
> I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default base is set and apic enabled) 
> and remove all remains of the support for variable APIC base.
> (we already have a warning when APIC base is set to non default value)
> 
> 
> Best regards,
> 	Maxim Levitsky
> 
Ping.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-07-22  9:12       ` KVM's support for non default APIC base Maxim Levitsky
  2021-08-02  9:20         ` Maxim Levitsky
@ 2021-08-06 21:55         ` Sean Christopherson
  2021-08-09  9:40           ` Maxim Levitsky
  1 sibling, 1 reply; 32+ messages in thread
From: Sean Christopherson @ 2021-08-06 21:55 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
>    thus must contain no guest memslots.
>    If the guest relocates the APIC base somewhere where we have a memslot, 
>    memslot will take priority, while on real hardware, LAPIC is likely to
>    take priority.

Yep.  The thing that really bites us is that other vCPUs should still be able to
access the memory defined by the memslot, e.g. to make it work we'd have to run
the vCPU with a completely different MMU root.

> As far as I know the only good reason to relocate APIC base is to access it
> from the real mode which is not something that is done these days by modern
> BIOSes.
> 
> I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> base is set and apic enabled) and remove all remains of the support for
> variable APIC base.

Making up our own behavior is almost never the right approach.  E.g. _best_ case
scenario for an unexpected #GP is the guest immediately terminates.  Worst case
scenario is the guest eats the #GP and continues on, which is basically the status
quo, except it's guaranteed to now work, whereas todays behavior can at least let
the guest function, for some definitions of "function".

I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
Whether or not we can do that without breaking userspace is probably the big
question.  Fully emulating APIC base relocation would be a tremendous amount of
effort and complexity for practically zero benefit.

> (we already have a warning when APIC base is set to non default value)

FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
misbehaving guest unless it's the first guest to misbehave on a particular
instantiation of KVM.   _ratelimited() would improve the situation, but not
completely eliminate the possibility of a misbehaving guest going unnoticed.
Anything else isn't an option becuase it's obviously guest triggerable.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-08-06 21:55         ` Sean Christopherson
@ 2021-08-09  9:40           ` Maxim Levitsky
  2021-08-09 15:57             ` Sean Christopherson
  2021-08-09 16:47             ` Jim Mattson
  0 siblings, 2 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-08-09  9:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> >    thus must contain no guest memslots.
> >    If the guest relocates the APIC base somewhere where we have a memslot, 
> >    memslot will take priority, while on real hardware, LAPIC is likely to
> >    take priority.
> 
> Yep.  The thing that really bites us is that other vCPUs should still be able to
> access the memory defined by the memslot, e.g. to make it work we'd have to run
> the vCPU with a completely different MMU root.
That is something I haven't took in the account. 
Complexity of supporting this indeed isn't worth it.

> 
> > As far as I know the only good reason to relocate APIC base is to access it
> > from the real mode which is not something that is done these days by modern
> > BIOSes.
> > 
> > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > base is set and apic enabled) and remove all remains of the support for
> > variable APIC base.
> 
> Making up our own behavior is almost never the right approach.  E.g. _best_ case
> scenario for an unexpected #GP is the guest immediately terminates.  Worst case
> scenario is the guest eats the #GP and continues on, which is basically the status
> quo, except it's guaranteed to now work, whereas todays behavior can at least let
> the guest function, for some definitions of "function".

Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
support it. In theory, a very well behaving guest can catch the exception and
fail back to the default base.

I don't understand what do you mean by 'guaranteed to now work'. If the guest
ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
A well behaving guest should never assume that a msr write that failed with #GP
worked.


> 
> I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> Whether or not we can do that without breaking userspace is probably the big
> question.  Fully emulating APIC base relocation would be a tremendous amount of
> effort and complexity for practically zero benefit.

I have nothing against this as well although I kind of like the #GP approach a bit more, 
and knowing that there are barely any reasons
to relocate the APIC base, and that it doesn't work well, there is a good chance
that no one does it anyway (except our kvm unit tests, but that isn't an issue).

> 
> > (we already have a warning when APIC base is set to non default value)
> 
> FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> misbehaving guest unless it's the first guest to misbehave on a particular
> instantiation of KVM.   _ratelimited() would improve the situation, but not
> completely eliminate the possibility of a misbehaving guest going unnoticed.
> Anything else isn't an option becuase it's obviously guest triggerable.

100% agree.

I'll say I would first make it _ratelimited() for few KVM versions, and then
if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
from the code that pretend that it can work.

And add a comment explaining *why* as you explained, supporting APIC base relocation
isn't worth it.

Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-08-09  9:40           ` Maxim Levitsky
@ 2021-08-09 15:57             ` Sean Christopherson
  2021-08-09 16:47             ` Jim Mattson
  1 sibling, 0 replies; 32+ messages in thread
From: Sean Christopherson @ 2021-08-09 15:57 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: kvm, open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Jim Mattson, Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov,
	Wanpeng Li, Paolo Bonzini, Thomas Gleixner, H. Peter Anvin,
	Ingo Molnar, maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, Aug 09, 2021, Maxim Levitsky wrote:
> On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > Making up our own behavior is almost never the right approach.  E.g. _best_ case
> > scenario for an unexpected #GP is the guest immediately terminates.  Worst case
> > scenario is the guest eats the #GP and continues on, which is basically the status
> > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > the guest function, for some definitions of "function".
> 
> Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> support it. In theory, a very well behaving guest can catch the exception and
> fail back to the default base.
> 
> I don't understand what do you mean by 'guaranteed to now work'. If the guest

Doh, typo, it should be "not", i.e. "guaranteed to not work".  As in, allowing the
unsupported WRMSR could work depending on what features KVM is using and what the
guest is doing, whereas injecting #GP is guaranteed to break the guest.

> ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> A well behaving guest should never assume that a msr write that failed with #GP
> worked.
> 
> > 
> > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > Whether or not we can do that without breaking userspace is probably the big
> > question.  Fully emulating APIC base relocation would be a tremendous amount of
> > effort and complexity for practically zero benefit.
> 
> I have nothing against this as well although I kind of like the #GP approach
> a bit more, and knowing that there are barely any reasons to relocate the
> APIC base, and that it doesn't work well, there is a good chance that no one
> does it anyway (except our kvm unit tests, but that isn't an issue).

Injecting an exception that architecturally should not happen is simply not
acceptable.  Silently (and partially) ignoring the WRMSR isn't acceptable either,
but we can't travel back in time to fix that so we're stuck with it unless we can
change the behavior without anyone complaining.

> > > (we already have a warning when APIC base is set to non default value)
> > 
> > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > misbehaving guest unless it's the first guest to misbehave on a particular
> > instantiation of KVM.   _ratelimited() would improve the situation, but not
> > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > Anything else isn't an option becuase it's obviously guest triggerable.
> 
> 100% agree.
> 
> I'll say I would first make it _ratelimited() for few KVM versions, and then
> if nobody complains, make it a KVM internal error / #GP, and remove all the
> leftovers from the code that pretend that it can work.

I don't see any point in temporarily making it _ratelimited(), (1) the odds of someone
running a guest that relies on APIC base relocation are very low, (2) the odds of
that someone noticing a _ratelimited() and not a _once() are even lower, and (3) the
odds of that prompting a bug report are even lower still.

> And add a comment explaining *why* as you explained, supporting APIC base relocation
> isn't worth it.
> 
> Best regards,
> 	Maxim Levitsky
> 
> > 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-08-09  9:40           ` Maxim Levitsky
  2021-08-09 15:57             ` Sean Christopherson
@ 2021-08-09 16:47             ` Jim Mattson
  2021-08-10 20:42               ` Maxim Levitsky
  1 sibling, 1 reply; 32+ messages in thread
From: Jim Mattson @ 2021-08-09 16:47 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, kvm,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, Aug 9, 2021 at 2:40 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> > >    thus must contain no guest memslots.
> > >    If the guest relocates the APIC base somewhere where we have a memslot,
> > >    memslot will take priority, while on real hardware, LAPIC is likely to
> > >    take priority.
> >
> > Yep.  The thing that really bites us is that other vCPUs should still be able to
> > access the memory defined by the memslot, e.g. to make it work we'd have to run
> > the vCPU with a completely different MMU root.
> That is something I haven't took in the account.
> Complexity of supporting this indeed isn't worth it.
>
> >
> > > As far as I know the only good reason to relocate APIC base is to access it
> > > from the real mode which is not something that is done these days by modern
> > > BIOSes.
> > >
> > > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > > base is set and apic enabled) and remove all remains of the support for
> > > variable APIC base.
> >
> > Making up our own behavior is almost never the right approach.  E.g. _best_ case
> > scenario for an unexpected #GP is the guest immediately terminates.  Worst case
> > scenario is the guest eats the #GP and continues on, which is basically the status
> > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > the guest function, for some definitions of "function".
>
> Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> support it. In theory, a very well behaving guest can catch the exception and
> fail back to the default base.
>
> I don't understand what do you mean by 'guaranteed to now work'. If the guest
> ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> A well behaving guest should never assume that a msr write that failed with #GP
> worked.
>
>
> >
> > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > Whether or not we can do that without breaking userspace is probably the big
> > question.  Fully emulating APIC base relocation would be a tremendous amount of
> > effort and complexity for practically zero benefit.
>
> I have nothing against this as well although I kind of like the #GP approach a bit more,
> and knowing that there are barely any reasons
> to relocate the APIC base, and that it doesn't work well, there is a good chance
> that no one does it anyway (except our kvm unit tests, but that isn't an issue).
>
> >
> > > (we already have a warning when APIC base is set to non default value)
> >
> > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > misbehaving guest unless it's the first guest to misbehave on a particular
> > instantiation of KVM.   _ratelimited() would improve the situation, but not
> > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > Anything else isn't an option becuase it's obviously guest triggerable.
>
> 100% agree.
>
> I'll say I would first make it _ratelimited() for few KVM versions, and then
> if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
> from the code that pretend that it can work.

Printing things to syslog is not very helpful. Any time that kvm
violates the architectural specification, it should provide
information about the emulation error to userspace.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: KVM's support for non default APIC base
  2021-08-09 16:47             ` Jim Mattson
@ 2021-08-10 20:42               ` Maxim Levitsky
  0 siblings, 0 replies; 32+ messages in thread
From: Maxim Levitsky @ 2021-08-10 20:42 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, kvm,
	open list:X86 ARCHITECTURE (32-BIT AND 64-BIT),
	Joerg Roedel, Borislav Petkov, Vitaly Kuznetsov, Wanpeng Li,
	Paolo Bonzini, Thomas Gleixner, H. Peter Anvin, Ingo Molnar,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT)

On Mon, 2021-08-09 at 09:47 -0700, Jim Mattson wrote:
> On Mon, Aug 9, 2021 at 2:40 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Fri, 2021-08-06 at 21:55 +0000, Sean Christopherson wrote:
> > > On Thu, Jul 22, 2021, Maxim Levitsky wrote:
> > > > On Mon, 2021-07-19 at 18:49 +0000, Sean Christopherson wrote:
> > > > > On Sun, Jul 18, 2021, Maxim Levitsky wrote:
> > > > -> APIC MMIO area has to be MMIO for 'apic_mmio_write' to be called,
> > > >    thus must contain no guest memslots.
> > > >    If the guest relocates the APIC base somewhere where we have a memslot,
> > > >    memslot will take priority, while on real hardware, LAPIC is likely to
> > > >    take priority.
> > > 
> > > Yep.  The thing that really bites us is that other vCPUs should still be able to
> > > access the memory defined by the memslot, e.g. to make it work we'd have to run
> > > the vCPU with a completely different MMU root.
> > That is something I haven't took in the account.
> > Complexity of supporting this indeed isn't worth it.
> > 
> > > > As far as I know the only good reason to relocate APIC base is to access it
> > > > from the real mode which is not something that is done these days by modern
> > > > BIOSes.
> > > > 
> > > > I vote to make it read only (#GP on MSR_IA32_APICBASE write when non default
> > > > base is set and apic enabled) and remove all remains of the support for
> > > > variable APIC base.
> > > 
> > > Making up our own behavior is almost never the right approach.  E.g. _best_ case
> > > scenario for an unexpected #GP is the guest immediately terminates.  Worst case
> > > scenario is the guest eats the #GP and continues on, which is basically the status
> > > quo, except it's guaranteed to now work, whereas todays behavior can at least let
> > > the guest function, for some definitions of "function".
> > 
> > Well, at least the Intel's PRM does state that APIC base relocation is not guaranteed
> > to work on all CPUs, so giving the guest a #GP is like telling it that current CPU doesn't
> > support it. In theory, a very well behaving guest can catch the exception and
> > fail back to the default base.
> > 
> > I don't understand what do you mean by 'guaranteed to now work'. If the guest
> > ignores this #GP and still thinks that APIC base relocation worked, it is its fault.
> > A well behaving guest should never assume that a msr write that failed with #GP
> > worked.
> > 
> > 
> > > I think the only viable "solution" is to exit to userspace on the guilty WRMSR.
> > > Whether or not we can do that without breaking userspace is probably the big
> > > question.  Fully emulating APIC base relocation would be a tremendous amount of
> > > effort and complexity for practically zero benefit.
> > 
> > I have nothing against this as well although I kind of like the #GP approach a bit more,
> > and knowing that there are barely any reasons
> > to relocate the APIC base, and that it doesn't work well, there is a good chance
> > that no one does it anyway (except our kvm unit tests, but that isn't an issue).
> > 
> > > > (we already have a warning when APIC base is set to non default value)
> > > 
> > > FWIW, that warning is worthless because it's _once(), i.e. won't help detect a
> > > misbehaving guest unless it's the first guest to misbehave on a particular
> > > instantiation of KVM.   _ratelimited() would improve the situation, but not
> > > completely eliminate the possibility of a misbehaving guest going unnoticed.
> > > Anything else isn't an option becuase it's obviously guest triggerable.
> > 
> > 100% agree.
> > 
> > I'll say I would first make it _ratelimited() for few KVM versions, and then
> > if nobody complains, make it a KVM internal error / #GP, and remove all the leftovers
> > from the code that pretend that it can work.
> 
> Printing things to syslog is not very helpful. Any time that kvm
> violates the architectural specification, it should provide
> information about the emulation error to userspace.
> 
Paolo, what do you think?

My personal opinion is that we should indeed cause KVM internal error
on all attempts to change the APIC base.

If someone complains, then we can look at their use-case.

My view is that any half-working feature is bound to bitrot
and cause harm and confusion.

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2021-08-10 20:42 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-13 14:20 [PATCH v2 0/8] My AVIC patch queue Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 1/8] KVM: SVM: svm_set_vintr don't warn if AVIC is active but is about to be deactivated Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 2/8] KVM: SVM: tweak warning about enabled AVIC on nested entry Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 3/8] KVM: SVM: use vmcb01 in svm_refresh_apicv_exec_ctrl Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 4/8] KVM: x86: APICv: drop immediate APICv disablement on current vCPU Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 5/8] KVM: x86: APICv: fix race in kvm_request_apicv_update on SVM Maxim Levitsky
2021-07-26 22:34   ` Paolo Bonzini
2021-07-27 13:22     ` Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 6/8] KVM: SVM: add warning for mistmatch between AVIC state and AVIC access page state Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 7/8] KVM: SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC Maxim Levitsky
2021-07-13 14:20 ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
2021-07-18 12:13   ` Maxim Levitsky
2021-07-19  7:47     ` Vitaly Kuznetsov
2021-07-19  9:00       ` Maxim Levitsky
2021-07-19  9:23         ` Vitaly Kuznetsov
2021-07-19  9:58           ` Maxim Levitsky
2021-07-19 18:49     ` Sean Christopherson
2021-07-20  9:40       ` Maxim Levitsky
2021-07-22  9:12       ` KVM's support for non default APIC base Maxim Levitsky
2021-08-02  9:20         ` Maxim Levitsky
2021-08-06 21:55         ` Sean Christopherson
2021-08-09  9:40           ` Maxim Levitsky
2021-08-09 15:57             ` Sean Christopherson
2021-08-09 16:47             ` Jim Mattson
2021-08-10 20:42               ` Maxim Levitsky
2021-07-22 17:35       ` [PATCH v2 8/8] KVM: x86: hyper-v: Deactivate APICv only when AutoEOI feature is in use Maxim Levitsky
2021-07-22 19:06         ` Sean Christopherson
2021-07-27 13:05           ` Maxim Levitsky
2021-07-27 17:48             ` Ben Gardon
2021-07-27 18:17               ` Sean Christopherson
2021-07-29 14:10                 ` Maxim Levitsky
2021-07-26 17:24 ` [PATCH v2 0/8] My AVIC patch queue Paolo Bonzini

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).