linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
@ 2021-12-08  1:52 Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 01/26] KVM: fix avic_set_running for preemptable kernels Sean Christopherson
                   ` (28 more replies)
  0 siblings, 29 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
of cruft, and document the lurking gotchas along the way.

Patch 01 is a fix from Paolo that's already been merged but hasn't made
its way to kvm/queue.  It's included here to avoid a number of conflicts.

Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")

v3:
 - Rebase to kvm/queue (and drop non-x86 patches as they've been queued). 
 - Redo AVIC patches, sadly the vcpu_(un)blocking() hooks need to stay.
 - Add a patch to fix a missing (docuentation-only) barrier in nested
   posted interrupt delivery. [Paolo]
 - Collect reviews.

v2:
 - https://lore.kernel.org/all/20211009021236.4122790-1-seanjc@google.com/
 - Collect reviews. [Christian, David]
 - Add patch to move arm64 WFI functionality out of hooks. [Marc]
 - Add RISC-V to the fun.
 - Add all the APICv fun.

v1: https://lkml.kernel.org/r/20210925005528.1145584-1-seanjc@google.com

Paolo Bonzini (1):
  KVM: fix avic_set_running for preemptable kernels

Sean Christopherson (25):
  KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ
    fails
  KVM: VMX: Clean up PI pre/post-block WARNs
  KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
  KVM: Drop unused kvm_vcpu.pre_pcpu field
  KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
  KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
  KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
  KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
  KVM: SVM: Don't bother checking for "running" AVIC when kicking for
    IPIs
  KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
  KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
  KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
  iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
  KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
  KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
    vCPU
  KVM: VMX: Pass desired vector instead of bool for triggering posted
    IRQ
  KVM: VMX: Fold fallback path into triggering posted IRQ helper
  KVM: VMX: Don't do full kick when handling posted interrupt wakeup
  KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
  KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
  KVM: x86: Skip APICv update if APICv is disable at the module level
  KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
  KVM: x86: Unexport __kvm_request_apicv_update()

 arch/x86/include/asm/kvm-x86-ops.h |   2 -
 arch/x86/include/asm/kvm_host.h    |  12 -
 arch/x86/kvm/hyperv.c              |   3 +
 arch/x86/kvm/lapic.c               |   2 -
 arch/x86/kvm/svm/avic.c            | 116 ++++---
 arch/x86/kvm/svm/svm.c             | 479 ++++++++++++++---------------
 arch/x86/kvm/svm/svm.h             |  16 +-
 arch/x86/kvm/vmx/posted_intr.c     | 234 +++++++-------
 arch/x86/kvm/vmx/posted_intr.h     |   8 +-
 arch/x86/kvm/vmx/vmx.c             |  66 ++--
 arch/x86/kvm/vmx/vmx.h             |   3 +
 arch/x86/kvm/x86.c                 |  41 ++-
 drivers/iommu/amd/iommu.c          |   6 +-
 include/linux/amd-iommu.h          |   6 +-
 include/linux/kvm_host.h           |   3 -
 virt/kvm/kvm_main.c                |   3 -
 16 files changed, 510 insertions(+), 490 deletions(-)

-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply	[flat|nested] 50+ messages in thread

* [PATCH v3 01/26] KVM: fix avic_set_running for preemptable kernels
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 02/26] KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails Sean Christopherson
                   ` (27 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

From: Paolo Bonzini <pbonzini@redhat.com>

avic_set_running() passes the current CPU to avic_vcpu_load(), albeit
via vcpu->cpu rather than smp_processor_id().  If the thread is migrated
while avic_set_running runs, the call to avic_vcpu_load() can use a stale
value for the processor id.  Avoid this by blocking preemption over the
entire execution of avic_set_running().

Reported-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
Not-signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 16 +++++++++-------
 1 file changed, 9 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 1d8ada5b01f8..d7132a4551a2 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -997,16 +997,18 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
+	int cpu = get_cpu();
 
+	WARN_ON(cpu != vcpu->cpu);
 	svm->avic_is_running = is_run;
 
-	if (!kvm_vcpu_apicv_active(vcpu))
-		return;
-
-	if (is_run)
-		avic_vcpu_load(vcpu, vcpu->cpu);
-	else
-		avic_vcpu_put(vcpu);
+	if (kvm_vcpu_apicv_active(vcpu)) {
+		if (is_run)
+			avic_vcpu_load(vcpu, cpu);
+		else
+			avic_vcpu_put(vcpu);
+	}
+	put_cpu();
 }
 
 void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 02/26] KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 01/26] KVM: fix avic_set_running for preemptable kernels Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 03/26] KVM: VMX: Clean up PI pre/post-block WARNs Sean Christopherson
                   ` (26 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Add a memory barrier between writing vcpu->requests and reading
vcpu->guest_mode to ensure the read is ordered after the write when
(potentially) delivering an IRQ to L2 via nested posted interrupt.  If
the request were to be completed after reading vcpu->mode, it would be
possible for the target vCPU to enter the guest without posting the
interrupt and without handling the event request.

Note, the barrier is only for documentation since atomic operations are
serializing on x86.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Fixes: 6b6977117f50 ("KVM: nVMX: Fix races when sending nested PI while dest enters/leaves L2")
Fixes: 705699a13994 ("KVM: nVMX: Enable nested posted interrupt processing")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index efcc5a58abbc..a94f0fb80fd4 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3941,6 +3941,19 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 		 */
 		vmx->nested.pi_pending = true;
 		kvm_make_request(KVM_REQ_EVENT, vcpu);
+
+		/*
+		 * This pairs with the smp_mb_*() after setting vcpu->mode in
+		 * vcpu_enter_guest() to guarantee the vCPU sees the event
+		 * request if triggering a posted interrupt "fails" because
+		 * vcpu->mode != IN_GUEST_MODE.  The extra barrier is needed as
+		 * the smb_wmb() in kvm_make_request() only ensures everything
+		 * done before making the request is visible when the request
+		 * is visible, it doesn't ensure ordering between the store to
+		 * vcpu->requests and the load from vcpu->mode.
+		 */
+		smp_mb__after_atomic();
+
 		/* the PIR and ON have been set by L1. */
 		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
 			kvm_vcpu_kick(vcpu);
@@ -3974,6 +3987,12 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (pi_test_and_set_on(&vmx->pi_desc))
 		return 0;
 
+	/*
+	 * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+	 * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+	 */
 	if (vcpu != kvm_get_running_vcpu() &&
 	    !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
 		kvm_vcpu_kick(vcpu);
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 03/26] KVM: VMX: Clean up PI pre/post-block WARNs
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 01/26] KVM: fix avic_set_running for preemptable kernels Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 02/26] KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 04/26] KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load Sean Christopherson
                   ` (25 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Move the WARN sanity checks out of the PI descriptor update loop so as
not to spam the kernel log if the condition is violated and the update
takes multiple attempts due to another writer.  This also eliminates a
few extra uops from the retry path.

Technically not checking every attempt could mean KVM will now fail to
WARN in a scenario that would have failed before, but any such failure
would be inherently racy as some other agent (CPU or device) would have
to concurrent modify the PI descriptor.

Add a helper to handle the actual write and more importantly to document
why the write may need to be retried.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 35 ++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 4db2b14ee7c6..88c53c521094 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -34,6 +34,20 @@ static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 	return &(to_vmx(vcpu)->pi_desc);
 }
 
+static int pi_try_set_control(struct pi_desc *pi_desc, u64 old, u64 new)
+{
+	/*
+	 * PID.ON can be set at any time by a different vCPU or by hardware,
+	 * e.g. a device.  PID.control must be written atomically, and the
+	 * update must be retried with a fresh snapshot an ON change causes
+	 * the cmpxchg to fail.
+	 */
+	if (cmpxchg64(&pi_desc->control, old, new) != old)
+		return -EBUSY;
+
+	return 0;
+}
+
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
@@ -74,8 +88,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 
 		new.ndst = dest;
 		new.sn = 0;
-	} while (cmpxchg64(&pi_desc->control, old.control,
-			   new.control) != old.control);
+	} while (pi_try_set_control(pi_desc, old.control, new.control));
 
 after_clear_sn:
 
@@ -128,17 +141,17 @@ static void __pi_post_block(struct kvm_vcpu *vcpu)
 	if (!x2apic_mode)
 		dest = (dest << 8) & 0xFF00;
 
+	WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
+	     "Wakeup handler not enabled while the vCPU was blocking");
+
 	do {
 		old.control = new.control = READ_ONCE(pi_desc->control);
-		WARN(old.nv != POSTED_INTR_WAKEUP_VECTOR,
-		     "Wakeup handler not enabled while the VCPU is blocked\n");
 
 		new.ndst = dest;
 
 		/* set 'NV' to 'notification vector' */
 		new.nv = POSTED_INTR_VECTOR;
-	} while (cmpxchg64(&pi_desc->control, old.control,
-			   new.control) != old.control);
+	} while (pi_try_set_control(pi_desc, old.control, new.control));
 
 	vcpu->pre_pcpu = -1;
 }
@@ -173,17 +186,15 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
 		      &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
 	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
 
+	WARN(pi_desc->sn == 1,
+	     "Posted Interrupt Suppress Notification set before blocking");
+
 	do {
 		old.control = new.control = READ_ONCE(pi_desc->control);
 
-		WARN((pi_desc->sn == 1),
-		     "Warning: SN field of posted-interrupts "
-		     "is set before blocking\n");
-
 		/* set 'NV' to 'wakeup vector' */
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
-	} while (cmpxchg64(&pi_desc->control, old.control,
-			   new.control) != old.control);
+	} while (pi_try_set_control(pi_desc, old.control, new.control));
 
 	/* We should not block the vCPU if an interrupt is posted for it.  */
 	if (pi_test_on(pi_desc))
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 04/26] KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (2 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 03/26] KVM: VMX: Clean up PI pre/post-block WARNs Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 05/26] KVM: Drop unused kvm_vcpu.pre_pcpu field Sean Christopherson
                   ` (24 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Move the posted interrupt pre/post_block logic into vcpu_put/load
respectively, using the kvm_vcpu_is_blocking() to determining whether or
not the wakeup handler needs to be set (and unset).  This avoids updating
the PI descriptor if halt-polling is successful, reduces the number of
touchpoints for updating the descriptor, and eliminates the confusing
behavior of intentionally leaving a "stale" PI.NDST when a blocking vCPU
is scheduled back in after preemption.

The downside is that KVM will do the PID update twice if the vCPU is
preempted after prepare_to_rcuwait() but before schedule(), but that's a
rare case (and non-existent on !PREEMPT kernels).

The notable wart is the need to send a self-IPI on the wakeup vector if
an outstanding notification is pending after configuring the wakeup
vector.  Ideally, KVM would just do a kvm_vcpu_wake_up() in this case,
but the scheduler doesn't support waking a task from its preemption
notifier callback, i.e. while the task is smack dab in the middle of
being scheduled out.

Note, setting the wakeup vector before halt-polling is not necessary as
the pending IRQ will be recorded in the PIR and detected as a blocking-
breaking condition by kvm_vcpu_has_events() -> vmx_sync_pir_to_irr().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 150 ++++++++++++++-------------------
 arch/x86/kvm/vmx/posted_intr.h |   8 +-
 arch/x86/kvm/vmx/vmx.c         |   5 --
 3 files changed, 70 insertions(+), 93 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 88c53c521094..a1776d186225 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -52,6 +52,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 	struct pi_desc old, new;
+	unsigned long flags;
 	unsigned int dest;
 
 	/*
@@ -62,23 +63,34 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	if (!enable_apicv || !lapic_in_kernel(vcpu))
 		return;
 
-	/* Nothing to do if PI.SN and PI.NDST both have the desired value. */
-	if (!pi_test_sn(pi_desc) && vcpu->cpu == cpu)
+	/*
+	 * If the vCPU wasn't on the wakeup list and wasn't migrated, then the
+	 * full update can be skipped as neither the vector nor the destination
+	 * needs to be changed.
+	 */
+	if (pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR && vcpu->cpu == cpu) {
+		/*
+		 * Clear SN if it was set due to being preempted.  Again, do
+		 * this even if there is no assigned device for simplicity.
+		 */
+		if (pi_test_and_clear_sn(pi_desc))
+			goto after_clear_sn;
 		return;
+	}
+
+	local_irq_save(flags);
 
 	/*
-	 * If the 'nv' field is POSTED_INTR_WAKEUP_VECTOR, do not change
-	 * PI.NDST: pi_post_block is the one expected to change PID.NDST and the
-	 * wakeup handler expects the vCPU to be on the blocked_vcpu_list that
-	 * matches PI.NDST. Otherwise, a vcpu may not be able to be woken up
-	 * correctly.
+	 * If the vCPU was waiting for wakeup, remove the vCPU from the wakeup
+	 * list of the _previous_ pCPU, which will not be the same as the
+	 * current pCPU if the task was migrated.
 	 */
-	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR || vcpu->cpu == cpu) {
-		pi_clear_sn(pi_desc);
-		goto after_clear_sn;
+	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
+		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+		list_del(&vcpu->blocked_vcpu_list);
+		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
 	}
 
-	/* The full case.  Set the new destination and clear SN. */
 	dest = cpu_physical_id(cpu);
 	if (!x2apic_mode)
 		dest = (dest << 8) & 0xFF00;
@@ -86,10 +98,22 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	do {
 		old.control = new.control = READ_ONCE(pi_desc->control);
 
+		/*
+		 * Clear SN (as above) and refresh the destination APIC ID to
+		 * handle task migration (@cpu != vcpu->cpu).
+		 */
 		new.ndst = dest;
 		new.sn = 0;
+
+		/*
+		 * Restore the notification vector; in the blocking case, the
+		 * descriptor was modified on "put" to use the wakeup vector.
+		 */
+		new.nv = POSTED_INTR_VECTOR;
 	} while (pi_try_set_control(pi_desc, old.control, new.control));
 
+	local_irq_restore(flags);
+
 after_clear_sn:
 
 	/*
@@ -111,83 +135,24 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
 		irq_remapping_cap(IRQ_POSTING_CAP);
 }
 
-void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
-{
-	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-
-	if (!vmx_can_use_vtd_pi(vcpu->kvm))
-		return;
-
-	/* Set SN when the vCPU is preempted */
-	if (vcpu->preempted)
-		pi_set_sn(pi_desc);
-}
-
-static void __pi_post_block(struct kvm_vcpu *vcpu)
-{
-	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-	struct pi_desc old, new;
-	unsigned int dest;
-
-	/*
-	 * Remove the vCPU from the wakeup list of the _previous_ pCPU, which
-	 * will not be the same as the current pCPU if the task was migrated.
-	 */
-	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
-	list_del(&vcpu->blocked_vcpu_list);
-	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->pre_pcpu));
-
-	dest = cpu_physical_id(vcpu->cpu);
-	if (!x2apic_mode)
-		dest = (dest << 8) & 0xFF00;
-
-	WARN(pi_desc->nv != POSTED_INTR_WAKEUP_VECTOR,
-	     "Wakeup handler not enabled while the vCPU was blocking");
-
-	do {
-		old.control = new.control = READ_ONCE(pi_desc->control);
-
-		new.ndst = dest;
-
-		/* set 'NV' to 'notification vector' */
-		new.nv = POSTED_INTR_VECTOR;
-	} while (pi_try_set_control(pi_desc, old.control, new.control));
-
-	vcpu->pre_pcpu = -1;
-}
-
 /*
- * This routine does the following things for vCPU which is going
- * to be blocked if VT-d PI is enabled.
- * - Store the vCPU to the wakeup list, so when interrupts happen
- *   we can find the right vCPU to wake up.
- * - Change the Posted-interrupt descriptor as below:
- *      'NV' <-- POSTED_INTR_WAKEUP_VECTOR
- * - If 'ON' is set during this process, which means at least one
- *   interrupt is posted for this vCPU, we cannot block it, in
- *   this case, return 1, otherwise, return 0.
- *
+ * Put the vCPU on this pCPU's list of vCPUs that needs to be awakened and set
+ * WAKEUP as the notification vector in the PI descriptor.
  */
-int pi_pre_block(struct kvm_vcpu *vcpu)
+static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 {
-	struct pi_desc old, new;
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct pi_desc old, new;
 	unsigned long flags;
 
-	if (!vmx_can_use_vtd_pi(vcpu->kvm) ||
-	    vmx_interrupt_blocked(vcpu))
-		return 0;
-
 	local_irq_save(flags);
 
-	vcpu->pre_pcpu = vcpu->cpu;
 	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
 	list_add_tail(&vcpu->blocked_vcpu_list,
 		      &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
 	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
 
-	WARN(pi_desc->sn == 1,
-	     "Posted Interrupt Suppress Notification set before blocking");
+	WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
 
 	do {
 		old.control = new.control = READ_ONCE(pi_desc->control);
@@ -196,24 +161,37 @@ int pi_pre_block(struct kvm_vcpu *vcpu)
 		new.nv = POSTED_INTR_WAKEUP_VECTOR;
 	} while (pi_try_set_control(pi_desc, old.control, new.control));
 
-	/* We should not block the vCPU if an interrupt is posted for it.  */
-	if (pi_test_on(pi_desc))
-		__pi_post_block(vcpu);
+	/*
+	 * Send a wakeup IPI to this CPU if an interrupt may have been posted
+	 * before the notification vector was updated, in which case the IRQ
+	 * will arrive on the non-wakeup vector.  An IPI is needed as calling
+	 * try_to_wake_up() from ->sched_out() isn't allowed (IRQs are not
+	 * enabled until it is safe to call try_to_wake_up() on the task being
+	 * scheduled out).
+	 */
+	if (pi_test_on(&new))
+		apic->send_IPI_self(POSTED_INTR_WAKEUP_VECTOR);
 
 	local_irq_restore(flags);
-	return (vcpu->pre_pcpu == -1);
 }
 
-void pi_post_block(struct kvm_vcpu *vcpu)
+void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 {
-	unsigned long flags;
+	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
 
-	if (vcpu->pre_pcpu == -1)
+	if (!vmx_can_use_vtd_pi(vcpu->kvm))
 		return;
 
-	local_irq_save(flags);
-	__pi_post_block(vcpu);
-	local_irq_restore(flags);
+	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+		pi_enable_wakeup_handler(vcpu);
+
+	/*
+	 * Set SN when the vCPU is preempted.  Note, the vCPU can both be seen
+	 * as blocking and preempted, e.g. if it's preempted between setting
+	 * its wait state and manually scheduling out.
+	 */
+	if (vcpu->preempted)
+		pi_set_sn(pi_desc);
 }
 
 /*
@@ -254,7 +232,7 @@ bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
  * Bail out of the block loop if the VM has an assigned
  * device, but the blocking vCPU didn't reconfigure the
  * PI.NV to the wakeup vector, i.e. the assigned device
- * came along after the initial check in pi_pre_block().
+ * came along after the initial check in vmx_vcpu_pi_put().
  */
 void vmx_pi_start_assignment(struct kvm *kvm)
 {
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 36ae035f14aa..eb14e76b84ef 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -40,6 +40,12 @@ static inline bool pi_test_and_clear_on(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
+static inline bool pi_test_and_clear_sn(struct pi_desc *pi_desc)
+{
+	return test_and_clear_bit(POSTED_INTR_SN,
+			(unsigned long *)&pi_desc->control);
+}
+
 static inline bool pi_test_and_set_pir(int vector, struct pi_desc *pi_desc)
 {
 	return test_and_set_bit(vector, (unsigned long *)pi_desc->pir);
@@ -88,8 +94,6 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
-int pi_pre_block(struct kvm_vcpu *vcpu);
-void pi_post_block(struct kvm_vcpu *vcpu);
 void pi_wakeup_handler(void);
 void __init pi_init_cpu(int cpu);
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a94f0fb80fd4..0254d7f64698 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7442,9 +7442,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 
 static int vmx_pre_block(struct kvm_vcpu *vcpu)
 {
-	if (pi_pre_block(vcpu))
-		return 1;
-
 	if (kvm_lapic_hv_timer_in_use(vcpu))
 		kvm_lapic_switch_to_sw_timer(vcpu);
 
@@ -7455,8 +7452,6 @@ static void vmx_post_block(struct kvm_vcpu *vcpu)
 {
 	if (kvm_x86_ops.set_hv_timer)
 		kvm_lapic_switch_to_hv_timer(vcpu);
-
-	pi_post_block(vcpu);
 }
 
 static void vmx_setup_mce(struct kvm_vcpu *vcpu)
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 05/26] KVM: Drop unused kvm_vcpu.pre_pcpu field
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (3 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 04/26] KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 06/26] KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx Sean Christopherson
                   ` (23 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Remove kvm_vcpu.pre_pcpu as it no longer has any users.  No functional
change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 include/linux/kvm_host.h | 1 -
 virt/kvm/kvm_main.c      | 1 -
 2 files changed, 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a745efe389ab..30cc1327065c 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -305,7 +305,6 @@ struct kvm_vcpu {
 	u64 requests;
 	unsigned long guest_debug;
 
-	int pre_pcpu;
 	struct list_head blocked_vcpu_list;
 
 	struct mutex mutex;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 863112783ed9..d3e86f246e1c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -427,7 +427,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 #endif
 	kvm_async_pf_vcpu_init(vcpu);
 
-	vcpu->pre_pcpu = -1;
 	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
 
 	kvm_vcpu_set_in_spin_loop(vcpu, false);
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 06/26] KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (4 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 05/26] KVM: Drop unused kvm_vcpu.pre_pcpu field Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 Sean Christopherson
                   ` (22 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Move the seemingly generic block_vcpu_list from kvm_vcpu to vcpu_vmx, and
rename the list and all associated variables to clarify that it tracks
the set of vCPU that need to be poked on a posted interrupt to the wakeup
vector.  The list is not used to track _all_ vCPUs that are blocking, and
the term "blocked" can be misleading as it may refer to a blocking
condition in the host or the guest, where as the PI wakeup case is
specifically for the vCPUs that are actively blocking from within the
guest.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 39 +++++++++++++++++-----------------
 arch/x86/kvm/vmx/vmx.c         |  2 ++
 arch/x86/kvm/vmx/vmx.h         |  3 +++
 include/linux/kvm_host.h       |  2 --
 virt/kvm/kvm_main.c            |  2 --
 5 files changed, 25 insertions(+), 23 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index a1776d186225..023a6b9b0fa4 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -19,7 +19,7 @@
  * wake the target vCPUs.  vCPUs are removed from the list and the notification
  * vector is reset when the vCPU is scheduled in.
  */
-static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
+static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
 /*
  * Protect the per-CPU list with a per-CPU spinlock to handle task migration.
  * When a blocking vCPU is awakened _and_ migrated to a different pCPU, the
@@ -27,7 +27,7 @@ static DEFINE_PER_CPU(struct list_head, blocked_vcpu_on_cpu);
  * CPU.  IRQs must be disabled when taking this lock, otherwise deadlock will
  * occur if a wakeup IRQ arrives and attempts to acquire the lock.
  */
-static DEFINE_PER_CPU(spinlock_t, blocked_vcpu_on_cpu_lock);
+static DEFINE_PER_CPU(spinlock_t, wakeup_vcpus_on_cpu_lock);
 
 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
@@ -51,6 +51,7 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 old, u64 new)
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct pi_desc old, new;
 	unsigned long flags;
 	unsigned int dest;
@@ -86,9 +87,9 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	 * current pCPU if the task was migrated.
 	 */
 	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
-		spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
-		list_del(&vcpu->blocked_vcpu_list);
-		spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+		spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+		list_del(&vmx->pi_wakeup_list);
+		spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 	}
 
 	dest = cpu_physical_id(cpu);
@@ -142,15 +143,16 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
 static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct pi_desc old, new;
 	unsigned long flags;
 
 	local_irq_save(flags);
 
-	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
-	list_add_tail(&vcpu->blocked_vcpu_list,
-		      &per_cpu(blocked_vcpu_on_cpu, vcpu->cpu));
-	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, vcpu->cpu));
+	spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
+	list_add_tail(&vmx->pi_wakeup_list,
+		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
+	spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 
 	WARN(pi_desc->sn, "PI descriptor SN field set before blocking");
 
@@ -199,24 +201,23 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
  */
 void pi_wakeup_handler(void)
 {
-	struct kvm_vcpu *vcpu;
 	int cpu = smp_processor_id();
+	struct vcpu_vmx *vmx;
 
-	spin_lock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
-	list_for_each_entry(vcpu, &per_cpu(blocked_vcpu_on_cpu, cpu),
-			blocked_vcpu_list) {
-		struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
+	spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
+	list_for_each_entry(vmx, &per_cpu(wakeup_vcpus_on_cpu, cpu),
+			    pi_wakeup_list) {
 
-		if (pi_test_on(pi_desc))
-			kvm_vcpu_kick(vcpu);
+		if (pi_test_on(&vmx->pi_desc))
+			kvm_vcpu_kick(&vmx->vcpu);
 	}
-	spin_unlock(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+	spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
 }
 
 void __init pi_init_cpu(int cpu)
 {
-	INIT_LIST_HEAD(&per_cpu(blocked_vcpu_on_cpu, cpu));
-	spin_lock_init(&per_cpu(blocked_vcpu_on_cpu_lock, cpu));
+	INIT_LIST_HEAD(&per_cpu(wakeup_vcpus_on_cpu, cpu));
+	spin_lock_init(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
 }
 
 bool pi_has_pending_interrupt(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0254d7f64698..64932cc3e4e8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6837,6 +6837,8 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 	BUILD_BUG_ON(offsetof(struct vcpu_vmx, vcpu) != 0);
 	vmx = to_vmx(vcpu);
 
+	INIT_LIST_HEAD(&vmx->pi_wakeup_list);
+
 	err = -ENOMEM;
 
 	vmx->vpid = allocate_vpid();
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f978699480e3..fc52721018ce 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -308,6 +308,9 @@ struct vcpu_vmx {
 	/* Posted interrupt descriptor */
 	struct pi_desc pi_desc;
 
+	/* Used if this vCPU is waiting for PI notification wakeup. */
+	struct list_head pi_wakeup_list;
+
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 30cc1327065c..2d17381b5210 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -305,8 +305,6 @@ struct kvm_vcpu {
 	u64 requests;
 	unsigned long guest_debug;
 
-	struct list_head blocked_vcpu_list;
-
 	struct mutex mutex;
 	struct kvm_run *run;
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d3e86f246e1c..84b4af38a2ec 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -427,8 +427,6 @@ static void kvm_vcpu_init(struct kvm_vcpu *vcpu, struct kvm *kvm, unsigned id)
 #endif
 	kvm_async_pf_vcpu_init(vcpu);
 
-	INIT_LIST_HEAD(&vcpu->blocked_vcpu_list);
-
 	kvm_vcpu_set_in_spin_loop(vcpu, false);
 	kvm_vcpu_set_dy_eligible(vcpu, false);
 	vcpu->preempted = false;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (5 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 06/26] KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2023-03-29 12:34   ` Tudor Ambarus
  2021-12-08  1:52 ` [PATCH v3 08/26] KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers Sean Christopherson
                   ` (21 subsequent siblings)
  28 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Handle the switch to/from the hypervisor/software timer when a vCPU is
blocking in common x86 instead of in VMX.  Even though VMX is the only
user of a hypervisor timer, the logic and all functions involved are
generic x86 (unless future CPUs do something completely different and
implement a hypervisor timer that runs regardless of mode).

Handling the switch in common x86 will allow for the elimination of the
pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
timer if and only if it was in use (without additional params).  Add a
comment explaining why the switch cannot be deferred to kvm_sched_out()
or kvm_vcpu_block().

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c |  6 +-----
 arch/x86/kvm/x86.c     | 21 +++++++++++++++++++++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 64932cc3e4e8..a5ba9a069f5d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7444,16 +7444,12 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 
 static int vmx_pre_block(struct kvm_vcpu *vcpu)
 {
-	if (kvm_lapic_hv_timer_in_use(vcpu))
-		kvm_lapic_switch_to_sw_timer(vcpu);
-
 	return 0;
 }
 
 static void vmx_post_block(struct kvm_vcpu *vcpu)
 {
-	if (kvm_x86_ops.set_hv_timer)
-		kvm_lapic_switch_to_hv_timer(vcpu);
+
 }
 
 static void vmx_setup_mce(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a05a26471f19..f13a61830d9d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10041,8 +10041,21 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 
 static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
 {
+	bool hv_timer;
+
 	if (!kvm_arch_vcpu_runnable(vcpu) &&
 	    (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
+		/*
+		 * Switch to the software timer before halt-polling/blocking as
+		 * the guest's timer may be a break event for the vCPU, and the
+		 * hypervisor timer runs only when the CPU is in guest mode.
+		 * Switch before halt-polling so that KVM recognizes an expired
+		 * timer before blocking.
+		 */
+		hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
+		if (hv_timer)
+			kvm_lapic_switch_to_sw_timer(vcpu);
+
 		srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
 		if (vcpu->arch.mp_state == KVM_MP_STATE_HALTED)
 			kvm_vcpu_halt(vcpu);
@@ -10050,6 +10063,9 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
 			kvm_vcpu_block(vcpu);
 		vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
 
+		if (hv_timer)
+			kvm_lapic_switch_to_hv_timer(vcpu);
+
 		if (kvm_x86_ops.post_block)
 			static_call(kvm_x86_post_block)(vcpu);
 
@@ -10244,6 +10260,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 			r = -EINTR;
 			goto out;
 		}
+		/*
+		 * It should be impossible for the hypervisor timer to be in
+		 * use before KVM has ever run the vCPU.
+		 */
+		WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
 		kvm_vcpu_block(vcpu);
 		if (kvm_apic_accept_events(vcpu) < 0) {
 			r = 0;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 08/26] KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (6 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 09/26] KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks Sean Christopherson
                   ` (20 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Unexport switch_to_{hv,sw}_timer() now that common x86 handles the
transitions.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/lapic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 40270d7bc597..d4d875cbf05c 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1947,7 +1947,6 @@ void kvm_lapic_switch_to_hv_timer(struct kvm_vcpu *vcpu)
 {
 	restart_apic_timer(vcpu->arch.apic);
 }
-EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_hv_timer);
 
 void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
 {
@@ -1959,7 +1958,6 @@ void kvm_lapic_switch_to_sw_timer(struct kvm_vcpu *vcpu)
 		start_sw_timer(apic);
 	preempt_enable();
 }
-EXPORT_SYMBOL_GPL(kvm_lapic_switch_to_sw_timer);
 
 void kvm_lapic_restart_hv_timer(struct kvm_vcpu *vcpu)
 {
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 09/26] KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (7 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 08/26] KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 10/26] KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode Sean Christopherson
                   ` (19 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Drop kvm_x86_ops' pre/post_block() now that all implementations are nops.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  2 --
 arch/x86/include/asm/kvm_host.h    | 12 ------------
 arch/x86/kvm/vmx/vmx.c             | 13 -------------
 arch/x86/kvm/x86.c                 |  6 +-----
 4 files changed, 1 insertion(+), 32 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..c2b007171abd 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -96,8 +96,6 @@ KVM_X86_OP(handle_exit_irqoff)
 KVM_X86_OP_NULL(request_immediate_exit)
 KVM_X86_OP(sched_in)
 KVM_X86_OP_NULL(update_cpu_dirty_logging)
-KVM_X86_OP_NULL(pre_block)
-KVM_X86_OP_NULL(post_block)
 KVM_X86_OP_NULL(vcpu_blocking)
 KVM_X86_OP_NULL(vcpu_unblocking)
 KVM_X86_OP_NULL(update_pi_irte)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e41ad1ead721..fb62ed94e52f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1447,18 +1447,6 @@ struct kvm_x86_ops {
 	const struct kvm_pmu_ops *pmu_ops;
 	const struct kvm_x86_nested_ops *nested_ops;
 
-	/*
-	 * Architecture specific hooks for vCPU blocking due to
-	 * HLT instruction.
-	 * Returns for .pre_block():
-	 *    - 0 means continue to block the vCPU.
-	 *    - 1 means we cannot block the vCPU since some event
-	 *        happens during this period, such as, 'ON' bit in
-	 *        posted-interrupts descriptor is set.
-	 */
-	int (*pre_block)(struct kvm_vcpu *vcpu);
-	void (*post_block)(struct kvm_vcpu *vcpu);
-
 	void (*vcpu_blocking)(struct kvm_vcpu *vcpu);
 	void (*vcpu_unblocking)(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index a5ba9a069f5d..1b8804d93776 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7442,16 +7442,6 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 		secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
 }
 
-static int vmx_pre_block(struct kvm_vcpu *vcpu)
-{
-	return 0;
-}
-
-static void vmx_post_block(struct kvm_vcpu *vcpu)
-{
-
-}
-
 static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.mcg_cap & MCG_LMCE_P)
@@ -7650,9 +7640,6 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.cpu_dirty_log_size = PML_ENTITY_NUM,
 	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
 
-	.pre_block = vmx_pre_block,
-	.post_block = vmx_post_block,
-
 	.pmu_ops = &intel_pmu_ops,
 	.nested_ops = &vmx_nested_ops,
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f13a61830d9d..4a2341e4ff30 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10043,8 +10043,7 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
 {
 	bool hv_timer;
 
-	if (!kvm_arch_vcpu_runnable(vcpu) &&
-	    (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
+	if (!kvm_arch_vcpu_runnable(vcpu)) {
 		/*
 		 * Switch to the software timer before halt-polling/blocking as
 		 * the guest's timer may be a break event for the vCPU, and the
@@ -10066,9 +10065,6 @@ static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
 		if (hv_timer)
 			kvm_lapic_switch_to_hv_timer(vcpu);
 
-		if (kvm_x86_ops.post_block)
-			static_call(kvm_x86_post_block)(vcpu);
-
 		if (!kvm_check_request(KVM_REQ_UNHALT, vcpu))
 			return 1;
 	}
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 10/26] KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (8 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 09/26] KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 11/26] KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs Sean Christopherson
                   ` (18 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Signal the AVIC doorbell iff the vCPU is running in the guest.  If the vCPU
is not IN_GUEST_MODE, it's guaranteed to pick up any pending IRQs on the
next VMRUN, which unconditionally processes the vIRR.

Add comments to document the logic.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 22 ++++++++++++++++++++--
 1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index d7132a4551a2..e2e4e9f04458 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -672,9 +672,22 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
 		return -1;
 
 	kvm_lapic_set_irr(vec, vcpu->arch.apic);
+
+	/*
+	 * Pairs with the smp_mb_*() after setting vcpu->guest_mode in
+	 * vcpu_enter_guest() to ensure the write to the vIRR is ordered before
+	 * the read of guest_mode, which guarantees that either VMRUN will see
+	 * and process the new vIRR entry, or that the below code will signal
+	 * the doorbell if the vCPU is already running in the guest.
+	 */
 	smp_mb__after_atomic();
 
-	if (avic_vcpu_is_running(vcpu)) {
+	/*
+	 * Signal the doorbell to tell hardware to inject the IRQ if the vCPU
+	 * is in the guest.  If the vCPU is not in the guest, hardware will
+	 * automatically process AVIC interrupts at VMRUN.
+	 */
+	if (vcpu->mode == IN_GUEST_MODE) {
 		int cpu = READ_ONCE(vcpu->cpu);
 
 		/*
@@ -688,8 +701,13 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec)
 		if (cpu != get_cpu())
 			wrmsrl(SVM_AVIC_DOORBELL, kvm_cpu_get_apicid(cpu));
 		put_cpu();
-	} else
+	} else {
+		/*
+		 * Wake the vCPU if it was blocking.  KVM will then detect the
+		 * pending IRQ when checking if the vCPU has a wake event.
+		 */
 		kvm_vcpu_wake_up(vcpu);
+	}
 
 	return 0;
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 11/26] KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (9 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 10/26] KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 12/26] KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path Sean Christopherson
                   ` (17 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Drop the avic_vcpu_is_running() check when waking vCPUs in response to a
VM-Exit due to incomplete IPI delivery.  The check isn't wrong per se, but
it's not 100% accurate in the sense that it doesn't guarantee that the vCPU
was one of the vCPUs that didn't receive the IPI.

The check isn't required for correctness as blocking == !running in this
context.

From a performance perspective, waking a live task is not expensive as the
only moderately costly operation is a locked operation to temporarily
disable preemption.  And if that is indeed a performance issue,
kvm_vcpu_is_blocking() would be a better check than poking into the AVIC.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/svm/avic.c | 15 +++++++++------
 arch/x86/kvm/svm/svm.h  | 11 -----------
 2 files changed, 9 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index e2e4e9f04458..37575b71cdf3 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -295,13 +295,16 @@ static void avic_kick_target_vcpus(struct kvm *kvm, struct kvm_lapic *source,
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
 
+	/*
+	 * Wake any target vCPUs that are blocking, i.e. waiting for a wake
+	 * event.  There's no need to signal doorbells, as hardware has handled
+	 * vCPUs that were in guest at the time of the IPI, and vCPUs that have
+	 * since entered the guest will have processed pending IRQs at VMRUN.
+	 */
 	kvm_for_each_vcpu(i, vcpu, kvm) {
-		bool m = kvm_apic_match_dest(vcpu, source,
-					     icrl & APIC_SHORT_MASK,
-					     GET_APIC_DEST_FIELD(icrh),
-					     icrl & APIC_DEST_MASK);
-
-		if (m && !avic_vcpu_is_running(vcpu))
+		if (kvm_apic_match_dest(vcpu, source, icrl & APIC_SHORT_MASK,
+					GET_APIC_DEST_FIELD(icrh),
+					icrl & APIC_DEST_MASK))
 			kvm_vcpu_wake_up(vcpu);
 	}
 }
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 9f153c59f2c8..ca51d6dfc8e6 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -574,17 +574,6 @@ extern struct kvm_x86_nested_ops svm_nested_ops;
 
 #define VMCB_AVIC_APIC_BAR_MASK		0xFFFFFFFFFF000ULL
 
-static inline bool avic_vcpu_is_running(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	u64 *entry = svm->avic_physical_id_cache;
-
-	if (!entry)
-		return false;
-
-	return (READ_ONCE(*entry) & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
-}
-
 int avic_ga_log_notifier(u32 ga_tag);
 void avic_vm_destroy(struct kvm *kvm);
 int avic_vm_init(struct kvm *kvm);
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 12/26] KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (10 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 11/26] KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 13/26] KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption Sean Christopherson
                   ` (16 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Remove handling of KVM_REQ_APICV_UPDATE from svm_vcpu_unblocking(), it's
no longer needed as it was made obsolete by commit df7e4827c549 ("KVM:
SVM: call avic_vcpu_load/avic_vcpu_put when enabling/disabling AVIC").
Prior to that commit, the manual check was necessary to ensure the AVIC
stuff was updated by avic_set_running() when a request to enable APICv
became pending while the vCPU was blocking, as the request handling
itself would not do the update.  But, as evidenced by the commit, that
logic was flawed and subject to various races.

Now that svm_refresh_apicv_exec_ctrl() does avic_vcpu_load/put() in
response to an APICv status change, drop the manual check in the
unblocking path.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 37575b71cdf3..16e4ebd980a2 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1039,7 +1039,5 @@ void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
 
 void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
 {
-	if (kvm_check_request(KVM_REQ_APICV_UPDATE, vcpu))
-		kvm_vcpu_update_apicv(vcpu);
 	avic_set_running(vcpu, true);
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 13/26] KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (11 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 12/26] KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 14/26] KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU Sean Christopherson
                   ` (15 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Use kvm_vcpu_is_blocking() to determine whether or not the vCPU should be
marked running during avic_vcpu_load().  Drop avic_is_running, which
really should have been named "vcpu_is_not_blocking", as it tracked if
the vCPU was blocking, not if it was actually running, e.g. it was set
during svm_create_vcpu() when the vCPU was obviously not running.

This is technically a teeny tiny functional change, as the vCPU will be
marked IsRunning=1 on being reloaded if the vCPU is preempted between
svm_vcpu_blocking() and prepare_to_rcuwait().  But that's a benign change
as the vCPU will be marked IsRunning=0 when KVM voluntarily schedules out
the vCPU.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 14 +++++++++-----
 arch/x86/kvm/svm/svm.c  |  6 ------
 arch/x86/kvm/svm/svm.h  |  1 -
 3 files changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 16e4ebd980a2..dc0cbe500106 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -974,6 +974,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	u64 entry;
 	/* ID = 0xff (broadcast), ID > 0xff (reserved) */
+	bool is_blocking = kvm_vcpu_is_blocking(vcpu);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -991,12 +992,17 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-	if (svm->avic_is_running)
+
+	/*
+	 * Don't mark the vCPU as running if its blocking, i.e. if the vCPU is
+	 * preempted after svm_vcpu_blocking() but before KVM voluntarily
+	 * schedules out the vCPU.
+	 */
+	if (!is_blocking)
 		entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id,
-					svm->avic_is_running);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, !is_blocking);
 }
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1017,11 +1023,9 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
  */
 static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
 {
-	struct vcpu_svm *svm = to_svm(vcpu);
 	int cpu = get_cpu();
 
 	WARN_ON(cpu != vcpu->cpu);
-	svm->avic_is_running = is_run;
 
 	if (kvm_vcpu_apicv_active(vcpu)) {
 		if (is_run)
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 208566f63bce..dde0106ffc47 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1444,12 +1444,6 @@ static int svm_create_vcpu(struct kvm_vcpu *vcpu)
 	if (err)
 		goto error_free_vmsa_page;
 
-	/* We initialize this flag to true to make sure that the is_running
-	 * bit would be set the first time the vcpu is loaded.
-	 */
-	if (irqchip_in_kernel(vcpu->kvm) && kvm_apicv_activated(vcpu->kvm))
-		svm->avic_is_running = true;
-
 	svm->msrpm = svm_vcpu_alloc_msrpm();
 	if (!svm->msrpm) {
 		err = -ENOMEM;
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index ca51d6dfc8e6..83ced47fa9b9 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -226,7 +226,6 @@ struct vcpu_svm {
 	u32 dfr_reg;
 	struct page *avic_backing_page;
 	u64 *avic_physical_id_cache;
-	bool avic_is_running;
 
 	/*
 	 * Per-vcpu list of struct amd_svm_iommu_ir:
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 14/26] KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (12 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 13/26] KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 15/26] iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE Sean Christopherson
                   ` (14 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Don't bother updating the Physical APIC table or IRTE when loading a vCPU
that is blocking, i.e. won't be marked IsRun{ning}=1, as the pCPU is
queried if and only if IsRunning is '1'.  If the vCPU was migrated, the
new pCPU will be picked up when avic_vcpu_load() is called by
svm_vcpu_unblocking().

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 33 +++++++++++++++++++--------------
 1 file changed, 19 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index dc0cbe500106..0c6dfd85b3bb 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -974,7 +974,6 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	u64 entry;
 	/* ID = 0xff (broadcast), ID > 0xff (reserved) */
-	bool is_blocking = kvm_vcpu_is_blocking(vcpu);
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
@@ -985,24 +984,25 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	if (WARN_ON(h_physical_id > AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK))
 		return;
 
+	/*
+	 * No need to update anything if the vCPU is blocking, i.e. if the vCPU
+	 * is being scheduled in after being preempted.  The CPU entries in the
+	 * Physical APIC table and IRTE are consumed iff IsRun{ning} is '1'.
+	 * If the vCPU was migrated, its new CPU value will be stuffed when the
+	 * vCPU unblocks.
+	 */
+	if (kvm_vcpu_is_blocking(vcpu))
+		return;
+
 	entry = READ_ONCE(*(svm->avic_physical_id_cache));
 	WARN_ON(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK;
 	entry |= (h_physical_id & AVIC_PHYSICAL_ID_ENTRY_HOST_PHYSICAL_ID_MASK);
-
-	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
-
-	/*
-	 * Don't mark the vCPU as running if its blocking, i.e. if the vCPU is
-	 * preempted after svm_vcpu_blocking() but before KVM voluntarily
-	 * schedules out the vCPU.
-	 */
-	if (!is_blocking)
-		entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
+	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, !is_blocking);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
 }
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1011,8 +1011,12 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	struct vcpu_svm *svm = to_svm(vcpu);
 
 	entry = READ_ONCE(*(svm->avic_physical_id_cache));
-	if (entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK)
-		avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+
+	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
+	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
+		return;
+
+	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
@@ -1043,5 +1047,6 @@ void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
 
 void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
 {
+
 	avic_set_running(vcpu, true);
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 15/26] iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (13 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 14/26] KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 16/26] KVM: VMX: Don't do full kick when triggering posted interrupt "fails" Sean Christopherson
                   ` (13 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Now that the one and only caller of amd_iommu_update_ga() passes in
"is_run == (cpu >= 0)" in all paths, infer IRT.vAPIC.IsRun from @cpu
instead of having the caller pass redundant information.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c   | 8 ++++----
 drivers/iommu/amd/iommu.c | 6 ++++--
 include/linux/amd-iommu.h | 6 ++----
 3 files changed, 10 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 0c6dfd85b3bb..88b3c315b34f 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -941,7 +941,7 @@ bool svm_check_apicv_inhibit_reasons(ulong bit)
 
 
 static inline int
-avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
+avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu)
 {
 	int ret = 0;
 	unsigned long flags;
@@ -961,7 +961,7 @@ avic_update_iommu_vcpu_affinity(struct kvm_vcpu *vcpu, int cpu, bool r)
 		goto out;
 
 	list_for_each_entry(ir, &svm->ir_list, node) {
-		ret = amd_iommu_update_ga(cpu, r, ir->data);
+		ret = amd_iommu_update_ga(cpu, ir->data);
 		if (ret)
 			break;
 	}
@@ -1002,7 +1002,7 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	entry |= AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
-	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id, true);
+	avic_update_iommu_vcpu_affinity(vcpu, h_physical_id);
 }
 
 void avic_vcpu_put(struct kvm_vcpu *vcpu)
@@ -1016,7 +1016,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	if (!(entry & AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK))
 		return;
 
-	avic_update_iommu_vcpu_affinity(vcpu, -1, 0);
+	avic_update_iommu_vcpu_affinity(vcpu, -1);
 
 	entry &= ~AVIC_PHYSICAL_ID_ENTRY_IS_RUNNING_MASK;
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c
index 461f1844ed1f..2affb42fa319 100644
--- a/drivers/iommu/amd/iommu.c
+++ b/drivers/iommu/amd/iommu.c
@@ -3522,7 +3522,7 @@ int amd_iommu_create_irq_domain(struct amd_iommu *iommu)
 	return 0;
 }
 
-int amd_iommu_update_ga(int cpu, bool is_run, void *data)
+int amd_iommu_update_ga(int cpu, void *data)
 {
 	unsigned long flags;
 	struct amd_iommu *iommu;
@@ -3552,8 +3552,10 @@ int amd_iommu_update_ga(int cpu, bool is_run, void *data)
 						APICID_TO_IRTE_DEST_LO(cpu);
 			ref->hi.fields.destination =
 						APICID_TO_IRTE_DEST_HI(cpu);
+			ref->lo.fields_vapic.is_run = true;
+		} else {
+			ref->lo.fields_vapic.is_run = false;
 		}
-		ref->lo.fields_vapic.is_run = is_run;
 		barrier();
 	}
 
diff --git a/include/linux/amd-iommu.h b/include/linux/amd-iommu.h
index 58e6c3806c09..005aa3ada2e9 100644
--- a/include/linux/amd-iommu.h
+++ b/include/linux/amd-iommu.h
@@ -165,8 +165,7 @@ static inline int amd_iommu_detect(void) { return -ENODEV; }
 /* IOMMU AVIC Function */
 extern int amd_iommu_register_ga_log_notifier(int (*notifier)(u32));
 
-extern int
-amd_iommu_update_ga(int cpu, bool is_run, void *data);
+extern int amd_iommu_update_ga(int cpu, void *data);
 
 extern int amd_iommu_activate_guest_mode(void *data);
 extern int amd_iommu_deactivate_guest_mode(void *data);
@@ -179,8 +178,7 @@ amd_iommu_register_ga_log_notifier(int (*notifier)(u32))
 	return 0;
 }
 
-static inline int
-amd_iommu_update_ga(int cpu, bool is_run, void *data)
+static inline int amd_iommu_update_ga(int cpu, void *data)
 {
 	return 0;
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 16/26] KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (14 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 15/26] iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 17/26] KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU Sean Christopherson
                   ` (12 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Replace the full "kick" with just the "wake" in the fallback path when
triggering a virtual interrupt via a posted interrupt fails because the
guest is not IN_GUEST_MODE.  If the guest transitions into guest mode
between the check and the kick, then it's guaranteed to see the pending
interrupt as KVM syncs the PIR to IRR (and onto GUEST_RVI) after setting
IN_GUEST_MODE.  Kicking the guest in this case is nothing more than an
unnecessary VM-Exit (and host IRQ).

Opportunistically update comments to explain the various ordering rules
and barriers at play.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 4 ++--
 arch/x86/kvm/x86.c     | 9 +++++----
 2 files changed, 7 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 1b8804d93776..fa90eacbf7e2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3956,7 +3956,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 
 		/* the PIR and ON have been set by L1. */
 		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
-			kvm_vcpu_kick(vcpu);
+			kvm_vcpu_wake_up(vcpu);
 		return 0;
 	}
 	return -1;
@@ -3995,7 +3995,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	 */
 	if (vcpu != kvm_get_running_vcpu() &&
 	    !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
-		kvm_vcpu_kick(vcpu);
+		kvm_vcpu_wake_up(vcpu);
 
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4a2341e4ff30..abf99b77883e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9887,10 +9887,11 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	smp_mb__after_srcu_read_unlock();
 
 	/*
-	 * This handles the case where a posted interrupt was
-	 * notified with kvm_vcpu_kick.  Assigned devices can
-	 * use the POSTED_INTR_VECTOR even if APICv is disabled,
-	 * so do it even if !kvm_vcpu_apicv_active(vcpu).
+	 * Process pending posted interrupts to handle the case where the
+	 * notification IRQ arrived in the host, or was never sent (because the
+	 * target vCPU wasn't running).  Do this regardless of the vCPU's APICv
+	 * status, KVM doesn't update assigned devices when APICv is inhibited,
+	 * i.e. they can post interrupts even if APICv is temporarily disabled.
 	 */
 	if (kvm_lapic_enabled(vcpu))
 		static_call_cond(kvm_x86_sync_pir_to_irr)(vcpu);
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 17/26] KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (15 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 16/26] KVM: VMX: Don't do full kick when triggering posted interrupt "fails" Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 18/26] KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ Sean Christopherson
                   ` (11 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Drop a check that guards triggering a posted interrupt on the currently
running vCPU, and more importantly guards waking the target vCPU if
triggering a posted interrupt fails because the vCPU isn't IN_GUEST_MODE.
The "do nothing" logic when "vcpu == running_vcpu" works only because KVM
doesn't have a path to ->deliver_posted_interrupt() from asynchronous
context, e.g. if apic_timer_expired() were changed to always go down the
posted interrupt path for APICv, or if the IN_GUEST_MODE check in
kvm_use_posted_timer_interrupt() were dropped, and the hrtimer fired in
kvm_vcpu_block() after the final kvm_vcpu_check_block() check, the vCPU
would be scheduled() out without being awakened, i.e. would "miss" the
timer interrupt.

One could argue that invoking kvm_apic_local_deliver() from (soft) IRQ
context for the current running vCPU should be illegal, but nothing in
KVM actually enforces that rules.  There's also no strong obvious benefit
to making such behavior illegal, e.g. checking IN_GUEST_MODE and calling
kvm_vcpu_wake_up() is at worst marginally more costly than querying the
current running vCPU.

Lastly, this aligns the non-nested and nested usage of triggering posted
interrupts, and will allow for additional cleanups.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fa90eacbf7e2..0eac98589472 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3993,8 +3993,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
 	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
 	 */
-	if (vcpu != kvm_get_running_vcpu() &&
-	    !kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+	if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
 		kvm_vcpu_wake_up(vcpu);
 
 	return 0;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 18/26] KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (16 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 17/26] KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 19/26] KVM: VMX: Fold fallback path into triggering posted IRQ helper Sean Christopherson
                   ` (10 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Refactor the posted interrupt helper to take the desired notification
vector instead of a bool so that the callers are self-documenting.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 8 +++-----
 1 file changed, 3 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0eac98589472..ff309ebe9f2c 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3890,11 +3890,9 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 }
 
 static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
-						     bool nested)
+						     int pi_vec)
 {
 #ifdef CONFIG_SMP
-	int pi_vec = nested ? POSTED_INTR_NESTED_VECTOR : POSTED_INTR_VECTOR;
-
 	if (vcpu->mode == IN_GUEST_MODE) {
 		/*
 		 * The vector of interrupt to be delivered to vcpu had
@@ -3955,7 +3953,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 		smp_mb__after_atomic();
 
 		/* the PIR and ON have been set by L1. */
-		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, true))
+		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
 			kvm_vcpu_wake_up(vcpu);
 		return 0;
 	}
@@ -3993,7 +3991,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
 	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
 	 */
-	if (!kvm_vcpu_trigger_posted_interrupt(vcpu, false))
+	if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
 		kvm_vcpu_wake_up(vcpu);
 
 	return 0;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 19/26] KVM: VMX: Fold fallback path into triggering posted IRQ helper
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (17 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 18/26] KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 20/26] KVM: VMX: Don't do full kick when handling posted interrupt wakeup Sean Christopherson
                   ` (9 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Move the fallback "wake_up" path into the helper to trigger posted
interrupt helper now that the nested and non-nested paths are identical.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ff309ebe9f2c..9153f5f5d424 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3889,7 +3889,7 @@ static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 	pt_update_intercept_for_msr(vcpu);
 }
 
-static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
 						     int pi_vec)
 {
 #ifdef CONFIG_SMP
@@ -3920,10 +3920,15 @@ static inline bool kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
 		 */
 
 		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
-		return true;
+		return;
 	}
 #endif
-	return false;
+	/*
+	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+	 * otherwise do nothing as KVM will grab the highest priority pending
+	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+	 */
+	kvm_vcpu_wake_up(vcpu);
 }
 
 static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
@@ -3953,8 +3958,7 @@ static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 		smp_mb__after_atomic();
 
 		/* the PIR and ON have been set by L1. */
-		if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR))
-			kvm_vcpu_wake_up(vcpu);
+		kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_NESTED_VECTOR);
 		return 0;
 	}
 	return -1;
@@ -3991,9 +3995,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
 	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
 	 */
-	if (!kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR))
-		kvm_vcpu_wake_up(vcpu);
-
+	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
 	return 0;
 }
 
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 20/26] KVM: VMX: Don't do full kick when handling posted interrupt wakeup
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (18 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 19/26] KVM: VMX: Fold fallback path into triggering posted IRQ helper Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper Sean Christopherson
                   ` (8 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

When waking vCPUs in the posted interrupt wakeup handling, do exactly
that and no more.  There is no need to kick the vCPU as the wakeup
handler just needs to get the vCPU task running, and if it's in the guest
then it's definitely running.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 023a6b9b0fa4..f4169c009400 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -209,7 +209,7 @@ void pi_wakeup_handler(void)
 			    pi_wakeup_list) {
 
 		if (pi_test_on(&vmx->pi_desc))
-			kvm_vcpu_kick(&vmx->vcpu);
+			kvm_vcpu_wake_up(&vmx->vcpu);
 	}
 	spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, cpu));
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (19 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 20/26] KVM: VMX: Don't do full kick when handling posted interrupt wakeup Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08 14:43   ` Paolo Bonzini
  2021-12-08  1:52 ` [PATCH v3 22/26] KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops Sean Christopherson
                   ` (7 subsequent siblings)
  28 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Drop avic_set_running() in favor of calling avic_vcpu_{load,put}()
directly, and modify the block+put path to use preempt_disable/enable()
instead of get/put_cpu(), as it doesn't actually care about the current
pCPU associated with the vCPU.  Opportunistically add lockdep assertions
as being preempted in avic_vcpu_put() would lead to consuming stale data,
even though doing so _in the current code base_ would not be fatal.

Add a much needed comment explaining why svm_vcpu_blocking() needs to
unload the AVIC and update the IRTE _before_ the vCPU starts blocking.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 50 ++++++++++++++++++++++++-----------------
 1 file changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index 88b3c315b34f..dd0d688bc342 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -977,6 +977,8 @@ void avic_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	int h_physical_id = kvm_cpu_get_apicid(cpu);
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	lockdep_assert_preemption_disabled();
+
 	/*
 	 * Since the host physical APIC id is 8 bits,
 	 * we can support host APIC ID upto 255.
@@ -1010,6 +1012,8 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	u64 entry;
 	struct vcpu_svm *svm = to_svm(vcpu);
 
+	lockdep_assert_preemption_disabled();
+
 	entry = READ_ONCE(*(svm->avic_physical_id_cache));
 
 	/* Nothing to do if IsRunning == '0' due to vCPU blocking. */
@@ -1022,31 +1026,37 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
 }
 
-/*
- * This function is called during VCPU halt/unhalt.
- */
-static void avic_set_running(struct kvm_vcpu *vcpu, bool is_run)
-{
-	int cpu = get_cpu();
-
-	WARN_ON(cpu != vcpu->cpu);
-
-	if (kvm_vcpu_apicv_active(vcpu)) {
-		if (is_run)
-			avic_vcpu_load(vcpu, cpu);
-		else
-			avic_vcpu_put(vcpu);
-	}
-	put_cpu();
-}
-
 void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
-	avic_set_running(vcpu, false);
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return;
+
+	preempt_disable();
+
+	/*
+	 * Unload the AVIC when the vCPU is about to block, _before_ the vCPU
+	 * actually blocks.  The vCPU needs to be marked IsRunning=0 before the
+	 * final pass over the vIRR via kvm_vcpu_check_block().  Any IRQs that
+	 * arrive before IsRunning=0 will not signal the doorbell, i.e. it's
+	 * KVM's responsibility to ensure there are no pending IRQs in the vIRR
+	 * after IsRunning is cleared, prior to scheduling out the vCPU.
+	 */
+	avic_vcpu_put(vcpu);
+
+	preempt_enable();
 }
 
 void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
 {
+	int cpu;
 
-	avic_set_running(vcpu, true);
+	if (!kvm_vcpu_apicv_active(vcpu))
+		return;
+
+	cpu = get_cpu();
+	WARN_ON(cpu != vcpu->cpu);
+
+	avic_vcpu_load(vcpu, cpu);
+
+	put_cpu();
 }
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 22/26] KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (20 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 23/26] KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled Sean Christopherson
                   ` (6 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Move svm_hardware_setup() below svm_x86_ops so that KVM can modify ops
during setup, e.g. the vcpu_(un)blocking hooks can be nullified if AVIC
is disabled or unsupported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/svm.c | 466 ++++++++++++++++++++---------------------
 1 file changed, 233 insertions(+), 233 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index dde0106ffc47..98f7c454a784 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -873,47 +873,6 @@ static void shrink_ple_window(struct kvm_vcpu *vcpu)
 	}
 }
 
-/*
- * The default MMIO mask is a single bit (excluding the present bit),
- * which could conflict with the memory encryption bit. Check for
- * memory encryption support and override the default MMIO mask if
- * memory encryption is enabled.
- */
-static __init void svm_adjust_mmio_mask(void)
-{
-	unsigned int enc_bit, mask_bit;
-	u64 msr, mask;
-
-	/* If there is no memory encryption support, use existing mask */
-	if (cpuid_eax(0x80000000) < 0x8000001f)
-		return;
-
-	/* If memory encryption is not enabled, use existing mask */
-	rdmsrl(MSR_AMD64_SYSCFG, msr);
-	if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
-		return;
-
-	enc_bit = cpuid_ebx(0x8000001f) & 0x3f;
-	mask_bit = boot_cpu_data.x86_phys_bits;
-
-	/* Increment the mask bit if it is the same as the encryption bit */
-	if (enc_bit == mask_bit)
-		mask_bit++;
-
-	/*
-	 * If the mask bit location is below 52, then some bits above the
-	 * physical addressing limit will always be reserved, so use the
-	 * rsvd_bits() function to generate the mask. This mask, along with
-	 * the present bit, will be used to generate a page fault with
-	 * PFER.RSV = 1.
-	 *
-	 * If the mask bit location is 52 (or above), then clear the mask.
-	 */
-	mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
-
-	kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
-}
-
 static void svm_hardware_teardown(void)
 {
 	int cpu;
@@ -928,198 +887,6 @@ static void svm_hardware_teardown(void)
 	iopm_base = 0;
 }
 
-static __init void svm_set_cpu_caps(void)
-{
-	kvm_set_cpu_caps();
-
-	supported_xss = 0;
-
-	/* CPUID 0x80000001 and 0x8000000A (SVM features) */
-	if (nested) {
-		kvm_cpu_cap_set(X86_FEATURE_SVM);
-
-		if (nrips)
-			kvm_cpu_cap_set(X86_FEATURE_NRIPS);
-
-		if (npt_enabled)
-			kvm_cpu_cap_set(X86_FEATURE_NPT);
-
-		if (tsc_scaling)
-			kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR);
-
-		/* Nested VM can receive #VMEXIT instead of triggering #GP */
-		kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
-	}
-
-	/* CPUID 0x80000008 */
-	if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
-	    boot_cpu_has(X86_FEATURE_AMD_SSBD))
-		kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
-
-	/* AMD PMU PERFCTR_CORE CPUID */
-	if (pmu && boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
-		kvm_cpu_cap_set(X86_FEATURE_PERFCTR_CORE);
-
-	/* CPUID 0x8000001F (SME/SEV features) */
-	sev_set_cpu_caps();
-}
-
-static __init int svm_hardware_setup(void)
-{
-	int cpu;
-	struct page *iopm_pages;
-	void *iopm_va;
-	int r;
-	unsigned int order = get_order(IOPM_SIZE);
-
-	/*
-	 * NX is required for shadow paging and for NPT if the NX huge pages
-	 * mitigation is enabled.
-	 */
-	if (!boot_cpu_has(X86_FEATURE_NX)) {
-		pr_err_ratelimited("NX (Execute Disable) not supported\n");
-		return -EOPNOTSUPP;
-	}
-	kvm_enable_efer_bits(EFER_NX);
-
-	iopm_pages = alloc_pages(GFP_KERNEL, order);
-
-	if (!iopm_pages)
-		return -ENOMEM;
-
-	iopm_va = page_address(iopm_pages);
-	memset(iopm_va, 0xff, PAGE_SIZE * (1 << order));
-	iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT;
-
-	init_msrpm_offsets();
-
-	supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
-
-	if (boot_cpu_has(X86_FEATURE_FXSR_OPT))
-		kvm_enable_efer_bits(EFER_FFXSR);
-
-	if (tsc_scaling) {
-		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
-			tsc_scaling = false;
-		} else {
-			pr_info("TSC scaling supported\n");
-			kvm_has_tsc_control = true;
-			kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX;
-			kvm_tsc_scaling_ratio_frac_bits = 32;
-		}
-	}
-
-	tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX);
-
-	/* Check for pause filtering support */
-	if (!boot_cpu_has(X86_FEATURE_PAUSEFILTER)) {
-		pause_filter_count = 0;
-		pause_filter_thresh = 0;
-	} else if (!boot_cpu_has(X86_FEATURE_PFTHRESHOLD)) {
-		pause_filter_thresh = 0;
-	}
-
-	if (nested) {
-		printk(KERN_INFO "kvm: Nested Virtualization enabled\n");
-		kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE);
-	}
-
-	/*
-	 * KVM's MMU doesn't support using 2-level paging for itself, and thus
-	 * NPT isn't supported if the host is using 2-level paging since host
-	 * CR4 is unchanged on VMRUN.
-	 */
-	if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE))
-		npt_enabled = false;
-
-	if (!boot_cpu_has(X86_FEATURE_NPT))
-		npt_enabled = false;
-
-	/* Force VM NPT level equal to the host's paging level */
-	kvm_configure_mmu(npt_enabled, get_npt_level(),
-			  get_npt_level(), PG_LEVEL_1G);
-	pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
-
-	/* Note, SEV setup consumes npt_enabled. */
-	sev_hardware_setup();
-
-	svm_hv_hardware_setup();
-
-	svm_adjust_mmio_mask();
-
-	for_each_possible_cpu(cpu) {
-		r = svm_cpu_init(cpu);
-		if (r)
-			goto err;
-	}
-
-	if (nrips) {
-		if (!boot_cpu_has(X86_FEATURE_NRIPS))
-			nrips = false;
-	}
-
-	enable_apicv = avic = avic && npt_enabled && boot_cpu_has(X86_FEATURE_AVIC);
-
-	if (enable_apicv) {
-		pr_info("AVIC enabled\n");
-
-		amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
-	}
-
-	if (vls) {
-		if (!npt_enabled ||
-		    !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD) ||
-		    !IS_ENABLED(CONFIG_X86_64)) {
-			vls = false;
-		} else {
-			pr_info("Virtual VMLOAD VMSAVE supported\n");
-		}
-	}
-
-	if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
-		svm_gp_erratum_intercept = false;
-
-	if (vgif) {
-		if (!boot_cpu_has(X86_FEATURE_VGIF))
-			vgif = false;
-		else
-			pr_info("Virtual GIF supported\n");
-	}
-
-	if (lbrv) {
-		if (!boot_cpu_has(X86_FEATURE_LBRV))
-			lbrv = false;
-		else
-			pr_info("LBR virtualization supported\n");
-	}
-
-	if (!pmu)
-		pr_info("PMU virtualization is disabled\n");
-
-	svm_set_cpu_caps();
-
-	/*
-	 * It seems that on AMD processors PTE's accessed bit is
-	 * being set by the CPU hardware before the NPF vmexit.
-	 * This is not expected behaviour and our tests fail because
-	 * of it.
-	 * A workaround here is to disable support for
-	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled.
-	 * In this case userspace can know if there is support using
-	 * KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle
-	 * it
-	 * If future AMD CPU models change the behaviour described above,
-	 * this variable can be changed accordingly
-	 */
-	allow_smaller_maxphyaddr = !npt_enabled;
-
-	return 0;
-
-err:
-	svm_hardware_teardown();
-	return r;
-}
-
 static void init_seg(struct vmcb_seg *seg)
 {
 	seg->selector = 0;
@@ -4714,6 +4481,239 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
 };
 
+/*
+ * The default MMIO mask is a single bit (excluding the present bit),
+ * which could conflict with the memory encryption bit. Check for
+ * memory encryption support and override the default MMIO mask if
+ * memory encryption is enabled.
+ */
+static __init void svm_adjust_mmio_mask(void)
+{
+	unsigned int enc_bit, mask_bit;
+	u64 msr, mask;
+
+	/* If there is no memory encryption support, use existing mask */
+	if (cpuid_eax(0x80000000) < 0x8000001f)
+		return;
+
+	/* If memory encryption is not enabled, use existing mask */
+	rdmsrl(MSR_AMD64_SYSCFG, msr);
+	if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
+		return;
+
+	enc_bit = cpuid_ebx(0x8000001f) & 0x3f;
+	mask_bit = boot_cpu_data.x86_phys_bits;
+
+	/* Increment the mask bit if it is the same as the encryption bit */
+	if (enc_bit == mask_bit)
+		mask_bit++;
+
+	/*
+	 * If the mask bit location is below 52, then some bits above the
+	 * physical addressing limit will always be reserved, so use the
+	 * rsvd_bits() function to generate the mask. This mask, along with
+	 * the present bit, will be used to generate a page fault with
+	 * PFER.RSV = 1.
+	 *
+	 * If the mask bit location is 52 (or above), then clear the mask.
+	 */
+	mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
+
+	kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
+}
+
+static __init void svm_set_cpu_caps(void)
+{
+	kvm_set_cpu_caps();
+
+	supported_xss = 0;
+
+	/* CPUID 0x80000001 and 0x8000000A (SVM features) */
+	if (nested) {
+		kvm_cpu_cap_set(X86_FEATURE_SVM);
+
+		if (nrips)
+			kvm_cpu_cap_set(X86_FEATURE_NRIPS);
+
+		if (npt_enabled)
+			kvm_cpu_cap_set(X86_FEATURE_NPT);
+
+		if (tsc_scaling)
+			kvm_cpu_cap_set(X86_FEATURE_TSCRATEMSR);
+
+		/* Nested VM can receive #VMEXIT instead of triggering #GP */
+		kvm_cpu_cap_set(X86_FEATURE_SVME_ADDR_CHK);
+	}
+
+	/* CPUID 0x80000008 */
+	if (boot_cpu_has(X86_FEATURE_LS_CFG_SSBD) ||
+	    boot_cpu_has(X86_FEATURE_AMD_SSBD))
+		kvm_cpu_cap_set(X86_FEATURE_VIRT_SSBD);
+
+	/* AMD PMU PERFCTR_CORE CPUID */
+	if (pmu && boot_cpu_has(X86_FEATURE_PERFCTR_CORE))
+		kvm_cpu_cap_set(X86_FEATURE_PERFCTR_CORE);
+
+	/* CPUID 0x8000001F (SME/SEV features) */
+	sev_set_cpu_caps();
+}
+
+static __init int svm_hardware_setup(void)
+{
+	int cpu;
+	struct page *iopm_pages;
+	void *iopm_va;
+	int r;
+	unsigned int order = get_order(IOPM_SIZE);
+
+	/*
+	 * NX is required for shadow paging and for NPT if the NX huge pages
+	 * mitigation is enabled.
+	 */
+	if (!boot_cpu_has(X86_FEATURE_NX)) {
+		pr_err_ratelimited("NX (Execute Disable) not supported\n");
+		return -EOPNOTSUPP;
+	}
+	kvm_enable_efer_bits(EFER_NX);
+
+	iopm_pages = alloc_pages(GFP_KERNEL, order);
+
+	if (!iopm_pages)
+		return -ENOMEM;
+
+	iopm_va = page_address(iopm_pages);
+	memset(iopm_va, 0xff, PAGE_SIZE * (1 << order));
+	iopm_base = page_to_pfn(iopm_pages) << PAGE_SHIFT;
+
+	init_msrpm_offsets();
+
+	supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
+
+	if (boot_cpu_has(X86_FEATURE_FXSR_OPT))
+		kvm_enable_efer_bits(EFER_FFXSR);
+
+	if (tsc_scaling) {
+		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+			tsc_scaling = false;
+		} else {
+			pr_info("TSC scaling supported\n");
+			kvm_has_tsc_control = true;
+			kvm_max_tsc_scaling_ratio = TSC_RATIO_MAX;
+			kvm_tsc_scaling_ratio_frac_bits = 32;
+		}
+	}
+
+	tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX);
+
+	/* Check for pause filtering support */
+	if (!boot_cpu_has(X86_FEATURE_PAUSEFILTER)) {
+		pause_filter_count = 0;
+		pause_filter_thresh = 0;
+	} else if (!boot_cpu_has(X86_FEATURE_PFTHRESHOLD)) {
+		pause_filter_thresh = 0;
+	}
+
+	if (nested) {
+		printk(KERN_INFO "kvm: Nested Virtualization enabled\n");
+		kvm_enable_efer_bits(EFER_SVME | EFER_LMSLE);
+	}
+
+	/*
+	 * KVM's MMU doesn't support using 2-level paging for itself, and thus
+	 * NPT isn't supported if the host is using 2-level paging since host
+	 * CR4 is unchanged on VMRUN.
+	 */
+	if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE))
+		npt_enabled = false;
+
+	if (!boot_cpu_has(X86_FEATURE_NPT))
+		npt_enabled = false;
+
+	/* Force VM NPT level equal to the host's paging level */
+	kvm_configure_mmu(npt_enabled, get_npt_level(),
+			  get_npt_level(), PG_LEVEL_1G);
+	pr_info("kvm: Nested Paging %sabled\n", npt_enabled ? "en" : "dis");
+
+	/* Note, SEV setup consumes npt_enabled. */
+	sev_hardware_setup();
+
+	svm_hv_hardware_setup();
+
+	svm_adjust_mmio_mask();
+
+	for_each_possible_cpu(cpu) {
+		r = svm_cpu_init(cpu);
+		if (r)
+			goto err;
+	}
+
+	if (nrips) {
+		if (!boot_cpu_has(X86_FEATURE_NRIPS))
+			nrips = false;
+	}
+
+	enable_apicv = avic = avic && npt_enabled && boot_cpu_has(X86_FEATURE_AVIC);
+
+	if (enable_apicv) {
+		pr_info("AVIC enabled\n");
+
+		amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
+	}
+
+	if (vls) {
+		if (!npt_enabled ||
+		    !boot_cpu_has(X86_FEATURE_V_VMSAVE_VMLOAD) ||
+		    !IS_ENABLED(CONFIG_X86_64)) {
+			vls = false;
+		} else {
+			pr_info("Virtual VMLOAD VMSAVE supported\n");
+		}
+	}
+
+	if (boot_cpu_has(X86_FEATURE_SVME_ADDR_CHK))
+		svm_gp_erratum_intercept = false;
+
+	if (vgif) {
+		if (!boot_cpu_has(X86_FEATURE_VGIF))
+			vgif = false;
+		else
+			pr_info("Virtual GIF supported\n");
+	}
+
+	if (lbrv) {
+		if (!boot_cpu_has(X86_FEATURE_LBRV))
+			lbrv = false;
+		else
+			pr_info("LBR virtualization supported\n");
+	}
+
+	if (!pmu)
+		pr_info("PMU virtualization is disabled\n");
+
+	svm_set_cpu_caps();
+
+	/*
+	 * It seems that on AMD processors PTE's accessed bit is
+	 * being set by the CPU hardware before the NPF vmexit.
+	 * This is not expected behaviour and our tests fail because
+	 * of it.
+	 * A workaround here is to disable support for
+	 * GUEST_MAXPHYADDR < HOST_MAXPHYADDR if NPT is enabled.
+	 * In this case userspace can know if there is support using
+	 * KVM_CAP_SMALLER_MAXPHYADDR extension and decide how to handle
+	 * it
+	 * If future AMD CPU models change the behaviour described above,
+	 * this variable can be changed accordingly
+	 */
+	allow_smaller_maxphyaddr = !npt_enabled;
+
+	return 0;
+
+err:
+	svm_hardware_teardown();
+	return r;
+}
+
 static struct kvm_x86_init_ops svm_init_ops __initdata = {
 	.cpu_has_kvm_support = has_svm,
 	.disabled_by_bios = is_disabled,
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 23/26] KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (21 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 22/26] KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 24/26] KVM: x86: Skip APICv update if APICv is disable at the module level Sean Christopherson
                   ` (5 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Nullify svm_x86_ops.vcpu_(un)blocking if AVIC/APICv is disabled as the
hooks are necessary only to clear the vCPU's IsRunning entry in the
Physical APIC and to update IRTE entries if the VM has a pass-through
device attached.

Opportunistically rename the helpers to clarify their AVIC relationship.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/avic.c | 4 ++--
 arch/x86/kvm/svm/svm.c  | 7 +++++--
 arch/x86/kvm/svm/svm.h  | 4 ++--
 3 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/svm/avic.c b/arch/x86/kvm/svm/avic.c
index dd0d688bc342..26ed5325c593 100644
--- a/arch/x86/kvm/svm/avic.c
+++ b/arch/x86/kvm/svm/avic.c
@@ -1026,7 +1026,7 @@ void avic_vcpu_put(struct kvm_vcpu *vcpu)
 	WRITE_ONCE(*(svm->avic_physical_id_cache), entry);
 }
 
-void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
+void avic_vcpu_blocking(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
@@ -1046,7 +1046,7 @@ void svm_vcpu_blocking(struct kvm_vcpu *vcpu)
 	preempt_enable();
 }
 
-void svm_vcpu_unblocking(struct kvm_vcpu *vcpu)
+void avic_vcpu_unblocking(struct kvm_vcpu *vcpu)
 {
 	int cpu;
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 98f7c454a784..e57e6857e063 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4370,8 +4370,8 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.prepare_guest_switch = svm_prepare_guest_switch,
 	.vcpu_load = svm_vcpu_load,
 	.vcpu_put = svm_vcpu_put,
-	.vcpu_blocking = svm_vcpu_blocking,
-	.vcpu_unblocking = svm_vcpu_unblocking,
+	.vcpu_blocking = avic_vcpu_blocking,
+	.vcpu_unblocking = avic_vcpu_unblocking,
 
 	.update_exception_bitmap = svm_update_exception_bitmap,
 	.get_msr_feature = svm_get_msr_feature,
@@ -4658,6 +4658,9 @@ static __init int svm_hardware_setup(void)
 		pr_info("AVIC enabled\n");
 
 		amd_iommu_register_ga_log_notifier(&avic_ga_log_notifier);
+	} else {
+		svm_x86_ops.vcpu_blocking = NULL;
+		svm_x86_ops.vcpu_unblocking = NULL;
 	}
 
 	if (vls) {
diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h
index 83ced47fa9b9..daa8ca84afcc 100644
--- a/arch/x86/kvm/svm/svm.h
+++ b/arch/x86/kvm/svm/svm.h
@@ -593,8 +593,8 @@ int svm_deliver_avic_intr(struct kvm_vcpu *vcpu, int vec);
 bool svm_dy_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
 		       uint32_t guest_irq, bool set);
-void svm_vcpu_blocking(struct kvm_vcpu *vcpu);
-void svm_vcpu_unblocking(struct kvm_vcpu *vcpu);
+void avic_vcpu_blocking(struct kvm_vcpu *vcpu);
+void avic_vcpu_unblocking(struct kvm_vcpu *vcpu);
 
 /* sev.c */
 
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 24/26] KVM: x86: Skip APICv update if APICv is disable at the module level
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (22 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 23/26] KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 25/26] KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons Sean Christopherson
                   ` (4 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Bail from the APICv update paths _before_ taking apicv_update_lock if
APICv is disabled at the module level.  kvm_request_apicv_update() in
particular is invoked from multiple paths that can be reached without
APICv being enabled, e.g. svm_enable_irq_window(), and taking the
rw_sem for write when APICv is disabled may introduce unnecessary
contention and stalls.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/hyperv.c | 3 +++
 arch/x86/kvm/x86.c    | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
index 7179fa645eda..175c1bace091 100644
--- a/arch/x86/kvm/hyperv.c
+++ b/arch/x86/kvm/hyperv.c
@@ -112,6 +112,9 @@ static void synic_update_vector(struct kvm_vcpu_hv_synic *synic,
 	if (!!auto_eoi_old == !!auto_eoi_new)
 		return;
 
+	if (!enable_apicv)
+		return;
+
 	down_write(&vcpu->kvm->arch.apicv_update_lock);
 
 	if (auto_eoi_new)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index abf99b77883e..c804cc39c90d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9602,6 +9602,9 @@ EXPORT_SYMBOL_GPL(__kvm_request_apicv_update);
 
 void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 {
+	if (!enable_apicv)
+		return;
+
 	down_write(&kvm->arch.apicv_update_lock);
 	__kvm_request_apicv_update(kvm, activate, bit);
 	up_write(&kvm->arch.apicv_update_lock);
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 25/26] KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (23 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 24/26] KVM: x86: Skip APICv update if APICv is disable at the module level Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  1:52 ` [PATCH v3 26/26] KVM: x86: Unexport __kvm_request_apicv_update() Sean Christopherson
                   ` (3 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Drop the useless NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
when handling an APICv update, both VMX and SVM unconditionally implement
the helper and leave it non-NULL even if APICv is disabled at the module
level.  The latter is a moot point now that __kvm_request_apicv_update()
is called if and only if enable_apicv is true.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c804cc39c90d..fc52b97d6aa1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9564,8 +9564,7 @@ void __kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 
 	lockdep_assert_held_write(&kvm->arch.apicv_update_lock);
 
-	if (!kvm_x86_ops.check_apicv_inhibit_reasons ||
-	    !static_call(kvm_x86_check_apicv_inhibit_reasons)(bit))
+	if (!static_call(kvm_x86_check_apicv_inhibit_reasons)(bit))
 		return;
 
 	old = new = kvm->arch.apicv_inhibit_reasons;
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* [PATCH v3 26/26] KVM: x86: Unexport __kvm_request_apicv_update()
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (24 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 25/26] KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons Sean Christopherson
@ 2021-12-08  1:52 ` Sean Christopherson
  2021-12-08  9:04 ` [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Paolo Bonzini
                   ` (2 subsequent siblings)
  28 siblings, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08  1:52 UTC (permalink / raw)
  To: Paolo Bonzini, Joerg Roedel
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

Unexport __kvm_request_apicv_update(), it's not used by vendor code and
should never be used by vendor code.  The only reason it's exposed at all
is because Hyper-V's SynIC needs to track how many auto-EOIs are in use,
and it's convenient to use apicv_update_lock to guard that tracking.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fc52b97d6aa1..0774cf4ccd88 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9597,7 +9597,6 @@ void __kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 	} else
 		kvm->arch.apicv_inhibit_reasons = new;
 }
-EXPORT_SYMBOL_GPL(__kvm_request_apicv_update);
 
 void kvm_request_apicv_update(struct kvm *kvm, bool activate, ulong bit)
 {
-- 
2.34.1.400.ga245620fadb-goog


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (25 preceding siblings ...)
  2021-12-08  1:52 ` [PATCH v3 26/26] KVM: x86: Unexport __kvm_request_apicv_update() Sean Christopherson
@ 2021-12-08  9:04 ` Paolo Bonzini
  2021-12-08 14:51 ` Paolo Bonzini
  2021-12-08 23:00 ` Maxim Levitsky
  28 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2021-12-08  9:04 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky

On 12/8/21 02:52, Sean Christopherson wrote:
> Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
> AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
> of cruft, and document the lurking gotchas along the way.
> 
> Patch 01 is a fix from Paolo that's already been merged but hasn't made
> its way to kvm/queue.  It's included here to avoid a number of conflicts.
> 
> Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")

Thanks, I've now finally pushed to kvm/next.  Debug kernels look like 
they have a few issues that caused my tests to fail, and I wanted to 
make sure they weren't related to the sizable amount of patches already 
in the queue.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  2021-12-08  1:52 ` [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper Sean Christopherson
@ 2021-12-08 14:43   ` Paolo Bonzini
  2021-12-08 15:03     ` Maxim Levitsky
  2021-12-08 15:43     ` Sean Christopherson
  0 siblings, 2 replies; 50+ messages in thread
From: Paolo Bonzini @ 2021-12-08 14:43 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky

On 12/8/21 02:52, Sean Christopherson wrote:
> +	/*
> +	 * Unload the AVIC when the vCPU is about to block,_before_  the vCPU
> +	 * actually blocks.  The vCPU needs to be marked IsRunning=0 before the
> +	 * final pass over the vIRR via kvm_vcpu_check_block().  Any IRQs that
> +	 * arrive before IsRunning=0 will not signal the doorbell, i.e. it's
> +	 * KVM's responsibility to ensure there are no pending IRQs in the vIRR
> +	 * after IsRunning is cleared, prior to scheduling out the vCPU.

I prefer to phrase this around paired memory barriers and the usual 
store/smp_mb/load lockless idiom:

	/*
	 * Unload the AVIC when the vCPU is about to block, _before_
	 * the vCPU actually blocks.
	 *
	 * Any IRQs that arrive before IsRunning=0 will not cause an
	 * incomplete IPI vmexit on the source, therefore vIRR will also
	 * be checked by kvm_vcpu_check_block() before blocking.  The
	 * memory barrier implicit in set_current_state orders writing
	 * IsRunning=0 before reading the vIRR.  The processor needs a
	 * matching memory barrier on interrupt delivery between writing
	 * IRR and reading IsRunning; the lack of this barrier might be
	 * the cause of errata #1235).
	 */

Is there any nuance that I am missing?

Paolo

> +	 */
> +	avic_vcpu_put(vcpu);
> +


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (26 preceding siblings ...)
  2021-12-08  9:04 ` [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Paolo Bonzini
@ 2021-12-08 14:51 ` Paolo Bonzini
  2021-12-08 23:00 ` Maxim Levitsky
  28 siblings, 0 replies; 50+ messages in thread
From: Paolo Bonzini @ 2021-12-08 14:51 UTC (permalink / raw)
  To: Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky

On 12/8/21 02:52, Sean Christopherson wrote:
> Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
> AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
> of cruft, and document the lurking gotchas along the way.
> 
> Patch 01 is a fix from Paolo that's already been merged but hasn't made
> its way to kvm/queue.  It's included here to avoid a number of conflicts.
> 
> Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")

Queued, thanks; patches 24-26 for 5.16 and the rest for 5.17.

Just one nit: please tune the language to have a little fewer idiomatic 
phrases, as that can be a bit taxing on non-native speakers.  I for one 
enjoy learning a few new words, and it even adds some "personality" to 
the remote interactions, but it probably distracts people that aren't 
too preficient in English.

Paolo

> v3:
>   - Rebase to kvm/queue (and drop non-x86 patches as they've been queued).
>   - Redo AVIC patches, sadly the vcpu_(un)blocking() hooks need to stay.
>   - Add a patch to fix a missing (docuentation-only) barrier in nested
>     posted interrupt delivery. [Paolo]
>   - Collect reviews.
> 
> v2:
>   - https://lore.kernel.org/all/20211009021236.4122790-1-seanjc@google.com/
>   - Collect reviews. [Christian, David]
>   - Add patch to move arm64 WFI functionality out of hooks. [Marc]
>   - Add RISC-V to the fun.
>   - Add all the APICv fun.
> 
> v1: https://lkml.kernel.org/r/20210925005528.1145584-1-seanjc@google.com
> 
> Paolo Bonzini (1):
>    KVM: fix avic_set_running for preemptable kernels
> 
> Sean Christopherson (25):
>    KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ
>      fails
>    KVM: VMX: Clean up PI pre/post-block WARNs
>    KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
>    KVM: Drop unused kvm_vcpu.pre_pcpu field
>    KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
>    KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
>    KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
>    KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
>    KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
>    KVM: SVM: Don't bother checking for "running" AVIC when kicking for
>      IPIs
>    KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
>    KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
>    KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
>    iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
>    KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
>    KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
>      vCPU
>    KVM: VMX: Pass desired vector instead of bool for triggering posted
>      IRQ
>    KVM: VMX: Fold fallback path into triggering posted IRQ helper
>    KVM: VMX: Don't do full kick when handling posted interrupt wakeup
>    KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
>    KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
>    KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
>    KVM: x86: Skip APICv update if APICv is disable at the module level
>    KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
>    KVM: x86: Unexport __kvm_request_apicv_update()
> 
>   arch/x86/include/asm/kvm-x86-ops.h |   2 -
>   arch/x86/include/asm/kvm_host.h    |  12 -
>   arch/x86/kvm/hyperv.c              |   3 +
>   arch/x86/kvm/lapic.c               |   2 -
>   arch/x86/kvm/svm/avic.c            | 116 ++++---
>   arch/x86/kvm/svm/svm.c             | 479 ++++++++++++++---------------
>   arch/x86/kvm/svm/svm.h             |  16 +-
>   arch/x86/kvm/vmx/posted_intr.c     | 234 +++++++-------
>   arch/x86/kvm/vmx/posted_intr.h     |   8 +-
>   arch/x86/kvm/vmx/vmx.c             |  66 ++--
>   arch/x86/kvm/vmx/vmx.h             |   3 +
>   arch/x86/kvm/x86.c                 |  41 ++-
>   drivers/iommu/amd/iommu.c          |   6 +-
>   include/linux/amd-iommu.h          |   6 +-
>   include/linux/kvm_host.h           |   3 -
>   virt/kvm/kvm_main.c                |   3 -
>   16 files changed, 510 insertions(+), 490 deletions(-)
> 


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  2021-12-08 14:43   ` Paolo Bonzini
@ 2021-12-08 15:03     ` Maxim Levitsky
  2021-12-08 15:43     ` Sean Christopherson
  1 sibling, 0 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-08 15:03 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel

On Wed, 2021-12-08 at 15:43 +0100, Paolo Bonzini wrote:
> On 12/8/21 02:52, Sean Christopherson wrote:
> > +	/*
> > +	 * Unload the AVIC when the vCPU is about to block,_before_  the vCPU
> > +	 * actually blocks.  The vCPU needs to be marked IsRunning=0 before the
> > +	 * final pass over the vIRR via kvm_vcpu_check_block().  Any IRQs that
> > +	 * arrive before IsRunning=0 will not signal the doorbell, i.e. it's
> > +	 * KVM's responsibility to ensure there are no pending IRQs in the vIRR
> > +	 * after IsRunning is cleared, prior to scheduling out the vCPU.
> 
> I prefer to phrase this around paired memory barriers and the usual 
> store/smp_mb/load lockless idiom:
> 
> 	/*
> 	 * Unload the AVIC when the vCPU is about to block, _before_
> 	 * the vCPU actually blocks.
> 	 *
> 	 * Any IRQs that arrive before IsRunning=0 will not cause an
> 	 * incomplete IPI vmexit on the source, therefore vIRR will also
> 	 * be checked by kvm_vcpu_check_block() before blocking.  The
> 	 * memory barrier implicit in set_current_state orders writing

If I understand correctly this is a full memory barrier and not only a write barrier?

 
Also, just to document, I also found out that lack of subsequent vIRR checking
in the 'KVM: SVM: Unconditionally mark AVIC as running on vCPU load (with APICv)'
is what made AVIC totally unusable on my systems.
That patch would set is_running right in the middle of schedule() and then
no vIRR check would be done afterwards.
 
Small update on my adventures with AVIC: On two Milan machines I got my hands on,
on both AVIC is disabled in CPUID, but seems to work. None of my reproducers
manage to hit that errata and on top of that I have set of patches that make
AVIC co-exist with nesting and it appears to work while stress tested with
my KVM unit test which I updated to run a nested guest on one of the vCPUs.
I mostly testing the second machine though this week.
 
I'll post my patches as soon as I rebase them on top of this patch series,
after I review it.
I’ll post the unit test soon too.
 
Still my gut feeling is that the errata is still there - I am still waiting for
AMD to provide any info they could on this.


Best regards,
	Maxim Levitsky


> 	 * IsRunning=0 before reading the vIRR.  The processor needs a
> 	 * matching memory barrier on interrupt delivery between writing
> 	 * IRR and reading IsRunning; the lack of this barrier might be
> 	 * the cause of errata #1235).
> 	 */
> 
> Is there any nuance that I am missing?
> 
> Paolo
> 
> > +	 */
> > +	avic_vcpu_put(vcpu);
> > +



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
  2021-12-08 14:43   ` Paolo Bonzini
  2021-12-08 15:03     ` Maxim Levitsky
@ 2021-12-08 15:43     ` Sean Christopherson
  1 sibling, 0 replies; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08 15:43 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel, Maxim Levitsky

On Wed, Dec 08, 2021, Paolo Bonzini wrote:
> On 12/8/21 02:52, Sean Christopherson wrote:
> > +	/*
> > +	 * Unload the AVIC when the vCPU is about to block,_before_  the vCPU
> > +	 * actually blocks.  The vCPU needs to be marked IsRunning=0 before the
> > +	 * final pass over the vIRR via kvm_vcpu_check_block().  Any IRQs that
> > +	 * arrive before IsRunning=0 will not signal the doorbell, i.e. it's
> > +	 * KVM's responsibility to ensure there are no pending IRQs in the vIRR
> > +	 * after IsRunning is cleared, prior to scheduling out the vCPU.
> 
> I prefer to phrase this around paired memory barriers and the usual
> store/smp_mb/load lockless idiom:

I've no objection to that, my goal is/was purely to emphasize the need to manually
process the vIRR after clearing IsRunning.
 
> 	/*
> 	 * Unload the AVIC when the vCPU is about to block, _before_
> 	 * the vCPU actually blocks.
> 	 *
> 	 * Any IRQs that arrive before IsRunning=0 will not cause an
> 	 * incomplete IPI vmexit on the source,

It's not just IPIs, the GA log will also suffer the same fate.  That's why I
didn't mention incomplete VM-Exits.  I'm certainly not opposed to that clarification,
but I don't want readers to walk away thinking this is only a problem for IPIs.

> therefore vIRR will also

"s/vIRR will/the vIRR must" to make it abundantly clear that checking the vIRR
is effectively a hard requirement.

> 	 * be checked by kvm_vcpu_check_block() before blocking.  The
> 	 * memory barrier implicit in set_current_state orders writing

set_current_state()

> 	 * IsRunning=0 before reading the vIRR.  The processor needs a
> 	 * matching memory barrier on interrupt delivery between writing
> 	 * IRR and reading IsRunning; the lack of this barrier might be

Missing the opening paranthesis.

> 	 * the cause of errata #1235).
> 	 */

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
                   ` (27 preceding siblings ...)
  2021-12-08 14:51 ` Paolo Bonzini
@ 2021-12-08 23:00 ` Maxim Levitsky
  2021-12-08 23:16   ` Maxim Levitsky
  2021-12-08 23:43   ` Sean Christopherson
  28 siblings, 2 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-08 23:00 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel

On Wed, 2021-12-08 at 01:52 +0000, Sean Christopherson wrote:
> Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
> AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
> of cruft, and document the lurking gotchas along the way.
> 
> Patch 01 is a fix from Paolo that's already been merged but hasn't made
> its way to kvm/queue.  It's included here to avoid a number of conflicts.
> 
> Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")
> 
> v3:
>  - Rebase to kvm/queue (and drop non-x86 patches as they've been queued). 
>  - Redo AVIC patches, sadly the vcpu_(un)blocking() hooks need to stay.
>  - Add a patch to fix a missing (docuentation-only) barrier in nested
>    posted interrupt delivery. [Paolo]
>  - Collect reviews.
> 
> v2:
>  - https://lore.kernel.org/all/20211009021236.4122790-1-seanjc@google.com/
>  - Collect reviews. [Christian, David]
>  - Add patch to move arm64 WFI functionality out of hooks. [Marc]
>  - Add RISC-V to the fun.
>  - Add all the APICv fun.
> 
> v1: https://lkml.kernel.org/r/20210925005528.1145584-1-seanjc@google.com
> 
> Paolo Bonzini (1):
>   KVM: fix avic_set_running for preemptable kernels
> 
> Sean Christopherson (25):
>   KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ
>     fails
>   KVM: VMX: Clean up PI pre/post-block WARNs
>   KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
>   KVM: Drop unused kvm_vcpu.pre_pcpu field
>   KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
>   KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
>   KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
>   KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
>   KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
>   KVM: SVM: Don't bother checking for "running" AVIC when kicking for
>     IPIs
>   KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
>   KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
>   KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
>   iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
>   KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
>   KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
>     vCPU
>   KVM: VMX: Pass desired vector instead of bool for triggering posted
>     IRQ
>   KVM: VMX: Fold fallback path into triggering posted IRQ helper
>   KVM: VMX: Don't do full kick when handling posted interrupt wakeup
>   KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
>   KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
>   KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
>   KVM: x86: Skip APICv update if APICv is disable at the module level
>   KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
>   KVM: x86: Unexport __kvm_request_apicv_update()
> 
>  arch/x86/include/asm/kvm-x86-ops.h |   2 -
>  arch/x86/include/asm/kvm_host.h    |  12 -
>  arch/x86/kvm/hyperv.c              |   3 +
>  arch/x86/kvm/lapic.c               |   2 -
>  arch/x86/kvm/svm/avic.c            | 116 ++++---
>  arch/x86/kvm/svm/svm.c             | 479 ++++++++++++++---------------
>  arch/x86/kvm/svm/svm.h             |  16 +-
>  arch/x86/kvm/vmx/posted_intr.c     | 234 +++++++-------
>  arch/x86/kvm/vmx/posted_intr.h     |   8 +-
>  arch/x86/kvm/vmx/vmx.c             |  66 ++--
>  arch/x86/kvm/vmx/vmx.h             |   3 +
>  arch/x86/kvm/x86.c                 |  41 ++-
>  drivers/iommu/amd/iommu.c          |   6 +-
>  include/linux/amd-iommu.h          |   6 +-
>  include/linux/kvm_host.h           |   3 -
>  virt/kvm/kvm_main.c                |   3 -
>  16 files changed, 510 insertions(+), 490 deletions(-)
> 

Probably just luck (can't reproduce this anymore) but
while running some kvm unit tests with this patch series (and few my patches
for AVIC co-existance which shouldn't affect this) I got this

(warning about is_running already set)

Dec 08 22:53:26 amdlaptop kernel: ------------[ cut here ]------------
Dec 08 22:53:26 amdlaptop kernel: WARNING: CPU: 3 PID: 72804 at arch/x86/kvm/svm/avic.c:1045 avic_vcpu_load+0xe3/0x100 [kvm_amd]
Dec 08 22:53:26 amdlaptop kernel: Modules linked in: kvm_amd(O) ccp rng_core kvm(O) irqbypass xt_conntrack ip6table_filter ip6_tables snd_soc_dmic snd_acp3x_>
Dec 08 22:53:26 amdlaptop kernel:  r8169 realtek 8250_pci usbmon nbd fuse autofs4 [last unloaded: rng_core]
Dec 08 22:53:26 amdlaptop kernel: CPU: 3 PID: 72804 Comm: qemu-system-i38 Tainted: G           O      5.16.0-rc4.unstable #6
Dec 08 22:53:26 amdlaptop kernel: Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
Dec 08 22:53:26 amdlaptop kernel: RIP: 0010:avic_vcpu_load+0xe3/0x100 [kvm_amd]
Dec 08 22:53:26 amdlaptop kernel: Code: 0d 9f e0 85 c0 74 e8 4c 89 f6 4c 89 ff e8 a5 99 f4 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 5b 41 5c 41 5d 41 5e 41 >
Dec 08 22:53:26 amdlaptop kernel: RSP: 0018:ffffc9000b17bba8 EFLAGS: 00010247
Dec 08 22:53:26 amdlaptop kernel: RAX: 6f63203a756d6571 RBX: ffff888106194740 RCX: ffff88812e7ac000
Dec 08 22:53:26 amdlaptop kernel: RDX: ffff8883ff6c0000 RSI: 0000000000000003 RDI: 0000000000000003
Dec 08 22:53:26 amdlaptop kernel: RBP: ffffc9000b17bbd0 R08: ffff888106194740 R09: 0000000000000000
Dec 08 22:53:26 amdlaptop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000003 R14: ffff88810023b060 R15: dead000000000100
Dec 08 22:53:26 amdlaptop kernel: FS:  0000000000000000(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
Dec 08 22:53:26 amdlaptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 08 22:53:26 amdlaptop kernel: CR2: 00005587e812f958 CR3: 0000000105f31000 CR4: 0000000000350ee0
Dec 08 22:53:26 amdlaptop kernel: DR0: 00000000004008da DR1: 0000000000000000 DR2: 0000000000000000
Dec 08 22:53:26 amdlaptop kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec 08 22:53:26 amdlaptop kernel: Call Trace:
Dec 08 22:53:26 amdlaptop kernel:  <TASK>
Dec 08 22:53:26 amdlaptop kernel:  svm_vcpu_load+0x56/0x60 [kvm_amd]
Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_vcpu_load+0x32/0x210 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  vcpu_load+0x34/0x40 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_destroy_vm+0xd4/0x1c0 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  kvm_destroy_vm+0x163/0x250 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  kvm_put_kvm+0x26/0x40 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  kvm_vm_release+0x22/0x30 [kvm]
Dec 08 22:53:26 amdlaptop kernel:  __fput+0x94/0x250
Dec 08 22:53:26 amdlaptop kernel:  ____fput+0xe/0x10
Dec 08 22:53:26 amdlaptop kernel:  task_work_run+0x63/0xa0
Dec 08 22:53:26 amdlaptop kernel:  do_exit+0x358/0xa30
Dec 08 22:53:26 amdlaptop kernel:  do_group_exit+0x3b/0xa0
Dec 08 22:53:26 amdlaptop kernel:  get_signal+0x15b/0x880
Dec 08 22:53:26 amdlaptop kernel:  ? _copy_to_user+0x20/0x30
Dec 08 22:53:26 amdlaptop kernel:  ? put_timespec64+0x3d/0x60
Dec 08 22:53:26 amdlaptop kernel:  arch_do_signal_or_restart+0x106/0x740
Dec 08 22:53:26 amdlaptop kernel:  ? hrtimer_nanosleep+0x9f/0x120
Dec 08 22:53:26 amdlaptop kernel:  ? __hrtimer_init+0xd0/0xd0
Dec 08 22:53:26 amdlaptop kernel:  exit_to_user_mode_prepare+0x112/0x1f0
Dec 08 22:53:26 amdlaptop kernel:  syscall_exit_to_user_mode+0x17/0x40
Dec 08 22:53:26 amdlaptop kernel:  do_syscall_64+0x42/0x80
Dec 08 22:53:26 amdlaptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
Dec 08 22:53:26 amdlaptop kernel: RIP: 0033:0x7f537abb13b5
Dec 08 22:53:26 amdlaptop kernel: Code: Unable to access opcode bytes at RIP 0x7f537abb138b.
Dec 08 22:53:26 amdlaptop kernel: RSP: 002b:00007f5376a39680 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
Dec 08 22:53:26 amdlaptop kernel: RAX: fffffffffffffdfc RBX: 00007f5376a396d0 RCX: 00007f537abb13b5
Dec 08 22:53:26 amdlaptop kernel: RDX: 00007f5376a396d0 RSI: 0000000000000000 RDI: 0000000000000000
Dec 08 22:53:26 amdlaptop kernel: RBP: 00007f5376a396c0 R08: 0000000000000000 R09: 0000000000000000
Dec 08 22:53:26 amdlaptop kernel: R10: 00007f5376a396c0 R11: 0000000000000293 R12: 00007f5376a3b640
Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000002 R14: 00007f537ab66880 R15: 0000000000000000
Dec 08 22:53:26 amdlaptop kernel:  </TASK>
Dec 08 22:53:26 amdlaptop kernel: ---[ end trace 676058aaf29d0267 ]---


I'll post my patches tomorrow, after some more testing.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:00 ` Maxim Levitsky
@ 2021-12-08 23:16   ` Maxim Levitsky
  2021-12-08 23:34     ` Maxim Levitsky
                       ` (2 more replies)
  2021-12-08 23:43   ` Sean Christopherson
  1 sibling, 3 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-08 23:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 01:00 +0200, Maxim Levitsky wrote:
> On Wed, 2021-12-08 at 01:52 +0000, Sean Christopherson wrote:
> > Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
> > AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
> > of cruft, and document the lurking gotchas along the way.
> > 
> > Patch 01 is a fix from Paolo that's already been merged but hasn't made
> > its way to kvm/queue.  It's included here to avoid a number of conflicts.
> > 
> > Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")
> > 
> > v3:
> >  - Rebase to kvm/queue (and drop non-x86 patches as they've been queued). 
> >  - Redo AVIC patches, sadly the vcpu_(un)blocking() hooks need to stay.
> >  - Add a patch to fix a missing (docuentation-only) barrier in nested
> >    posted interrupt delivery. [Paolo]
> >  - Collect reviews.
> > 
> > v2:
> >  - https://lore.kernel.org/all/20211009021236.4122790-1-seanjc@google.com/
> >  - Collect reviews. [Christian, David]
> >  - Add patch to move arm64 WFI functionality out of hooks. [Marc]
> >  - Add RISC-V to the fun.
> >  - Add all the APICv fun.
> > 
> > v1: https://lkml.kernel.org/r/20210925005528.1145584-1-seanjc@google.com
> > 
> > Paolo Bonzini (1):
> >   KVM: fix avic_set_running for preemptable kernels
> > 
> > Sean Christopherson (25):
> >   KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ
> >     fails
> >   KVM: VMX: Clean up PI pre/post-block WARNs
> >   KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
> >   KVM: Drop unused kvm_vcpu.pre_pcpu field
> >   KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
> >   KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
> >   KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
> >   KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
> >   KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
> >   KVM: SVM: Don't bother checking for "running" AVIC when kicking for
> >     IPIs
> >   KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
> >   KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
> >   KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
> >   iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
> >   KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
> >   KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
> >     vCPU
> >   KVM: VMX: Pass desired vector instead of bool for triggering posted
> >     IRQ
> >   KVM: VMX: Fold fallback path into triggering posted IRQ helper
> >   KVM: VMX: Don't do full kick when handling posted interrupt wakeup
> >   KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
> >   KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
> >   KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
> >   KVM: x86: Skip APICv update if APICv is disable at the module level
> >   KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
> >   KVM: x86: Unexport __kvm_request_apicv_update()
> > 
> >  arch/x86/include/asm/kvm-x86-ops.h |   2 -
> >  arch/x86/include/asm/kvm_host.h    |  12 -
> >  arch/x86/kvm/hyperv.c              |   3 +
> >  arch/x86/kvm/lapic.c               |   2 -
> >  arch/x86/kvm/svm/avic.c            | 116 ++++---
> >  arch/x86/kvm/svm/svm.c             | 479 ++++++++++++++---------------
> >  arch/x86/kvm/svm/svm.h             |  16 +-
> >  arch/x86/kvm/vmx/posted_intr.c     | 234 +++++++-------
> >  arch/x86/kvm/vmx/posted_intr.h     |   8 +-
> >  arch/x86/kvm/vmx/vmx.c             |  66 ++--
> >  arch/x86/kvm/vmx/vmx.h             |   3 +
> >  arch/x86/kvm/x86.c                 |  41 ++-
> >  drivers/iommu/amd/iommu.c          |   6 +-
> >  include/linux/amd-iommu.h          |   6 +-
> >  include/linux/kvm_host.h           |   3 -
> >  virt/kvm/kvm_main.c                |   3 -
> >  16 files changed, 510 insertions(+), 490 deletions(-)
> > 
> 
> Probably just luck (can't reproduce this anymore) but
> while running some kvm unit tests with this patch series (and few my patches
> for AVIC co-existance which shouldn't affect this) I got this
> 
> (warning about is_running already set)
> 
> Dec 08 22:53:26 amdlaptop kernel: ------------[ cut here ]------------
> Dec 08 22:53:26 amdlaptop kernel: WARNING: CPU: 3 PID: 72804 at arch/x86/kvm/svm/avic.c:1045 avic_vcpu_load+0xe3/0x100 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel: Modules linked in: kvm_amd(O) ccp rng_core kvm(O) irqbypass xt_conntrack ip6table_filter ip6_tables snd_soc_dmic snd_acp3x_>
> Dec 08 22:53:26 amdlaptop kernel:  r8169 realtek 8250_pci usbmon nbd fuse autofs4 [last unloaded: rng_core]
> Dec 08 22:53:26 amdlaptop kernel: CPU: 3 PID: 72804 Comm: qemu-system-i38 Tainted: G           O      5.16.0-rc4.unstable #6
> Dec 08 22:53:26 amdlaptop kernel: Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> Dec 08 22:53:26 amdlaptop kernel: RIP: 0010:avic_vcpu_load+0xe3/0x100 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel: Code: 0d 9f e0 85 c0 74 e8 4c 89 f6 4c 89 ff e8 a5 99 f4 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 5b 41 5c 41 5d 41 5e 41 >
> Dec 08 22:53:26 amdlaptop kernel: RSP: 0018:ffffc9000b17bba8 EFLAGS: 00010247
> Dec 08 22:53:26 amdlaptop kernel: RAX: 6f63203a756d6571 RBX: ffff888106194740 RCX: ffff88812e7ac000
> Dec 08 22:53:26 amdlaptop kernel: RDX: ffff8883ff6c0000 RSI: 0000000000000003 RDI: 0000000000000003
> Dec 08 22:53:26 amdlaptop kernel: RBP: ffffc9000b17bbd0 R08: ffff888106194740 R09: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000003 R14: ffff88810023b060 R15: dead000000000100
> Dec 08 22:53:26 amdlaptop kernel: FS:  0000000000000000(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Dec 08 22:53:26 amdlaptop kernel: CR2: 00005587e812f958 CR3: 0000000105f31000 CR4: 0000000000350ee0
> Dec 08 22:53:26 amdlaptop kernel: DR0: 00000000004008da DR1: 0000000000000000 DR2: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 08 22:53:26 amdlaptop kernel: Call Trace:
> Dec 08 22:53:26 amdlaptop kernel:  <TASK>
> Dec 08 22:53:26 amdlaptop kernel:  svm_vcpu_load+0x56/0x60 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_vcpu_load+0x32/0x210 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  vcpu_load+0x34/0x40 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_destroy_vm+0xd4/0x1c0 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_destroy_vm+0x163/0x250 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_put_kvm+0x26/0x40 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_vm_release+0x22/0x30 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  __fput+0x94/0x250
> Dec 08 22:53:26 amdlaptop kernel:  ____fput+0xe/0x10
> Dec 08 22:53:26 amdlaptop kernel:  task_work_run+0x63/0xa0
> Dec 08 22:53:26 amdlaptop kernel:  do_exit+0x358/0xa30
> Dec 08 22:53:26 amdlaptop kernel:  do_group_exit+0x3b/0xa0
> Dec 08 22:53:26 amdlaptop kernel:  get_signal+0x15b/0x880
> Dec 08 22:53:26 amdlaptop kernel:  ? _copy_to_user+0x20/0x30
> Dec 08 22:53:26 amdlaptop kernel:  ? put_timespec64+0x3d/0x60
> Dec 08 22:53:26 amdlaptop kernel:  arch_do_signal_or_restart+0x106/0x740
> Dec 08 22:53:26 amdlaptop kernel:  ? hrtimer_nanosleep+0x9f/0x120
> Dec 08 22:53:26 amdlaptop kernel:  ? __hrtimer_init+0xd0/0xd0
> Dec 08 22:53:26 amdlaptop kernel:  exit_to_user_mode_prepare+0x112/0x1f0
> Dec 08 22:53:26 amdlaptop kernel:  syscall_exit_to_user_mode+0x17/0x40
> Dec 08 22:53:26 amdlaptop kernel:  do_syscall_64+0x42/0x80
> Dec 08 22:53:26 amdlaptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> Dec 08 22:53:26 amdlaptop kernel: RIP: 0033:0x7f537abb13b5
> Dec 08 22:53:26 amdlaptop kernel: Code: Unable to access opcode bytes at RIP 0x7f537abb138b.
> Dec 08 22:53:26 amdlaptop kernel: RSP: 002b:00007f5376a39680 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
> Dec 08 22:53:26 amdlaptop kernel: RAX: fffffffffffffdfc RBX: 00007f5376a396d0 RCX: 00007f537abb13b5
> Dec 08 22:53:26 amdlaptop kernel: RDX: 00007f5376a396d0 RSI: 0000000000000000 RDI: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: RBP: 00007f5376a396c0 R08: 0000000000000000 R09: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: R10: 00007f5376a396c0 R11: 0000000000000293 R12: 00007f5376a3b640
> Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000002 R14: 00007f537ab66880 R15: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel:  </TASK>
> Dec 08 22:53:26 amdlaptop kernel: ---[ end trace 676058aaf29d0267 ]---


Also got this while trying a VM with passed through device:

[mlevitsk@amdlaptop ~]$[   34.926140] usb 5-3: reset full-speed USB device number 3 using xhci_hcd
[   42.583661] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
[  363.562173] VFIO - User Level meta-driver version: 0.3
[  365.160357] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x154
[  384.138110] BUG: kernel NULL pointer dereference, address: 0000000000000021
[  384.154039] #PF: supervisor read access in kernel mode
[  384.165645] #PF: error_code(0x0000) - not-present page
[  384.177254] PGD 16da9d067 P4D 16da9d067 PUD 13ad1a067 PMD 0 
[  384.190036] Oops: 0000 [#1] SMP
[  384.197117] CPU: 3 PID: 14403 Comm: CPU 3/KVM Tainted: G           O      5.16.0-rc4.unstable #6
[  384.216978] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
[  384.235258] RIP: 0010:amd_iommu_update_ga+0x32/0x160
[  384.246469] Code: e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 83 3d 04 75 2a 01 02 0f 85 bd 00 00 00 <4c> 8b 62 20 48 8b 4a 18 4d 85 e4 0f 84 ca 00 00 00
48 85 c9 0f 84
[  384.288932] RSP: 0018:ffffc9000036fca0 EFLAGS: 00010046
[  384.300727] RAX: 0000000000000000 RBX: ffff88810b68ab60 RCX: ffff8881667a6018
[  384.316850] RDX: 0000000000000001 RSI: ffff888107476b00 RDI: 0000000000000003
[  384.332973] RBP: ffffc9000036fcf0 R08: 0000000000000010 R09: 0000000000000010
[  384.349096] R10: 0000000000000bb8 R11: ffffffff82a0fe80 R12: 0000000000000003
[  384.365219] R13: ffff88836b98ea68 R14: 0000000000000202 R15: ffff88836b98ea78
[  384.381338] FS:  00007f18d9ffb640(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
[  384.399622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  384.412597] CR2: 0000000000000021 CR3: 00000001617c3000 CR4: 0000000000350ee0
[  384.428717] Call Trace:
[  384.434224]  <TASK>
[  384.438951]  ? svm_set_msr+0x349/0x740 [kvm_amd]
[  384.449374]  avic_vcpu_load+0xbc/0x100 [kvm_amd]
[  384.459795]  svm_vcpu_unblocking+0x2d/0x40 [kvm_amd]
[  384.471011]  kvm_vcpu_block+0x71/0x90 [kvm]
[  384.480479]  kvm_vcpu_halt+0x3b/0x390 [kvm]
[  384.489950]  kvm_arch_vcpu_ioctl_run+0xa81/0x17f0 [kvm]
[  384.501781]  kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
[  384.511640]  ? vfio_pci_rw+0x6b/0xa0 [vfio_pci_core]
[  384.522856]  ? vfio_pci_core_write+0x1c/0x20 [vfio_pci_core]
[  384.535642]  ? vfio_device_fops_write+0x1f/0x30 [vfio]
[  384.547243]  __x64_sys_ioctl+0x8e/0xc0
[  384.555701]  do_syscall_64+0x35/0x80
[  384.563765]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  384.575170] RIP: 0033:0x7f18f4c4b39b
[  384.583241] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 2a 0f 00 f7
d8 64 89 01 48
[  384.625709] RSP: 002b:00007f18d9ff95c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  384.639715] [drm] Fence fallback timer expired on ring gfx
[  384.642816] RAX: ffffffffffffffda RBX: 00000000017aef60 RCX: 00007f18f4c4b39b
[  384.642817] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000026
[  384.687520] RBP: 00007f18d9ff96c0 R08: 0000000000daf570 R09: 000000000000ffff
[  384.703646] R10: 0000070000000082 R11: 0000000000000246 R12: 00007f18d9ffb640
[  384.719765] R13: 0000000000000000 R14: 00007f18f4bd1880 R15: 0000000000000000
[  384.735896]  </TASK>
[  384.740812] Modules linked in: vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio xt_conntrack ip6table_filter ip6_tables snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic
snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_acp3x_pdm_dma snd_soc_dmic snd_acp3x_rn tps6598x snd_hda_codec snd_usbmidi_lib kvm_amd(O) btusb iwlmvm regmap_i2c snd_hda_core btrtl
snd_soc_core wmi_bmof snd_hwdep tpm_crb kvm(O) btbcm irqbypass snd_rawmidi ftdi_sio btintel pcspkr psmouse snd_rn_pci_acp3x bfq k10temp i2c_piix4 thinkpad_acpi snd_pcm i2c_multi_instantiate iwlwifi
tpm_tis tpm_tis_core wmi platform_profile rtc_cmos i2c_designware_platform i2c_designware_core vendor_reset(O) dm_crypt hid_generic usbhid mmc_block amdgpu drm_ttm_helper ttm gpu_sched i2c_algo_bit
drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea xhci_pci nvme rtsx_pci_sdmmc ucsi_acpi typec_ucsi mmc_core atkbd ccp rtsx_pci libps2 drm nvme_core
mfd_core sp5100_tco
[  384.740851]  rng_core xhci_hcd drm_panel_orientation_quirks t10_pi typec i8042 pinctrl_amd r8169 realtek 8250_pci usbmon nbd fuse autofs4
[  384.965149] CR2: 0000000000000021
[  384.972622] ---[ end trace 23949f0862c2ac65 ]---
[  384.983042] RIP: 0010:amd_iommu_update_ga+0x32/0x160
[  384.994255] Code: e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 83 3d 04 75 2a 01 02 0f 85 bd 00 00 00 <4c> 8b 62 20 48 8b 4a 18 4d 85 e4 0f 84 ca 00 00 00
48 85 c9 0f 84
[  385.036720] RSP: 0018:ffffc9000036fca0 EFLAGS: 00010046
[  385.048518] RAX: 0000000000000000 RBX: ffff88810b68ab60 RCX: ffff8881667a6018
[  385.064646] RDX: 0000000000000001 RSI: ffff888107476b00 RDI: 0000000000000003
[  385.080765] RBP: ffffc9000036fcf0 R08: 0000000000000010 R09: 0000000000000010
[  385.096886] R10: 0000000000000bb8 R11: ffffffff82a0fe80 R12: 0000000000000003
[  385.113004] R13: ffff88836b98ea68 R14: 0000000000000202 R15: ffff88836b98ea78
[  385.129128] FS:  00007f18d9ffb640(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
[  385.147407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  385.160385] CR2: 0000000000000021 CR3: 00000001617c3000 CR4: 0000000000350ee0
[  385.172739] [drm] Fence fallback timer expired on ring sdma0
> 


Oh, well tomorrow I'll see what I can do with  these.

Best regards,
	Maxim Levitsky
> 
> I'll post my patches tomorrow, after some more testing.
> 
> Best regards,
> 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:16   ` Maxim Levitsky
@ 2021-12-08 23:34     ` Maxim Levitsky
  2021-12-09  0:04       ` Sean Christopherson
  2021-12-09  0:02     ` Sean Christopherson
  2021-12-09  1:37     ` Sean Christopherson
  2 siblings, 1 reply; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-08 23:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 01:16 +0200, Maxim Levitsky wrote:
> On Thu, 2021-12-09 at 01:00 +0200, Maxim Levitsky wrote:
> > On Wed, 2021-12-08 at 01:52 +0000, Sean Christopherson wrote:
> > > Overhaul and cleanup APIC virtualization (Posted Interrupts on Intel VMX,
> > > AVIC on AMD SVM) to streamline things as much as possible, remove a bunch
> > > of cruft, and document the lurking gotchas along the way.
> > > 
> > > Patch 01 is a fix from Paolo that's already been merged but hasn't made
> > > its way to kvm/queue.  It's included here to avoid a number of conflicts.
> > > 
> > > Based on kvm/queue, commit 1cf84614b04a ("KVM: x86: Exit to ...")
> > > 
> > > v3:
> > >  - Rebase to kvm/queue (and drop non-x86 patches as they've been queued). 
> > >  - Redo AVIC patches, sadly the vcpu_(un)blocking() hooks need to stay.
> > >  - Add a patch to fix a missing (docuentation-only) barrier in nested
> > >    posted interrupt delivery. [Paolo]
> > >  - Collect reviews.
> > > 
> > > v2:
> > >  - https://lore.kernel.org/all/20211009021236.4122790-1-seanjc@google.com/
> > >  - Collect reviews. [Christian, David]
> > >  - Add patch to move arm64 WFI functionality out of hooks. [Marc]
> > >  - Add RISC-V to the fun.
> > >  - Add all the APICv fun.
> > > 
> > > v1: https://lkml.kernel.org/r/20210925005528.1145584-1-seanjc@google.com
> > > 
> > > Paolo Bonzini (1):
> > >   KVM: fix avic_set_running for preemptable kernels
> > > 
> > > Sean Christopherson (25):
> > >   KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ
> > >     fails
> > >   KVM: VMX: Clean up PI pre/post-block WARNs
> > >   KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load
> > >   KVM: Drop unused kvm_vcpu.pre_pcpu field
> > >   KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx
> > >   KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
> > >   KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers
> > >   KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks
> > >   KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode
> > >   KVM: SVM: Don't bother checking for "running" AVIC when kicking for
> > >     IPIs
> > >   KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
> > >   KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption
> > >   KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU
> > >   iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE
> > >   KVM: VMX: Don't do full kick when triggering posted interrupt "fails"
> > >   KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this
> > >     vCPU
> > >   KVM: VMX: Pass desired vector instead of bool for triggering posted
> > >     IRQ
> > >   KVM: VMX: Fold fallback path into triggering posted IRQ helper
> > >   KVM: VMX: Don't do full kick when handling posted interrupt wakeup
> > >   KVM: SVM: Drop AVIC's intermediate avic_set_running() helper
> > >   KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops
> > >   KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled
> > >   KVM: x86: Skip APICv update if APICv is disable at the module level
> > >   KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons
> > >   KVM: x86: Unexport __kvm_request_apicv_update()
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h |   2 -
> > >  arch/x86/include/asm/kvm_host.h    |  12 -
> > >  arch/x86/kvm/hyperv.c              |   3 +
> > >  arch/x86/kvm/lapic.c               |   2 -
> > >  arch/x86/kvm/svm/avic.c            | 116 ++++---
> > >  arch/x86/kvm/svm/svm.c             | 479 ++++++++++++++---------------
> > >  arch/x86/kvm/svm/svm.h             |  16 +-
> > >  arch/x86/kvm/vmx/posted_intr.c     | 234 +++++++-------
> > >  arch/x86/kvm/vmx/posted_intr.h     |   8 +-
> > >  arch/x86/kvm/vmx/vmx.c             |  66 ++--
> > >  arch/x86/kvm/vmx/vmx.h             |   3 +
> > >  arch/x86/kvm/x86.c                 |  41 ++-
> > >  drivers/iommu/amd/iommu.c          |   6 +-
> > >  include/linux/amd-iommu.h          |   6 +-
> > >  include/linux/kvm_host.h           |   3 -
> > >  virt/kvm/kvm_main.c                |   3 -
> > >  16 files changed, 510 insertions(+), 490 deletions(-)
> > > 
> > 
> > Probably just luck (can't reproduce this anymore) but
> > while running some kvm unit tests with this patch series (and few my patches
> > for AVIC co-existance which shouldn't affect this) I got this
> > 
> > (warning about is_running already set)
> > 
> > Dec 08 22:53:26 amdlaptop kernel: ------------[ cut here ]------------
> > Dec 08 22:53:26 amdlaptop kernel: WARNING: CPU: 3 PID: 72804 at arch/x86/kvm/svm/avic.c:1045 avic_vcpu_load+0xe3/0x100 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel: Modules linked in: kvm_amd(O) ccp rng_core kvm(O) irqbypass xt_conntrack ip6table_filter ip6_tables snd_soc_dmic snd_acp3x_>
> > Dec 08 22:53:26 amdlaptop kernel:  r8169 realtek 8250_pci usbmon nbd fuse autofs4 [last unloaded: rng_core]
> > Dec 08 22:53:26 amdlaptop kernel: CPU: 3 PID: 72804 Comm: qemu-system-i38 Tainted: G           O      5.16.0-rc4.unstable #6
> > Dec 08 22:53:26 amdlaptop kernel: Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> > Dec 08 22:53:26 amdlaptop kernel: RIP: 0010:avic_vcpu_load+0xe3/0x100 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel: Code: 0d 9f e0 85 c0 74 e8 4c 89 f6 4c 89 ff e8 a5 99 f4 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 5b 41 5c 41 5d 41 5e 41 >
> > Dec 08 22:53:26 amdlaptop kernel: RSP: 0018:ffffc9000b17bba8 EFLAGS: 00010247
> > Dec 08 22:53:26 amdlaptop kernel: RAX: 6f63203a756d6571 RBX: ffff888106194740 RCX: ffff88812e7ac000
> > Dec 08 22:53:26 amdlaptop kernel: RDX: ffff8883ff6c0000 RSI: 0000000000000003 RDI: 0000000000000003
> > Dec 08 22:53:26 amdlaptop kernel: RBP: ffffc9000b17bbd0 R08: ffff888106194740 R09: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> > Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000003 R14: ffff88810023b060 R15: dead000000000100
> > Dec 08 22:53:26 amdlaptop kernel: FS:  0000000000000000(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Dec 08 22:53:26 amdlaptop kernel: CR2: 00005587e812f958 CR3: 0000000105f31000 CR4: 0000000000350ee0
> > Dec 08 22:53:26 amdlaptop kernel: DR0: 00000000004008da DR1: 0000000000000000 DR2: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Dec 08 22:53:26 amdlaptop kernel: Call Trace:
> > Dec 08 22:53:26 amdlaptop kernel:  <TASK>
> > Dec 08 22:53:26 amdlaptop kernel:  svm_vcpu_load+0x56/0x60 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_vcpu_load+0x32/0x210 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  vcpu_load+0x34/0x40 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_destroy_vm+0xd4/0x1c0 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_destroy_vm+0x163/0x250 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_put_kvm+0x26/0x40 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_vm_release+0x22/0x30 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  __fput+0x94/0x250
> > Dec 08 22:53:26 amdlaptop kernel:  ____fput+0xe/0x10
> > Dec 08 22:53:26 amdlaptop kernel:  task_work_run+0x63/0xa0
> > Dec 08 22:53:26 amdlaptop kernel:  do_exit+0x358/0xa30
> > Dec 08 22:53:26 amdlaptop kernel:  do_group_exit+0x3b/0xa0
> > Dec 08 22:53:26 amdlaptop kernel:  get_signal+0x15b/0x880
> > Dec 08 22:53:26 amdlaptop kernel:  ? _copy_to_user+0x20/0x30
> > Dec 08 22:53:26 amdlaptop kernel:  ? put_timespec64+0x3d/0x60
> > Dec 08 22:53:26 amdlaptop kernel:  arch_do_signal_or_restart+0x106/0x740
> > Dec 08 22:53:26 amdlaptop kernel:  ? hrtimer_nanosleep+0x9f/0x120
> > Dec 08 22:53:26 amdlaptop kernel:  ? __hrtimer_init+0xd0/0xd0
> > Dec 08 22:53:26 amdlaptop kernel:  exit_to_user_mode_prepare+0x112/0x1f0
> > Dec 08 22:53:26 amdlaptop kernel:  syscall_exit_to_user_mode+0x17/0x40
> > Dec 08 22:53:26 amdlaptop kernel:  do_syscall_64+0x42/0x80
> > Dec 08 22:53:26 amdlaptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > Dec 08 22:53:26 amdlaptop kernel: RIP: 0033:0x7f537abb13b5
> > Dec 08 22:53:26 amdlaptop kernel: Code: Unable to access opcode bytes at RIP 0x7f537abb138b.
> > Dec 08 22:53:26 amdlaptop kernel: RSP: 002b:00007f5376a39680 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
> > Dec 08 22:53:26 amdlaptop kernel: RAX: fffffffffffffdfc RBX: 00007f5376a396d0 RCX: 00007f537abb13b5
> > Dec 08 22:53:26 amdlaptop kernel: RDX: 00007f5376a396d0 RSI: 0000000000000000 RDI: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: RBP: 00007f5376a396c0 R08: 0000000000000000 R09: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: R10: 00007f5376a396c0 R11: 0000000000000293 R12: 00007f5376a3b640
> > Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000002 R14: 00007f537ab66880 R15: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel:  </TASK>
> > Dec 08 22:53:26 amdlaptop kernel: ---[ end trace 676058aaf29d0267 ]---
> 
> Also got this while trying a VM with passed through device:
> 
> [mlevitsk@amdlaptop ~]$[   34.926140] usb 5-3: reset full-speed USB device number 3 using xhci_hcd
> [   42.583661] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
> [  363.562173] VFIO - User Level meta-driver version: 0.3
> [  365.160357] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x154
> [  384.138110] BUG: kernel NULL pointer dereference, address: 0000000000000021
> [  384.154039] #PF: supervisor read access in kernel mode
> [  384.165645] #PF: error_code(0x0000) - not-present page
> [  384.177254] PGD 16da9d067 P4D 16da9d067 PUD 13ad1a067 PMD 0 
> [  384.190036] Oops: 0000 [#1] SMP
> [  384.197117] CPU: 3 PID: 14403 Comm: CPU 3/KVM Tainted: G           O      5.16.0-rc4.unstable #6
> [  384.216978] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> [  384.235258] RIP: 0010:amd_iommu_update_ga+0x32/0x160
> [  384.246469] Code: e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 83 3d 04 75 2a 01 02 0f 85 bd 00 00 00 <4c> 8b 62 20 48 8b 4a 18 4d 85 e4 0f 84 ca 00 00 00
> 48 85 c9 0f 84
> [  384.288932] RSP: 0018:ffffc9000036fca0 EFLAGS: 00010046
> [  384.300727] RAX: 0000000000000000 RBX: ffff88810b68ab60 RCX: ffff8881667a6018
> [  384.316850] RDX: 0000000000000001 RSI: ffff888107476b00 RDI: 0000000000000003
> [  384.332973] RBP: ffffc9000036fcf0 R08: 0000000000000010 R09: 0000000000000010
> [  384.349096] R10: 0000000000000bb8 R11: ffffffff82a0fe80 R12: 0000000000000003
> [  384.365219] R13: ffff88836b98ea68 R14: 0000000000000202 R15: ffff88836b98ea78
> [  384.381338] FS:  00007f18d9ffb640(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> [  384.399622] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  384.412597] CR2: 0000000000000021 CR3: 00000001617c3000 CR4: 0000000000350ee0
> [  384.428717] Call Trace:
> [  384.434224]  <TASK>
> [  384.438951]  ? svm_set_msr+0x349/0x740 [kvm_amd]
> [  384.449374]  avic_vcpu_load+0xbc/0x100 [kvm_amd]
> [  384.459795]  svm_vcpu_unblocking+0x2d/0x40 [kvm_amd]
> [  384.471011]  kvm_vcpu_block+0x71/0x90 [kvm]
> [  384.480479]  kvm_vcpu_halt+0x3b/0x390 [kvm]
> [  384.489950]  kvm_arch_vcpu_ioctl_run+0xa81/0x17f0 [kvm]
> [  384.501781]  kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
> [  384.511640]  ? vfio_pci_rw+0x6b/0xa0 [vfio_pci_core]
> [  384.522856]  ? vfio_pci_core_write+0x1c/0x20 [vfio_pci_core]
> [  384.535642]  ? vfio_device_fops_write+0x1f/0x30 [vfio]
> [  384.547243]  __x64_sys_ioctl+0x8e/0xc0
> [  384.555701]  do_syscall_64+0x35/0x80
> [  384.563765]  entry_SYSCALL_64_after_hwframe+0x44/0xae
> [  384.575170] RIP: 0033:0x7f18f4c4b39b
> [  384.583241] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 2a 0f 00 f7
> d8 64 89 01 48
> [  384.625709] RSP: 002b:00007f18d9ff95c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  384.639715] [drm] Fence fallback timer expired on ring gfx
> [  384.642816] RAX: ffffffffffffffda RBX: 00000000017aef60 RCX: 00007f18f4c4b39b
> [  384.642817] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000026
> [  384.687520] RBP: 00007f18d9ff96c0 R08: 0000000000daf570 R09: 000000000000ffff
> [  384.703646] R10: 0000070000000082 R11: 0000000000000246 R12: 00007f18d9ffb640
> [  384.719765] R13: 0000000000000000 R14: 00007f18f4bd1880 R15: 0000000000000000
> [  384.735896]  </TASK>
> [  384.740812] Modules linked in: vfio_pci vfio_pci_core vfio_virqfd vfio_iommu_type1 vfio xt_conntrack ip6table_filter ip6_tables snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic
> snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_usb_audio snd_acp3x_pdm_dma snd_soc_dmic snd_acp3x_rn tps6598x snd_hda_codec snd_usbmidi_lib kvm_amd(O) btusb iwlmvm regmap_i2c snd_hda_core btrtl
> snd_soc_core wmi_bmof snd_hwdep tpm_crb kvm(O) btbcm irqbypass snd_rawmidi ftdi_sio btintel pcspkr psmouse snd_rn_pci_acp3x bfq k10temp i2c_piix4 thinkpad_acpi snd_pcm i2c_multi_instantiate iwlwifi
> tpm_tis tpm_tis_core wmi platform_profile rtc_cmos i2c_designware_platform i2c_designware_core vendor_reset(O) dm_crypt hid_generic usbhid mmc_block amdgpu drm_ttm_helper ttm gpu_sched i2c_algo_bit
> drm_kms_helper cfbfillrect syscopyarea cfbimgblt sysfillrect sysimgblt fb_sys_fops cfbcopyarea xhci_pci nvme rtsx_pci_sdmmc ucsi_acpi typec_ucsi mmc_core atkbd ccp rtsx_pci libps2 drm nvme_core
> mfd_core sp5100_tco
> [  384.740851]  rng_core xhci_hcd drm_panel_orientation_quirks t10_pi typec i8042 pinctrl_amd r8169 realtek 8250_pci usbmon nbd fuse autofs4
> [  384.965149] CR2: 0000000000000021
> [  384.972622] ---[ end trace 23949f0862c2ac65 ]---
> [  384.983042] RIP: 0010:amd_iommu_update_ga+0x32/0x160
> [  384.994255] Code: e5 41 57 41 56 41 55 41 54 53 48 83 ec 28 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 83 3d 04 75 2a 01 02 0f 85 bd 00 00 00 <4c> 8b 62 20 48 8b 4a 18 4d 85 e4 0f 84 ca 00 00 00
> 48 85 c9 0f 84
> [  385.036720] RSP: 0018:ffffc9000036fca0 EFLAGS: 00010046
> [  385.048518] RAX: 0000000000000000 RBX: ffff88810b68ab60 RCX: ffff8881667a6018
> [  385.064646] RDX: 0000000000000001 RSI: ffff888107476b00 RDI: 0000000000000003
> [  385.080765] RBP: ffffc9000036fcf0 R08: 0000000000000010 R09: 0000000000000010
> [  385.096886] R10: 0000000000000bb8 R11: ffffffff82a0fe80 R12: 0000000000000003
> [  385.113004] R13: ffff88836b98ea68 R14: 0000000000000202 R15: ffff88836b98ea78
> [  385.129128] FS:  00007f18d9ffb640(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> [  385.147407] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  385.160385] CR2: 0000000000000021 CR3: 00000001617c3000 CR4: 0000000000350ee0
> [  385.172739] [drm] Fence fallback timer expired on ring sdma0
> 
> Oh, well tomorrow I'll see what I can do with  these.

Host crash while running 32 bit VM and another 32 bit VM nested in it:


[  751.182290] BUG: kernel NULL pointer dereference, address: 0000000000000025
[  751.198234] #PF: supervisor read access in kernel mode
[  751.209982] #PF: error_code(0x0000) - not-present page
[  751.221733] PGD 3720f9067 P4D 3720f9067 PUD 3720f8067 PMD 0 
[  751.234682] Oops: 0000 [#1] SMP
[  751.241857] CPU: 8 PID: 54050 Comm: CPU 8/KVM Tainted: G           O      5.16.0-rc4.unstable #6
[  751.261960] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
[  751.280475] RIP: 0010:is_page_fault_stale.isra.0+0x2a/0xa0 [kvm]
[  751.294246] Code: 0f 1f 44 00 00 55 48 8b 46 38 41 b8 01 00 00 00 48 be 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 89 e5 48 8b 44 30 28 <f6> 40 25 08 75 25 80 78 20 00 74 24 48 83 7a 20 00
74 31 48 83 bf
[  751.337236] RSP: 0018:ffffc900020afa90 EFLAGS: 00010216
[  751.349186] RAX: 0000000000000000 RBX: ffffc900020afc88 RCX: 00000000000000a9
[  751.365517] RDX: ffffc900020afc88 RSI: ffffea0000000000 RDI: ffffc90001fe9000
[  751.381851] RBP: ffffc900020afa90 R08: 0000000000000001 R09: ffffc900020afcb8
[  751.398173] R10: 0000000000000000 R11: 0000000000000000 R12: 00000000000000a9
[  751.414496] R13: ffffc90001fe9000 R14: ffff888176048000 R15: 0000000000000000
[  751.430820] FS:  00007fd5f1dfb640(0000) GS:ffff8883ff800000(0000) knlGS:0000000000000000
[  751.449349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  751.462484] CR2: 0000000000000025 CR3: 00000003720fc000 CR4: 0000000000350ee0
[  751.478800] Call Trace:
[  751.484381]  <TASK>
[  751.489165]  paging64_page_fault+0x20d/0x990 [kvm]
[  751.500141]  ? emulator_read_write+0xda/0x1a0 [kvm]
[  751.511320]  ? linearize.isra.0+0x82/0x280 [kvm]
[  751.521898]  ? emulator_write_emulated+0x15/0x20 [kvm]
[  751.533673]  ? segmented_write.isra.0+0x5b/0x80 [kvm]
[  751.545249]  ? kvm_mmu_free_roots+0x80/0x170 [kvm]
[  751.556225]  kvm_mmu_page_fault+0x108/0x7f0 [kvm]
[  751.567002]  ? kvm_vcpu_read_guest+0xb8/0x120 [kvm]
[  751.578176]  ? kvm_mmu_get_page+0x253/0x5d0 [kvm]
[  751.588951]  ? kvm_mmu_sync_roots+0x188/0x1d0 [kvm]
[  751.600134]  npf_interception+0x47/0xa0 [kvm_amd]
[  751.610886]  svm_invoke_exit_handler+0x9d/0xe0 [kvm_amd]
[  751.623034]  handle_exit+0xb8/0x210 [kvm_amd]
[  751.632993]  kvm_arch_vcpu_ioctl_run+0xdac/0x17f0 [kvm]
[  751.644970]  ? kvm_vm_ioctl_irq_line+0x27/0x40 [kvm]
[  751.656344]  ? _copy_to_user+0x20/0x30
[  751.664914]  ? kvm_vm_ioctl+0x279/0xdb0 [kvm]
[  751.674894]  kvm_vcpu_ioctl+0x284/0x6c0 [kvm]
[  751.684876]  __x64_sys_ioctl+0x8e/0xc0
[  751.693444]  do_syscall_64+0x35/0x80
[  751.701617]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[  751.713171] RIP: 0033:0x7fd7d47c839b
[  751.721347] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c 89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 5d 2a 0f 00 f7
d8 64 89 01 48
[  751.764347] RSP: 002b:00007fd5f1df95c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  751.781674] RAX: ffffffffffffffda RBX: 0000000003002350 RCX: 00007fd7d47c839b
[  751.798000] RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000002a
[  751.814326] RBP: 00007fd5f1df96c0 R08: 0000000000daf570 R09: 00000000000000ff
[  751.830648] R10: 000000000082a1bb R11: 0000000000000246 R12: 00007fd5f1dfb640
[  751.846977] R13: 0000000000000000 R14: 00007fd7d474e880 R15: 0000000000000000
[  751.863310]  </TASK>
[  751.868292] Modules linked in: tun xt_conntrack ip6table_filter ip6_tables snd_ctl_led snd_hda_codec_realtek snd_hda_codec_generic snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_usb_audio
snd_usbmidi_lib snd_hda_codec btusb kvm_amd(O) snd_acp3x_pdm_dma snd_soc_dmic snd_acp3x_rn iwlmvm btrtl tps6598x snd_hwdep btbcm regmap_i2c snd_soc_core wmi_bmof kvm(O) snd_hda_core irqbypass pcspkr
psmouse ftdi_sio tpm_crb k10temp snd_rawmidi btintel i2c_multi_instantiate i2c_piix4 snd_rn_pci_acp3x iwlwifi bfq snd_pcm tpm_tis thinkpad_acpi tpm_tis_core wmi platform_profile rtc_cmos
i2c_designware_platform i2c_designware_core vendor_reset(O) dm_crypt hid_generic usbhid mmc_block amdgpu drm_ttm_helper ttm gpu_sched i2c_algo_bit drm_kms_helper cfbfillrect syscopyarea cfbimgblt
sysfillrect rtsx_pci_sdmmc sysimgblt fb_sys_fops cfbcopyarea mmc_core xhci_pci ccp atkbd rtsx_pci nvme libps2 drm ucsi_acpi rng_core mfd_core xhci_hcd drm_panel_orientation_quirks sp5100_tco
typec_ucsi nvme_core typec t10_pi
[  751.868330]  i8042 pinctrl_amd r8169 realtek 8250_pci usbmon nbd fuse autofs4
[  752.084873] CR2: 0000000000000025
[  752.092439] ---[ end trace cf646c318ffb08e5 ]---
[  752.102994] RIP: 0010:is_page_fault_stale.isra.0+0x2a/0xa0 [kvm]
[  752.116760] Code: 0f 1f 44 00 00 55 48 8b 46 38 41 b8 01 00 00 00 48 be 00 00 00 00 00 ea ff ff 48 c1 e8 0c 48 c1 e0 06 48 89 e5 48 8b 44 30 28 <f6> 40 25 08 75 25 80 78 20 00 74 24 48 83 7a 20 00
74 31 48 83 bf
[  752.159774] RSP: 0018:ffffc900020afa90 EFLAGS: 00010216
[  752.171722] RAX: 0000000000000000 RBX: ffffc900020afc88 RCX: 00000000000000a9
[  752.188053] RDX: ffffc900020afc88 RSI: ffffea0000000000 RDI: ffffc90001fe9000
[  752.204382] RBP: ffffc900020afa90 R08: 0000000000000001 R09: ffffc900020afcb8
[  752.2

Oh well, not related to the patch series but just that I don't forget.
I need to do some throughfull testing on all the VMs I use.

Best regards,
	Maxim Levitsky


> 
> Best regards,
> 	Maxim Levitsky
> > I'll post my patches tomorrow, after some more testing.
> > 
> > Best regards,
> > 	Maxim Levitsky



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:00 ` Maxim Levitsky
  2021-12-08 23:16   ` Maxim Levitsky
@ 2021-12-08 23:43   ` Sean Christopherson
  2021-12-09  6:34     ` Maxim Levitsky
  1 sibling, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-08 23:43 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> >   KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path

...

> Probably just luck (can't reproduce this anymore) but
> while running some kvm unit tests with this patch series (and few my patches
> for AVIC co-existance which shouldn't affect this) I got this
> 
> (warning about is_running already set)

My best guess would be the above commit that dropped the handling in the unblock
path, but I haven't been able to concoct a scenario where avic_physical_id_cache
can get out of sync with respect to kvm_vcpu_apicv_active().

> Dec 08 22:53:26 amdlaptop kernel: ------------[ cut here ]------------
> Dec 08 22:53:26 amdlaptop kernel: WARNING: CPU: 3 PID: 72804 at arch/x86/kvm/svm/avic.c:1045 avic_vcpu_load+0xe3/0x100 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel: Modules linked in: kvm_amd(O) ccp rng_core kvm(O) irqbypass xt_conntrack ip6table_filter ip6_tables snd_soc_dmic snd_acp3x_>
> Dec 08 22:53:26 amdlaptop kernel:  r8169 realtek 8250_pci usbmon nbd fuse autofs4 [last unloaded: rng_core]
> Dec 08 22:53:26 amdlaptop kernel: CPU: 3 PID: 72804 Comm: qemu-system-i38 Tainted: G           O      5.16.0-rc4.unstable #6
> Dec 08 22:53:26 amdlaptop kernel: Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> Dec 08 22:53:26 amdlaptop kernel: RIP: 0010:avic_vcpu_load+0xe3/0x100 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel: Code: 0d 9f e0 85 c0 74 e8 4c 89 f6 4c 89 ff e8 a5 99 f4 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 5b 41 5c 41 5d 41 5e 41 >
> Dec 08 22:53:26 amdlaptop kernel: RSP: 0018:ffffc9000b17bba8 EFLAGS: 00010247
> Dec 08 22:53:26 amdlaptop kernel: RAX: 6f63203a756d6571 RBX: ffff888106194740 RCX: ffff88812e7ac000
> Dec 08 22:53:26 amdlaptop kernel: RDX: ffff8883ff6c0000 RSI: 0000000000000003 RDI: 0000000000000003
> Dec 08 22:53:26 amdlaptop kernel: RBP: ffffc9000b17bbd0 R08: ffff888106194740 R09: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000003 R14: ffff88810023b060 R15: dead000000000100
> Dec 08 22:53:26 amdlaptop kernel: FS:  0000000000000000(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> Dec 08 22:53:26 amdlaptop kernel: CR2: 00005587e812f958 CR3: 0000000105f31000 CR4: 0000000000350ee0
> Dec 08 22:53:26 amdlaptop kernel: DR0: 00000000004008da DR1: 0000000000000000 DR2: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> Dec 08 22:53:26 amdlaptop kernel: Call Trace:
> Dec 08 22:53:26 amdlaptop kernel:  <TASK>
> Dec 08 22:53:26 amdlaptop kernel:  svm_vcpu_load+0x56/0x60 [kvm_amd]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_vcpu_load+0x32/0x210 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  vcpu_load+0x34/0x40 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_destroy_vm+0xd4/0x1c0 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_destroy_vm+0x163/0x250 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_put_kvm+0x26/0x40 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  kvm_vm_release+0x22/0x30 [kvm]
> Dec 08 22:53:26 amdlaptop kernel:  __fput+0x94/0x250
> Dec 08 22:53:26 amdlaptop kernel:  ____fput+0xe/0x10
> Dec 08 22:53:26 amdlaptop kernel:  task_work_run+0x63/0xa0
> Dec 08 22:53:26 amdlaptop kernel:  do_exit+0x358/0xa30
> Dec 08 22:53:26 amdlaptop kernel:  do_group_exit+0x3b/0xa0
> Dec 08 22:53:26 amdlaptop kernel:  get_signal+0x15b/0x880
> Dec 08 22:53:26 amdlaptop kernel:  ? _copy_to_user+0x20/0x30
> Dec 08 22:53:26 amdlaptop kernel:  ? put_timespec64+0x3d/0x60
> Dec 08 22:53:26 amdlaptop kernel:  arch_do_signal_or_restart+0x106/0x740
> Dec 08 22:53:26 amdlaptop kernel:  ? hrtimer_nanosleep+0x9f/0x120
> Dec 08 22:53:26 amdlaptop kernel:  ? __hrtimer_init+0xd0/0xd0
> Dec 08 22:53:26 amdlaptop kernel:  exit_to_user_mode_prepare+0x112/0x1f0
> Dec 08 22:53:26 amdlaptop kernel:  syscall_exit_to_user_mode+0x17/0x40
> Dec 08 22:53:26 amdlaptop kernel:  do_syscall_64+0x42/0x80
> Dec 08 22:53:26 amdlaptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> Dec 08 22:53:26 amdlaptop kernel: RIP: 0033:0x7f537abb13b5
> Dec 08 22:53:26 amdlaptop kernel: Code: Unable to access opcode bytes at RIP 0x7f537abb138b.
> Dec 08 22:53:26 amdlaptop kernel: RSP: 002b:00007f5376a39680 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
> Dec 08 22:53:26 amdlaptop kernel: RAX: fffffffffffffdfc RBX: 00007f5376a396d0 RCX: 00007f537abb13b5
> Dec 08 22:53:26 amdlaptop kernel: RDX: 00007f5376a396d0 RSI: 0000000000000000 RDI: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: RBP: 00007f5376a396c0 R08: 0000000000000000 R09: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel: R10: 00007f5376a396c0 R11: 0000000000000293 R12: 00007f5376a3b640
> Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000002 R14: 00007f537ab66880 R15: 0000000000000000
> Dec 08 22:53:26 amdlaptop kernel:  </TASK>
> Dec 08 22:53:26 amdlaptop kernel: ---[ end trace 676058aaf29d0267 ]---
> 
> 
> I'll post my patches tomorrow, after some more testing.
> 
> Best regards,
> 	Maxim Levitsky
> 

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:16   ` Maxim Levitsky
  2021-12-08 23:34     ` Maxim Levitsky
@ 2021-12-09  0:02     ` Sean Christopherson
  2021-12-09 14:29       ` Paolo Bonzini
  2021-12-09  1:37     ` Sean Christopherson
  2 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-09  0:02 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> Also got this while trying a VM with passed through device:
> 
> [mlevitsk@amdlaptop ~]$[   34.926140] usb 5-3: reset full-speed USB device number 3 using xhci_hcd
> [   42.583661] FAT-fs (mmcblk0p1): Volume was not properly unmounted. Some data may be corrupt. Please run fsck.
> [  363.562173] VFIO - User Level meta-driver version: 0.3
> [  365.160357] vfio-pci 0000:03:00.0: vfio_ecap_init: hiding ecap 0x1e@0x154
> [  384.138110] BUG: kernel NULL pointer dereference, address: 0000000000000021
> [  384.154039] #PF: supervisor read access in kernel mode
> [  384.165645] #PF: error_code(0x0000) - not-present page
> [  384.177254] PGD 16da9d067 P4D 16da9d067 PUD 13ad1a067 PMD 0 
> [  384.190036] Oops: 0000 [#1] SMP
> [  384.197117] CPU: 3 PID: 14403 Comm: CPU 3/KVM Tainted: G           O      5.16.0-rc4.unstable #6
> [  384.216978] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> [  384.235258] RIP: 0010:amd_iommu_update_ga+0x32/0x160
> [  384.246469] Code: <4c> 8b 62 20 48 8b 4a 18 4d 85 e4 0f 84 ca 00 00 00 48 85 c9 0f 84
> [  384.288932] RSP: 0018:ffffc9000036fca0 EFLAGS: 00010046
> [  384.300727] RAX: 0000000000000000 RBX: ffff88810b68ab60 RCX: ffff8881667a6018
> [  384.316850] RDX: 0000000000000001 RSI: ffff888107476b00 RDI: 0000000000000003

RDX, a.k.a. ir_data is NULL.  This check in svm_ir_list_add() 

	if (pi->ir_data && (pi->prev_ga_tag != 0)) {

implies pi->ir_data can be NULL, but neither avic_update_iommu_vcpu_affinity()
nor amd_iommu_update_ga() check ir->data for NULL.

amd_ir_set_vcpu_affinity() returns "success" without clearing pi.is_guest_mode

	/* Note:
	 * This device has never been set up for guest mode.
	 * we should not modify the IRTE
	 */
	if (!dev_data || !dev_data->use_vapic)
		return 0;

so it's plausible svm_ir_list_add() could add to the list with a NULL pi->ir_data.

But none of the relevant code has seen any meaningful changes since 5.15, so odds
are good I broke something :-/

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:34     ` Maxim Levitsky
@ 2021-12-09  0:04       ` Sean Christopherson
  2021-12-09  6:36         ` Maxim Levitsky
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-09  0:04 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> Host crash while running 32 bit VM and another 32 bit VM nested in it:
> 
> [  751.182290] BUG: kernel NULL pointer dereference, address: 0000000000000025
> [  751.198234] #PF: supervisor read access in kernel mode
> [  751.209982] #PF: error_code(0x0000) - not-present page
> [  751.221733] PGD 3720f9067 P4D 3720f9067 PUD 3720f8067 PMD 0 
> [  751.234682] Oops: 0000 [#1] SMP
> [  751.241857] CPU: 8 PID: 54050 Comm: CPU 8/KVM Tainted: G           O      5.16.0-rc4.unstable #6
> [  751.261960] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> [  751.280475] RIP: 0010:is_page_fault_stale.isra.0+0x2a/0xa0 [kvm]

...

> Oh well, not related to the patch series but just that I don't forget.
> I need to do some throughfull testing on all the VMs I use.

This is my goof, I'll post a fix shortly.

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:16   ` Maxim Levitsky
  2021-12-08 23:34     ` Maxim Levitsky
  2021-12-09  0:02     ` Sean Christopherson
@ 2021-12-09  1:37     ` Sean Christopherson
  2021-12-09  6:31       ` Maxim Levitsky
  2 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-09  1:37 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> On Thu, 2021-12-09 at 01:00 +0200, Maxim Levitsky wrote:
> > Probably just luck (can't reproduce this anymore) but
> > while running some kvm unit tests with this patch series (and few my patches
> > for AVIC co-existance which shouldn't affect this) I got this
> > 
> > (warning about is_running already set)

...
 
> Also got this while trying a VM with passed through device:

A tangentially related question: have you seen any mysterious crashes on your AMD
system?  I've been bisecting (well, attempting to bisect) bizarre crashes that
AFAICT showed up between v5.15 and v5.16-rc2.  Things like runqueues being NULL
deep in the scheduler when a CPU is coming out of idle.  I _think_ the issues have
been fixed as of v5.16-rc4, but I don't have a good reproducer so bisecting in
either direction has been a complete mess.  I've reproduced on multiple AMD hosts,
but never on an Intel system.  I have a sinking feeling that the issue is
relatively unique to our systems :-/

And a request: any testing and bug fixes you can throw at the AVIC changes would be
greatly appreciated.  I've been partially blocked on testing the AVIC stuff for the
better part of the week.  If the crashes I'm seeing have been resolved, then I should
be able to help hunt down the issues, but if not...

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09  1:37     ` Sean Christopherson
@ 2021-12-09  6:31       ` Maxim Levitsky
  0 siblings, 0 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-09  6:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 01:37 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> > On Thu, 2021-12-09 at 01:00 +0200, Maxim Levitsky wrote:
> > > Probably just luck (can't reproduce this anymore) but
> > > while running some kvm unit tests with this patch series (and few my patches
> > > for AVIC co-existance which shouldn't affect this) I got this
> > > 
> > > (warning about is_running already set)
> 
> ...
>  
> > Also got this while trying a VM with passed through device:
> 
> A tangentially related question: have you seen any mysterious crashes on your AMD
> system?  I've been bisecting (well, attempting to bisect) bizarre crashes that
> AFAICT showed up between v5.15 and v5.16-rc2.  Things like runqueues being NULL
> deep in the scheduler when a CPU is coming out of idle.  I _think_ the issues have
> been fixed as of v5.16-rc4, but I don't have a good reproducer so bisecting in
> either direction has been a complete mess.  I've reproduced on multiple AMD hosts,
> but never on an Intel system.  I have a sinking feeling that the issue is
> relatively unique to our systems :-/

I did had my 3970X lockup hard once on 5.16-rc2. It locked up completely without even
sending anything over a a pcie serial port card I. 

I don't remember what I was doing during the crash but probably had some VMs running.

Since then it didn't happen again, and I am running 5.16-rc3 for some time
with Paolo's kvm/queue merged and my own patches.

> 
> And a request: any testing and bug fixes you can throw at the AVIC changes would be
> greatly appreciated.  I've been partially blocked on testing the AVIC stuff for the
> better part of the week.  If the crashes I'm seeing have been resolved, then I should
> be able to help hunt down the issues, but if not...
> 

This is what I started doing last evening, and I'll go through all of the usual testing
I do soon.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-08 23:43   ` Sean Christopherson
@ 2021-12-09  6:34     ` Maxim Levitsky
  0 siblings, 0 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-09  6:34 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Wed, 2021-12-08 at 23:43 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> > >   KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path
> 
> ...
> 
> > Probably just luck (can't reproduce this anymore) but
> > while running some kvm unit tests with this patch series (and few my patches
> > for AVIC co-existance which shouldn't affect this) I got this
> > 
> > (warning about is_running already set)
> 
> My best guess would be the above commit that dropped the handling in the unblock
> path, but I haven't been able to concoct a scenario where avic_physical_id_cache
> can get out of sync with respect to kvm_vcpu_apicv_active().

Same here. It seems like the unit test is beeing killed by a timeout signal, I tried
to reproduce that this way but no luck so far.

Best regards,
	Maxim Levitsky

> 
> > Dec 08 22:53:26 amdlaptop kernel: ------------[ cut here ]------------
> > Dec 08 22:53:26 amdlaptop kernel: WARNING: CPU: 3 PID: 72804 at arch/x86/kvm/svm/avic.c:1045 avic_vcpu_load+0xe3/0x100 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel: Modules linked in: kvm_amd(O) ccp rng_core kvm(O) irqbypass xt_conntrack ip6table_filter ip6_tables snd_soc_dmic snd_acp3x_>
> > Dec 08 22:53:26 amdlaptop kernel:  r8169 realtek 8250_pci usbmon nbd fuse autofs4 [last unloaded: rng_core]
> > Dec 08 22:53:26 amdlaptop kernel: CPU: 3 PID: 72804 Comm: qemu-system-i38 Tainted: G           O      5.16.0-rc4.unstable #6
> > Dec 08 22:53:26 amdlaptop kernel: Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> > Dec 08 22:53:26 amdlaptop kernel: RIP: 0010:avic_vcpu_load+0xe3/0x100 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel: Code: 0d 9f e0 85 c0 74 e8 4c 89 f6 4c 89 ff e8 a5 99 f4 e0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 5b 41 5c 41 5d 41 5e 41 >
> > Dec 08 22:53:26 amdlaptop kernel: RSP: 0018:ffffc9000b17bba8 EFLAGS: 00010247
> > Dec 08 22:53:26 amdlaptop kernel: RAX: 6f63203a756d6571 RBX: ffff888106194740 RCX: ffff88812e7ac000
> > Dec 08 22:53:26 amdlaptop kernel: RDX: ffff8883ff6c0000 RSI: 0000000000000003 RDI: 0000000000000003
> > Dec 08 22:53:26 amdlaptop kernel: RBP: ffffc9000b17bbd0 R08: ffff888106194740 R09: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000003
> > Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000003 R14: ffff88810023b060 R15: dead000000000100
> > Dec 08 22:53:26 amdlaptop kernel: FS:  0000000000000000(0000) GS:ffff8883ff6c0000(0000) knlGS:0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > Dec 08 22:53:26 amdlaptop kernel: CR2: 00005587e812f958 CR3: 0000000105f31000 CR4: 0000000000350ee0
> > Dec 08 22:53:26 amdlaptop kernel: DR0: 00000000004008da DR1: 0000000000000000 DR2: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > Dec 08 22:53:26 amdlaptop kernel: Call Trace:
> > Dec 08 22:53:26 amdlaptop kernel:  <TASK>
> > Dec 08 22:53:26 amdlaptop kernel:  svm_vcpu_load+0x56/0x60 [kvm_amd]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_vcpu_load+0x32/0x210 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  vcpu_load+0x34/0x40 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_arch_destroy_vm+0xd4/0x1c0 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_destroy_vm+0x163/0x250 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_put_kvm+0x26/0x40 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  kvm_vm_release+0x22/0x30 [kvm]
> > Dec 08 22:53:26 amdlaptop kernel:  __fput+0x94/0x250
> > Dec 08 22:53:26 amdlaptop kernel:  ____fput+0xe/0x10
> > Dec 08 22:53:26 amdlaptop kernel:  task_work_run+0x63/0xa0
> > Dec 08 22:53:26 amdlaptop kernel:  do_exit+0x358/0xa30
> > Dec 08 22:53:26 amdlaptop kernel:  do_group_exit+0x3b/0xa0
> > Dec 08 22:53:26 amdlaptop kernel:  get_signal+0x15b/0x880
> > Dec 08 22:53:26 amdlaptop kernel:  ? _copy_to_user+0x20/0x30
> > Dec 08 22:53:26 amdlaptop kernel:  ? put_timespec64+0x3d/0x60
> > Dec 08 22:53:26 amdlaptop kernel:  arch_do_signal_or_restart+0x106/0x740
> > Dec 08 22:53:26 amdlaptop kernel:  ? hrtimer_nanosleep+0x9f/0x120
> > Dec 08 22:53:26 amdlaptop kernel:  ? __hrtimer_init+0xd0/0xd0
> > Dec 08 22:53:26 amdlaptop kernel:  exit_to_user_mode_prepare+0x112/0x1f0
> > Dec 08 22:53:26 amdlaptop kernel:  syscall_exit_to_user_mode+0x17/0x40
> > Dec 08 22:53:26 amdlaptop kernel:  do_syscall_64+0x42/0x80
> > Dec 08 22:53:26 amdlaptop kernel:  entry_SYSCALL_64_after_hwframe+0x44/0xae
> > Dec 08 22:53:26 amdlaptop kernel: RIP: 0033:0x7f537abb13b5
> > Dec 08 22:53:26 amdlaptop kernel: Code: Unable to access opcode bytes at RIP 0x7f537abb138b.
> > Dec 08 22:53:26 amdlaptop kernel: RSP: 002b:00007f5376a39680 EFLAGS: 00000293 ORIG_RAX: 00000000000000e6
> > Dec 08 22:53:26 amdlaptop kernel: RAX: fffffffffffffdfc RBX: 00007f5376a396d0 RCX: 00007f537abb13b5
> > Dec 08 22:53:26 amdlaptop kernel: RDX: 00007f5376a396d0 RSI: 0000000000000000 RDI: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: RBP: 00007f5376a396c0 R08: 0000000000000000 R09: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel: R10: 00007f5376a396c0 R11: 0000000000000293 R12: 00007f5376a3b640
> > Dec 08 22:53:26 amdlaptop kernel: R13: 0000000000000002 R14: 00007f537ab66880 R15: 0000000000000000
> > Dec 08 22:53:26 amdlaptop kernel:  </TASK>
> > Dec 08 22:53:26 amdlaptop kernel: ---[ end trace 676058aaf29d0267 ]---
> > 
> > 
> > I'll post my patches tomorrow, after some more testing.
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09  0:04       ` Sean Christopherson
@ 2021-12-09  6:36         ` Maxim Levitsky
  0 siblings, 0 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-09  6:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 00:04 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> > Host crash while running 32 bit VM and another 32 bit VM nested in it:
> > 
> > [  751.182290] BUG: kernel NULL pointer dereference, address: 0000000000000025
> > [  751.198234] #PF: supervisor read access in kernel mode
> > [  751.209982] #PF: error_code(0x0000) - not-present page
> > [  751.221733] PGD 3720f9067 P4D 3720f9067 PUD 3720f8067 PMD 0 
> > [  751.234682] Oops: 0000 [#1] SMP
> > [  751.241857] CPU: 8 PID: 54050 Comm: CPU 8/KVM Tainted: G           O      5.16.0-rc4.unstable #6
> > [  751.261960] Hardware name: LENOVO 20UF001CUS/20UF001CUS, BIOS R1CET65W(1.34 ) 06/17/2021
> > [  751.280475] RIP: 0010:is_page_fault_stale.isra.0+0x2a/0xa0 [kvm]
> 
> ...
> 
> > Oh well, not related to the patch series but just that I don't forget.
> > I need to do some throughfull testing on all the VMs I use.
> 
> This is my goof, I'll post a fix shortly.
> 
Thanks!

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09  0:02     ` Sean Christopherson
@ 2021-12-09 14:29       ` Paolo Bonzini
  2021-12-09 14:48         ` Maxim Levitsky
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2021-12-09 14:29 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel

On 12/9/21 01:02, Sean Christopherson wrote:
> RDX, a.k.a. ir_data is NULL.  This check in svm_ir_list_add()
> 
> 	if (pi->ir_data && (pi->prev_ga_tag != 0)) {
> 
> implies pi->ir_data can be NULL, but neither avic_update_iommu_vcpu_affinity()
> nor amd_iommu_update_ga() check ir->data for NULL.
> 
> amd_ir_set_vcpu_affinity() returns "success" without clearing pi.is_guest_mode
> 
> 	/* Note:
> 	 * This device has never been set up for guest mode.
> 	 * we should not modify the IRTE
> 	 */
> 	if (!dev_data || !dev_data->use_vapic)
> 		return 0;
> 
> so it's plausible svm_ir_list_add() could add to the list with a NULL pi->ir_data.
> 
> But none of the relevant code has seen any meaningful changes since 5.15, so odds
> are good I broke something :-/
> 

Ok, I'll take this.

Paolo

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09 14:29       ` Paolo Bonzini
@ 2021-12-09 14:48         ` Maxim Levitsky
  2021-12-09 15:45           ` Sean Christopherson
  0 siblings, 1 reply; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-09 14:48 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson
  Cc: Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 15:29 +0100, Paolo Bonzini wrote:
> On 12/9/21 01:02, Sean Christopherson wrote:
> > RDX, a.k.a. ir_data is NULL.  This check in svm_ir_list_add()
> > 
> > 	if (pi->ir_data && (pi->prev_ga_tag != 0)) {
> > 
> > implies pi->ir_data can be NULL, but neither avic_update_iommu_vcpu_affinity()
> > nor amd_iommu_update_ga() check ir->data for NULL.
> > 
> > amd_ir_set_vcpu_affinity() returns "success" without clearing pi.is_guest_mode
> > 
> > 	/* Note:
> > 	 * This device has never been set up for guest mode.
> > 	 * we should not modify the IRTE
> > 	 */
> > 	if (!dev_data || !dev_data->use_vapic)
> > 		return 0;
> > 
> > so it's plausible svm_ir_list_add() could add to the list with a NULL pi->ir_data.
> > 
> > But none of the relevant code has seen any meaningful changes since 5.15, so odds
> > are good I broke something :-/

Doesn't reproduce here yet even with my iommu changes :-(
Oh well.

Best regards,
	Maxim Levitsky


> > 
> 
> Ok, I'll take this.
> 
> Paolo
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09 14:48         ` Maxim Levitsky
@ 2021-12-09 15:45           ` Sean Christopherson
  2021-12-09 16:03             ` Maxim Levitsky
  0 siblings, 1 reply; 50+ messages in thread
From: Sean Christopherson @ 2021-12-09 15:45 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> On Thu, 2021-12-09 at 15:29 +0100, Paolo Bonzini wrote:
> > On 12/9/21 01:02, Sean Christopherson wrote:
> > > RDX, a.k.a. ir_data is NULL.  This check in svm_ir_list_add()
> > > 
> > > 	if (pi->ir_data && (pi->prev_ga_tag != 0)) {
> > > 
> > > implies pi->ir_data can be NULL, but neither avic_update_iommu_vcpu_affinity()
> > > nor amd_iommu_update_ga() check ir->data for NULL.
> > > 
> > > amd_ir_set_vcpu_affinity() returns "success" without clearing pi.is_guest_mode
> > > 
> > > 	/* Note:
> > > 	 * This device has never been set up for guest mode.
> > > 	 * we should not modify the IRTE
> > > 	 */
> > > 	if (!dev_data || !dev_data->use_vapic)
> > > 		return 0;
> > > 
> > > so it's plausible svm_ir_list_add() could add to the list with a NULL pi->ir_data.
> > > 
> > > But none of the relevant code has seen any meaningful changes since 5.15, so odds
> > > are good I broke something :-/
> 
> Doesn't reproduce here yet even with my iommu changes :-(
> Oh well.

Hmm, which suggests it could be an existing corner case.

Based on the above, this seems prudent and correct:

@@ -747,7 +754,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
         * so we need to check here if it's already been * added
         * to the ir_list.
         */
-       if (pi->ir_data && (pi->prev_ga_tag != 0)) {
+       if (pi->prev_ga_tag != 0) {
                struct kvm *kvm = svm->vcpu.kvm;
                u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag);
                struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
@@ -877,7 +884,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
                         * we can reference to them directly when we update vcpu
                         * scheduling information in IOMMU irte.
                         */
-                       if (!ret && pi.is_guest_mode)
+                       if (!ret && pi.is_guest_mode && pi.ir_data)
                                svm_ir_list_add(svm, &pi);
                } else {
                        /* Use legacy mode in IRTE */
@@ -898,7 +905,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
                         * was cached. If so, we need to clean up the per-vcpu
                         * ir_list.
                         */
-                       if (!ret && pi.prev_ga_tag) {
+                       if (!ret && pi.prev_ga_tag && !WARN_ON(!pi.ir_data)) {
                                int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
                                struct kvm_vcpu *vcpu;



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul
  2021-12-09 15:45           ` Sean Christopherson
@ 2021-12-09 16:03             ` Maxim Levitsky
  0 siblings, 0 replies; 50+ messages in thread
From: Maxim Levitsky @ 2021-12-09 16:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Joerg Roedel, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Suravee Suthikulpanit, kvm, iommu, linux-kernel

On Thu, 2021-12-09 at 15:45 +0000, Sean Christopherson wrote:
> On Thu, Dec 09, 2021, Maxim Levitsky wrote:
> > On Thu, 2021-12-09 at 15:29 +0100, Paolo Bonzini wrote:
> > > On 12/9/21 01:02, Sean Christopherson wrote:
> > > > RDX, a.k.a. ir_data is NULL.  This check in svm_ir_list_add()
> > > > 
> > > > 	if (pi->ir_data && (pi->prev_ga_tag != 0)) {
> > > > 
> > > > implies pi->ir_data can be NULL, but neither avic_update_iommu_vcpu_affinity()
> > > > nor amd_iommu_update_ga() check ir->data for NULL.
> > > > 
> > > > amd_ir_set_vcpu_affinity() returns "success" without clearing pi.is_guest_mode
> > > > 
> > > > 	/* Note:
> > > > 	 * This device has never been set up for guest mode.
> > > > 	 * we should not modify the IRTE
> > > > 	 */
> > > > 	if (!dev_data || !dev_data->use_vapic)
> > > > 		return 0;
> > > > 
> > > > so it's plausible svm_ir_list_add() could add to the list with a NULL pi->ir_data.
> > > > 
> > > > But none of the relevant code has seen any meaningful changes since 5.15, so odds
> > > > are good I broke something :-/
> > 
> > Doesn't reproduce here yet even with my iommu changes :-(
> > Oh well.
> 
> Hmm, which suggests it could be an existing corner case.

Could very very be!
Next Sunday I'll lean the AMD iommu code a bit closer, and see if I can spot more bugs in it.

Best regards,
	Maxim Levitsky

> 
> Based on the above, this seems prudent and correct:
> 
> @@ -747,7 +754,7 @@ static int svm_ir_list_add(struct vcpu_svm *svm, struct amd_iommu_pi_data *pi)
>          * so we need to check here if it's already been * added
>          * to the ir_list.
>          */
> -       if (pi->ir_data && (pi->prev_ga_tag != 0)) {
> +       if (pi->prev_ga_tag != 0) {
>                 struct kvm *kvm = svm->vcpu.kvm;
>                 u32 vcpu_id = AVIC_GATAG_TO_VCPUID(pi->prev_ga_tag);
>                 struct kvm_vcpu *prev_vcpu = kvm_get_vcpu_by_id(kvm, vcpu_id);
> @@ -877,7 +884,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
>                          * we can reference to them directly when we update vcpu
>                          * scheduling information in IOMMU irte.
>                          */
> -                       if (!ret && pi.is_guest_mode)
> +                       if (!ret && pi.is_guest_mode && pi.ir_data)
>                                 svm_ir_list_add(svm, &pi);
>                 } else {
>                         /* Use legacy mode in IRTE */
> @@ -898,7 +905,7 @@ int svm_update_pi_irte(struct kvm *kvm, unsigned int host_irq,
>                          * was cached. If so, we need to clean up the per-vcpu
>                          * ir_list.
>                          */
> -                       if (!ret && pi.prev_ga_tag) {
> +                       if (!ret && pi.prev_ga_tag && !WARN_ON(!pi.ir_data)) {
>                                 int id = AVIC_GATAG_TO_VCPUID(pi.prev_ga_tag);
>                                 struct kvm_vcpu *vcpu;
> 
> 



^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  2021-12-08  1:52 ` [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 Sean Christopherson
@ 2023-03-29 12:34   ` Tudor Ambarus
  2023-03-29 13:47     ` Paolo Bonzini
  0 siblings, 1 reply; 50+ messages in thread
From: Tudor Ambarus @ 2023-03-29 12:34 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky, Lee Jones

Hi!

On 12/8/21 01:52, Sean Christopherson wrote:
> Handle the switch to/from the hypervisor/software timer when a vCPU is
> blocking in common x86 instead of in VMX.  Even though VMX is the only
> user of a hypervisor timer, the logic and all functions involved are
> generic x86 (unless future CPUs do something completely different and
> implement a hypervisor timer that runs regardless of mode).
> 
> Handling the switch in common x86 will allow for the elimination of the
> pre/post_blocks hooks, and also lets KVM switch back to the hypervisor
> timer if and only if it was in use (without additional params).  Add a
> comment explaining why the switch cannot be deferred to kvm_sched_out()
> or kvm_vcpu_block().
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> ---
>  arch/x86/kvm/vmx/vmx.c |  6 +-----
>  arch/x86/kvm/x86.c     | 21 +++++++++++++++++++++
>  2 files changed, 22 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 64932cc3e4e8..a5ba9a069f5d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7444,16 +7444,12 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
>  
>  static int vmx_pre_block(struct kvm_vcpu *vcpu)
>  {
> -	if (kvm_lapic_hv_timer_in_use(vcpu))
> -		kvm_lapic_switch_to_sw_timer(vcpu);
> -
>  	return 0;
>  }
>  
>  static void vmx_post_block(struct kvm_vcpu *vcpu)
>  {
> -	if (kvm_x86_ops.set_hv_timer)
> -		kvm_lapic_switch_to_hv_timer(vcpu);
> +
>  }
>  

This patch fixes the bug reported at:
LINK:
https://syzkaller.appspot.com/bug?id=489beb3d76ef14cc6cd18125782dc6f86051a605

One may find the strace at:
LINK: https://syzkaller.appspot.com/text?tag=CrashLog&x=1798b54ec80000
and the c reproducer at:
LINK: https://syzkaller.appspot.com/text?tag=ReproC&x=10365781c80000

Since I've no experience with kvm, it would be helpful if one of you can
provide some guidance. Do you think it is worth to backport this patch
to stable (together with its prerequisite patches), or shall I try to
get familiar with the code and try to provide a less invasive fix?

Thanks!
ta

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  2023-03-29 12:34   ` Tudor Ambarus
@ 2023-03-29 13:47     ` Paolo Bonzini
  2023-03-29 15:22       ` Tudor Ambarus
  0 siblings, 1 reply; 50+ messages in thread
From: Paolo Bonzini @ 2023-03-29 13:47 UTC (permalink / raw)
  To: Tudor Ambarus, Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky, Lee Jones

On 3/29/23 14:34, Tudor Ambarus wrote:
> This patch fixes the bug reported at:
> LINK:
> https://syzkaller.appspot.com/bug?id=489beb3d76ef14cc6cd18125782dc6f86051a605
> 
> One may find the strace at:
> LINK:https://syzkaller.appspot.com/text?tag=CrashLog&x=1798b54ec80000
> and the c reproducer at:
> LINK:https://syzkaller.appspot.com/text?tag=ReproC&x=10365781c80000
> 
> Since I've no experience with kvm, it would be helpful if one of you can
> provide some guidance. Do you think it is worth to backport this patch
> to stable (together with its prerequisite patches), or shall I try to
> get familiar with the code and try to provide a less invasive fix?

I think it is enough to fix the conflicts in vmx_pre_block and
vmx_post_block, there are no prerequisites:

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0718658268fe..895069038856 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7577,17 +7577,11 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
  	if (pi_pre_block(vcpu))
  		return 1;
  
-	if (kvm_lapic_hv_timer_in_use(vcpu))
-		kvm_lapic_switch_to_sw_timer(vcpu);
-
  	return 0;
  }
  
  static void vmx_post_block(struct kvm_vcpu *vcpu)
  {
-	if (kvm_x86_ops.set_hv_timer)
-		kvm_lapic_switch_to_hv_timer(vcpu);
-
  	pi_post_block(vcpu);
  }
  
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fcfa3fedf84f..4eca3ec38afd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10022,12 +10022,28 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
  
  static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
  {
+	bool hv_timer;
+
  	if (!kvm_arch_vcpu_runnable(vcpu) &&
  	    (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu) == 0)) {
+		/*
+		 * Switch to the software timer before halt-polling/blocking as
+		 * the guest's timer may be a break event for the vCPU, and the
+		 * hypervisor timer runs only when the CPU is in guest mode.
+		 * Switch before halt-polling so that KVM recognizes an expired
+		 * timer before blocking.
+		 */
+		hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
+		if (hv_timer)
+			kvm_lapic_switch_to_sw_timer(vcpu);
+
  		srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
  		kvm_vcpu_block(vcpu);
  		vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
  
+		if (hv_timer)
+			kvm_lapic_switch_to_hv_timer(vcpu);
+
  		if (kvm_x86_ops.post_block)
  			static_call(kvm_x86_post_block)(vcpu);
  
@@ -10266,6 +10282,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
  			r = -EINTR;
  			goto out;
  		}
+		/*
+		 * It should be impossible for the hypervisor timer to be in
+		 * use before KVM has ever run the vCPU.
+		 */
+		WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
  		kvm_vcpu_block(vcpu);
  		if (kvm_apic_accept_events(vcpu) < 0) {
  			r = 0;

The fix is due to the second "if" changing from
kvm_x86_ops.set_hv_timer to hv_timer.

Paolo


^ permalink raw reply related	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  2023-03-29 13:47     ` Paolo Bonzini
@ 2023-03-29 15:22       ` Tudor Ambarus
  2023-03-30  7:12         ` Tudor Ambarus
  0 siblings, 1 reply; 50+ messages in thread
From: Tudor Ambarus @ 2023-03-29 15:22 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky, Lee Jones



On 3/29/23 14:47, Paolo Bonzini wrote:

Hi, Paolo!

> On 3/29/23 14:34, Tudor Ambarus wrote:
>> This patch fixes the bug reported at:
>> LINK:
>> https://syzkaller.appspot.com/bug?id=489beb3d76ef14cc6cd18125782dc6f86051a605
>>
>> One may find the strace at:
>> LINK:https://syzkaller.appspot.com/text?tag=CrashLog&x=1798b54ec80000
>> and the c reproducer at:
>> LINK:https://syzkaller.appspot.com/text?tag=ReproC&x=10365781c80000
>>
>> Since I've no experience with kvm, it would be helpful if one of you can
>> provide some guidance. Do you think it is worth to backport this patch
>> to stable (together with its prerequisite patches), or shall I try to
>> get familiar with the code and try to provide a less invasive fix?
> 
> I think it is enough to fix the conflicts in vmx_pre_block and
> vmx_post_block, there are no prerequisites:
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0718658268fe..895069038856 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7577,17 +7577,11 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
>      if (pi_pre_block(vcpu))
>          return 1;
>  
> -    if (kvm_lapic_hv_timer_in_use(vcpu))
> -        kvm_lapic_switch_to_sw_timer(vcpu);
> -
>      return 0;
>  }
>  
>  static void vmx_post_block(struct kvm_vcpu *vcpu)
>  {
> -    if (kvm_x86_ops.set_hv_timer)
> -        kvm_lapic_switch_to_hv_timer(vcpu);
> -
>      pi_post_block(vcpu);
>  }
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index fcfa3fedf84f..4eca3ec38afd 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10022,12 +10022,28 @@ static int vcpu_enter_guest(struct kvm_vcpu
> *vcpu)
>  
>  static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
>  {
> +    bool hv_timer;
> +
>      if (!kvm_arch_vcpu_runnable(vcpu) &&
>          (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu)
> == 0)) {
> +        /*
> +         * Switch to the software timer before halt-polling/blocking as
> +         * the guest's timer may be a break event for the vCPU, and the
> +         * hypervisor timer runs only when the CPU is in guest mode.
> +         * Switch before halt-polling so that KVM recognizes an expired
> +         * timer before blocking.
> +         */
> +        hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
> +        if (hv_timer)
> +            kvm_lapic_switch_to_sw_timer(vcpu);
> +
>          srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>          kvm_vcpu_block(vcpu);
>          vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>  
> +        if (hv_timer)
> +            kvm_lapic_switch_to_hv_timer(vcpu);
> +
>          if (kvm_x86_ops.post_block)
>              static_call(kvm_x86_post_block)(vcpu);
>  
> @@ -10266,6 +10282,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>              r = -EINTR;
>              goto out;
>          }
> +        /*
> +         * It should be impossible for the hypervisor timer to be in
> +         * use before KVM has ever run the vCPU.
> +         */
> +        WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
>          kvm_vcpu_block(vcpu);
>          if (kvm_apic_accept_events(vcpu) < 0) {
>              r = 0;
> 
> The fix is due to the second "if" changing from
> kvm_x86_ops.set_hv_timer to hv_timer.
> 

Thanks for the prompt answer! I fixed the conflicts as per your
suggestion and tested the patch with the reproducer on top of
stable/linux-5.15.y and I confirm the reproducer is silenced. Sent the
patch proposal (with you in To:) at:
https://lore.kernel.org/all/20230329151747.2938509-1-tudor.ambarus@linaro.org/T/#u
Feel free to ACK/NACK it.

Cheers,
ta

^ permalink raw reply	[flat|nested] 50+ messages in thread

* Re: [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86
  2023-03-29 15:22       ` Tudor Ambarus
@ 2023-03-30  7:12         ` Tudor Ambarus
  0 siblings, 0 replies; 50+ messages in thread
From: Tudor Ambarus @ 2023-03-30  7:12 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Joerg Roedel
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Suravee Suthikulpanit,
	kvm, iommu, linux-kernel, Maxim Levitsky, Lee Jones

Hi, Paolo,

On 3/29/23 16:22, Tudor Ambarus wrote:
> 
> 
> On 3/29/23 14:47, Paolo Bonzini wrote:
> 
> Hi, Paolo!
> 
>> On 3/29/23 14:34, Tudor Ambarus wrote:
>>> This patch fixes the bug reported at:
>>> LINK:
>>> https://syzkaller.appspot.com/bug?id=489beb3d76ef14cc6cd18125782dc6f86051a605
>>>
>>> One may find the strace at:
>>> LINK:https://syzkaller.appspot.com/text?tag=CrashLog&x=1798b54ec80000
>>> and the c reproducer at:
>>> LINK:https://syzkaller.appspot.com/text?tag=ReproC&x=10365781c80000
>>>
>>> Since I've no experience with kvm, it would be helpful if one of you can
>>> provide some guidance. Do you think it is worth to backport this patch
>>> to stable (together with its prerequisite patches), or shall I try to
>>> get familiar with the code and try to provide a less invasive fix?
>>
>> I think it is enough to fix the conflicts in vmx_pre_block and
>> vmx_post_block, there are no prerequisites:
>>
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 0718658268fe..895069038856 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7577,17 +7577,11 @@ static int vmx_pre_block(struct kvm_vcpu *vcpu)
>>      if (pi_pre_block(vcpu))
>>          return 1;
>>  
>> -    if (kvm_lapic_hv_timer_in_use(vcpu))
>> -        kvm_lapic_switch_to_sw_timer(vcpu);
>> -
>>      return 0;
>>  }
>>  
>>  static void vmx_post_block(struct kvm_vcpu *vcpu)
>>  {
>> -    if (kvm_x86_ops.set_hv_timer)
>> -        kvm_lapic_switch_to_hv_timer(vcpu);
>> -
>>      pi_post_block(vcpu);
>>  }
>>  
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index fcfa3fedf84f..4eca3ec38afd 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -10022,12 +10022,28 @@ static int vcpu_enter_guest(struct kvm_vcpu
>> *vcpu)
>>  
>>  static inline int vcpu_block(struct kvm *kvm, struct kvm_vcpu *vcpu)
>>  {
>> +    bool hv_timer;
>> +
>>      if (!kvm_arch_vcpu_runnable(vcpu) &&
>>          (!kvm_x86_ops.pre_block || static_call(kvm_x86_pre_block)(vcpu)
>> == 0)) {
>> +        /*
>> +         * Switch to the software timer before halt-polling/blocking as
>> +         * the guest's timer may be a break event for the vCPU, and the
>> +         * hypervisor timer runs only when the CPU is in guest mode.
>> +         * Switch before halt-polling so that KVM recognizes an expired
>> +         * timer before blocking.
>> +         */
>> +        hv_timer = kvm_lapic_hv_timer_in_use(vcpu);
>> +        if (hv_timer)
>> +            kvm_lapic_switch_to_sw_timer(vcpu);
>> +
>>          srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
>>          kvm_vcpu_block(vcpu);
>>          vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
>>  
>> +        if (hv_timer)
>> +            kvm_lapic_switch_to_hv_timer(vcpu);
>> +
>>          if (kvm_x86_ops.post_block)
>>              static_call(kvm_x86_post_block)(vcpu);
>>  
>> @@ -10266,6 +10282,11 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>>              r = -EINTR;
>>              goto out;
>>          }
>> +        /*
>> +         * It should be impossible for the hypervisor timer to be in
>> +         * use before KVM has ever run the vCPU.
>> +         */
>> +        WARN_ON_ONCE(kvm_lapic_hv_timer_in_use(vcpu));
>>          kvm_vcpu_block(vcpu);
>>          if (kvm_apic_accept_events(vcpu) < 0) {
>>              r = 0;
>>
>> The fix is due to the second "if" changing from
>> kvm_x86_ops.set_hv_timer to hv_timer.
>>
> 
> Thanks for the prompt answer! I fixed the conflicts as per your
> suggestion and tested the patch with the reproducer on top of
> stable/linux-5.15.y and I confirm the reproducer is silenced. Sent the
> patch proposal (with you in To:) at:
> https://lore.kernel.org/all/20230329151747.2938509-1-tudor.ambarus@linaro.org/T/#u
> Feel free to ACK/NACK it.
> 

Wanted to let you know that I've tried the reproducer on
stable/linux-5.{10, 4}.y and the bug is not hit. The only long term
maintained kernel that seems affected is 5.15, so I backported the patch
just for it. If you think it would be good to backport the patch further
or you want me to do some other tests, I'll be glad to do so.

Thanks for the guidance!
ta

^ permalink raw reply	[flat|nested] 50+ messages in thread

end of thread, other threads:[~2023-03-30  7:12 UTC | newest]

Thread overview: 50+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-08  1:52 [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 01/26] KVM: fix avic_set_running for preemptable kernels Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 02/26] KVM: nVMX: Ensure vCPU honors event request if posting nested IRQ fails Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 03/26] KVM: VMX: Clean up PI pre/post-block WARNs Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 04/26] KVM: VMX: Handle PI wakeup shenanigans during vcpu_put/load Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 05/26] KVM: Drop unused kvm_vcpu.pre_pcpu field Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 06/26] KVM: Move x86 VMX's posted interrupt list_head to vcpu_vmx Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 07/26] KVM: VMX: Move preemption timer <=> hrtimer dance to common x86 Sean Christopherson
2023-03-29 12:34   ` Tudor Ambarus
2023-03-29 13:47     ` Paolo Bonzini
2023-03-29 15:22       ` Tudor Ambarus
2023-03-30  7:12         ` Tudor Ambarus
2021-12-08  1:52 ` [PATCH v3 08/26] KVM: x86: Unexport LAPIC's switch_to_{hv,sw}_timer() helpers Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 09/26] KVM: x86: Remove defunct pre_block/post_block kvm_x86_ops hooks Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 10/26] KVM: SVM: Signal AVIC doorbell iff vCPU is in guest mode Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 11/26] KVM: SVM: Don't bother checking for "running" AVIC when kicking for IPIs Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 12/26] KVM: SVM: Remove unnecessary APICv/AVIC update in vCPU unblocking path Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 13/26] KVM: SVM: Use kvm_vcpu_is_blocking() in AVIC load to handle preemption Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 14/26] KVM: SVM: Skip AVIC and IRTE updates when loading blocking vCPU Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 15/26] iommu/amd: KVM: SVM: Use pCPU to infer IsRun state for IRTE Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 16/26] KVM: VMX: Don't do full kick when triggering posted interrupt "fails" Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 17/26] KVM: VMX: Wake vCPU when delivering posted IRQ even if vCPU == this vCPU Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 18/26] KVM: VMX: Pass desired vector instead of bool for triggering posted IRQ Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 19/26] KVM: VMX: Fold fallback path into triggering posted IRQ helper Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 20/26] KVM: VMX: Don't do full kick when handling posted interrupt wakeup Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 21/26] KVM: SVM: Drop AVIC's intermediate avic_set_running() helper Sean Christopherson
2021-12-08 14:43   ` Paolo Bonzini
2021-12-08 15:03     ` Maxim Levitsky
2021-12-08 15:43     ` Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 22/26] KVM: SVM: Move svm_hardware_setup() and its helpers below svm_x86_ops Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 23/26] KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 24/26] KVM: x86: Skip APICv update if APICv is disable at the module level Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 25/26] KVM: x86: Drop NULL check on kvm_x86_ops.check_apicv_inhibit_reasons Sean Christopherson
2021-12-08  1:52 ` [PATCH v3 26/26] KVM: x86: Unexport __kvm_request_apicv_update() Sean Christopherson
2021-12-08  9:04 ` [PATCH v3 00/26] KVM: x86: Halt and APICv overhaul Paolo Bonzini
2021-12-08 14:51 ` Paolo Bonzini
2021-12-08 23:00 ` Maxim Levitsky
2021-12-08 23:16   ` Maxim Levitsky
2021-12-08 23:34     ` Maxim Levitsky
2021-12-09  0:04       ` Sean Christopherson
2021-12-09  6:36         ` Maxim Levitsky
2021-12-09  0:02     ` Sean Christopherson
2021-12-09 14:29       ` Paolo Bonzini
2021-12-09 14:48         ` Maxim Levitsky
2021-12-09 15:45           ` Sean Christopherson
2021-12-09 16:03             ` Maxim Levitsky
2021-12-09  1:37     ` Sean Christopherson
2021-12-09  6:31       ` Maxim Levitsky
2021-12-08 23:43   ` Sean Christopherson
2021-12-09  6:34     ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).