linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] KVM: x86: APICv fixes
@ 2022-04-16  3:42 Sean Christopherson
  2022-04-16  3:42 ` [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled Sean Christopherson
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Sean Christopherson @ 2022-04-16  3:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia,
	Maxim Levitsky

Patch 1 is brown paper bag fix for a copy+paste bug (thankfully from this
cycle).

Patch 2 fixes a nVMX + APICv bug where KVM essentially corrupts vmcs02 if
an APICv update arrives while L2 is running.  Found when testing a slight
variation of patch 3.

Patch 3 fixes a race where an APICv update that occurs while a vCPU is
being created will fail to notify that vCPU due it not yet being visible
to the updater.

Patch 4 is a minor optimization/cleanup found by inspection when digging
into everything else.

Sean Christopherson (4):
  KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
  KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
  KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled

 arch/x86/kvm/vmx/nested.c |  5 +++++
 arch/x86/kvm/vmx/vmx.c    |  5 +++++
 arch/x86/kvm/vmx/vmx.h    |  1 +
 arch/x86/kvm/x86.c        | 20 ++++++++++++++++++--
 4 files changed, 29 insertions(+), 2 deletions(-)


base-commit: 150866cd0ec871c765181d145aa0912628289c8a
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
  2022-04-16  3:42 [PATCH 0/4] KVM: x86: APICv fixes Sean Christopherson
@ 2022-04-16  3:42 ` Sean Christopherson
  2022-04-18  8:37   ` Maxim Levitsky
  2022-04-16  3:42 ` [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active Sean Christopherson
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-16  3:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia,
	Maxim Levitsky

Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
module param.  A recent refactoring to add a wrapper for setting/clearing
inhibits unintentionally changed the flag, probably due to a copy+paste
goof.

Fixes: 4f4c4a3ee53c ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab336f7c82e4..753296902535 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9159,7 +9159,7 @@ static void kvm_apicv_init(struct kvm *kvm)
 
 	if (!enable_apicv)
 		set_or_clear_apicv_inhibit(inhibits,
-					   APICV_INHIBIT_REASON_ABSENT, true);
+					   APICV_INHIBIT_REASON_DISABLE, true);
 }
 
 static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  2022-04-16  3:42 [PATCH 0/4] KVM: x86: APICv fixes Sean Christopherson
  2022-04-16  3:42 ` [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled Sean Christopherson
@ 2022-04-16  3:42 ` Sean Christopherson
  2022-04-18 12:49   ` Maxim Levitsky
  2022-04-16  3:42 ` [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race Sean Christopherson
  2022-04-16  3:42 ` [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled Sean Christopherson
  3 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-16  3:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia,
	Maxim Levitsky

Defer APICv updates that occur while L2 is active until nested VM-Exit,
i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
becomes unhibited while L2 is active, KVM will set various APICv controls
in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
running with nested_early_check=1, KVM blames L1 and chaos ensues.

The obvious elephant in the room is whether or not KVM needs to disallow
APICv in L2 if it is inhibited in L1.  Luckily, that's largely a moot
point as the only dynamic inhibit conditions that affect VMX are Hyper-V
and IRQ blocking.  IRQ blocking is firmly a debug-only feature, and L1
probably deserves whatever mess it gets by enabling AutoEOI while running
L2 with APICv enabled.

Lack of dynamic toggling is also why this scenario is all but impossible
to encounter in KVM's current form.  But a future patch will pend an
APICv update request _during_ vCPU creation to plug a race where a vCPU
that's being created doesn't get included in the "all vCPUs request"
because it's not yet visible to other vCPUs.  If userspaces restores L2
after VM creation (hello, KVM selftests), the first KVM_RUN will occur
while L2 is active and thus service the APICv update request made during
VM creation.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 5 +++++
 arch/x86/kvm/vmx/vmx.c    | 5 +++++
 arch/x86/kvm/vmx/vmx.h    | 1 +
 3 files changed, 11 insertions(+)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a6688663da4d..f5cb18e00e78 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4640,6 +4640,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
 	}
 
+	if (vmx->nested.update_vmcs01_apicv_status) {
+		vmx->nested.update_vmcs01_apicv_status = false;
+		kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+	}
+
 	if ((vm_exit_reason != -1) &&
 	    (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)))
 		vmx->nested.need_vmcs12_to_shadow_sync = true;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce..4c407a34b11e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4174,6 +4174,11 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	if (is_guest_mode(vcpu)) {
+		vmx->nested.update_vmcs01_apicv_status = true;
+		return;
+	}
+
 	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
 	if (cpu_has_secondary_exec_ctrls()) {
 		if (kvm_vcpu_apicv_active(vcpu))
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 9c6bfcd84008..b98c7e96697a 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -183,6 +183,7 @@ struct nested_vmx {
 	bool change_vmcs01_virtual_apic_mode;
 	bool reload_vmcs01_apic_access_page;
 	bool update_vmcs01_cpu_dirty_logging;
+	bool update_vmcs01_apicv_status;
 
 	/*
 	 * Enlightened VMCS has been enabled. It does not mean that L1 has to
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
  2022-04-16  3:42 [PATCH 0/4] KVM: x86: APICv fixes Sean Christopherson
  2022-04-16  3:42 ` [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled Sean Christopherson
  2022-04-16  3:42 ` [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active Sean Christopherson
@ 2022-04-16  3:42 ` Sean Christopherson
  2022-04-18 12:49   ` Maxim Levitsky
  2022-04-16  3:42 ` [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled Sean Christopherson
  3 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-16  3:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia,
	Maxim Levitsky

Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
in-kernel local APIC and APICv enabled at the module level.  Consuming
kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
before the vCPU is fully onlined, i.e. it won't get the request made to
"all" vCPUs.  If APICv is globally inhibited between setting apicv_active
and onlining the vCPU, the vCPU will end up running with APICv enabled
and trigger KVM's sanity check.

Mark APICv as active during vCPU creation if APICv is enabled at the
module level, both to be optimistic about it's final state, e.g. to avoid
additional VMWRITEs on VMX, and because there are likely bugs lurking
since KVM checks apicv_active in multiple vCPU creation paths.  While
keeping the current behavior of consuming kvm_apicv_activated() is
arguably safer from a regression perspective, force apicv_active so that
vCPU creation runs with deterministic state and so that if there are bugs,
they are found sooner than later, i.e. not when some crazy race condition
is hit.

  WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
  Modules linked in:
  CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
  Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
  RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
  Call Trace:
   <TASK>
   vcpu_run arch/x86/kvm/x86.c:10039 [inline]
   kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
   kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
   vfs_ioctl fs/ioctl.c:51 [inline]
   __do_sys_ioctl fs/ioctl.c:874 [inline]
   __se_sys_ioctl fs/ioctl.c:860 [inline]
   __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
   do_syscall_x64 arch/x86/entry/common.c:50 [inline]
   do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
   entry_SYSCALL_64_after_hwframe+0x44/0xae

The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
call to KVM_SET_GUEST_DEBUG.

  r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
  r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
  ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
  r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
  r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
  ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
  ioctl$KVM_RUN(r2, 0xae80, 0x0)

Reported-by: Gaoning Pan <pgn@zju.edu.cn>
Reported-by: Yongkang Jia <kangel@zju.edu.cn>
Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation")
Cc: stable@vger.kernel.org
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 753296902535..09a270cc1c8f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11259,8 +11259,21 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
 		if (r < 0)
 			goto fail_mmu_destroy;
-		if (kvm_apicv_activated(vcpu->kvm))
+
+		/*
+		 * Defer evaluating inhibits until the vCPU is first run, as
+		 * this vCPU will not get notified of any changes until this
+		 * vCPU is visible to other vCPUs (marked online and added to
+		 * the set of vCPUs).  Opportunistically mark APICv active as
+		 * VMX in particularly is highly unlikely to have inhibits.
+		 * Ignore the current per-VM APICv state so that vCPU creation
+		 * is guaranteed to run with a deterministic value, the request
+		 * will ensure the vCPU gets the correct state before VM-Entry.
+		 */
+		if (enable_apicv) {
 			vcpu->arch.apicv_active = true;
+			kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
+		}
 	} else
 		static_branch_inc(&kvm_has_noapic_vcpu);
 
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
  2022-04-16  3:42 [PATCH 0/4] KVM: x86: APICv fixes Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-04-16  3:42 ` [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race Sean Christopherson
@ 2022-04-16  3:42 ` Sean Christopherson
  2022-04-18 12:50   ` Maxim Levitsky
  3 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-16  3:42 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia,
	Maxim Levitsky

Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
disabled at the module level to avoid having to acquire the mutex and
potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
never be lifted, so piling on more inhibits is unnecessary.

Fixes: cae72dcc3b21 ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
Cc: Maxim Levitsky <mlevitsk@redhat.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 09a270cc1c8f..16c5fa7d165d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11048,6 +11048,9 @@ static void kvm_arch_vcpu_guestdbg_update_apicv_inhibit(struct kvm *kvm)
 	struct kvm_vcpu *vcpu;
 	unsigned long i;
 
+	if (!enable_apicv)
+		return;
+
 	down_write(&kvm->arch.apicv_update_lock);
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
-- 
2.36.0.rc0.470.gd361397f0d-goog


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled
  2022-04-16  3:42 ` [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled Sean Christopherson
@ 2022-04-18  8:37   ` Maxim Levitsky
  0 siblings, 0 replies; 12+ messages in thread
From: Maxim Levitsky @ 2022-04-18  8:37 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Gaoning Pan, Yongkang Jia

On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> Set the DISABLE inhibit, not the ABSENT inhibit, if APICv is disabled via
> module param.  A recent refactoring to add a wrapper for setting/clearing
> inhibits unintentionally changed the flag, probably due to a copy+paste
> goof.
> 
> Fixes: 4f4c4a3ee53c ("KVM: x86: Trace all APICv inhibit changes and capture overall status")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab336f7c82e4..753296902535 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9159,7 +9159,7 @@ static void kvm_apicv_init(struct kvm *kvm)
>  
>  	if (!enable_apicv)
>  		set_or_clear_apicv_inhibit(inhibits,
> -					   APICV_INHIBIT_REASON_ABSENT, true);
> +					   APICV_INHIBIT_REASON_DISABLE, true);
>  }
>  
>  static void kvm_sched_yield(struct kvm_vcpu *vcpu, unsigned long dest_id)

So ABSENT means that userspace didn't enable, it and DISABLE means kernel module param disabled it.
I didn't follow patches that touched those but it feels like we can use a single inhibit reason for both,
or at least make better names for this. APICV_INHIBIT_REASON_ABSENT doesn't sound good to me.

Having said that, the patch is OK.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  2022-04-16  3:42 ` [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active Sean Christopherson
@ 2022-04-18 12:49   ` Maxim Levitsky
  2022-04-18 15:35     ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Levitsky @ 2022-04-18 12:49 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Gaoning Pan, Yongkang Jia

On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> Defer APICv updates that occur while L2 is active until nested VM-Exit,
> i.e. until L1 regains control.  vmx_refresh_apicv_exec_ctrl() assumes L1
> is active and (a) stomps all over vmcs02 and (b) neglects to ever updated
> vmcs01.  E.g. if vmcs12 doesn't enable the TPR shadow for L2 (and thus no
> APICv controls), L1 performs nested VM-Enter APICv inhibited, and APICv
> becomes unhibited while L2 is active, KVM will set various APICv controls
> in vmcs02 and trigger a failed VM-Entry.  The kicker is that, unless
> running with nested_early_check=1, KVM blames L1 and chaos ensues.
> 
> The obvious elephant in the room is whether or not KVM needs to disallow
> APICv in L2 if it is inhibited in L1.  Luckily, that's largely a moot
> point as the only dynamic inhibit conditions that affect VMX are Hyper-V
> and IRQ blocking.  IRQ blocking is firmly a debug-only feature, and L1
> probably deserves whatever mess it gets by enabling AutoEOI while running
> L2 with APICv enabled.

Let me explain:
 
a vCPU either runs L1 or it "sort of" runs both L1 and L2 in regard to
how it sends/receives interrupts: 

Sending interrupts while nested: 
   While in guest mode, a vCPU can in theory use both L2's and L1's APIC 
   (if L2 has access to L1's APIC mmio/msrs). 

Receiving interrupts while nested: 
 
   While in guest mode, a vCPU can receive interrupts that target either L1 or L2 code it "runs".
   Later request will hopefully cause it to jump to the the interrupt handler without VMexit,
   while the former will cause VMexit, and let L1 run again.
 
   On AVIC interrupt requests can come from either L1 or L2, while on APICv 
   they have to come from L1, because APICv doesn't yet support nested IPI virtualization.
 
 
Another thing to note is that pretty much, APICv/AVIC inhibition is for 
L1 use of APICv/AVIC only.

In fact I won't object to rename the inhibition functions to explicitly state this.
 
When L2 uses APICv/AVIC, we just safely passthrough its usage to the real hardware.

If we were to to need to inhibit it, we would have to emulate APICv/AVIC so that L1 would
still think that it can use it - thankfully there is no need for that.


So how it all works together:
 
Sending interrupts while nested:
 
   - only L2's avic/apicv is accelerated because its settings are in vmcb/vmcs02.
 
   - if L1 allows L2 to access its APIC, then it accessed through regular MMIO/msr 
     interception and should still work.
     (this reminds me to write a unit test for this)
 
 
Receiving interrupts while nested:
 
  - Interrupts that target L2 are received via APICv/AVIC which is active in vmcb/vmcs02
 
    This APICv/AVIC should never be inhibited because it merely works on behalf of L1's hypervisor,
    which should decide if it wants to inhibit it.
 
    In fact even if L1's APICv/AVIC is always inhibited (e.g when L1 uses x2apic on AMD),
    L2 should still be able to use APICv/AVIC just fine - and it does work *very fine* with my nested AVIC code.
 
    Another way to say it, is that the whole APICv/AVIC inhibition mechanism is 
    in fact L1'1 APICv/AVIC inhibition,
    and there is nothing wrong when L1's APICv/AVIC is inhibited while L2 is running:
 
    That just means that vCPUs which don't run L2, will now stop using AVIC/APICv.
    It will also update the private memslot, which also only L1 uses, while L2 will continue
    using its APICv/AVIC as if nothing happened (it doesn't use this private memslot,
    but should itself map the apic mmio in its own EPT/NPT tables).
 
 
 - Interrupts that target L1: here there is a big difference between APICv and AVIC:

   On APICv, despite the fact that vmcs02 uses L2' APICv, we can still receive them
   as normal interrupts, due to the trick that we use different posted interrupt vectors.

   So for receiving, L1's APICv still "sort of works" while L2 is running,
   and with help of the interrupt handler will cause L2 to vmexit to L1 when such
   interrupt is received.
 
   That is why on APICv there is no need to inhibit L1's APICv when entering nested guest,
   because from peer POV, this vCPU's APICv is running just fine.
 

   On AVIC however, there is only one doorbell, and this is why I have to inhibit the AVIC
   while entering the nested guest, although the update it does to vmcb01 is not really needed,

   but what is needed is to inhibit this vCPU peers from sending it interrupts via AVIC
   (this includes clearing is_running bit in physid table, in IOMMU, and also clearing
   per-vcpu avic enabled bit, so that KVM doesn't send IPIs to this vCPU as well).
 
 
 
Having said all that, delaying L1's APICv inhibition to the next nested VM exit should not
cause any issues, as vmcs01 is not active at all while running nested,
and APICv inhibition is all about changing vmcs01 settings.

(That should be revised when IPI virtualization is merged).
 
On AVIC I don't bother with that, as inhibition explicitly updates vmcb01 and 
L1's physid page and should not have any impact on L2.
 
 
So I am not against this patch, but please update the commit message with the fact that it is all right to 
have L1 APICv inhibited/uninhibited while running nested, even though it is not likely
to happen with APICv.
 
Best regards,
	Maxim Levitsky
 


> 
> Lack of dynamic toggling is also why this scenario is all but impossible
> to encounter in KVM's current form.  But a future patch will pend an
> APICv update request _during_ vCPU creation to plug a race where a vCPU
> that's being created doesn't get included in the "all vCPUs request"
> because it's not yet visible to other vCPUs.  If userspaces restores L2
> after VM creation (hello, KVM selftests), the first KVM_RUN will occur
> while L2 is active and thus service the APICv update request made during
> VM creation.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 5 +++++
>  arch/x86/kvm/vmx/vmx.c    | 5 +++++
>  arch/x86/kvm/vmx/vmx.h    | 1 +
>  3 files changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a6688663da4d..f5cb18e00e78 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4640,6 +4640,11 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  		kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  	}
>  
> +	if (vmx->nested.update_vmcs01_apicv_status) {
> +		vmx->nested.update_vmcs01_apicv_status = false;
> +		kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> +	}
> +
>  	if ((vm_exit_reason != -1) &&
>  	    (enable_shadow_vmcs || evmptr_is_valid(vmx->nested.hv_evmcs_vmptr)))
>  		vmx->nested.need_vmcs12_to_shadow_sync = true;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf8581978bce..4c407a34b11e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4174,6 +4174,11 @@ static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
> +	if (is_guest_mode(vcpu)) {
> +		vmx->nested.update_vmcs01_apicv_status = true;
> +		return;
> +	}
> +
>  	pin_controls_set(vmx, vmx_pin_based_exec_ctrl(vmx));
>  	if (cpu_has_secondary_exec_ctrls()) {
>  		if (kvm_vcpu_apicv_active(vcpu))
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index 9c6bfcd84008..b98c7e96697a 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -183,6 +183,7 @@ struct nested_vmx {
>  	bool change_vmcs01_virtual_apic_mode;
>  	bool reload_vmcs01_apic_access_page;
>  	bool update_vmcs01_cpu_dirty_logging;
> +	bool update_vmcs01_apicv_status;
>  
>  	/*
>  	 * Enlightened VMCS has been enabled. It does not mean that L1 has to



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race
  2022-04-16  3:42 ` [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race Sean Christopherson
@ 2022-04-18 12:49   ` Maxim Levitsky
  0 siblings, 0 replies; 12+ messages in thread
From: Maxim Levitsky @ 2022-04-18 12:49 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Gaoning Pan, Yongkang Jia

On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> Make a KVM_REQ_APICV_UPDATE request when creating a vCPU with an
> in-kernel local APIC and APICv enabled at the module level.  Consuming
> kvm_apicv_activated() and stuffing vcpu->arch.apicv_active directly can
> race with __kvm_set_or_clear_apicv_inhibit(), as vCPU creation happens
> before the vCPU is fully onlined, i.e. it won't get the request made to
> "all" vCPUs.  If APICv is globally inhibited between setting apicv_active
> and onlining the vCPU, the vCPU will end up running with APICv enabled
> and trigger KVM's sanity check.
> 
> Mark APICv as active during vCPU creation if APICv is enabled at the
> module level, both to be optimistic about it's final state, e.g. to avoid
> additional VMWRITEs on VMX, and because there are likely bugs lurking
> since KVM checks apicv_active in multiple vCPU creation paths.  While
> keeping the current behavior of consuming kvm_apicv_activated() is
> arguably safer from a regression perspective, force apicv_active so that
> vCPU creation runs with deterministic state and so that if there are bugs,
> they are found sooner than later, i.e. not when some crazy race condition
> is hit.
> 
>   WARNING: CPU: 0 PID: 484 at arch/x86/kvm/x86.c:9877 vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877

I told you that this warning catches bugs. I am not disappointed!

>   Modules linked in:
>   CPU: 0 PID: 484 Comm: syz-executor361 Not tainted 5.16.13 #2
>   Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1~cloud0 04/01/2014
>   RIP: 0010:vcpu_enter_guest+0x2ae3/0x3ee0 arch/x86/kvm/x86.c:9877
>   Call Trace:
>    <TASK>
>    vcpu_run arch/x86/kvm/x86.c:10039 [inline]
>    kvm_arch_vcpu_ioctl_run+0x337/0x15e0 arch/x86/kvm/x86.c:10234
>    kvm_vcpu_ioctl+0x4d2/0xc80 arch/x86/kvm/../../../virt/kvm/kvm_main.c:3727
>    vfs_ioctl fs/ioctl.c:51 [inline]
>    __do_sys_ioctl fs/ioctl.c:874 [inline]
>    __se_sys_ioctl fs/ioctl.c:860 [inline]
>    __x64_sys_ioctl+0x16d/0x1d0 fs/ioctl.c:860
>    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
>    do_syscall_64+0x38/0x90 arch/x86/entry/common.c:80
>    entry_SYSCALL_64_after_hwframe+0x44/0xae
> 
> The bug was hit by a syzkaller spamming VM creation with 2 vCPUs and a
> call to KVM_SET_GUEST_DEBUG.
> 
>   r0 = openat$kvm(0xffffffffffffff9c, &(0x7f0000000000), 0x0, 0x0)
>   r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
>   ioctl$KVM_CAP_SPLIT_IRQCHIP(r1, 0x4068aea3, &(0x7f0000000000)) (async)
>   r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) (async)
>   r3 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x400000000000002)
>   ioctl$KVM_SET_GUEST_DEBUG(r3, 0x4048ae9b, &(0x7f00000000c0)={0x5dda9c14aa95f5c5})
>   ioctl$KVM_RUN(r2, 0xae80, 0x0)
> 
> Reported-by: Gaoning Pan <pgn@zju.edu.cn>
> Reported-by: Yongkang Jia <kangel@zju.edu.cn>
> Fixes: 8df14af42f00 ("kvm: x86: Add support for dynamic APICv activation")
> Cc: stable@vger.kernel.org
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 15 ++++++++++++++-
>  1 file changed, 14 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 753296902535..09a270cc1c8f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11259,8 +11259,21 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
>  		r = kvm_create_lapic(vcpu, lapic_timer_advance_ns);
>  		if (r < 0)
>  			goto fail_mmu_destroy;
> -		if (kvm_apicv_activated(vcpu->kvm))
> +
> +		/*
> +		 * Defer evaluating inhibits until the vCPU is first run, as
> +		 * this vCPU will not get notified of any changes until this
> +		 * vCPU is visible to other vCPUs (marked online and added to
> +		 * the set of vCPUs).  Opportunistically mark APICv active as
> +		 * VMX in particularly is highly unlikely to have inhibits.
> +		 * Ignore the current per-VM APICv state so that vCPU creation
> +		 * is guaranteed to run with a deterministic value, the request
> +		 * will ensure the vCPU gets the correct state before VM-Entry.
> +		 */
> +		if (enable_apicv) {
>  			vcpu->arch.apicv_active = true;
> +			kvm_make_request(KVM_REQ_APICV_UPDATE, vcpu);
> +		}
>  	} else
>  		static_branch_inc(&kvm_has_noapic_vcpu);
>  


Makes sense.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled
  2022-04-16  3:42 ` [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled Sean Christopherson
@ 2022-04-18 12:50   ` Maxim Levitsky
  0 siblings, 0 replies; 12+ messages in thread
From: Maxim Levitsky @ 2022-04-18 12:50 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Gaoning Pan, Yongkang Jia

On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> Skip the APICv inhibit update for KVM_GUESTDBG_BLOCKIRQ if APICv is
> disabled at the module level to avoid having to acquire the mutex and
> potentially process all vCPUs. The DISABLE inhibit will (barring bugs)
> never be lifted, so piling on more inhibits is unnecessary.
> 
> Fixes: cae72dcc3b21 ("KVM: x86: inhibit APICv when KVM_GUESTDBG_BLOCKIRQ active")
> Cc: Maxim Levitsky <mlevitsk@redhat.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 09a270cc1c8f..16c5fa7d165d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11048,6 +11048,9 @@ static void kvm_arch_vcpu_guestdbg_update_apicv_inhibit(struct kvm *kvm)
>  	struct kvm_vcpu *vcpu;
>  	unsigned long i;
>  
> +	if (!enable_apicv)
> +		return;
> +
>  	down_write(&kvm->arch.apicv_update_lock);
>  
>  	kvm_for_each_vcpu(i, vcpu, kvm) {

Makes sense as a precation.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  2022-04-18 12:49   ` Maxim Levitsky
@ 2022-04-18 15:35     ` Sean Christopherson
  2022-04-19  6:44       ` Maxim Levitsky
  0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2022-04-18 15:35 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia

On Mon, Apr 18, 2022, Maxim Levitsky wrote:
> On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> When L2 uses APICv/AVIC, we just safely passthrough its usage to the real hardware.
> 
> If we were to to need to inhibit it, we would have to emulate APICv/AVIC so that L1 would
> still think that it can use it - thankfully there is no need for that.

What if L1 passes through IRQs and all MSRs to L2?  E.g. if L2 activates Auto EOI
via WRMSR, then arguably it is KVM's responsibility to disable APICv in vmcs02 _and_
vmcs01 in order to handle the Auto EOI properly.  L1 isn't expecting a VM-Exit, so
KVM can't safely punt to L1 even if conceptually we think that it's L1's problem.

It's a contrived scenario, but technically possible.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  2022-04-18 15:35     ` Sean Christopherson
@ 2022-04-19  6:44       ` Maxim Levitsky
  2022-04-19 15:57         ` Sean Christopherson
  0 siblings, 1 reply; 12+ messages in thread
From: Maxim Levitsky @ 2022-04-19  6:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia

On Mon, 2022-04-18 at 15:35 +0000, Sean Christopherson wrote:
> On Mon, Apr 18, 2022, Maxim Levitsky wrote:
> > On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> > When L2 uses APICv/AVIC, we just safely passthrough its usage to the real hardware.
> > 
> > If we were to to need to inhibit it, we would have to emulate APICv/AVIC so that L1 would
> > still think that it can use it - thankfully there is no need for that.
> 
> What if L1 passes through IRQs and all MSRs to L2? 

KVM absolutely should inhibit L1 AVIC/APICv in this case if L2 triggers AUTOEOI inhibit via msr write via non
intercepted msr, and I don't see why it won't work.

It should not affect L2's APICv/AVIC though, only L1 can decide that it wants to inhibit it,
and it has all the means to do so without any help from L0.


In regard to APICv, neither vmcs02 nor vmcs01 should need to be touched on a vCPU
that does this though:
 
 
- vmcs02 can't have APICv enabled, because passthrough of interrupts thankfully
  conflicts with APICv (virtual interrupt delivery depends on intercepting interrupts)
  and even if that was false, it would have contained L2's APICv settings which should
  continue to work as usual.
 
- vmcs01 isn't active on this vCPU, so no need to touch it. vmcs01 of other vCPUs
  which don't run nested, does need to be updated to have APICv inhibited, 
  which should work just fine unless there are bugs in KVM's APICv.
 
 
- Posted interrupts that target L1 will be delivered as normal interrupts and cause KVM to 
  inject them to L2 (because of no interception)
 
- Posted interrupts that target L2 can't even happen because APICv can't be enabled in L1

 
In regard to SVM, in theory you can have interrupt intercept disabled and yet have AVIC enabled
in both L0 and L1, but it should work just fine as well:
 
In this case, while running nested:
 
L2 if it has direct access to L1's AVIC, will be able to write there but all writes will
be normal MMIO writes and work as usual but not accelerated.
 
Interrupts that target L1 will not go through AVIC, but be delivered via normal IPIs,
because L1's avic is inhibited on this vCPU locally, and then KVM will inject them to L2.
 
Interrupts that target L2 will go through nested AVIC and land there just fine.
 
AVIC inhibition/uninhibition in this case has 0 effect on this vCPU, as L1 AVIC is already
inhibited locally on this vCPU, and my code will correctly do nothing in this case.

Yes in theory, when L1 doesn't use AVIC for L2, and passes through all interrupts to L2,
I could have setup vmcb02 to use L0's AVIC and let interrupts that target L1 go to L2
via AVIC, but it is just not worth the complexity this adds, and it might not even
work correctly in some cases, since L1 still has its own APIC, even if it doesn't
really receive interrupts while L2 is running.

 
IMHO it all adds up.
 
Best regards,
	Maxim Levitsky



>   L1 isn't expecting a VM-Exit, so
> KVM can't safely punt to L1 even if conceptually we think that it's L1's problem.
> 
> It's a contrived scenario, but technically possible.
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active
  2022-04-19  6:44       ` Maxim Levitsky
@ 2022-04-19 15:57         ` Sean Christopherson
  0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2022-04-19 15:57 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Gaoning Pan, Yongkang Jia

On Tue, Apr 19, 2022, Maxim Levitsky wrote:
> On Mon, 2022-04-18 at 15:35 +0000, Sean Christopherson wrote:
> > On Mon, Apr 18, 2022, Maxim Levitsky wrote:
> > > On Sat, 2022-04-16 at 03:42 +0000, Sean Christopherson wrote:
> > > When L2 uses APICv/AVIC, we just safely passthrough its usage to the real hardware.
> > > 
> > > If we were to to need to inhibit it, we would have to emulate APICv/AVIC so that L1 would
> > > still think that it can use it - thankfully there is no need for that.
> > 
> > What if L1 passes through IRQs and all MSRs to L2? 

...

> - vmcs02 can't have APICv enabled, because passthrough of interrupts thankfully
>   conflicts with APICv (virtual interrupt delivery depends on intercepting interrupts)
>   and even if that was false, it would have contained L2's APICv settings which should
>   continue to work as usual.

Ah, this was the critical piece I was forgetting.  I'll tweak the changelog and
post a new version.

Thanks!

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2022-04-19 15:57 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-16  3:42 [PATCH 0/4] KVM: x86: APICv fixes Sean Christopherson
2022-04-16  3:42 ` [PATCH 1/4] KVM: x86: Tag APICv DISABLE inhibit, not ABSENT, if APICv is disabled Sean Christopherson
2022-04-18  8:37   ` Maxim Levitsky
2022-04-16  3:42 ` [PATCH 2/4] KVM: nVMX: Defer APICv updates while L2 is active until L1 is active Sean Christopherson
2022-04-18 12:49   ` Maxim Levitsky
2022-04-18 15:35     ` Sean Christopherson
2022-04-19  6:44       ` Maxim Levitsky
2022-04-19 15:57         ` Sean Christopherson
2022-04-16  3:42 ` [PATCH 3/4] KVM: x86: Pend KVM_REQ_APICV_UPDATE during vCPU creation to fix a race Sean Christopherson
2022-04-18 12:49   ` Maxim Levitsky
2022-04-16  3:42 ` [PATCH 4/4] KVM: x86: Skip KVM_GUESTDBG_BLOCKIRQ APICv update if APICv is disabled Sean Christopherson
2022-04-18 12:50   ` Maxim Levitsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).