All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM
@ 2022-08-02 23:07 Mingwei Zhang
  2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
                   ` (4 more replies)
  0 siblings, 5 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

This patch set aims to fix a race condition between posted interrupt
delivery and migration for a nested VM. In particular, we proves that when
a nested vCPU is halted and just migrated, it will lose a posted
interrupt from another vCPU in the same VM.

The patches consist of 1 kernel change which is the fix and the rest of
the changes generate a selftest that articulates such a racing scenario
to prove the existence of the race. In summary, running this test on an
unpatched kernel will generate a warning [1] and with that, we proves that
there is the loss of a posted interrupt. Note, the warning will only happen
once per reboot, since it is a WARN_ON_ONCE.

[1] The kernel warning happens at arch/x86/kvm/vmx/vmx.c:

static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
	...
	if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
		!nested_cpu_has_vid(get_vmcs12(vcpu)) ||
		WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn)) <= HERE
		return false;
	...
}

The dump is there:

[237880.809453] ------------[ cut here ]------------
[237880.809455] WARNING: CPU: 21 PID: 112454 at
arch/x86/kvm/vmx/vmx.c:3973 vmx_guest_apic_has_interrupt+0x79/0xe0
[kvm_intel]
[237880.809469] Modules linked in: kvm_intel vfat fat i2c_mux_pca954x
i2c_mux spidev cdc_acm xhci_pci xhci_hcd sha3_generic gq(O)
[237880.809479] CPU: 21 PID: 112454 Comm: vmx_migrate_pi_ Tainted: G S
O      5.19.0-smp-DEV #2
		......
[237880.809484] RIP: 0010:vmx_guest_apic_has_interrupt+0x79/0xe0
[kvm_intel]
[237880.809491] Code: c6 76 2d 41 81 e6 f0 00 00 00 48 8b 83 68 25 00 00
b9 f0 00 00 00 23 88 a0 00 00 00 44 39 f1 0f 92 c0 eb c0 0f 0b 31 c0 eb
ba <0f> 0b 31 c0 eb b4 80 3d 41 c9 02 00 00 74 39 48 c7 c7 18 f3 12 c0
[237880.809493] RSP: 0018:ffff88815c9e7d80 EFLAGS: 00010246
[237880.809495] RAX: ffff88813acbd000 RBX: ffff8881943ec9c0 RCX:
00000000ffffffff
[237880.809497] RDX: 0000000000000000 RSI: ffff8881d8676000 RDI:
ffff8881943ec9c0
[237880.809499] RBP: ffff88815c9e7d90 R08: ffff88815c9e7ce8 R09:
ffff88815c9e7cf0
[237880.809500] R10: 0000000000000000 R11: 0000000000000000 R12:
0000000000009aa8
[237880.809501] R13: ffff8881943ec9c0 R14: ffff8881943ed101 R15:
ffff8881943ec9c0
[237880.809503] FS:  00000000006283c0(0000) GS:ffff88af80740000(0000)
knlGS:0000000000000000
[237880.809505] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[237880.809506] CR2: 00007f9314b4f001 CR3: 00000001cd7b0005 CR4:
00000000003726e0
[237880.809508] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[237880.809509] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[237880.809511] Call Trace:
[237880.809512]  <TASK>
[237880.809514]  kvm_vcpu_has_events+0xe1/0x150
[237880.809519]  vcpu_run+0xee/0x2c0
[237880.809523]  kvm_arch_vcpu_ioctl_run+0x355/0x610
[237880.809526]  kvm_vcpu_ioctl+0x551/0x610
[237880.809531]  ? do_futex+0xc8/0x160
[237880.809537]  __se_sys_ioctl+0x77/0xc0
[237880.809541]  __x64_sys_ioctl+0x1d/0x20
[237880.809543]  do_syscall_64+0x44/0xa0
[237880.809549]  ? irqentry_exit+0x12/0x30
[237880.809552]  entry_SYSCALL_64_after_hwframe+0x46/0xb0
[237880.809555] RIP: 0033:0x471777
...
[237880.809570]  </TASK>
[237880.809571] ---[ end trace 0000000000000000 ]---



Jim Mattson (1):
  selftests: KVM: Test if posted interrupt delivery race with migration

Mingwei Zhang (3):
  selftests: KVM/x86: Add APIC state into kvm_x86_state
  selftests: KVM: Introduce vcpu_run_interruptable()
  selftests: KVM: Add support for posted interrupt handling in L2

Oliver Upton (1):
  kvm: x86: get vmcs12 pages before checking pending interrupts

 arch/x86/kvm/x86.c                            |  17 ++
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/kvm_util_base.h     |  12 +
 .../selftests/kvm/include/x86_64/processor.h  |   1 +
 .../selftests/kvm/include/x86_64/vmx.h        |  10 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  11 +
 .../selftests/kvm/lib/x86_64/processor.c      |   2 +
 tools/testing/selftests/kvm/lib/x86_64/vmx.c  |  16 +
 .../kvm/x86_64/vmx_migrate_pi_pending.c       | 289 ++++++++++++++++++
 10 files changed, 360 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_migrate_pi_pending.c

-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply	[flat|nested] 28+ messages in thread

* [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
@ 2022-08-02 23:07 ` Mingwei Zhang
  2022-08-03 10:08   ` Maxim Levitsky
  2022-08-03 17:18   ` Paolo Bonzini
  2022-08-02 23:07 ` [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state Mingwei Zhang
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

From: Oliver Upton <oupton@google.com>

vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
page being present + mapped into the kernel address space. However, with
demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
event isn't assessed before entering vcpu_block.

Fix this by getting vmcs12 pages before inspecting the guest's APIC
page. Note that upstream does not have this issue, as they will directly
get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
event request mechanism. However, the upstream approach is problematic,
as the vmcs12 pages will not be present if a live migration occurred
before checking the virtual APIC page.

Signed-off-by: Oliver Upton <oupton@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 arch/x86/kvm/x86.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5366f884e9a7..1d3d8127aaea 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
 {
 	bool hv_timer;
 
+	/*
+	 * We must first get the vmcs12 pages before checking for interrupts
+	 * that might unblock the guest if L1 is using virtual-interrupt
+	 * delivery.
+	 */
+	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
+		/*
+		 * If we have to ask user-space to post-copy a page,
+		 * then we have to keep trying to get all of the
+		 * VMCS12 pages until we succeed.
+		 */
+		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
+			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
+			return 0;
+		}
+	}
+
 	if (!kvm_arch_vcpu_runnable(vcpu)) {
 		/*
 		 * Switch to the software timer before halt-polling/blocking as
-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state
  2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
  2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
@ 2022-08-02 23:07 ` Mingwei Zhang
  2022-08-03 18:44   ` Oliver Upton
  2022-08-02 23:07 ` [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable() Mingwei Zhang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state and
properly save/restore it in vcpu_{save,load}_state(). When vcpu resets,
APIC state become software disabled in kernel and thus the corresponding
vCPU is not able to receive posted interrupts [1].  So, add APIC
save/restore in userspace in selftest library code.

[1] commit 97222cc83163 ("KVM: Emulate local APIC in kernel").

Cc: Jim Mattson <jmattson@google.com>

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util_base.h    | 10 ++++++++++
 tools/testing/selftests/kvm/include/x86_64/processor.h |  1 +
 tools/testing/selftests/kvm/lib/x86_64/processor.c     |  2 ++
 3 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index 24fde97f6121..ac883b8eab57 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -457,6 +457,16 @@ static inline void vcpu_fpu_set(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu)
 	vcpu_ioctl(vcpu, KVM_SET_FPU, fpu);
 }
 
+static inline void vcpu_apic_get(struct kvm_vcpu *vcpu, struct kvm_lapic_state *apic)
+{
+	vcpu_ioctl(vcpu, KVM_GET_LAPIC, apic);
+}
+
+static inline void vcpu_apic_set(struct kvm_vcpu *vcpu, struct kvm_lapic_state *apic)
+{
+	vcpu_ioctl(vcpu, KVM_SET_LAPIC, apic);
+}
+
 static inline int __vcpu_get_reg(struct kvm_vcpu *vcpu, uint64_t id, void *addr)
 {
 	struct kvm_one_reg reg = { .id = id, .addr = (uint64_t)addr };
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 45edf45821d0..bf5f874709a4 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -225,6 +225,7 @@ struct kvm_x86_state {
 		struct kvm_nested_state nested;
 		char nested_[16384];
 	};
+	struct kvm_lapic_state apic;
 	struct kvm_msrs msrs;
 };
 
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index f35626df1dea..d18da71654b6 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -980,6 +980,7 @@ struct kvm_x86_state *vcpu_save_state(struct kvm_vcpu *vcpu)
 	vcpu_msrs_get(vcpu, &state->msrs);
 
 	vcpu_debugregs_get(vcpu, &state->debugregs);
+	vcpu_apic_get(vcpu, &state->apic);
 
 	return state;
 }
@@ -997,6 +998,7 @@ void vcpu_load_state(struct kvm_vcpu *vcpu, struct kvm_x86_state *state)
 	vcpu_mp_state_set(vcpu, &state->mp_state);
 	vcpu_debugregs_set(vcpu, &state->debugregs);
 	vcpu_regs_set(vcpu, &state->regs);
+	vcpu_apic_set(vcpu, &state->apic);
 
 	if (state->nested.size)
 		vcpu_nested_state_set(vcpu, &state->nested);
-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable()
  2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
  2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
  2022-08-02 23:07 ` [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state Mingwei Zhang
@ 2022-08-02 23:07 ` Mingwei Zhang
  2022-08-25  0:15   ` Sean Christopherson
  2022-08-02 23:07 ` [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2 Mingwei Zhang
  2022-08-02 23:07 ` [PATCH 5/5] selftests: KVM: Test if posted interrupt delivery race with migration Mingwei Zhang
  4 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

Introduce vcpu_run_interruptable() to allow selftests execute their own
code when a vcpu is kicked out of KVM_RUN on receiving a POSIX signal.

This function gives selftests the flexibility to execute their own logic
especially when a vCPU is halted. Note that vcpu_run_complete_io() almost
achieves the same effect. However, it explicitly disallows the case of
returning to caller when errno is EINTR.

Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 tools/testing/selftests/kvm/include/kvm_util_base.h |  2 ++
 tools/testing/selftests/kvm/lib/kvm_util.c          | 11 +++++++++++
 2 files changed, 13 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index ac883b8eab57..cfb91c63d8c3 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -398,6 +398,8 @@ static inline int __vcpu_run(struct kvm_vcpu *vcpu)
 	return __vcpu_ioctl(vcpu, KVM_RUN, NULL);
 }
 
+int vcpu_run_interruptable(struct kvm_vcpu *vcpu);
+
 void vcpu_run_complete_io(struct kvm_vcpu *vcpu);
 struct kvm_reg_list *vcpu_get_reg_list(struct kvm_vcpu *vcpu);
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9889fe0d8919..aca418ce4e8c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1406,6 +1406,17 @@ void vm_create_irqchip(struct kvm_vm *vm)
 	vm->has_irqchip = true;
 }
 
+int vcpu_run_interruptable(struct kvm_vcpu *vcpu)
+{
+	int rc;
+
+	rc = __vcpu_run(vcpu);
+
+	vcpu->run->immediate_exit = 0;
+
+	return rc;
+}
+
 int _vcpu_run(struct kvm_vcpu *vcpu)
 {
 	int rc;
-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2
  2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
                   ` (2 preceding siblings ...)
  2022-08-02 23:07 ` [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable() Mingwei Zhang
@ 2022-08-02 23:07 ` Mingwei Zhang
  2022-08-25  0:16   ` Sean Christopherson
  2022-08-02 23:07 ` [PATCH 5/5] selftests: KVM: Test if posted interrupt delivery race with migration Mingwei Zhang
  4 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

Add support for posted interrupt handling in L2. This is done by adding
needed data structures in vmx_pages and APIs to allow an L2 receive posted
interrupts

Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 tools/testing/selftests/kvm/include/x86_64/vmx.h | 10 ++++++++++
 tools/testing/selftests/kvm/lib/x86_64/vmx.c     | 16 ++++++++++++++++
 2 files changed, 26 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
index 7d8c980317f7..3449ae8ab282 100644
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@@ -579,6 +579,14 @@ struct vmx_pages {
 	void *apic_access_hva;
 	uint64_t apic_access_gpa;
 	void *apic_access;
+
+	void *virtual_apic_hva;
+	uint64_t virtual_apic_gpa;
+	void *virtual_apic;
+
+	void *posted_intr_desc_hva;
+	uint64_t posted_intr_desc_gpa;
+	void *posted_intr_desc;
 };
 
 union vmx_basic {
@@ -622,5 +630,7 @@ void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm,
 void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
 		  uint32_t eptp_memslot);
 void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm);
+void prepare_virtual_apic(struct vmx_pages *vmx, struct kvm_vm *vm);
+void prepare_posted_intr_desc(struct vmx_pages *vmx, struct kvm_vm *vm);
 
 #endif /* SELFTEST_KVM_VMX_H */
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
index 80a568c439b8..7d65bee37b9f 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -556,3 +556,19 @@ void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm)
 	vmx->apic_access_hva = addr_gva2hva(vm, (uintptr_t)vmx->apic_access);
 	vmx->apic_access_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->apic_access);
 }
+
+void prepare_virtual_apic(struct vmx_pages *vmx, struct kvm_vm *vm)
+{
+	vmx->virtual_apic = (void *)vm_vaddr_alloc_page(vm);
+	vmx->virtual_apic_hva = addr_gva2hva(vm, (uintptr_t)vmx->virtual_apic);
+	vmx->virtual_apic_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->virtual_apic);
+}
+
+void prepare_posted_intr_desc(struct vmx_pages *vmx, struct kvm_vm *vm)
+{
+	vmx->posted_intr_desc = (void *)vm_vaddr_alloc_page(vm);
+	vmx->posted_intr_desc_hva =
+		addr_gva2hva(vm, (uintptr_t)vmx->posted_intr_desc);
+	vmx->posted_intr_desc_gpa =
+		addr_gva2gpa(vm, (uintptr_t)vmx->posted_intr_desc);
+}
-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [PATCH 5/5] selftests: KVM: Test if posted interrupt delivery race with migration
  2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
                   ` (3 preceding siblings ...)
  2022-08-02 23:07 ` [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2 Mingwei Zhang
@ 2022-08-02 23:07 ` Mingwei Zhang
  4 siblings, 0 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-02 23:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Mingwei Zhang

From: Jim Mattson <jmattson@google.com>

Test if posted interrupt delivery race with migration. Add a selftest
to demonstrate a race condition between migration and posted
interrupt for a nested VM. The consequence of this race condition causes
the loss of a posted interrupt for a nested vCPU after migration and
triggers a warning for unpatched kernel.

The selftest demonstrates that if a L2 vCPU is in halted state before
migration, then after migration, it is not able to receive a posted
interrupt from another vCPU within the same VM.

The fundamental problem is deeply buried in the kernel logic where
vcpu_block() will directly check vmcs12 related mappings before having a
valid vmcs12 ready.  Because of that, it fails to process the posted
interrupt and triggers the warning in vmx_guest_apic_has_interrupt()

static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
{
	...
	if (WARN_ON_ONCE(!is_guest_mode(vcpu)) ||
		!nested_cpu_has_vid(get_vmcs12(vcpu)) ||
		WARN_ON_ONCE(!vmx->nested.virtual_apic_map.gfn)) <= HERE
		return false;
	...
}

Signed-off-by: Jim Mattson <jmattson@google.com>
Co-developed-by: Mingwei Zhang <mizhang@google.com>
Signed-off-by: Mingwei Zhang <mizhang@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/vmx_migrate_pi_pending.c       | 289 ++++++++++++++++++
 3 files changed, 291 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/vmx_migrate_pi_pending.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index d625a3f83780..749b2be5b23c 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -51,6 +51,7 @@
 /x86_64/vmx_exception_with_invalid_guest_state
 /x86_64/vmx_invalid_nested_guest_state
 /x86_64/vmx_msrs_test
+/x86_64/vmx_migrate_pi_pending
 /x86_64/vmx_preemption_timer_test
 /x86_64/vmx_set_nested_state_test
 /x86_64/vmx_tsc_adjust_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 690b499c3471..2d32416237db 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -109,6 +109,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/vmx_dirty_log_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_exception_with_invalid_guest_state
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_msrs_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_invalid_nested_guest_state
+TEST_GEN_PROGS_x86_64 += x86_64/vmx_migrate_pi_pending
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_set_nested_state_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_tsc_adjust_test
 TEST_GEN_PROGS_x86_64 += x86_64/vmx_nested_tsc_scaling_test
diff --git a/tools/testing/selftests/kvm/x86_64/vmx_migrate_pi_pending.c b/tools/testing/selftests/kvm/x86_64/vmx_migrate_pi_pending.c
new file mode 100644
index 000000000000..f1498621eb9a
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/vmx_migrate_pi_pending.c
@@ -0,0 +1,289 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * vmx_migrate_pi_pending
+ *
+ * Copyright (C) 2022, Google, LLC.
+ *
+ * Deliver a nested posted interrupt between migration and the first
+ * KVM_RUN on the target.
+ */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "vmx.h"
+
+#include <string.h>
+#include <sys/ioctl.h>
+#include <linux/bitmap.h>
+#include <pthread.h>
+#include <signal.h>
+
+#include "kselftest.h"
+
+#define VCPU_ID0 0
+#define VCPU_ID1 1
+#define PI_ON_BIT 256
+#define PI_NV    0x42
+#define L2_INTR  0x71
+
+enum {
+	PORT_L0_EXIT = 0x2000,
+};
+
+/* The virtual machine object. */
+struct vmx_pages *vmx;
+
+static struct kvm_vm *vm;
+static struct kvm_vcpu *vcpu0;
+static struct kvm_vcpu *vcpu1;
+bool vcpu0_can_run = true;
+bool vcpu0_running;
+bool pi_executed;
+pthread_t pthread_cpu0;
+pthread_t pthread_cpu1;
+
+static void vcpu0_ipi_handler(struct ex_regs *regs)
+{
+	 asm volatile("inb %%dx, %%al"
+		      : : [port] "d" (PORT_L0_EXIT) : "rax");
+	asm volatile("vmcall");
+}
+
+static void l2_vcpu0_guest_code(void)
+{
+	asm volatile("cli");
+	asm volatile("sti; nop; hlt");
+}
+
+static void l1_vcpu0_guest_code(struct vmx_pages *vmx_pages)
+{
+#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_vcpu0_guest_stack[L2_GUEST_STACK_SIZE];
+	uint32_t control;
+
+	x2apic_enable();
+
+	GUEST_ASSERT(prepare_for_vmx_operation(vmx_pages));
+	GUEST_ASSERT(load_vmcs(vmx_pages));
+
+	/* Prepare the VMCS for L2 execution. */
+	prepare_vmcs(vmx_pages, l2_vcpu0_guest_code,
+		     &l2_vcpu0_guest_stack[L2_GUEST_STACK_SIZE]);
+	control = vmreadz(PIN_BASED_VM_EXEC_CONTROL);
+	control |= (PIN_BASED_EXT_INTR_MASK |
+		    PIN_BASED_POSTED_INTR);
+	vmwrite(PIN_BASED_VM_EXEC_CONTROL, control);
+	control = vmreadz(CPU_BASED_VM_EXEC_CONTROL);
+	control |= (CPU_BASED_TPR_SHADOW |
+		    CPU_BASED_ACTIVATE_SECONDARY_CONTROLS);
+	vmwrite(CPU_BASED_VM_EXEC_CONTROL, control);
+	control = vmreadz(SECONDARY_VM_EXEC_CONTROL);
+	control |= (SECONDARY_EXEC_VIRTUALIZE_X2APIC_MODE |
+		    SECONDARY_EXEC_VIRTUAL_INTR_DELIVERY);
+	vmwrite(SECONDARY_VM_EXEC_CONTROL, control);
+	control = vmreadz(VM_EXIT_CONTROLS);
+	control |= VM_EXIT_ACK_INTR_ON_EXIT;
+	vmwrite(VM_EXIT_CONTROLS, control);
+	vmwrite(VIRTUAL_APIC_PAGE_ADDR, vmx_pages->virtual_apic_gpa);
+	vmwrite(POSTED_INTR_DESC_ADDR, vmx_pages->posted_intr_desc_gpa);
+	vmwrite(POSTED_INTR_NV, PI_NV);
+
+	GUEST_ASSERT(!vmlaunch());
+	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_VMCALL);
+	GUEST_ASSERT(!test_bit(PI_ON_BIT, (void *)vmx_pages->posted_intr_desc));
+	GUEST_DONE();
+}
+
+static void post_intr(u8 vector, void *pi_desc)
+{
+	set_bit(vector, pi_desc);
+	set_bit(PI_ON_BIT, pi_desc);
+}
+
+static void l1_vcpu1_guest_code(void *vcpu0_pi_desc)
+{
+	post_intr(L2_INTR, vcpu0_pi_desc);
+	x2apic_enable();
+	x2apic_write_reg(APIC_ICR, ((u64)VCPU_ID0 << 32) |
+			 APIC_DEST_PHYSICAL | APIC_DM_FIXED | PI_NV);
+	GUEST_DONE();
+}
+
+static void save_restore_vm(struct kvm_vm *vm)
+{
+	struct kvm_regs regs1 = {}, regs2 = {};
+	struct kvm_x86_state *state;
+
+	state = vcpu_save_state(vcpu0);
+	vcpu_regs_get(vcpu0, &regs1);
+
+	kvm_vm_release(vm);
+
+	/* Restore state in a new VM.  */
+	vcpu0 = vm_recreate_with_one_vcpu(vm);
+	vcpu_load_state(vcpu0, state);
+	kvm_x86_state_cleanup(state);
+
+	vcpu_regs_get(vcpu0, &regs2);
+	TEST_ASSERT(!memcmp(&regs1, &regs2, sizeof(regs2)),
+		    "vcpu0: Unexpected register values after vcpu_load_state; rdi: %lx rsi: %lx",
+		    (ulong) regs2.rdi, (ulong) regs2.rsi);
+}
+
+void *create_and_run_vcpu1(void *arg)
+{
+	struct ucall uc;
+	struct kvm_run *run;
+	struct kvm_mp_state vcpu0_mp_state;
+
+	pthread_cpu1 = pthread_self();
+
+	/* Keep trying to kick out vcpu0 until it is in halted state. */
+	for (;;) {
+		WRITE_ONCE(vcpu0_can_run, true);
+		sleep(0.1);
+		WRITE_ONCE(vcpu0_can_run, false);
+		pthread_kill(pthread_cpu0, SIGUSR1);
+		printf("vcpu1: Sent SIGUSR1 to vcpu0\n");
+
+		while (READ_ONCE(vcpu0_running))
+			;
+
+		vcpu_mp_state_get(vcpu0, &vcpu0_mp_state);
+		if (vcpu0_mp_state.mp_state == KVM_MP_STATE_HALTED)
+			break;
+	}
+
+	printf("vcpu1: Kicked out vcpu0 and ensure vcpu0 is halted\n");
+
+	/* Use save_restore_vm() to simulate a VM migration. */
+	save_restore_vm(vm);
+
+	printf("vcpu1: Finished save and restore vm.\n");
+	vcpu1 = vm_vcpu_add(vm, VCPU_ID1, l1_vcpu1_guest_code);
+	vcpu_args_set(vcpu1, 1, vmx->posted_intr_desc);
+
+	/* Start an L1 in vcpu1 and send a posted interrupt to halted L2 in vcpu0. */
+	for (;;) {
+		run = vcpu1->run;
+		vcpu_run(vcpu1);
+
+		TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+			    "vcpu1: Got exit_reason other than KVM_EXIT_IO: %u (%s)\n",
+			    run->exit_reason,
+			    exit_reason_str(run->exit_reason));
+
+		switch (get_ucall(vcpu1, &uc)) {
+		case UCALL_ABORT:
+			TEST_FAIL("%s", (const char *)uc.args[0]);
+			/* NOT REACHED */
+		case UCALL_DONE:
+			printf("vcpu1: Successfully send a posted interrupt to vcpu0\n");
+			goto done;
+		default:
+			TEST_FAIL("vcpu1: Unknown ucall %lu", uc.cmd);
+		}
+	}
+
+done:
+	/*
+	 * Allow vcpu0 resume execution from L0 userspace and check if the
+	 * posted interrupt get executed.
+	 */
+	WRITE_ONCE(vcpu0_can_run, true);
+	sleep(1);
+	TEST_ASSERT(READ_ONCE(pi_executed),
+		    "vcpu0 did not execute the posted interrupt.\n");
+
+	return NULL;
+}
+
+void sig_handler(int signum, siginfo_t *info, void *context)
+{
+	TEST_ASSERT(pthread_self() == pthread_cpu0,
+		    "Incorrect receiver of the signal, expect pthread_cpu0: "
+		    "%lu, but get: %lu\n", pthread_cpu0, pthread_self());
+	printf("vcpu0: Execute sighandler for signal: %d\n", signum);
+}
+
+int main(int argc, char *argv[])
+{
+	vm_vaddr_t vmx_pages_gva;
+	struct sigaction sig_action;
+	struct sigaction old_action;
+
+	memset(&sig_action, 0, sizeof(sig_action));
+	sig_action.sa_sigaction = sig_handler;
+	sig_action.sa_flags = SA_RESTART | SA_SIGINFO;
+	sigemptyset(&sig_action.sa_mask);
+	sigaction(SIGUSR1, &sig_action, &old_action);
+
+	pthread_cpu0 = pthread_self();
+	printf("vcpu0: Finish setup signal handler for SIGUSR1\n");
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_NESTED_STATE));
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_VMX));
+
+	vm = vm_create_with_one_vcpu(&vcpu0, (void *)l1_vcpu0_guest_code);
+
+	vm_init_descriptor_tables(vm);
+	vcpu_init_descriptor_tables(vcpu0);
+	vm_install_exception_handler(vm, L2_INTR, vcpu0_ipi_handler);
+
+	/* Allocate VMX pages and shared descriptors (vmx_pages). */
+	vmx = vcpu_alloc_vmx(vm, &vmx_pages_gva);
+	prepare_virtual_apic(vmx, vm);
+	prepare_posted_intr_desc(vmx, vm);
+	vcpu_args_set(vcpu0, 1, vmx_pages_gva);
+
+	pthread_create(&pthread_cpu1, NULL, create_and_run_vcpu1, NULL);
+
+	for (;;) {
+		struct kvm_run *run = vcpu0->run;
+		struct ucall uc;
+		int rc;
+
+		while (!READ_ONCE(vcpu0_can_run))
+			;
+
+		WRITE_ONCE(vcpu0_running, true);
+
+		rc = vcpu_run_interruptable(vcpu0);
+
+		/*
+		 * When vCPU is kicked out by a signal, ensure a consistent vCPU
+		 * state to prepare for migration before setting the
+		 * vcpu_running flag to false.
+		 */
+		if (rc == -1 && run->exit_reason == KVM_EXIT_INTR) {
+			vcpu_run_complete_io(vcpu0);
+
+			WRITE_ONCE(vcpu0_running, false);
+
+			continue;
+		}
+
+		WRITE_ONCE(vcpu0_running, false);
+
+		if (run->io.port == PORT_L0_EXIT) {
+			printf("vcpu0: Executed the posted interrupt\n");
+			WRITE_ONCE(pi_executed, true);
+			continue;
+		}
+
+		switch (get_ucall(vcpu0, &uc)) {
+		case UCALL_ABORT:
+			TEST_FAIL("%s", (const char *)uc.args[0]);
+			/* NOT REACHED */
+		case UCALL_DONE:
+			goto done;
+		default:
+			TEST_FAIL("vcpu0: Unknown ucall %lu", uc.cmd);
+		}
+	}
+
+done:
+	kvm_vm_free(vm);
+	return 0;
+}
-- 
2.37.1.455.g008518b4e5-goog


^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
@ 2022-08-03 10:08   ` Maxim Levitsky
  2022-08-03 16:45     ` Mingwei Zhang
  2022-08-03 17:18   ` Paolo Bonzini
  1 sibling, 1 reply; 28+ messages in thread
From: Maxim Levitsky @ 2022-08-03 10:08 UTC (permalink / raw)
  To: Mingwei Zhang, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> From: Oliver Upton <oupton@google.com>
> 
> vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> page being present + mapped into the kernel address space. However, with
> demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> event isn't assessed before entering vcpu_block.
> 
> Fix this by getting vmcs12 pages before inspecting the guest's APIC
> page. Note that upstream does not have this issue, as they will directly
> get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> event request mechanism. However, the upstream approach is problematic,
> as the vmcs12 pages will not be present if a live migration occurred
> before checking the virtual APIC page.

Since this patch is intended for upstream, I don't fully understand
the meaning of the above paragraph.


> 
> Signed-off-by: Oliver Upton <oupton@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  arch/x86/kvm/x86.c | 17 +++++++++++++++++
>  1 file changed, 17 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5366f884e9a7..1d3d8127aaea 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
>  {
>  	bool hv_timer;
>  
> +	/*
> +	 * We must first get the vmcs12 pages before checking for interrupts
> +	 * that might unblock the guest if L1 is using virtual-interrupt
> +	 * delivery.
> +	 */
> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +		/*
> +		 * If we have to ask user-space to post-copy a page,
> +		 * then we have to keep trying to get all of the
> +		 * VMCS12 pages until we succeed.
> +		 */
> +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +			return 0;
> +		}
> +	}
> +
>  	if (!kvm_arch_vcpu_runnable(vcpu)) {
>  		/*
>  		 * Switch to the software timer before halt-polling/blocking as


If I understand correctly, you are saying that if apic backing page is migrated in post copy
then 'get_nested_state_pages' will return false and thus fail?

AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
for many things like MSR bitmaps and such - they always uses non atomic versions
of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
supposed to block if they attempt to access HVA which is not present, and then
userfaultd should take over and wake them up.

If that still fails, nested VM entry is usually failed, and/or the whole VM
is crashed with 'KVM_EXIT_INTERNAL_ERROR'.

Anything I missed? 

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 10:08   ` Maxim Levitsky
@ 2022-08-03 16:45     ` Mingwei Zhang
  2022-08-03 17:00       ` Mingwei Zhang
  0 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-03 16:45 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > From: Oliver Upton <oupton@google.com>
> > 
> > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > page being present + mapped into the kernel address space. However, with
> > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > event isn't assessed before entering vcpu_block.
> > 
> > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > page. Note that upstream does not have this issue, as they will directly
> > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > event request mechanism. However, the upstream approach is problematic,
> > as the vmcs12 pages will not be present if a live migration occurred
> > before checking the virtual APIC page.
> 
> Since this patch is intended for upstream, I don't fully understand
> the meaning of the above paragraph.

My apology. Some of the statement needs to be updated, which I should do
before sending. But I think the point here is that there is a missing
get_nested_state_pages() call here within vcpu_block() when there is the
request of KVM_REQ_GET_NESTED_STATE_PAGES.

> 
> 
> > 
> > Signed-off-by: Oliver Upton <oupton@google.com>
> > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > ---
> >  arch/x86/kvm/x86.c | 17 +++++++++++++++++
> >  1 file changed, 17 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 5366f884e9a7..1d3d8127aaea 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> >  {
> >  	bool hv_timer;
> >  
> > +	/*
> > +	 * We must first get the vmcs12 pages before checking for interrupts
> > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > +	 * delivery.
> > +	 */
> > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +		/*
> > +		 * If we have to ask user-space to post-copy a page,
> > +		 * then we have to keep trying to get all of the
> > +		 * VMCS12 pages until we succeed.
> > +		 */
> > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > +			return 0;
> > +		}
> > +	}
> > +
> >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> >  		/*
> >  		 * Switch to the software timer before halt-polling/blocking as
> 
> 
> If I understand correctly, you are saying that if apic backing page is migrated in post copy
> then 'get_nested_state_pages' will return false and thus fail?

What I mean is that when the vCPU was halted and then migrated in this
case, KVM did not call get_nested_state_pages() before getting into
kvm_arch_vcpu_runnable(). This function checks the apic backing page and
fails on that check and triggered the warning.
> 
> AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> for many things like MSR bitmaps and such - they always uses non atomic versions
> of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> supposed to block if they attempt to access HVA which is not present, and then
> userfaultd should take over and wake them up.

You are right here.
> 
> If that still fails, nested VM entry is usually failed, and/or the whole VM
> is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> 
> Anything I missed? 
> 
> Best regards,
> 	Maxim Levitsky
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 16:45     ` Mingwei Zhang
@ 2022-08-03 17:00       ` Mingwei Zhang
  2022-08-03 18:36         ` Oliver Upton
  0 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-03 17:00 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Wed, Aug 03, 2022, Mingwei Zhang wrote:
> On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > > From: Oliver Upton <oupton@google.com>
> > > 
> > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > > page being present + mapped into the kernel address space. However, with
> > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > > event isn't assessed before entering vcpu_block.
> > > 
> > > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > > page. Note that upstream does not have this issue, as they will directly
> > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > > event request mechanism. However, the upstream approach is problematic,
> > > as the vmcs12 pages will not be present if a live migration occurred
> > > before checking the virtual APIC page.
> > 
> > Since this patch is intended for upstream, I don't fully understand
> > the meaning of the above paragraph.
> 
> My apology. Some of the statement needs to be updated, which I should do
> before sending. But I think the point here is that there is a missing
> get_nested_state_pages() call here within vcpu_block() when there is the
> request of KVM_REQ_GET_NESTED_STATE_PAGES.
> 
> > 
> > 
> > > 
> > > Signed-off-by: Oliver Upton <oupton@google.com>
> > > Signed-off-by: Mingwei Zhang <mizhang@google.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 17 +++++++++++++++++
> > >  1 file changed, 17 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 5366f884e9a7..1d3d8127aaea 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> > >  {
> > >  	bool hv_timer;
> > >  
> > > +	/*
> > > +	 * We must first get the vmcs12 pages before checking for interrupts
> > > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > > +	 * delivery.
> > > +	 */
> > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > +		/*
> > > +		 * If we have to ask user-space to post-copy a page,
> > > +		 * then we have to keep trying to get all of the
> > > +		 * VMCS12 pages until we succeed.
> > > +		 */
> > > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > +			return 0;
> > > +		}
> > > +	}
> > > +
> > >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> > >  		/*
> > >  		 * Switch to the software timer before halt-polling/blocking as
> > 
> > 
> > If I understand correctly, you are saying that if apic backing page is migrated in post copy
> > then 'get_nested_state_pages' will return false and thus fail?
> 
> What I mean is that when the vCPU was halted and then migrated in this
> case, KVM did not call get_nested_state_pages() before getting into
> kvm_arch_vcpu_runnable(). This function checks the apic backing page and
> fails on that check and triggered the warning.
> > 
> > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> > for many things like MSR bitmaps and such - they always uses non atomic versions
> > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> > supposed to block if they attempt to access HVA which is not present, and then
> > userfaultd should take over and wake them up.
> 
> You are right here.
> > 
> > If that still fails, nested VM entry is usually failed, and/or the whole VM
> > is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> > 

Ah, I think I understand what you are saying. hmm, so basically the
patch here is to continuously request vmcs12 pages if failed. But what
you are saying is that we just need to call 'get_nested_state_pages'
once. If it fails, then the VM fails to work. Let me double check and
get back.

> > Anything I missed? 
> > 
> > Best regards,
> > 	Maxim Levitsky
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
  2022-08-03 10:08   ` Maxim Levitsky
@ 2022-08-03 17:18   ` Paolo Bonzini
  2022-08-03 17:51     ` Mingwei Zhang
  1 sibling, 1 reply; 28+ messages in thread
From: Paolo Bonzini @ 2022-08-03 17:18 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On 8/3/22 01:07, Mingwei Zhang wrote:
> +	/*
> +	 * We must first get the vmcs12 pages before checking for interrupts
> +	 * that might unblock the guest if L1 is using virtual-interrupt
> +	 * delivery.
> +	 */
> +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> +		/*
> +		 * If we have to ask user-space to post-copy a page,
> +		 * then we have to keep trying to get all of the
> +		 * VMCS12 pages until we succeed.
> +		 */
> +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> +			return 0;
> +		}
> +	}
> +

I think request handling (except for KVM_REQ_EVENT) could be more 
generically moved from vcpu_enter_guest() to vcpu_run().

Paolo


^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 17:18   ` Paolo Bonzini
@ 2022-08-03 17:51     ` Mingwei Zhang
  2022-08-03 19:34       ` Maxim Levitsky
  0 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-03 17:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Oliver Upton

On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
>
> On 8/3/22 01:07, Mingwei Zhang wrote:
> > +     /*
> > +      * We must first get the vmcs12 pages before checking for interrupts
> > +      * that might unblock the guest if L1 is using virtual-interrupt
> > +      * delivery.
> > +      */
> > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > +             /*
> > +              * If we have to ask user-space to post-copy a page,
> > +              * then we have to keep trying to get all of the
> > +              * VMCS12 pages until we succeed.
> > +              */
> > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > +                     return 0;
> > +             }
> > +     }
> > +
>
> I think request handling (except for KVM_REQ_EVENT) could be more
> generically moved from vcpu_enter_guest() to vcpu_run().

Yeah, sounds good to me. I can come up with an updated version. At
least, I will remove the repeat request here.

>
> Paolo
>

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 17:00       ` Mingwei Zhang
@ 2022-08-03 18:36         ` Oliver Upton
  0 siblings, 0 replies; 28+ messages in thread
From: Oliver Upton @ 2022-08-03 18:36 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Maxim Levitsky, Paolo Bonzini, Sean Christopherson,
	Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel

On Wed, Aug 03, 2022 at 05:00:26PM +0000, Mingwei Zhang wrote:
> On Wed, Aug 03, 2022, Mingwei Zhang wrote:
> > On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> > > On Tue, 2022-08-02 at 23:07 +0000, Mingwei Zhang wrote:
> > > > From: Oliver Upton <oupton@google.com>
> > > > 
> > > > vmx_guest_apic_has_interrupts implicitly depends on the virtual APIC
> > > > page being present + mapped into the kernel address space. However, with
> > > > demand paging we break this dependency, as the KVM_REQ_GET_VMCS12_PAGES
> > > > event isn't assessed before entering vcpu_block.
> > > > 
> > > > Fix this by getting vmcs12 pages before inspecting the guest's APIC
> > > > page. Note that upstream does not have this issue, as they will directly
> > > > get the vmcs12 pages on vmlaunch/vmresume instead of relying on the
> > > > event request mechanism. However, the upstream approach is problematic,
> > > > as the vmcs12 pages will not be present if a live migration occurred
> > > > before checking the virtual APIC page.
> > > 
> > > Since this patch is intended for upstream, I don't fully understand
> > > the meaning of the above paragraph.
> > 
> > My apology. Some of the statement needs to be updated, which I should do
> > before sending. But I think the point here is that there is a missing
> > get_nested_state_pages() call here within vcpu_block() when there is the
> > request of KVM_REQ_GET_NESTED_STATE_PAGES.

This was my poorly written changelog, sorry about that Mingwei :)

> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 5366f884e9a7..1d3d8127aaea 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -10599,6 +10599,23 @@ static inline int vcpu_block(struct kvm_vcpu *vcpu)
> > > >  {
> > > >  	bool hv_timer;
> > > >  
> > > > +	/*
> > > > +	 * We must first get the vmcs12 pages before checking for interrupts
> > > > +	 * that might unblock the guest if L1 is using virtual-interrupt
> > > > +	 * delivery.
> > > > +	 */
> > > > +	if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > > +		/*
> > > > +		 * If we have to ask user-space to post-copy a page,
> > > > +		 * then we have to keep trying to get all of the
> > > > +		 * VMCS12 pages until we succeed.
> > > > +		 */
> > > > +		if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > > +			kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > > +			return 0;
> > > > +		}
> > > > +	}
> > > > +
> > > >  	if (!kvm_arch_vcpu_runnable(vcpu)) {
> > > >  		/*
> > > >  		 * Switch to the software timer before halt-polling/blocking as
> > > 
> > > 
> > > If I understand correctly, you are saying that if apic backing page is migrated in post copy
> > > then 'get_nested_state_pages' will return false and thus fail?
> > 
> > What I mean is that when the vCPU was halted and then migrated in this
> > case, KVM did not call get_nested_state_pages() before getting into
> > kvm_arch_vcpu_runnable(). This function checks the apic backing page and
> > fails on that check and triggered the warning.
> > > 
> > > AFAIK both SVM and VMX versions of 'get_nested_state_pages' assume that this is not the case
> > > for many things like MSR bitmaps and such - they always uses non atomic versions
> > > of guest memory access like 'kvm_vcpu_read_guest' and 'kvm_vcpu_map' which
> > > supposed to block if they attempt to access HVA which is not present, and then
> > > userfaultd should take over and wake them up.
> > 
> > You are right here.
> > > 
> > > If that still fails, nested VM entry is usually failed, and/or the whole VM
> > > is crashed with 'KVM_EXIT_INTERNAL_ERROR'.
> > > 
> 
> Ah, I think I understand what you are saying. hmm, so basically the
> patch here is to continuously request vmcs12 pages if failed. But what
> you are saying is that we just need to call 'get_nested_state_pages'
> once. If it fails, then the VM fails to work. Let me double check and
> get back.

IIRC the reason we reset the request on a nonzero return was because our
local implementation of the VMX hook was non-blocking and would bail on
the first page that needed to be demanded from the source. So, you
effectively keep hitting the request until all the pages are pulled in.

--
Thanks,
Oliver



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state
  2022-08-02 23:07 ` [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state Mingwei Zhang
@ 2022-08-03 18:44   ` Oliver Upton
  2022-08-03 19:21     ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Oliver Upton @ 2022-08-03 18:44 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

Hi Mingwei,

On Tue, Aug 02, 2022 at 11:07:15PM +0000, Mingwei Zhang wrote:
> Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state and
> properly save/restore it in vcpu_{save,load}_state(). When vcpu resets,
> APIC state become software disabled in kernel and thus the corresponding
> vCPU is not able to receive posted interrupts [1].  So, add APIC
> save/restore in userspace in selftest library code.

Of course, there are no hard rules around it but IMO a changelog is
easier to grok if it first describes the what/why of the problem, then
afterwards how it is fixed by the commit.

> [1] commit 97222cc83163 ("KVM: Emulate local APIC in kernel").

What is the reason for the citation here?

> Cc: Jim Mattson <jmattson@google.com>
> 

nit: no newline between footers.

> Signed-off-by: Mingwei Zhang <mizhang@google.com>

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state
  2022-08-03 18:44   ` Oliver Upton
@ 2022-08-03 19:21     ` Sean Christopherson
  2022-08-03 23:55       ` Oliver Upton
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-03 19:21 UTC (permalink / raw)
  To: Oliver Upton
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

KVM: selftests: for the shortlog.

On Wed, Aug 03, 2022, Oliver Upton wrote:
> Hi Mingwei,
> 
> On Tue, Aug 02, 2022 at 11:07:15PM +0000, Mingwei Zhang wrote:
> > Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state and
> > properly save/restore it in vcpu_{save,load}_state(). When vcpu resets,
> > APIC state become software disabled in kernel and thus the corresponding
> > vCPU is not able to receive posted interrupts [1].  So, add APIC
> > save/restore in userspace in selftest library code.
> 
> Of course, there are no hard rules around it but IMO a changelog is
> easier to grok if it first describes the what/why of the problem, then
> afterwards how it is fixed by the commit.

I strongly disagree.  :-)  To some extent, it's a personal preference, e.g. I
find it easier to understand the details (why something is a problem) if I have
the extra context of how a problem is fixed (or: what code was broken).

But beyond personal preference, there are less subjective reasons for stating
what a patch does before diving into details.  First and foremost, what code is
actually being changed is the most important information, and so that information
should be easy to find.  Changelogs that bury the "what's actually changing" in a
one-liner after 3+ paragraphs of background make it very hard to find that information.

Maybe for initial review one could argue that "what's the bug" is more important,
but for skimming logs and git archeology, the gory details matter less and less.
E.g. when doing a series of "git blame", the details of each change along the way
are useless, the details only matter for the culprit; I just want to quickly
determine whether or not a commit might be of interest.

Another argument for stating "what's changing" first is that it's almost always
possible to state "what's changing" in a single sentence.  Conversely, all but the
most simple bugs require multiple sentences or paragraphs to fully describe the
problem.  If both the "what's changing" and "what's the bug" are super short then
the order doesn't matter.  But if one is shorter (almost always the "what's changing),
then covering the shorter one first is advantageous because it's less of an
inconvenience for readers/reviewers that have a strict ordering preference.  E.g.
having to skip one sentence to get to the stuff you care about is less painful than
me having to skip three paragraphs to get to the stuff that I care about.

I think the underlying problem with this changelog (and the shortlog) is that it's
too literal about what is being fixed.  Shortlogs and changelogs shouldn't be
play-by-play descriptions of the code changes, they should be abstractions of the
problem and the fix.  E.g. 

  KVM: selftests: Save/restore vAPIC state in "migration" tests
  
  Save/restore vAPIC state as part of vCPU save/load so that it's preserved
  across VM "migration".  This will allow testing that posted interrupts
  are properly handled across VM migration.

With that, the first sentence covers both the "what's changing" and provides a
high-level description of the "bug" it's fixing.  And the second sentence covers
(a) "why do we want this patch", (b) "why wasn't this a problem before", and (c)
"what's the urgency of this patch".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 17:51     ` Mingwei Zhang
@ 2022-08-03 19:34       ` Maxim Levitsky
  2022-08-25  0:11         ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Maxim Levitsky @ 2022-08-03 19:34 UTC (permalink / raw)
  To: Mingwei Zhang, Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, LKML, Oliver Upton

On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote:
> On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > On 8/3/22 01:07, Mingwei Zhang wrote:
> > > +     /*
> > > +      * We must first get the vmcs12 pages before checking for interrupts
> > > +      * that might unblock the guest if L1 is using virtual-interrupt
> > > +      * delivery.
> > > +      */
> > > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > +             /*
> > > +              * If we have to ask user-space to post-copy a page,
> > > +              * then we have to keep trying to get all of the
> > > +              * VMCS12 pages until we succeed.
> > > +              */
> > > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > +                     return 0;
> > > +             }
> > > +     }
> > > +
> > 
> > I think request handling (except for KVM_REQ_EVENT) could be more
> > generically moved from vcpu_enter_guest() to vcpu_run().
> 
> Yeah, sounds good to me. I can come up with an updated version. At
> least, I will remove the repeat request here.


Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing
when the vCPU is halted is indeed missing.

This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES,
if by any chance there is an agreement to do so upstream.
This is yet another reason to do so to be honest.
Just my 0.2 cents of course.


Best regards,
	Maxim Levitsky



PS:
This also reminded me of an related issue:
https://lore.kernel.org/lkml/20220427173758.517087-1-pbonzini@redhat.com/T/
Any update on it?

> 
> > Paolo
> > 



^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state
  2022-08-03 19:21     ` Sean Christopherson
@ 2022-08-03 23:55       ` Oliver Upton
  0 siblings, 0 replies; 28+ messages in thread
From: Oliver Upton @ 2022-08-03 23:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, linux-kernel

On Wed, Aug 03, 2022 at 07:21:43PM +0000, Sean Christopherson wrote:
> KVM: selftests: for the shortlog.
> 
> On Wed, Aug 03, 2022, Oliver Upton wrote:
> > Hi Mingwei,
> > 
> > On Tue, Aug 02, 2022 at 11:07:15PM +0000, Mingwei Zhang wrote:
> > > Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state and
> > > properly save/restore it in vcpu_{save,load}_state(). When vcpu resets,
> > > APIC state become software disabled in kernel and thus the corresponding
> > > vCPU is not able to receive posted interrupts [1].  So, add APIC
> > > save/restore in userspace in selftest library code.
> > 
> > Of course, there are no hard rules around it but IMO a changelog is
> > easier to grok if it first describes the what/why of the problem, then
> > afterwards how it is fixed by the commit.
> 
> I strongly disagree.  :-)  To some extent, it's a personal preference, e.g. I
> find it easier to understand the details (why something is a problem) if I have
> the extra context of how a problem is fixed (or: what code was broken).
> 

Sorry, what I wrote definitely was asking for strict ordering. Thank you
for rightly calling that out.

My actual issue if I had been bothered to articulate it well was that $WHAT
was effectively restated in different terms which can be confusing.
Where possible, atomically addressing what, why and how can lead to a
crisper changelog.

[...]

>   KVM: selftests: Save/restore vAPIC state in "migration" tests
>   
>   Save/restore vAPIC state as part of vCPU save/load so that it's preserved
>   across VM "migration".  This will allow testing that posted interrupts
>   are properly handled across VM migration.
> 
> With that, the first sentence covers both the "what's changing" and provides a
> high-level description of the "bug" it's fixing.  And the second sentence covers
> (a) "why do we want this patch", (b) "why wasn't this a problem before", and (c)
> "what's the urgency of this patch".

LGTM.

--
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-03 19:34       ` Maxim Levitsky
@ 2022-08-25  0:11         ` Sean Christopherson
  2022-08-25  2:51           ` Jim Mattson
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-25  0:11 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, kvm, LKML, Oliver Upton

On Wed, Aug 03, 2022, Maxim Levitsky wrote:
> On Wed, 2022-08-03 at 10:51 -0700, Mingwei Zhang wrote:
> > On Wed, Aug 3, 2022 at 10:18 AM Paolo Bonzini <pbonzini@redhat.com> wrote:
> > > On 8/3/22 01:07, Mingwei Zhang wrote:
> > > > +     /*
> > > > +      * We must first get the vmcs12 pages before checking for interrupts
> > > > +      * that might unblock the guest if L1 is using virtual-interrupt
> > > > +      * delivery.
> > > > +      */
> > > > +     if (kvm_check_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu)) {
> > > > +             /*
> > > > +              * If we have to ask user-space to post-copy a page,
> > > > +              * then we have to keep trying to get all of the
> > > > +              * VMCS12 pages until we succeed.
> > > > +              */
> > > > +             if (unlikely(!kvm_x86_ops.nested_ops->get_nested_state_pages(vcpu))) {
> > > > +                     kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> > > > +                     return 0;
> > > > +             }
> > > > +     }
> > > > +
> > > 
> > > I think request handling (except for KVM_REQ_EVENT) could be more
> > > generically moved from vcpu_enter_guest() to vcpu_run().
> > 
> > Yeah, sounds good to me. I can come up with an updated version. At
> > least, I will remove the repeat request here.
> 
> Now it all makes sense. I do think that KVM_REQ_GET_NESTED_STATE_PAGES processing
> when the vCPU is halted is indeed missing.
> 
> This reminds me that I would be *very* happy to remove the KVM_REQ_GET_NESTED_STATE_PAGES,
> if by any chance there is an agreement to do so upstream.
> This is yet another reason to do so to be honest.
> Just my 0.2 cents of course.

+100

@google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable()
  2022-08-02 23:07 ` [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable() Mingwei Zhang
@ 2022-08-25  0:15   ` Sean Christopherson
  2022-08-28 19:21     ` Mingwei Zhang
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-25  0:15 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Tue, Aug 02, 2022, Mingwei Zhang wrote:
> Introduce vcpu_run_interruptable() to allow selftests execute their own
> code when a vcpu is kicked out of KVM_RUN on receiving a POSIX signal.

But that's what __vcpu_run() is for.  Clearing "immediate_exit" after KVM_RUN does
not scream "interruptible" to me.

There's only one user after this series, just clear vcpu->run->immediate_exit
manually in that test (a comment on _why_ it's cleared would be helpful).

> +int vcpu_run_interruptable(struct kvm_vcpu *vcpu)
> +{
> +	int rc;
> +
> +	rc = __vcpu_run(vcpu);
> +
> +	vcpu->run->immediate_exit = 0;
> +
> +	return rc;
> +}
> +
>  int _vcpu_run(struct kvm_vcpu *vcpu)
>  {
>  	int rc;
> -- 
> 2.37.1.455.g008518b4e5-goog
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2
  2022-08-02 23:07 ` [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2 Mingwei Zhang
@ 2022-08-25  0:16   ` Sean Christopherson
  2022-08-28 19:29     ` Mingwei Zhang
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-25  0:16 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Tue, Aug 02, 2022, Mingwei Zhang wrote:
> Add support for posted interrupt handling in L2. This is done by adding
> needed data structures in vmx_pages and APIs to allow an L2 receive posted
> interrupts
> 
> Cc: Jim Mattson <jmattson@google.com>
> Signed-off-by: Mingwei Zhang <mizhang@google.com>
> ---
>  tools/testing/selftests/kvm/include/x86_64/vmx.h | 10 ++++++++++
>  tools/testing/selftests/kvm/lib/x86_64/vmx.c     | 16 ++++++++++++++++
>  2 files changed, 26 insertions(+)
> 
> diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
> index 7d8c980317f7..3449ae8ab282 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
> @@ -579,6 +579,14 @@ struct vmx_pages {
>  	void *apic_access_hva;
>  	uint64_t apic_access_gpa;
>  	void *apic_access;
> +
> +	void *virtual_apic_hva;
> +	uint64_t virtual_apic_gpa;
> +	void *virtual_apic;
> +
> +	void *posted_intr_desc_hva;
> +	uint64_t posted_intr_desc_gpa;
> +	void *posted_intr_desc;
>  };
>  
>  union vmx_basic {
> @@ -622,5 +630,7 @@ void nested_identity_map_1g(struct vmx_pages *vmx, struct kvm_vm *vm,
>  void prepare_eptp(struct vmx_pages *vmx, struct kvm_vm *vm,
>  		  uint32_t eptp_memslot);
>  void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm);
> +void prepare_virtual_apic(struct vmx_pages *vmx, struct kvm_vm *vm);
> +void prepare_posted_intr_desc(struct vmx_pages *vmx, struct kvm_vm *vm);
>  
>  #endif /* SELFTEST_KVM_VMX_H */
> diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
> index 80a568c439b8..7d65bee37b9f 100644
> --- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
> +++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
> @@ -556,3 +556,19 @@ void prepare_virtualize_apic_accesses(struct vmx_pages *vmx, struct kvm_vm *vm)
>  	vmx->apic_access_hva = addr_gva2hva(vm, (uintptr_t)vmx->apic_access);
>  	vmx->apic_access_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->apic_access);
>  }
> +
> +void prepare_virtual_apic(struct vmx_pages *vmx, struct kvm_vm *vm)
> +{
> +	vmx->virtual_apic = (void *)vm_vaddr_alloc_page(vm);
> +	vmx->virtual_apic_hva = addr_gva2hva(vm, (uintptr_t)vmx->virtual_apic);
> +	vmx->virtual_apic_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->virtual_apic);
> +}
> +
> +void prepare_posted_intr_desc(struct vmx_pages *vmx, struct kvm_vm *vm)
> +{
> +	vmx->posted_intr_desc = (void *)vm_vaddr_alloc_page(vm);
> +	vmx->posted_intr_desc_hva =
> +		addr_gva2hva(vm, (uintptr_t)vmx->posted_intr_desc);
> +	vmx->posted_intr_desc_gpa =
> +		addr_gva2gpa(vm, (uintptr_t)vmx->posted_intr_desc);

Let these poke out, or capture the field in a local var.  Actually, posted_intr_desc
is a void *, is casting even necessary?


> +}
> -- 
> 2.37.1.455.g008518b4e5-goog
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25  0:11         ` Sean Christopherson
@ 2022-08-25  2:51           ` Jim Mattson
  2022-08-25 14:40             ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Jim Mattson @ 2022-08-25  2:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:

> @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.

Userfaultfd is a red herring. There were two reasons that we needed
this when nested live migration was implemented:
1) our netlink socket mechanism for funneling remote page requests to
a userspace listener was broken.
2) we were not necessarily prepared to deal with remote page requests
during VM setup.

(1) has long since been fixed. Though our preference is to exit from
KVM_RUN and get the vCPU thread to request the remote page itself, we
are now capable of queuing a remote page request with a separate
listener thread and blocking in the kernel until the page is received.
I believe that mechanism is functionally equivalent to userfaultfd,
though not as elegant.
I don't know about (2). I'm not sure when the listener thread is set
up, relative to all of the other setup steps. Eliminating
KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
same is true when using userfaultfd.

These new ordering constraints represent a UAPI breakage, but we don't
seem to be as concerned about that as we once were. Maybe that's a
good thing. Can we get rid of all of the superseded ioctls, like
KVM_SET_CPUID, while we're at it?

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25  2:51           ` Jim Mattson
@ 2022-08-25 14:40             ` Sean Christopherson
  2022-08-25 17:16               ` Mingwei Zhang
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-25 14:40 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Maxim Levitsky, Mingwei Zhang, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

On Wed, Aug 24, 2022, Jim Mattson wrote:
> On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> 
> > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> 
> Userfaultfd is a red herring. There were two reasons that we needed
> this when nested live migration was implemented:
> 1) our netlink socket mechanism for funneling remote page requests to
> a userspace listener was broken.
> 2) we were not necessarily prepared to deal with remote page requests
> during VM setup.
> 
> (1) has long since been fixed. Though our preference is to exit from
> KVM_RUN and get the vCPU thread to request the remote page itself, we
> are now capable of queuing a remote page request with a separate
> listener thread and blocking in the kernel until the page is received.
> I believe that mechanism is functionally equivalent to userfaultfd,
> though not as elegant.
> I don't know about (2). I'm not sure when the listener thread is set
> up, relative to all of the other setup steps. Eliminating
> KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> same is true when using userfaultfd.
> 
> These new ordering constraints represent a UAPI breakage, but we don't
> seem to be as concerned about that as we once were. Maybe that's a
> good thing. Can we get rid of all of the superseded ioctls, like
> KVM_SET_CPUID, while we're at it?

I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
users, we can (eventually) wean ourselves off the feature, and we can carry
internal patches (which we are obviously already carrying) until we transition
away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
forgotten, this feature is likely to be a maintenance burden as long as it exists.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25 14:40             ` Sean Christopherson
@ 2022-08-25 17:16               ` Mingwei Zhang
  2022-08-25 17:53                 ` Sean Christopherson
  0 siblings, 1 reply; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-25 17:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote:
>
> On Wed, Aug 24, 2022, Jim Mattson wrote:
> > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> >
> > Userfaultfd is a red herring. There were two reasons that we needed
> > this when nested live migration was implemented:
> > 1) our netlink socket mechanism for funneling remote page requests to
> > a userspace listener was broken.
> > 2) we were not necessarily prepared to deal with remote page requests
> > during VM setup.
> >
> > (1) has long since been fixed. Though our preference is to exit from
> > KVM_RUN and get the vCPU thread to request the remote page itself, we
> > are now capable of queuing a remote page request with a separate
> > listener thread and blocking in the kernel until the page is received.
> > I believe that mechanism is functionally equivalent to userfaultfd,
> > though not as elegant.
> > I don't know about (2). I'm not sure when the listener thread is set
> > up, relative to all of the other setup steps. Eliminating
> > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> > same is true when using userfaultfd.
> >
> > These new ordering constraints represent a UAPI breakage, but we don't
> > seem to be as concerned about that as we once were. Maybe that's a
> > good thing. Can we get rid of all of the superseded ioctls, like
> > KVM_SET_CPUID, while we're at it?
>
> I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
> users, we can (eventually) wean ourselves off the feature, and we can carry
> internal patches (which we are obviously already carrying) until we transition
> away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
> forgotten, this feature is likely to be a maintenance burden as long as it exists.

KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in
KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is
basically a two-step setting up of a nested state mechanism.

We can change that, but this may have side effects and I think this
usage case has nothing to do with demand paging.

I noticed that nested_vmx_enter_non_root_mode() is called in
KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);

hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
because the same function is called again in nested_vmx_run().

Thanks.
-Mingwei

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25 17:16               ` Mingwei Zhang
@ 2022-08-25 17:53                 ` Sean Christopherson
  2022-08-25 20:35                   ` Mingwei Zhang
  0 siblings, 1 reply; 28+ messages in thread
From: Sean Christopherson @ 2022-08-25 17:53 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Jim Mattson, Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

On Thu, Aug 25, 2022, Mingwei Zhang wrote:
> On Thu, Aug 25, 2022 at 7:41 AM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Wed, Aug 24, 2022, Jim Mattson wrote:
> > > On Wed, Aug 24, 2022 at 5:11 PM Sean Christopherson <seanjc@google.com> wrote:
> > >
> > > > @google folks, what would it take for us to mark KVM_REQ_GET_NESTED_STATE_PAGES
> > > > as deprecated in upstream and stop accepting patches/fixes?  IIUC, when we eventually
> > > > move to userfaultfd, all this goes away, i.e. we do want to ditch this at some point.
> > >
> > > Userfaultfd is a red herring. There were two reasons that we needed
> > > this when nested live migration was implemented:
> > > 1) our netlink socket mechanism for funneling remote page requests to
> > > a userspace listener was broken.
> > > 2) we were not necessarily prepared to deal with remote page requests
> > > during VM setup.
> > >
> > > (1) has long since been fixed. Though our preference is to exit from
> > > KVM_RUN and get the vCPU thread to request the remote page itself, we
> > > are now capable of queuing a remote page request with a separate
> > > listener thread and blocking in the kernel until the page is received.
> > > I believe that mechanism is functionally equivalent to userfaultfd,
> > > though not as elegant.
> > > I don't know about (2). I'm not sure when the listener thread is set
> > > up, relative to all of the other setup steps. Eliminating
> > > KVM_REQ_GET_NESTED_STATE_PAGES means that userspace must be prepared
> > > to fetch a remote page by the first call to KVM_SET_NESTED_STATE. The
> > > same is true when using userfaultfd.
> > >
> > > These new ordering constraints represent a UAPI breakage, but we don't
> > > seem to be as concerned about that as we once were. Maybe that's a
> > > good thing. Can we get rid of all of the superseded ioctls, like
> > > KVM_SET_CPUID, while we're at it?
> >
> > I view KVM_REQ_GET_NESTED_STATE_PAGES as a special case.  We are likely the only
> > users, we can (eventually) wean ourselves off the feature, and we can carry
> > internal patches (which we are obviously already carrying) until we transition
> > away.  And unlike KVM_SET_CPUID and other ancient ioctls() that are largely
> > forgotten, this feature is likely to be a maintenance burden as long as it exists.
> 
> KVM_REQ_GET_NESTED_STATE_PAGES has been uniformly used in
> KVM_SET_NESTED_STATE ioctl in VMX (including eVMCS) and SVM, it is
> basically a two-step setting up of a nested state mechanism.
> 
> We can change that, but this may have side effects and I think this
> usage case has nothing to do with demand paging.

There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:

  1. Defer loads when leaving SMM.

  2: Defer loads for KVM_SET_NESTED_STATE.

#1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
one that restores whatever metadata is needed before restoring from SMRAM, and
a second to load guest virtualization state that _after_ restoring all other guest
state from SMRAM.

#2 is done because of the reasons Jim listed above, which are specific to demand
paging (including userfaultfd).  There might be some interactions with other
ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
needs to refresh things (though ideally we'd avoid that).

In other words, if the demand paging use case goes away, then KVM can get rid of
KVM_REQ_GET_NESTED_STATE_PAGES.  

> KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);

svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
do the same, but that may or may not be a net positive.

> hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> because the same function is called again in nested_vmx_run().

nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
if the vCPU is already running L2 at the time of migration.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25 17:53                 ` Sean Christopherson
@ 2022-08-25 20:35                   ` Mingwei Zhang
  2022-08-25 20:37                     ` Mingwei Zhang
  2022-08-25 23:21                     ` Jim Mattson
  0 siblings, 2 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-25 20:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

> There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:
>
>   1. Defer loads when leaving SMM.
>
>   2: Defer loads for KVM_SET_NESTED_STATE.
>
> #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
> one that restores whatever metadata is needed before restoring from SMRAM, and
> a second to load guest virtualization state that _after_ restoring all other guest
> state from SMRAM.
>
> #2 is done because of the reasons Jim listed above, which are specific to demand
> paging (including userfaultfd).  There might be some interactions with other
> ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
> solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
> needs to refresh things (though ideally we'd avoid that).

Ack on the fact that the 2-step process is specific to demand paging.

Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
step does not require vmcs12 to be ready. So, I am thinking about what
it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?

For case #2, I think there might be two options if we deprecate it:

 - Ensuring vmcs12 ready during the call to
ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the
thread who is listening to the remote page request ready to serve
before this call (this is true regardless of uffd based or Google base
demand paging). We definitely can solve this ordering problem, but
that is beyond KVM scope. It basically requires our userspace to
cooperate.

 - Ensuring vmcs12 ready before vmenter. This basically defers the
vmcs12 checks to the last second. I think this might be a better one.
However, isn't it the same as the original implementation, i.e.,
instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some
other flags to tell KVM to load a vmcs12?

Thanks.
-Mingwei
>
> In other words, if the demand paging use case goes away, then KVM can get rid of
> KVM_REQ_GET_NESTED_STATE_PAGES.
>
> > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
>
> svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
> do the same, but that may or may not be a net positive.
>
> > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> > because the same function is called again in nested_vmx_run().
>
> nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
> if the vCPU is already running L2 at the time of migration.

Ack.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25 20:35                   ` Mingwei Zhang
@ 2022-08-25 20:37                     ` Mingwei Zhang
  2022-08-25 23:21                     ` Jim Mattson
  1 sibling, 0 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-25 20:37 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Jim Mattson, Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov,
	Wanpeng Li, Joerg Roedel, kvm, LKML, Oliver Upton

> Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
> step does not require vmcs12 to be ready. So, I am thinking about what
> it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?

typo:

"KVM_SET_NESTED_STATE is a two-step process" -> "Launching a nested VM
is a two-step process".

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts
  2022-08-25 20:35                   ` Mingwei Zhang
  2022-08-25 20:37                     ` Mingwei Zhang
@ 2022-08-25 23:21                     ` Jim Mattson
  1 sibling, 0 replies; 28+ messages in thread
From: Jim Mattson @ 2022-08-25 23:21 UTC (permalink / raw)
  To: Mingwei Zhang
  Cc: Sean Christopherson, Maxim Levitsky, Paolo Bonzini,
	Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm, LKML,
	Oliver Upton

On Thu, Aug 25, 2022 at 1:35 PM Mingwei Zhang <mizhang@google.com> wrote:
>
> > There are two uses of KVM_REQ_GET_NESTED_STATE_PAGES:
> >
> >   1. Defer loads when leaving SMM.
> >
> >   2: Defer loads for KVM_SET_NESTED_STATE.
> >
> > #1 is fully solvable without a request, e.g. split ->leave_smm() into two helpers,
> > one that restores whatever metadata is needed before restoring from SMRAM, and
> > a second to load guest virtualization state that _after_ restoring all other guest
> > state from SMRAM.
> >
> > #2 is done because of the reasons Jim listed above, which are specific to demand
> > paging (including userfaultfd).  There might be some interactions with other
> > ioctls() (KVM_SET_SREGS?) that are papered over by the request, but that can be
> > solved without a full request since only the first KVM_RUN after KVM_SET_NESTED_STATE
> > needs to refresh things (though ideally we'd avoid that).
>
> Ack on the fact that the 2-step process is specific to demand paging.
>
> Currently, KVM_SET_NESTED_STATE is a two-step process in which the 1st
> step does not require vmcs12 to be ready. So, I am thinking about what
> it means to deprecate KVM_REQ_GET_NESTED_STATE_PAGES?
>
> For case #2, I think there might be two options if we deprecate it:
>
>  - Ensuring vmcs12 ready during the call to
> ioctl(KVM_SET_NESTED_STATE). This requires, as Jim mentioned, that the
> thread who is listening to the remote page request ready to serve
> before this call (this is true regardless of uffd based or Google base
> demand paging). We definitely can solve this ordering problem, but
> that is beyond KVM scope. It basically requires our userspace to
> cooperate.

The vmcs12 isn't the problem, since its contents were loaded into L0
kernel memory at VMPTRLD. The problem is the structures hanging off of
the vmcs12, like the posted interrupt descriptor. The new constraints
need to be documented, and every user space VMM has to follow them
before we can eliminate KVM_REQ_GET_NESTED_STATE_PAGES.

>  - Ensuring vmcs12 ready before vmenter. This basically defers the
> vmcs12 checks to the last second. I think this might be a better one.
> However, isn't it the same as the original implementation, i.e.,
> instead of using KVM_REQ_GET_NESTED_STATE_PAGES, we have to use some
> other flags to tell KVM to load a vmcs12?

Again, the vmcs12 is not a problem, since its contents are already
cached in L0 kernel memory. Accesses to the structures hanging off of
the vmcs12 are already handled *during KVM_RUN* by existing demand
paging mechanisms.

> Thanks.
> -Mingwei
> >
> > In other words, if the demand paging use case goes away, then KVM can get rid of
> > KVM_REQ_GET_NESTED_STATE_PAGES.
> >
> > > KVM_SET_NESTED_STATE in VMX, while in SVM implementation, it is simply
> > > just a kvm_make_request(KVM_REQ_GET_NESTED_STATE_PAGES, vcpu);
> >
> > svm_set_nested_state() very rougly open codes enter_svm_guest_mode().  VMX could
> > do the same, but that may or may not be a net positive.
> >
> > > hmm... so is the nested_vmx_enter_non_root_mode() call in vmx
> > > KVM_SET_NESTED_STATE ioctl() still necessary? I am thinking that
> > > because the same function is called again in nested_vmx_run().
> >
> > nested_vmx_run() is used only to emulate VMLAUNCH/VMRESUME and wont' be invoked
> > if the vCPU is already running L2 at the time of migration.
>
> Ack.

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable()
  2022-08-25  0:15   ` Sean Christopherson
@ 2022-08-28 19:21     ` Mingwei Zhang
  0 siblings, 0 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-28 19:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Thu, Aug 25, 2022, Sean Christopherson wrote:
> On Tue, Aug 02, 2022, Mingwei Zhang wrote:
> > Introduce vcpu_run_interruptable() to allow selftests execute their own
> > code when a vcpu is kicked out of KVM_RUN on receiving a POSIX signal.
> 
> But that's what __vcpu_run() is for.  Clearing "immediate_exit" after KVM_RUN does
> not scream "interruptible" to me.
> 
> There's only one user after this series, just clear vcpu->run->immediate_exit
> manually in that test (a comment on _why_ it's cleared would be helpful).
> 
hmm. good point. __vcpu_run() I thought it was internal function private
to kvm_util.c, but now after your selftest refactoring, this is useable.
Will use __vcpu_run() instead.

> > +int vcpu_run_interruptable(struct kvm_vcpu *vcpu)
> > +{
> > +	int rc;
> > +
> > +	rc = __vcpu_run(vcpu);
> > +
> > +	vcpu->run->immediate_exit = 0;
> > +
> > +	return rc;
> > +}
> > +
> >  int _vcpu_run(struct kvm_vcpu *vcpu)
> >  {
> >  	int rc;
> > -- 
> > 2.37.1.455.g008518b4e5-goog
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2
  2022-08-25  0:16   ` Sean Christopherson
@ 2022-08-28 19:29     ` Mingwei Zhang
  0 siblings, 0 replies; 28+ messages in thread
From: Mingwei Zhang @ 2022-08-28 19:29 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton

On Thu, Aug 25, 2022, Sean Christopherson wrote:
> > +
> > +void prepare_virtual_apic(struct vmx_pages *vmx, struct kvm_vm *vm)
> > +{
> > +	vmx->virtual_apic = (void *)vm_vaddr_alloc_page(vm);
> > +	vmx->virtual_apic_hva = addr_gva2hva(vm, (uintptr_t)vmx->virtual_apic);
> > +	vmx->virtual_apic_gpa = addr_gva2gpa(vm, (uintptr_t)vmx->virtual_apic);
> > +}
> > +
> > +void prepare_posted_intr_desc(struct vmx_pages *vmx, struct kvm_vm *vm)
> > +{
> > +	vmx->posted_intr_desc = (void *)vm_vaddr_alloc_page(vm);
> > +	vmx->posted_intr_desc_hva =
> > +		addr_gva2hva(vm, (uintptr_t)vmx->posted_intr_desc);
> > +	vmx->posted_intr_desc_gpa =
> > +		addr_gva2gpa(vm, (uintptr_t)vmx->posted_intr_desc);
> 
> Let these poke out, or capture the field in a local var.  Actually, posted_intr_desc
> is a void *, is casting even necessary?
> 
Will do. Yeah, casting is necessary here because add_ga2gpa takes
'vm_vaddr_t' which is the 'long unsigned int', so we have to cast it.
> 
> > +}
> > -- 
> > 2.37.1.455.g008518b4e5-goog
> > 

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2022-08-28 19:30 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-02 23:07 [PATCH 0/5] Fix a race between posted interrupt delivery and migration in a nested VM Mingwei Zhang
2022-08-02 23:07 ` [PATCH 1/5] KVM: x86: Get vmcs12 pages before checking pending interrupts Mingwei Zhang
2022-08-03 10:08   ` Maxim Levitsky
2022-08-03 16:45     ` Mingwei Zhang
2022-08-03 17:00       ` Mingwei Zhang
2022-08-03 18:36         ` Oliver Upton
2022-08-03 17:18   ` Paolo Bonzini
2022-08-03 17:51     ` Mingwei Zhang
2022-08-03 19:34       ` Maxim Levitsky
2022-08-25  0:11         ` Sean Christopherson
2022-08-25  2:51           ` Jim Mattson
2022-08-25 14:40             ` Sean Christopherson
2022-08-25 17:16               ` Mingwei Zhang
2022-08-25 17:53                 ` Sean Christopherson
2022-08-25 20:35                   ` Mingwei Zhang
2022-08-25 20:37                     ` Mingwei Zhang
2022-08-25 23:21                     ` Jim Mattson
2022-08-02 23:07 ` [PATCH 2/5] selftests: KVM/x86: Fix vcpu_{save,load}_state() by adding APIC state into kvm_x86_state Mingwei Zhang
2022-08-03 18:44   ` Oliver Upton
2022-08-03 19:21     ` Sean Christopherson
2022-08-03 23:55       ` Oliver Upton
2022-08-02 23:07 ` [PATCH 3/5] selftests: KVM: Introduce vcpu_run_interruptable() Mingwei Zhang
2022-08-25  0:15   ` Sean Christopherson
2022-08-28 19:21     ` Mingwei Zhang
2022-08-02 23:07 ` [PATCH 4/5] selftests: KVM: Add support for posted interrupt handling in L2 Mingwei Zhang
2022-08-25  0:16   ` Sean Christopherson
2022-08-28 19:29     ` Mingwei Zhang
2022-08-02 23:07 ` [PATCH 5/5] selftests: KVM: Test if posted interrupt delivery race with migration Mingwei Zhang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.