All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
@ 2022-06-14 20:47 Sean Christopherson
  2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
                   ` (22 more replies)
  0 siblings, 23 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

The main goal of this series is to fix KVM's longstanding bug of not
honoring L1's exception intercepts wants when handling an exception that
occurs during delivery of a different exception.  E.g. if L0 and L1 are
using shadow paging, and L2 hits a #PF, and then hits another #PF while
vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
so that the #PF is routed to L1, not injected into L2 as a #DF.

nVMX has hacked around the bug for years by overriding the #PF injector
for shadow paging to go straight to VM-Exit, and nSVM has started doing
the same.  The hacks mostly work, but they're incomplete, confusing, and
lead to other hacky code, e.g. bailing from the emulator because #PF
injection forced a VM-Exit and suddenly KVM is back in L1.

Everything leading up to that are related fixes and cleanups I encountered
along the way; some through code inspection, some through tests.

v2:
  - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
    overhaul.
    https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
  - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.

v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com

Sean Christopherson (21):
  KVM: nVMX: Unconditionally purge queued/injected events on nested
    "exit"
  KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  KVM: x86: Don't check for code breakpoints when emulating on exception
  KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
  KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  KVM: x86: Treat #DBs from the emulator as fault-like (code and
    DR7.GD=1)
  KVM: x86: Use DR7_GD macro instead of open coding check in emulator
  KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
  KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
  KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
  KVM: x86: Make kvm_queued_exception a properly named, visible struct
  KVM: x86: Formalize blocking of nested pending exceptions
  KVM: x86: Use kvm_queue_exception_e() to queue #DF
  KVM: x86: Hoist nested event checks above event injection logic
  KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
    VM-Exit
  KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
  KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
    behavior
  KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
  KVM: selftests: Add an x86-only test to verify nested exception
    queueing

 arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
 arch/x86/include/asm/kvm_host.h               |  35 +-
 arch/x86/kvm/emulate.c                        |   3 +-
 arch/x86/kvm/svm/nested.c                     | 102 ++---
 arch/x86/kvm/svm/svm.c                        |  18 +-
 arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
 arch/x86/kvm/vmx/sgx.c                        |   2 +-
 arch/x86/kvm/vmx/vmx.c                        |  53 ++-
 arch/x86/kvm/x86.c                            | 404 +++++++++++-------
 arch/x86/kvm/x86.h                            |  11 +-
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
 .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
 .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
 15 files changed, 886 insertions(+), 418 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c


base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply	[flat|nested] 78+ messages in thread

* [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-06-16 23:47   ` Jim Mattson
  2022-07-06 11:40   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS Sean Christopherson
                   ` (21 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Drop pending exceptions and events queued for re-injection when leaving
nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
by host userspace.  Failure to purge events could result in an event
belonging to L2 being injected into L1.

This _should_ never happen for VM-Fail as all events should be blocked by
nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
the source of VM-Fail when running vmcs02.

SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
to SMM is blocked by pending exceptions and re-injected events.

Forced exit is definitely buggy, but has likely gone unnoticed because
userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
some other ioctl() that purges the queue).

Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7d8cd0ebcc75..ee6f27dffdba 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -4263,14 +4263,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			nested_vmx_abort(vcpu,
 					 VMX_ABORT_SAVE_GUEST_MSR_FAIL);
 	}
-
-	/*
-	 * Drop what we picked up for L2 via vmx_complete_interrupts. It is
-	 * preserved above and would only end up incorrectly in L1.
-	 */
-	vcpu->arch.nmi_injected = false;
-	kvm_clear_exception_queue(vcpu);
-	kvm_clear_interrupt_queue(vcpu);
 }
 
 /*
@@ -4609,6 +4601,17 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 		WARN_ON_ONCE(nested_early_check);
 	}
 
+	/*
+	 * Drop events/exceptions that were queued for re-injection to L2
+	 * (picked up via vmx_complete_interrupts()), as well as exceptions
+	 * that were pending for L2.  Note, this must NOT be hoisted above
+	 * prepare_vmcs12(), events/exceptions queued for re-injection need to
+	 * be captured in vmcs12 (see vmcs12_save_pending_event()).
+	 */
+	vcpu->arch.nmi_injected = false;
+	kvm_clear_exception_queue(vcpu);
+	kvm_clear_interrupt_queue(vcpu);
+
 	vmx_switch_vmcs(vcpu, &vmx->vmcs01);
 
 	/* Update any VMCS fields that might have changed while L2 ran */
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
  2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:43   ` Maxim Levitsky
  2022-07-06 20:02   ` Jim Mattson
  2022-06-14 20:47 ` [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception Sean Christopherson
                   ` (20 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Deliberately truncate the exception error code when shoving it into the
VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
Intel CPUs are incapable of handling 32-bit error codes and will never
generate an error code with bits 31:16, but userspace can provide an
arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
on exception injection results in failed VM-Entry, as VMX disallows
setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
reinject the exception back into L2.

Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c |  9 ++++++++-
 arch/x86/kvm/vmx/vmx.c    | 11 ++++++++++-
 2 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ee6f27dffdba..33ffc8bcf9cd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3833,7 +3833,14 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
 	if (vcpu->arch.exception.has_error_code) {
-		vmcs12->vm_exit_intr_error_code = vcpu->arch.exception.error_code;
+		/*
+		 * Intel CPUs will never generate an error code with bits 31:16
+		 * set, and more importantly VMX disallows setting bits 31:16
+		 * in the injected error code for VM-Entry.  Drop the bits to
+		 * mimic hardware and avoid inducing failure on nested VM-Entry
+		 * if L1 chooses to inject the exception back to L2.
+		 */
+		vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5e14e4c40007..ec98992024e2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1621,7 +1621,16 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
 	kvm_deliver_exception_payload(vcpu);
 
 	if (has_error_code) {
-		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
+		/*
+		 * Despite the error code being architecturally defined as 32
+		 * bits, and the VMCS field being 32 bits, Intel CPUs and thus
+		 * VMX don't actually supporting setting bits 31:16.  Hardware
+		 * will (should) never provide a bogus error code, but KVM's
+		 * ABI lets userspace shove in arbitrary 32-bit values.  Drop
+		 * the upper bits to avoid VM-Fail, losing information that
+		 * does't really exist is preferable to killing the VM.
+		 */
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
  2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
  2022-06-14 20:47 ` [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:43   ` Maxim Levitsky
  2022-07-06 22:17   ` Jim Mattson
  2022-06-14 20:47 ` [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like Sean Christopherson
                   ` (19 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Don't check for code breakpoints during instruction emulation if the
emulation was triggered by exception interception.  Code breakpoints are
the highest priority fault-like exception, and KVM only emulates on
exceptions that are fault-like.  Thus, if hardware signaled a different
exception, then the vCPU is already passed the stage of checking for
hardware breakpoints.

This is likely a glorified nop in terms of functionality, and is more for
clarification and is technically an optimization.  Intel's SDM explicitly
states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the
value that would have been saved on the stack had the exception not been
intercepted, i.e. will be '1' due to all fault-like exceptions setting RF
to '1'.  AMD says "guest state saved ... is the processor state as of the
moment the intercept triggers", but that begs the question, "when does
the intercept trigger?".

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2318a99139fa..c5db31b4bd6f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8364,8 +8364,24 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
 
-static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r)
+static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu,
+					   int emulation_type, int *r)
 {
+	WARN_ON_ONCE(emulation_type & EMULTYPE_NO_DECODE);
+
+	/*
+	 * Do not check for code breakpoints if hardware has already done the
+	 * checks, as inferred from the emulation type.  On NO_DECODE and SKIP,
+	 * the instruction has passed all exception checks, and all intercepted
+	 * exceptions that trigger emulation have lower priority than code
+	 * breakpoints, i.e. the fact that the intercepted exception occurred
+	 * means any code breakpoints have already been serviced.
+	 */
+	if (emulation_type & (EMULTYPE_NO_DECODE | EMULTYPE_SKIP |
+			      EMULTYPE_TRAP_UD | EMULTYPE_TRAP_UD_FORCED |
+			      EMULTYPE_VMWARE_GP | EMULTYPE_PF))
+		return false;
+
 	if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) &&
 	    (vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) {
 		struct kvm_run *kvm_run = vcpu->run;
@@ -8487,8 +8503,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		 * are fault-like and are higher priority than any faults on
 		 * the code fetch itself.
 		 */
-		if (!(emulation_type & EMULTYPE_SKIP) &&
-		    kvm_vcpu_check_code_breakpoint(vcpu, &r))
+		if (kvm_vcpu_check_code_breakpoint(vcpu, emulation_type, &r))
 			return r;
 
 		r = x86_decode_emulated_instruction(vcpu, emulation_type,
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (2 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:45   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag Sean Christopherson
                   ` (18 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Exclude General Detect #DBs, which have fault-like behavior but also have
a non-zero payload (DR6.BD=1), from nVMX's handling of pending debug
traps.  Opportunistically rewrite the comment to better document what is
being checked, i.e. "has a non-zero payload" vs. "has a payload", and to
call out the many caveats surrounding #DBs that KVM dodges one way or
another.

Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Fixes: 684c0422da71 ("KVM: nVMX: Handle pending #DB when injecting INIT VM-exit")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 36 +++++++++++++++++++++++++-----------
 1 file changed, 25 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 33ffc8bcf9cd..61bc80fc4cfa 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3857,16 +3857,29 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 }
 
 /*
- * Returns true if a debug trap is pending delivery.
+ * Returns true if a debug trap is (likely) pending delivery.  Infer the class
+ * of a #DB (trap-like vs. fault-like) from the exception payload (to-be-DR6).
+ * Using the payload is flawed because code breakpoints (fault-like) and data
+ * breakpoints (trap-like) set the same bits in DR6 (breakpoint detected), i.e.
+ * this will return false positives if a to-be-injected code breakpoint #DB is
+ * pending (from KVM's perspective, but not "pending" across an instruction
+ * boundary).  ICEBP, a.k.a. INT1, is also not reflected here even though it
+ * too is trap-like.
  *
- * In KVM, debug traps bear an exception payload. As such, the class of a #DB
- * exception may be inferred from the presence of an exception payload.
+ * KVM "works" despite these flaws as ICEBP isn't currently supported by the
+ * emulator, Monitor Trap Flag is not marked pending on intercepted #DBs (the
+ * #DB has already happened), and MTF isn't marked pending on code breakpoints
+ * from the emulator (because such #DBs are fault-like and thus don't trigger
+ * actions that fire on instruction retire).
  */
-static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
+static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
 {
-	return vcpu->arch.exception.pending &&
-			vcpu->arch.exception.nr == DB_VECTOR &&
-			vcpu->arch.exception.payload;
+	if (!vcpu->arch.exception.pending ||
+	    vcpu->arch.exception.nr != DB_VECTOR)
+		return 0;
+
+	/* General Detect #DBs are always fault-like. */
+	return vcpu->arch.exception.payload & ~DR6_BD;
 }
 
 /*
@@ -3878,9 +3891,10 @@ static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
  */
 static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
 {
-	if (vmx_pending_dbg_trap(vcpu))
-		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
-			    vcpu->arch.exception.payload);
+	unsigned long pending_dbg = vmx_get_pending_dbg_trap(vcpu);
+
+	if (pending_dbg)
+		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg);
 }
 
 static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
@@ -3937,7 +3951,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	 * while delivering the pending exception.
 	 */
 
-	if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
+	if (vcpu->arch.exception.pending && !vmx_get_pending_dbg_trap(vcpu)) {
 		if (vmx->nested.nested_run_pending)
 			return -EBUSY;
 		if (!nested_vmx_check_exception(vcpu, &exit_qual))
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (3 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:57   ` Maxim Levitsky
  2022-07-06 23:51   ` Jim Mattson
  2022-06-14 20:47 ` [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1) Sean Christopherson
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
priority than MTF.  KVM itself doesn't emulate TSS #DBs, and any such
exceptions injected from L1 will be handled by hardware (or morphed to
a fault-like exception if injection fails), but theoretically userspace
could pend a TSS T-flag #DB in conjunction with a pending MTF.

Note, there's no known use case this fixes, it's purely to be technically
correct with respect to Intel's SDM.

Cc: Oliver Upton <oupton@google.com>
Cc: Peter Shier <pshier@google.com>
Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 61bc80fc4cfa..e794791a6bdd 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3943,15 +3943,17 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	}
 
 	/*
-	 * Process any exceptions that are not debug traps before MTF.
+	 * Process exceptions that are higher priority than Monitor Trap Flag:
+	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
+	 * could theoretically come in from userspace), and ICEBP (INT1).
 	 *
 	 * Note that only a pending nested run can block a pending exception.
 	 * Otherwise an injected NMI/interrupt should either be
 	 * lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
 	 * while delivering the pending exception.
 	 */
-
-	if (vcpu->arch.exception.pending && !vmx_get_pending_dbg_trap(vcpu)) {
+	if (vcpu->arch.exception.pending &&
+	    !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
 		if (vmx->nested.nested_run_pending)
 			return -EBUSY;
 		if (!nested_vmx_check_exception(vcpu, &exit_qual))
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (4 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:57   ` Maxim Levitsky
  2022-07-06 23:55   ` Jim Mattson
  2022-06-14 20:47 ` [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator Sean Christopherson
                   ` (16 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
trap-like depending the sub-type of #DB, and effectively defer the
decision of what to do with the #DB to the caller.

For the emulator's two calls to exception_type(), treat the #DB as
fault-like, as the emulator handles only code breakpoint and general
detect #DBs, both of which are fault-like.

For event injection, which uses exception_type() to determine whether to
set EFLAGS.RF=1 on the stack, keep the current behavior of not setting
RF=1 for #DBs.  Intel and AMD explicitly state RF isn't set on code #DBs,
so exempting by failing the "== EXCPT_FAULT" check is correct.  The only
other fault-like #DB is General Detect, and despite Intel and AMD both
strongly implying (through omission) that General Detect #DBs should set
RF=1, hardware (multiple generations of both Intel and AMD), in fact does
not.  Through insider knowledge, extreme foresight, sheer dumb luck, or
some combination thereof, KVM correctly handled RF for General Detect #DBs.

Fixes: 38827dbd3fb8 ("KVM: x86: Do not update EFLAGS on faulting emulation")
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++--
 1 file changed, 25 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5db31b4bd6f..7c3ce601bdcc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -529,6 +529,7 @@ static int exception_class(int vector)
 #define EXCPT_TRAP		1
 #define EXCPT_ABORT		2
 #define EXCPT_INTERRUPT		3
+#define EXCPT_DB		4
 
 static int exception_type(int vector)
 {
@@ -539,8 +540,14 @@ static int exception_type(int vector)
 
 	mask = 1 << vector;
 
-	/* #DB is trap, as instruction watchpoints are handled elsewhere */
-	if (mask & ((1 << DB_VECTOR) | (1 << BP_VECTOR) | (1 << OF_VECTOR)))
+	/*
+	 * #DBs can be trap-like or fault-like, the caller must check other CPU
+	 * state, e.g. DR6, to determine whether a #DB is a trap or fault.
+	 */
+	if (mask & (1 << DB_VECTOR))
+		return EXCPT_DB;
+
+	if (mask & ((1 << BP_VECTOR) | (1 << OF_VECTOR)))
 		return EXCPT_TRAP;
 
 	if (mask & ((1 << DF_VECTOR) | (1 << MC_VECTOR)))
@@ -8632,6 +8639,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
 		toggle_interruptibility(vcpu, ctxt->interruptibility);
 		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
+
+		/*
+		 * Note, EXCPT_DB is assumed to be fault-like as the emulator
+		 * only supports code breakpoints and general detect #DB, both
+		 * of which are fault-like.
+		 */
 		if (!ctxt->have_exception ||
 		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
 			kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
@@ -9546,6 +9559,16 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
+		/*
+		 * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS
+		 * value pushed on the stack.  Trap-like exception and all #DBs
+		 * leave RF as-is (KVM follows Intel's behavior in this regard;
+		 * AMD states that code breakpoint #DBs excplitly clear RF=0).
+		 *
+		 * Note, most versions of Intel's SDM and AMD's APM incorrectly
+		 * describe the behavior of General Detect #DBs, which are
+		 * fault-like.  They do _not_ set RF, a la code breakpoints.
+		 */
 		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
 			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
 					     X86_EFLAGS_RF);
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (5 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1) Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:58   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS Sean Christopherson
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Use DR7_GD in the emulator instead of open coding the check, and drop a
comically wrong comment.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/emulate.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index 39ea9138224c..bf499716d9d3 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -4182,8 +4182,7 @@ static int check_dr7_gd(struct x86_emulate_ctxt *ctxt)
 
 	ctxt->ops->get_dr(ctxt, 7, &dr7);
 
-	/* Check if DR7.Global_Enable is set */
-	return dr7 & (1 << 13);
+	return dr7 & DR7_GD;
 }
 
 static int check_dr_read(struct x86_emulate_ctxt *ctxt)
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (6 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 11:59   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit Sean Christopherson
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Fall through to handling other pending exception/events for L2 if SIPI
is pending while the CPU is not in Wait-for-SIPI.  KVM correctly ignores
the event, but incorrectly returns immediately, e.g. a SIPI coincident
with another event could lead to KVM incorrectly routing the event to L1
instead of L2.

Fixes: bf0cd88ce363 ("KVM: x86: emulate wait-for-SIPI and SIPI-VMExit")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index e794791a6bdd..d080bfca16ef 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3936,10 +3936,12 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 			return -EBUSY;
 
 		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
-		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED)
+		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
 			nested_vmx_vmexit(vcpu, EXIT_REASON_SIPI_SIGNAL, 0,
 						apic->sipi_vector & 0xFFUL);
-		return 0;
+			return 0;
+		}
+		/* Fallthrough, the SIPI is completely ignored. */
 	}
 
 	/*
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (7 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:00   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF Sean Christopherson
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Clear mtf_pending on nested VM-Exit instead of handling the clear on a
case-by-case basis in vmx_check_nested_events().  The pending MTF should
rever survive nested VM-Exit, as it is a property of KVM's run of the
current L2, i.e. should never affect the next L2 run by L1.  In practice,
this is likely a nop as getting to L1 with nested_run_pending is
impossible, and KVM doesn't correctly handle morphing a pending exception
that occurs on a prior injected exception (need for re-injected exception
being the other case where MTF isn't cleared).  However, KVM will
hopefully soon correctly deal with a pending exception on top of an
injected exception.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/nested.c | 16 +++++++---------
 1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d080bfca16ef..7b644513c82b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3909,16 +3909,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	unsigned long exit_qual;
 	bool block_nested_events =
 	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
-	bool mtf_pending = vmx->nested.mtf_pending;
 	struct kvm_lapic *apic = vcpu->arch.apic;
 
-	/*
-	 * Clear the MTF state. If a higher priority VM-exit is delivered first,
-	 * this state is discarded.
-	 */
-	if (!block_nested_events)
-		vmx->nested.mtf_pending = false;
-
 	if (lapic_in_kernel(vcpu) &&
 		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
 		if (block_nested_events)
@@ -3927,6 +3919,9 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		clear_bit(KVM_APIC_INIT, &apic->pending_events);
 		if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED)
 			nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
+
+		/* MTF is discarded if the vCPU is in WFS. */
+		vmx->nested.mtf_pending = false;
 		return 0;
 	}
 
@@ -3964,7 +3959,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
-	if (mtf_pending) {
+	if (vmx->nested.mtf_pending) {
 		if (block_nested_events)
 			return -EBUSY;
 		nested_vmx_update_pending_dbg(vcpu);
@@ -4562,6 +4557,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
+	/* Pending MTF traps are discarded on VM-Exit. */
+	vmx->nested.mtf_pending = false;
+
 	/* trying to cancel vmlaunch/vmresume is a bug */
 	WARN_ON_ONCE(vmx->nested.nested_run_pending);
 
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (8 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:00   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception Sean Christopherson
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Treat #PFs that occur during emulation of ENCLS as, wait for it, emulated
page faults.  Practically speaking, this is a glorified nop as the
exception is never of the nested flavor, and it's extremely unlikely the
guest is relying on the side effect of an implicit INVLPG on the faulting
address.

Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/sgx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/sgx.c b/arch/x86/kvm/vmx/sgx.c
index 35e7ec91ae86..966cfa228f2a 100644
--- a/arch/x86/kvm/vmx/sgx.c
+++ b/arch/x86/kvm/vmx/sgx.c
@@ -129,7 +129,7 @@ static int sgx_inject_fault(struct kvm_vcpu *vcpu, gva_t gva, int trapnr)
 		ex.address = gva;
 		ex.error_code_valid = true;
 		ex.nested_page_fault = false;
-		kvm_inject_page_fault(vcpu, &ex);
+		kvm_inject_emulated_page_fault(vcpu, &ex);
 	} else {
 		kvm_inject_gp(vcpu, 0);
 	}
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (9 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:01   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct Sean Christopherson
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Rename the kvm_x86_ops hook for exception injection to better reflect
reality, and to align with pretty much every other related function name
in KVM.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 +-
 arch/x86/include/asm/kvm_host.h    | 2 +-
 arch/x86/kvm/svm/svm.c             | 4 ++--
 arch/x86/kvm/vmx/vmx.c             | 4 ++--
 arch/x86/kvm/x86.c                 | 2 +-
 5 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 6f2f1affbb78..a42e2d9b04fe 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -67,7 +67,7 @@ KVM_X86_OP(get_interrupt_shadow)
 KVM_X86_OP(patch_hypercall)
 KVM_X86_OP(inject_irq)
 KVM_X86_OP(inject_nmi)
-KVM_X86_OP(queue_exception)
+KVM_X86_OP(inject_exception)
 KVM_X86_OP(cancel_injection)
 KVM_X86_OP(interrupt_allowed)
 KVM_X86_OP(nmi_allowed)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7e98b2876380..16a7f91cdf75 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1505,7 +1505,7 @@ struct kvm_x86_ops {
 				unsigned char *hypercall_addr);
 	void (*inject_irq)(struct kvm_vcpu *vcpu, bool reinjected);
 	void (*inject_nmi)(struct kvm_vcpu *vcpu);
-	void (*queue_exception)(struct kvm_vcpu *vcpu);
+	void (*inject_exception)(struct kvm_vcpu *vcpu);
 	void (*cancel_injection)(struct kvm_vcpu *vcpu);
 	int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
 	int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c6cca0ce127b..ca39f76ca44b 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -430,7 +430,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static void svm_queue_exception(struct kvm_vcpu *vcpu)
+static void svm_inject_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
@@ -4761,7 +4761,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.patch_hypercall = svm_patch_hypercall,
 	.inject_irq = svm_inject_irq,
 	.inject_nmi = svm_inject_nmi,
-	.queue_exception = svm_queue_exception,
+	.inject_exception = svm_inject_exception,
 	.cancel_injection = svm_cancel_injection,
 	.interrupt_allowed = svm_interrupt_allowed,
 	.nmi_allowed = svm_nmi_allowed,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ec98992024e2..26b863c78a9f 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1610,7 +1610,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
@@ -7993,7 +7993,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.patch_hypercall = vmx_patch_hypercall,
 	.inject_irq = vmx_inject_irq,
 	.inject_nmi = vmx_inject_nmi,
-	.queue_exception = vmx_queue_exception,
+	.inject_exception = vmx_inject_exception,
 	.cancel_injection = vmx_cancel_injection,
 	.interrupt_allowed = vmx_interrupt_allowed,
 	.nmi_allowed = vmx_nmi_allowed,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7c3ce601bdcc..b63421d511c5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9504,7 +9504,7 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 
 	if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
 		vcpu->arch.exception.error_code = false;
-	static_call(kvm_x86_queue_exception)(vcpu);
+	static_call(kvm_x86_inject_exception)(vcpu);
 }
 
 static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (10 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:02   ` Maxim Levitsky
  2022-07-18 13:07   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions Sean Christopherson
                   ` (10 subsequent siblings)
  22 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
in anticipation of adding a second instance in kvm_vcpu_arch to handle
exceptions that occur when vectoring an injected exception and are
morphed to VM-Exit instead of leading to #DF.

Opportunistically take advantage of the churn to rename "nr" to "vector".

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 23 +++++-----
 arch/x86/kvm/svm/nested.c       | 45 ++++++++++---------
 arch/x86/kvm/svm/svm.c          | 14 +++---
 arch/x86/kvm/vmx/nested.c       | 42 +++++++++--------
 arch/x86/kvm/vmx/vmx.c          | 20 ++++-----
 arch/x86/kvm/x86.c              | 80 ++++++++++++++++-----------------
 arch/x86/kvm/x86.h              |  3 +-
 7 files changed, 111 insertions(+), 116 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 16a7f91cdf75..7f321d53a7e9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -640,6 +640,17 @@ struct kvm_vcpu_xen {
 	struct timer_list poll_timer;
 };
 
+struct kvm_queued_exception {
+	bool pending;
+	bool injected;
+	bool has_error_code;
+	u8 vector;
+	u32 error_code;
+	unsigned long payload;
+	bool has_payload;
+	u8 nested_apf;
+};
+
 struct kvm_vcpu_arch {
 	/*
 	 * rip and regs accesses must go through
@@ -739,16 +750,8 @@ struct kvm_vcpu_arch {
 
 	u8 event_exit_inst_len;
 
-	struct kvm_queued_exception {
-		bool pending;
-		bool injected;
-		bool has_error_code;
-		u8 nr;
-		u32 error_code;
-		unsigned long payload;
-		bool has_payload;
-		u8 nested_apf;
-	} exception;
+	/* Exceptions to be injected to the guest. */
+	struct kvm_queued_exception exception;
 
 	struct kvm_queued_interrupt {
 		bool injected;
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 83bae1f2eeb8..471d40e97890 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -462,7 +462,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
 	unsigned int nr;
 
 	if (vcpu->arch.exception.injected) {
-		nr = vcpu->arch.exception.nr;
+		nr = vcpu->arch.exception.vector;
 		exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
 
 		if (vcpu->arch.exception.has_error_code) {
@@ -1299,42 +1299,43 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
 
 static bool nested_exit_on_exception(struct vcpu_svm *svm)
 {
-	unsigned int nr = svm->vcpu.arch.exception.nr;
+	unsigned int vector = svm->vcpu.arch.exception.vector;
 
-	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(nr));
+	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
 }
 
-static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
+static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 {
-	unsigned int nr = svm->vcpu.arch.exception.nr;
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb *vmcb = svm->vmcb;
 
-	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
+	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + ex->vector;
 	vmcb->control.exit_code_hi = 0;
 
-	if (svm->vcpu.arch.exception.has_error_code)
-		vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code;
+	if (ex->has_error_code)
+		vmcb->control.exit_info_1 = ex->error_code;
 
 	/*
 	 * EXITINFO2 is undefined for all exception intercepts other
 	 * than #PF.
 	 */
-	if (nr == PF_VECTOR) {
-		if (svm->vcpu.arch.exception.nested_apf)
-			vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
-		else if (svm->vcpu.arch.exception.has_payload)
-			vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
+	if (ex->vector == PF_VECTOR) {
+		if (ex->has_payload)
+			vmcb->control.exit_info_2 = ex->payload;
 		else
-			vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
-	} else if (nr == DB_VECTOR) {
+			vmcb->control.exit_info_2 = vcpu->arch.cr2;
+	} else if (ex->vector == DB_VECTOR) {
 		/* See inject_pending_event.  */
-		kvm_deliver_exception_payload(&svm->vcpu);
-		if (svm->vcpu.arch.dr7 & DR7_GD) {
-			svm->vcpu.arch.dr7 &= ~DR7_GD;
-			kvm_update_dr7(&svm->vcpu);
+		kvm_deliver_exception_payload(vcpu, ex);
+
+		if (vcpu->arch.dr7 & DR7_GD) {
+			vcpu->arch.dr7 &= ~DR7_GD;
+			kvm_update_dr7(vcpu);
 		}
-	} else
-		WARN_ON(svm->vcpu.arch.exception.has_payload);
+	} else {
+		WARN_ON(ex->has_payload);
+	}
 
 	nested_svm_vmexit(svm);
 }
@@ -1372,7 +1373,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
                         return -EBUSY;
 		if (!nested_exit_on_exception(svm))
 			return 0;
-		nested_svm_inject_exception_vmexit(svm);
+		nested_svm_inject_exception_vmexit(vcpu);
 		return 0;
 	}
 
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index ca39f76ca44b..6b80046a014f 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -432,22 +432,20 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
 
 static void svm_inject_exception(struct kvm_vcpu *vcpu)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	struct vcpu_svm *svm = to_svm(vcpu);
-	unsigned nr = vcpu->arch.exception.nr;
-	bool has_error_code = vcpu->arch.exception.has_error_code;
-	u32 error_code = vcpu->arch.exception.error_code;
 
-	kvm_deliver_exception_payload(vcpu);
+	kvm_deliver_exception_payload(vcpu, ex);
 
-	if (kvm_exception_is_soft(nr) &&
+	if (kvm_exception_is_soft(ex->vector) &&
 	    svm_update_soft_interrupt_rip(vcpu))
 		return;
 
-	svm->vmcb->control.event_inj = nr
+	svm->vmcb->control.event_inj = ex->vector
 		| SVM_EVTINJ_VALID
-		| (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
+		| (ex->has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
 		| SVM_EVTINJ_TYPE_EXEPT;
-	svm->vmcb->control.event_inj_err = error_code;
+	svm->vmcb->control.event_inj_err = ex->error_code;
 }
 
 static void svm_init_erratum_383(void)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 7b644513c82b..fafdcbfeca1f 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -445,29 +445,27 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
  */
 static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	unsigned int nr = vcpu->arch.exception.nr;
-	bool has_payload = vcpu->arch.exception.has_payload;
-	unsigned long payload = vcpu->arch.exception.payload;
 
-	if (nr == PF_VECTOR) {
-		if (vcpu->arch.exception.nested_apf) {
+	if (ex->vector == PF_VECTOR) {
+		if (ex->nested_apf) {
 			*exit_qual = vcpu->arch.apf.nested_apf_token;
 			return 1;
 		}
-		if (nested_vmx_is_page_fault_vmexit(vmcs12,
-						    vcpu->arch.exception.error_code)) {
-			*exit_qual = has_payload ? payload : vcpu->arch.cr2;
+		if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
+			*exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
 			return 1;
 		}
-	} else if (vmcs12->exception_bitmap & (1u << nr)) {
-		if (nr == DB_VECTOR) {
-			if (!has_payload) {
-				payload = vcpu->arch.dr6;
-				payload &= ~DR6_BT;
-				payload ^= DR6_ACTIVE_LOW;
+	} else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
+		if (ex->vector == DB_VECTOR) {
+			if (ex->has_payload) {
+				*exit_qual = ex->payload;
+			} else {
+				*exit_qual = vcpu->arch.dr6;
+				*exit_qual &= ~DR6_BT;
+				*exit_qual ^= DR6_ACTIVE_LOW;
 			}
-			*exit_qual = payload;
 		} else
 			*exit_qual = 0;
 		return 1;
@@ -3724,7 +3722,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	     is_double_fault(exit_intr_info))) {
 		vmcs12->idt_vectoring_info_field = 0;
 	} else if (vcpu->arch.exception.injected) {
-		nr = vcpu->arch.exception.nr;
+		nr = vcpu->arch.exception.vector;
 		idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
 
 		if (kvm_exception_is_soft(nr)) {
@@ -3828,11 +3826,11 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 					       unsigned long exit_qual)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	unsigned int nr = vcpu->arch.exception.nr;
-	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
-	if (vcpu->arch.exception.has_error_code) {
+	if (ex->has_error_code) {
 		/*
 		 * Intel CPUs will never generate an error code with bits 31:16
 		 * set, and more importantly VMX disallows setting bits 31:16
@@ -3840,11 +3838,11 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 		 * mimic hardware and avoid inducing failure on nested VM-Entry
 		 * if L1 chooses to inject the exception back to L2.
 		 */
-		vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
+		vmcs12->vm_exit_intr_error_code = (u16)ex->error_code;
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 
-	if (kvm_exception_is_soft(nr))
+	if (kvm_exception_is_soft(ex->vector))
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
 	else
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
@@ -3875,7 +3873,7 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
 static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
 {
 	if (!vcpu->arch.exception.pending ||
-	    vcpu->arch.exception.nr != DB_VECTOR)
+	    vcpu->arch.exception.vector != DB_VECTOR)
 		return 0;
 
 	/* General Detect #DBs are always fault-like. */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 26b863c78a9f..7ef5659a1bbd 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1585,7 +1585,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 	 */
 	if (nested_cpu_has_mtf(vmcs12) &&
 	    (!vcpu->arch.exception.pending ||
-	     vcpu->arch.exception.nr == DB_VECTOR))
+	     vcpu->arch.exception.vector == DB_VECTOR))
 		vmx->nested.mtf_pending = true;
 	else
 		vmx->nested.mtf_pending = false;
@@ -1612,15 +1612,13 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 
 static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned nr = vcpu->arch.exception.nr;
-	bool has_error_code = vcpu->arch.exception.has_error_code;
-	u32 error_code = vcpu->arch.exception.error_code;
-	u32 intr_info = nr | INTR_INFO_VALID_MASK;
 
-	kvm_deliver_exception_payload(vcpu);
+	kvm_deliver_exception_payload(vcpu, ex);
 
-	if (has_error_code) {
+	if (ex->has_error_code) {
 		/*
 		 * Despite the error code being architecturally defined as 32
 		 * bits, and the VMCS field being 32 bits, Intel CPUs and thus
@@ -1630,21 +1628,21 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 		 * the upper bits to avoid VM-Fail, losing information that
 		 * does't really exist is preferable to killing the VM.
 		 */
-		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
+		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)ex->error_code);
 		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
 	}
 
 	if (vmx->rmode.vm86_active) {
 		int inc_eip = 0;
-		if (kvm_exception_is_soft(nr))
+		if (kvm_exception_is_soft(ex->vector))
 			inc_eip = vcpu->arch.event_exit_inst_len;
-		kvm_inject_realmode_interrupt(vcpu, nr, inc_eip);
+		kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
 		return;
 	}
 
 	WARN_ON_ONCE(vmx->emulation_required);
 
-	if (kvm_exception_is_soft(nr)) {
+	if (kvm_exception_is_soft(ex->vector)) {
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b63421d511c5..511c0c8af80e 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -557,16 +557,13 @@ static int exception_type(int vector)
 	return EXCPT_FAULT;
 }
 
-void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
+void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
+				   struct kvm_queued_exception *ex)
 {
-	unsigned nr = vcpu->arch.exception.nr;
-	bool has_payload = vcpu->arch.exception.has_payload;
-	unsigned long payload = vcpu->arch.exception.payload;
-
-	if (!has_payload)
+	if (!ex->has_payload)
 		return;
 
-	switch (nr) {
+	switch (ex->vector) {
 	case DB_VECTOR:
 		/*
 		 * "Certain debug exceptions may clear bit 0-3.  The
@@ -591,8 +588,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
 		 * So they need to be flipped for DR6.
 		 */
 		vcpu->arch.dr6 |= DR6_ACTIVE_LOW;
-		vcpu->arch.dr6 |= payload;
-		vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW;
+		vcpu->arch.dr6 |= ex->payload;
+		vcpu->arch.dr6 ^= ex->payload & DR6_ACTIVE_LOW;
 
 		/*
 		 * The #DB payload is defined as compatible with the 'pending
@@ -603,12 +600,12 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
 		vcpu->arch.dr6 &= ~BIT(12);
 		break;
 	case PF_VECTOR:
-		vcpu->arch.cr2 = payload;
+		vcpu->arch.cr2 = ex->payload;
 		break;
 	}
 
-	vcpu->arch.exception.has_payload = false;
-	vcpu->arch.exception.payload = 0;
+	ex->has_payload = false;
+	ex->payload = 0;
 }
 EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
 
@@ -647,17 +644,18 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 			vcpu->arch.exception.injected = false;
 		}
 		vcpu->arch.exception.has_error_code = has_error;
-		vcpu->arch.exception.nr = nr;
+		vcpu->arch.exception.vector = nr;
 		vcpu->arch.exception.error_code = error_code;
 		vcpu->arch.exception.has_payload = has_payload;
 		vcpu->arch.exception.payload = payload;
 		if (!is_guest_mode(vcpu))
-			kvm_deliver_exception_payload(vcpu);
+			kvm_deliver_exception_payload(vcpu,
+						      &vcpu->arch.exception);
 		return;
 	}
 
 	/* to check exception */
-	prev_nr = vcpu->arch.exception.nr;
+	prev_nr = vcpu->arch.exception.vector;
 	if (prev_nr == DF_VECTOR) {
 		/* triple fault -> shutdown */
 		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
@@ -675,7 +673,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 		vcpu->arch.exception.pending = true;
 		vcpu->arch.exception.injected = false;
 		vcpu->arch.exception.has_error_code = true;
-		vcpu->arch.exception.nr = DF_VECTOR;
+		vcpu->arch.exception.vector = DF_VECTOR;
 		vcpu->arch.exception.error_code = 0;
 		vcpu->arch.exception.has_payload = false;
 		vcpu->arch.exception.payload = 0;
@@ -4886,25 +4884,24 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 					       struct kvm_vcpu_events *events)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+
 	process_nmi(vcpu);
 
 	if (kvm_check_request(KVM_REQ_SMI, vcpu))
 		process_smi(vcpu);
 
 	/*
-	 * In guest mode, payload delivery should be deferred,
-	 * so that the L1 hypervisor can intercept #PF before
-	 * CR2 is modified (or intercept #DB before DR6 is
-	 * modified under nVMX). Unless the per-VM capability,
-	 * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
-	 * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
-	 * opportunistically defer the exception payload, deliver it if the
-	 * capability hasn't been requested before processing a
-	 * KVM_GET_VCPU_EVENTS.
+	 * In guest mode, payload delivery should be deferred if the exception
+	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
+	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
+	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
+	 * propagate the payload and so it cannot be safely deferred.  Deliver
+	 * the payload if the capability hasn't been requested.
 	 */
 	if (!vcpu->kvm->arch.exception_payload_enabled &&
-	    vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
-		kvm_deliver_exception_payload(vcpu);
+	    ex->pending && ex->has_payload)
+		kvm_deliver_exception_payload(vcpu, ex);
 
 	/*
 	 * The API doesn't provide the instruction length for software
@@ -4912,26 +4909,25 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 	 * isn't advanced, we should expect to encounter the exception
 	 * again.
 	 */
-	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
+	if (kvm_exception_is_soft(ex->vector)) {
 		events->exception.injected = 0;
 		events->exception.pending = 0;
 	} else {
-		events->exception.injected = vcpu->arch.exception.injected;
-		events->exception.pending = vcpu->arch.exception.pending;
+		events->exception.injected = ex->injected;
+		events->exception.pending = ex->pending;
 		/*
 		 * For ABI compatibility, deliberately conflate
 		 * pending and injected exceptions when
 		 * KVM_CAP_EXCEPTION_PAYLOAD isn't enabled.
 		 */
 		if (!vcpu->kvm->arch.exception_payload_enabled)
-			events->exception.injected |=
-				vcpu->arch.exception.pending;
+			events->exception.injected |= ex->pending;
 	}
-	events->exception.nr = vcpu->arch.exception.nr;
-	events->exception.has_error_code = vcpu->arch.exception.has_error_code;
-	events->exception.error_code = vcpu->arch.exception.error_code;
-	events->exception_has_payload = vcpu->arch.exception.has_payload;
-	events->exception_payload = vcpu->arch.exception.payload;
+	events->exception.nr = ex->vector;
+	events->exception.has_error_code = ex->has_error_code;
+	events->exception.error_code = ex->error_code;
+	events->exception_has_payload = ex->has_payload;
+	events->exception_payload = ex->payload;
 
 	events->interrupt.injected =
 		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
@@ -5003,7 +4999,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 	process_nmi(vcpu);
 	vcpu->arch.exception.injected = events->exception.injected;
 	vcpu->arch.exception.pending = events->exception.pending;
-	vcpu->arch.exception.nr = events->exception.nr;
+	vcpu->arch.exception.vector = events->exception.nr;
 	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
 	vcpu->arch.exception.error_code = events->exception.error_code;
 	vcpu->arch.exception.has_payload = events->exception_has_payload;
@@ -9497,7 +9493,7 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
 
 static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 {
-	trace_kvm_inj_exception(vcpu->arch.exception.nr,
+	trace_kvm_inj_exception(vcpu->arch.exception.vector,
 				vcpu->arch.exception.has_error_code,
 				vcpu->arch.exception.error_code,
 				vcpu->arch.exception.injected);
@@ -9569,12 +9565,12 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 		 * describe the behavior of General Detect #DBs, which are
 		 * fault-like.  They do _not_ set RF, a la code breakpoints.
 		 */
-		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
+		if (exception_type(vcpu->arch.exception.vector) == EXCPT_FAULT)
 			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
 					     X86_EFLAGS_RF);
 
-		if (vcpu->arch.exception.nr == DB_VECTOR) {
-			kvm_deliver_exception_payload(vcpu);
+		if (vcpu->arch.exception.vector == DB_VECTOR) {
+			kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception);
 			if (vcpu->arch.dr7 & DR7_GD) {
 				vcpu->arch.dr7 &= ~DR7_GD;
 				kvm_update_dr7(vcpu);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 501b884b8cc4..dc2af0146220 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -286,7 +286,8 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
 
 int handle_ud(struct kvm_vcpu *vcpu);
 
-void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu);
+void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
+				   struct kvm_queued_exception *ex);
 
 void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
 u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (11 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:04   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF Sean Christopherson
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Capture nested_run_pending as block_pending_exceptions so that the logic
of why exceptions are blocked only needs to be documented once instead of
at every place that employs the logic.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/svm/nested.c | 20 ++++++++++----------
 arch/x86/kvm/vmx/nested.c | 23 ++++++++++++-----------
 2 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 471d40e97890..460161e67ce5 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -1347,10 +1347,16 @@ static inline bool nested_exit_on_init(struct vcpu_svm *svm)
 
 static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 {
-	struct vcpu_svm *svm = to_svm(vcpu);
-	bool block_nested_events =
-		kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	struct vcpu_svm *svm = to_svm(vcpu);
+	/*
+	 * Only a pending nested run blocks a pending exception.  If there is a
+	 * previously injected event, the pending exception occurred while said
+	 * event was being delivered and thus needs to be handled.
+	 */
+	bool block_nested_exceptions = svm->nested.nested_run_pending;
+	bool block_nested_events = block_nested_exceptions ||
+				   kvm_event_needs_reinjection(vcpu);
 
 	if (lapic_in_kernel(vcpu) &&
 	    test_bit(KVM_APIC_INIT, &apic->pending_events)) {
@@ -1363,13 +1369,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 	}
 
 	if (vcpu->arch.exception.pending) {
-		/*
-		 * Only a pending nested run can block a pending exception.
-		 * Otherwise an injected NMI/interrupt should either be
-		 * lost or delivered to the nested hypervisor in the EXITINTINFO
-		 * vmcb field, while delivering the pending exception.
-		 */
-		if (svm->nested.nested_run_pending)
+		if (block_nested_exceptions)
                         return -EBUSY;
 		if (!nested_exit_on_exception(svm))
 			return 0;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index fafdcbfeca1f..50fe66f0cc1b 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -3903,11 +3903,17 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
 
 static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 {
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned long exit_qual;
-	bool block_nested_events =
-	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
 	struct kvm_lapic *apic = vcpu->arch.apic;
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	unsigned long exit_qual;
+	/*
+	 * Only a pending nested run blocks a pending exception.  If there is a
+	 * previously injected event, the pending exception occurred while said
+	 * event was being delivered and thus needs to be handled.
+	 */
+	bool block_nested_exceptions = vmx->nested.nested_run_pending;
+	bool block_nested_events = block_nested_exceptions ||
+				   kvm_event_needs_reinjection(vcpu);
 
 	if (lapic_in_kernel(vcpu) &&
 		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
@@ -3941,15 +3947,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	 * Process exceptions that are higher priority than Monitor Trap Flag:
 	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
 	 * could theoretically come in from userspace), and ICEBP (INT1).
-	 *
-	 * Note that only a pending nested run can block a pending exception.
-	 * Otherwise an injected NMI/interrupt should either be
-	 * lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
-	 * while delivering the pending exception.
 	 */
 	if (vcpu->arch.exception.pending &&
 	    !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
-		if (vmx->nested.nested_run_pending)
+		if (block_nested_exceptions)
 			return -EBUSY;
 		if (!nested_vmx_check_exception(vcpu, &exit_qual))
 			goto no_vmexit;
@@ -3966,7 +3967,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 	}
 
 	if (vcpu->arch.exception.pending) {
-		if (vmx->nested.nested_run_pending)
+		if (block_nested_exceptions)
 			return -EBUSY;
 		if (!nested_vmx_check_exception(vcpu, &exit_qual))
 			goto no_vmexit;
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (12 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:04   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic Sean Christopherson
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Queue #DF by recursing on kvm_multiple_exception() by way of
kvm_queue_exception_e() instead of open coding the behavior.  This will
allow KVM to Just Work when a future commit moves exception interception
checks (for L2 => L1) into kvm_multiple_exception().

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 511c0c8af80e..e45465075005 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -663,25 +663,22 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 	}
 	class1 = exception_class(prev_nr);
 	class2 = exception_class(nr);
-	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
-		|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
+	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) ||
+	    (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
 		/*
-		 * Generate double fault per SDM Table 5-5.  Set
-		 * exception.pending = true so that the double fault
-		 * can trigger a nested vmexit.
+		 * Synthesize #DF.  Clear the previously injected or pending
+		 * exception so as not to incorrectly trigger shutdown.
 		 */
-		vcpu->arch.exception.pending = true;
 		vcpu->arch.exception.injected = false;
-		vcpu->arch.exception.has_error_code = true;
-		vcpu->arch.exception.vector = DF_VECTOR;
-		vcpu->arch.exception.error_code = 0;
-		vcpu->arch.exception.has_payload = false;
-		vcpu->arch.exception.payload = 0;
-	} else
+		vcpu->arch.exception.pending = false;
+
+		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
+	} else {
 		/* replace previous exception with a new one in a hope
 		   that instruction re-execution will regenerate lost
 		   exception */
 		goto queue;
+	}
 }
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (13 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:05   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit Sean Christopherson
                   ` (7 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Perform nested event checks before re-injecting exceptions/events into
L2.  If a pending exception causes VM-Exit to L1, re-injecting events
into vmcs02 is premature and wasted effort.  Take care to ensure events
that need to be re-injected are still re-injected if checking for nested
events "fails", i.e. if KVM needs to force an immediate entry+exit to
complete the to-be-re-injecteed event.

Keep the "can_inject" logic the same for now; it too can be pushed below
the nested checks, but is a slightly riskier change (see past bugs about
events not being properly purged on nested VM-Exit).

Add and/or modify comments to better document the various interactions.
Of note is the comment regarding "blocking" previously injected NMIs and
IRQs if an exception is pending.  The old comment isn't wrong strictly
speaking, but it failed to capture the reason why the logic even exists.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 89 +++++++++++++++++++++++++++-------------------
 1 file changed, 53 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e45465075005..930de833aa2b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9502,53 +9502,70 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 
 static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 {
+	bool can_inject = !kvm_event_needs_reinjection(vcpu);
 	int r;
-	bool can_inject = true;
 
-	/* try to reinject previous events if any */
+	/*
+	 * Process nested events first, as nested VM-Exit supercedes event
+	 * re-injection.  If there's an event queued for re-injection, it will
+	 * be saved into the appropriate vmc{b,s}12 fields on nested VM-Exit.
+	 */
+	if (is_guest_mode(vcpu))
+		r = kvm_check_nested_events(vcpu);
+	else
+		r = 0;
 
-	if (vcpu->arch.exception.injected) {
+	/*
+	 * Re-inject exceptions and events *especially* if immediate entry+exit
+	 * to/from L2 is needed, as any event that has already been injected
+	 * into L2 needs to complete its lifecycle before injecting a new event.
+	 *
+	 * Don't re-inject an NMI or interrupt if there is a pending exception.
+	 * This collision arises if an exception occurred while vectoring the
+	 * injected event, KVM intercepted said exception, and KVM ultimately
+	 * determined the fault belongs to the guest and queues the exception
+	 * for injection back into the guest.
+	 *
+	 * "Injected" interrupts can also collide with pending exceptions if
+	 * userspace ignores the "ready for injection" flag and blindly queues
+	 * an interrupt.  In that case, prioritizing the exception is correct,
+	 * as the exception "occurred" before the exit to userspace.  Trap-like
+	 * exceptions, e.g. most #DBs, have higher priority than interrupts.
+	 * And while fault-like exceptions, e.g. #GP and #PF, are the lowest
+	 * priority, they're only generated (pended) during instruction
+	 * execution, and interrupts are recognized at instruction boundaries.
+	 * Thus a pending fault-like exception means the fault occurred on the
+	 * *previous* instruction and must be serviced prior to recognizing any
+	 * new events in order to fully complete the previous instruction.
+	 */
+	if (vcpu->arch.exception.injected)
 		kvm_inject_exception(vcpu);
-		can_inject = false;
-	}
+	else if (vcpu->arch.exception.pending)
+		; /* see above */
+	else if (vcpu->arch.nmi_injected)
+		static_call(kvm_x86_inject_nmi)(vcpu);
+	else if (vcpu->arch.interrupt.injected)
+		static_call(kvm_x86_inject_irq)(vcpu, true);
+
 	/*
-	 * Do not inject an NMI or interrupt if there is a pending
-	 * exception.  Exceptions and interrupts are recognized at
-	 * instruction boundaries, i.e. the start of an instruction.
-	 * Trap-like exceptions, e.g. #DB, have higher priority than
-	 * NMIs and interrupts, i.e. traps are recognized before an
-	 * NMI/interrupt that's pending on the same instruction.
-	 * Fault-like exceptions, e.g. #GP and #PF, are the lowest
-	 * priority, but are only generated (pended) during instruction
-	 * execution, i.e. a pending fault-like exception means the
-	 * fault occurred on the *previous* instruction and must be
-	 * serviced prior to recognizing any new events in order to
-	 * fully complete the previous instruction.
+	 * Exceptions that morph to VM-Exits are handled above, and pending
+	 * exceptions on top of injected exceptions that do not VM-Exit should
+	 * either morph to #DF or, sadly, override the injected exception.
 	 */
-	else if (!vcpu->arch.exception.pending) {
-		if (vcpu->arch.nmi_injected) {
-			static_call(kvm_x86_inject_nmi)(vcpu);
-			can_inject = false;
-		} else if (vcpu->arch.interrupt.injected) {
-			static_call(kvm_x86_inject_irq)(vcpu, true);
-			can_inject = false;
-		}
-	}
-
 	WARN_ON_ONCE(vcpu->arch.exception.injected &&
 		     vcpu->arch.exception.pending);
 
 	/*
-	 * Call check_nested_events() even if we reinjected a previous event
-	 * in order for caller to determine if it should require immediate-exit
-	 * from L2 to L1 due to pending L1 events which require exit
-	 * from L2 to L1.
+	 * Bail if immediate entry+exit to/from the guest is needed to complete
+	 * nested VM-Enter or event re-injection so that a different pending
+	 * event can be serviced (or if KVM needs to exit to userspace).
+	 *
+	 * Otherwise, continue processing events even if VM-Exit occurred.  The
+	 * VM-Exit will have cleared exceptions that were meant for L2, but
+	 * there may now be events that can be injected into L1.
 	 */
-	if (is_guest_mode(vcpu)) {
-		r = kvm_check_nested_events(vcpu);
-		if (r < 0)
-			goto out;
-	}
+	if (r < 0)
+		goto out;
 
 	/* try to inject new event if pending */
 	if (vcpu->arch.exception.pending) {
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (14 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:05   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time Sean Christopherson
                   ` (6 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Determine whether or not new events can be injected after checking nested
events.  If a VM-Exit occurred during nested event handling, any previous
event that needed re-injection is gone from's KVM perspective; the event
is captured in the vmc*12 VM-Exit information, but doesn't exist in terms
of what needs to be done for entry to L1.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 930de833aa2b..1a301a1730a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9502,7 +9502,7 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
 
 static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 {
-	bool can_inject = !kvm_event_needs_reinjection(vcpu);
+	bool can_inject;
 	int r;
 
 	/*
@@ -9567,7 +9567,13 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 	if (r < 0)
 		goto out;
 
-	/* try to inject new event if pending */
+	/*
+	 * New events, other than exceptions, cannot be injected if KVM needs
+	 * to re-inject a previous event.  See above comments on re-injecting
+	 * for why pending exceptions get priority.
+	 */
+	can_inject = !kvm_event_needs_reinjection(vcpu);
+
 	if (vcpu->arch.exception.pending) {
 		/*
 		 * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (15 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:15   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions Sean Christopherson
                   ` (5 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Morph pending exceptions to pending VM-Exits (due to interception) when
the exception is queued instead of waiting until nested events are
checked at VM-Entry.  This fixes a longstanding bug where KVM fails to
handle an exception that occurs during delivery of a previous exception,
KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
paging), and KVM determines that the exception is in the guest's domain,
i.e. queues the new exception for L2.  Deferring the interception check
causes KVM to esclate various combinations of injected+pending exceptions
to double fault (#DF) without consulting L1's interception desires, and
ends up injecting a spurious #DF into L2.

KVM has fudged around the issue for #PF by special casing emulated #PF
injection for shadow paging, but the underlying issue is not unique to
shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
Other exceptions are affected as well, e.g. if KVM is intercepting #GP
for one of SVM's workaround or for the VMware backdoor emulation stuff.
The other cases have gone unnoticed because the #DF is spurious if and
only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
would have injected #DF anyways.

The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
if #PF injection forced a nested VM-Exit and the emulator finds itself
back in L1.  Allowing for direct-to-VM-Exit queueing also neatly solves
the async #PF in L2 mess; no need to set a magic flag and token, simply
queue a #PF nested VM-Exit.

Deal with event migration by flagging that a pending exception was queued
by userspace and check for interception at the next KVM_RUN, e.g. so that
KVM does the right thing regardless of the order in which userspace
restores nested state vs. event state.

When "getting" events from userspace, simply drop any pending excpetion
that is destined to be intercepted if there is also an injected exception
to be migrated.  Ideally, KVM would migrate both events, but that would
require new ABI, and practically speaking losing the event is unlikely to
be noticed, let alone fatal.  The injected exception is captured, RIP
still points at the original faulting instruction, etc...  So either the
injection on the target will trigger the same intercepted exception, or
the source of the intercepted exception was transient and/or
non-deterministic, thus dropping it is ok-ish.

Opportunistically add a gigantic comment above vmx_check_nested_events()
to document the priorities of all known events on Intel CPUs.  Kudos to
Jim Mattson for doing the hard work of collecting and interpreting the
priorities from various locations throughtout the SDM (because putting
them all in one place in the SDM would be too easy).

Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
Cc: Jim Mattson <jmattson@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  12 +-
 arch/x86/kvm/svm/nested.c       |  41 ++----
 arch/x86/kvm/vmx/nested.c       | 220 +++++++++++++++++++++-----------
 arch/x86/kvm/vmx/vmx.c          |   6 +-
 arch/x86/kvm/x86.c              | 159 ++++++++++++++++-------
 arch/x86/kvm/x86.h              |   7 +
 6 files changed, 287 insertions(+), 158 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 7f321d53a7e9..3bf7fdeeb25c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -648,7 +648,6 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
-	u8 nested_apf;
 };
 
 struct kvm_vcpu_arch {
@@ -750,8 +749,12 @@ struct kvm_vcpu_arch {
 
 	u8 event_exit_inst_len;
 
+	bool exception_from_userspace;
+
 	/* Exceptions to be injected to the guest. */
 	struct kvm_queued_exception exception;
+	/* Exception VM-Exits to be synthesized to L1. */
+	struct kvm_queued_exception exception_vmexit;
 
 	struct kvm_queued_interrupt {
 		bool injected;
@@ -861,7 +864,6 @@ struct kvm_vcpu_arch {
 		u32 id;
 		bool send_user_only;
 		u32 host_apf_flags;
-		unsigned long nested_apf_token;
 		bool delivery_as_pf_vmexit;
 		bool pageready_pending;
 	} apf;
@@ -1618,9 +1620,9 @@ struct kvm_x86_ops {
 
 struct kvm_x86_nested_ops {
 	void (*leave_nested)(struct kvm_vcpu *vcpu);
+	bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
+				    u32 error_code);
 	int (*check_events)(struct kvm_vcpu *vcpu);
-	bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
-					     struct x86_exception *fault);
 	bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
 	void (*triple_fault)(struct kvm_vcpu *vcpu);
 	int (*get_state)(struct kvm_vcpu *vcpu,
@@ -1847,7 +1849,7 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long pay
 void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
-bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
 bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
 bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index 460161e67ce5..4075deefd132 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -55,28 +55,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
 	nested_svm_vmexit(svm);
 }
 
-static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
-						    struct x86_exception *fault)
-{
-	struct vcpu_svm *svm = to_svm(vcpu);
-	struct vmcb *vmcb = svm->vmcb;
-
- 	WARN_ON(!is_guest_mode(vcpu));
-
-	if (vmcb12_is_intercept(&svm->nested.ctl,
-				INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
-	    !WARN_ON_ONCE(svm->nested.nested_run_pending)) {
-	     	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
-		vmcb->control.exit_code_hi = 0;
-		vmcb->control.exit_info_1 = fault->error_code;
-		vmcb->control.exit_info_2 = fault->address;
-		nested_svm_vmexit(svm);
-		return true;
-	}
-
-	return false;
-}
-
 static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
 {
 	struct vcpu_svm *svm = to_svm(vcpu);
@@ -1297,16 +1275,17 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static bool nested_exit_on_exception(struct vcpu_svm *svm)
+static bool nested_svm_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
+					   u32 error_code)
 {
-	unsigned int vector = svm->vcpu.arch.exception.vector;
+	struct vcpu_svm *svm = to_svm(vcpu);
 
 	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
 }
 
 static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 {
-	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	struct vcpu_svm *svm = to_svm(vcpu);
 	struct vmcb *vmcb = svm->vmcb;
 
@@ -1368,15 +1347,19 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
-	if (vcpu->arch.exception.pending) {
+	if (vcpu->arch.exception_vmexit.pending) {
 		if (block_nested_exceptions)
                         return -EBUSY;
-		if (!nested_exit_on_exception(svm))
-			return 0;
 		nested_svm_inject_exception_vmexit(vcpu);
 		return 0;
 	}
 
+	if (vcpu->arch.exception.pending) {
+		if (block_nested_exceptions)
+			return -EBUSY;
+		return 0;
+	}
+
 	if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
 		if (block_nested_events)
 			return -EBUSY;
@@ -1714,8 +1697,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
 
 struct kvm_x86_nested_ops svm_nested_ops = {
 	.leave_nested = svm_leave_nested,
+	.is_exception_vmexit = nested_svm_is_exception_vmexit,
 	.check_events = svm_check_nested_events,
-	.handle_page_fault_workaround = nested_svm_handle_page_fault_workaround,
 	.triple_fault = nested_svm_triple_fault,
 	.get_nested_state_pages = svm_get_nested_state_pages,
 	.get_state = svm_get_nested_state,
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 50fe66f0cc1b..53f6ea15081d 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -438,59 +438,22 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
 	return inequality ^ bit;
 }
 
-
-/*
- * KVM wants to inject page-faults which it got to the guest. This function
- * checks whether in a nested guest, we need to inject them to L1 or L2.
- */
-static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
-{
-	struct kvm_queued_exception *ex = &vcpu->arch.exception;
-	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-
-	if (ex->vector == PF_VECTOR) {
-		if (ex->nested_apf) {
-			*exit_qual = vcpu->arch.apf.nested_apf_token;
-			return 1;
-		}
-		if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
-			*exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
-			return 1;
-		}
-	} else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
-		if (ex->vector == DB_VECTOR) {
-			if (ex->has_payload) {
-				*exit_qual = ex->payload;
-			} else {
-				*exit_qual = vcpu->arch.dr6;
-				*exit_qual &= ~DR6_BT;
-				*exit_qual ^= DR6_ACTIVE_LOW;
-			}
-		} else
-			*exit_qual = 0;
-		return 1;
-	}
-
-	return 0;
-}
-
-static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
-						    struct x86_exception *fault)
+static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
+					   u32 error_code)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
-	WARN_ON(!is_guest_mode(vcpu));
+	/*
+	 * Drop bits 31:16 of the error code when performing the #PF mask+match
+	 * check.  All VMCS fields involved are 32 bits, but Intel CPUs never
+	 * set bits 31:16 and VMX disallows setting bits 31:16 in the injected
+	 * error code.  Including the to-be-dropped bits in the check might
+	 * result in an "impossible" or missed exit from L1's perspective.
+	 */
+	if (vector == PF_VECTOR)
+		return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code);
 
-	if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
-	    !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) {
-		vmcs12->vm_exit_intr_error_code = fault->error_code;
-		nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
-				  PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
-				  INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
-				  fault->address);
-		return true;
-	}
-	return false;
+	return (vmcs12->exception_bitmap & (1u << vector));
 }
 
 static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
@@ -3823,12 +3786,24 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
 	return -ENXIO;
 }
 
-static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
-					       unsigned long exit_qual)
+static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 {
-	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
+	unsigned long exit_qual;
+
+	if (ex->has_payload) {
+		exit_qual = ex->payload;
+	} else if (ex->vector == PF_VECTOR) {
+		exit_qual = vcpu->arch.cr2;
+	} else if (ex->vector == DB_VECTOR) {
+		exit_qual = vcpu->arch.dr6;
+		exit_qual &= ~DR6_BT;
+		exit_qual ^= DR6_ACTIVE_LOW;
+	} else {
+		exit_qual = 0;
+	}
 
 	if (ex->has_error_code) {
 		/*
@@ -3870,14 +3845,24 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
  * from the emulator (because such #DBs are fault-like and thus don't trigger
  * actions that fire on instruction retire).
  */
-static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
+static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex)
 {
-	if (!vcpu->arch.exception.pending ||
-	    vcpu->arch.exception.vector != DB_VECTOR)
+	if (!ex->pending || ex->vector != DB_VECTOR)
 		return 0;
 
 	/* General Detect #DBs are always fault-like. */
-	return vcpu->arch.exception.payload & ~DR6_BD;
+	return ex->payload & ~DR6_BD;
+}
+
+/*
+ * Returns true if there's a pending #DB exception that is lower priority than
+ * a pending Monitor Trap Flag VM-Exit.  TSS T-flag #DBs are not emulated by
+ * KVM, but could theoretically be injected by userspace.  Note, this code is
+ * imperfect, see above.
+ */
+static bool vmx_is_low_priority_db_trap(struct kvm_queued_exception *ex)
+{
+	return vmx_get_pending_dbg_trap(ex) & ~DR6_BT;
 }
 
 /*
@@ -3889,8 +3874,9 @@ static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
  */
 static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
 {
-	unsigned long pending_dbg = vmx_get_pending_dbg_trap(vcpu);
+	unsigned long pending_dbg;
 
+	pending_dbg = vmx_get_pending_dbg_trap(&vcpu->arch.exception);
 	if (pending_dbg)
 		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg);
 }
@@ -3901,11 +3887,93 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
 	       to_vmx(vcpu)->nested.preemption_timer_expired;
 }
 
+/*
+ * Per the Intel SDM's table "Priority Among Concurrent Events", with minor
+ * edits to fill in missing examples, e.g. #DB due to split-lock accesses,
+ * and less minor edits to splice in the priority of VMX Non-Root specific
+ * events, e.g. MTF and NMI/INTR-window exiting.
+ *
+ * 1 Hardware Reset and Machine Checks
+ *	- RESET
+ *	- Machine Check
+ *
+ * 2 Trap on Task Switch
+ *	- T flag in TSS is set (on task switch)
+ *
+ * 3 External Hardware Interventions
+ *	- FLUSH
+ *	- STOPCLK
+ *	- SMI
+ *	- INIT
+ *
+ * 3.5 Monitor Trap Flag (MTF) VM-exit[1]
+ *
+ * 4 Traps on Previous Instruction
+ *	- Breakpoints
+ *	- Trap-class Debug Exceptions (#DB due to TF flag set, data/I-O
+ *	  breakpoint, or #DB due to a split-lock access)
+ *
+ * 4.3	VMX-preemption timer expired VM-exit
+ *
+ * 4.6	NMI-window exiting VM-exit[2]
+ *
+ * 5 Nonmaskable Interrupts (NMI)
+ *
+ * 5.5 Interrupt-window exiting VM-exit and Virtual-interrupt delivery
+ *
+ * 6 Maskable Hardware Interrupts
+ *
+ * 7 Code Breakpoint Fault
+ *
+ * 8 Faults from Fetching Next Instruction
+ *	- Code-Segment Limit Violation
+ *	- Code Page Fault
+ *	- Control protection exception (missing ENDBRANCH at target of indirect
+ *					call or jump)
+ *
+ * 9 Faults from Decoding Next Instruction
+ *	- Instruction length > 15 bytes
+ *	- Invalid Opcode
+ *	- Coprocessor Not Available
+ *
+ *10 Faults on Executing Instruction
+ *	- Overflow
+ *	- Bound error
+ *	- Invalid TSS
+ *	- Segment Not Present
+ *	- Stack fault
+ *	- General Protection
+ *	- Data Page Fault
+ *	- Alignment Check
+ *	- x86 FPU Floating-point exception
+ *	- SIMD floating-point exception
+ *	- Virtualization exception
+ *	- Control protection exception
+ *
+ * [1] Per the "Monitor Trap Flag" section: System-management interrupts (SMIs),
+ *     INIT signals, and higher priority events take priority over MTF VM exits.
+ *     MTF VM exits take priority over debug-trap exceptions and lower priority
+ *     events.
+ *
+ * [2] Debug-trap exceptions and higher priority events take priority over VM exits
+ *     caused by the VMX-preemption timer.  VM exits caused by the VMX-preemption
+ *     timer take priority over VM exits caused by the "NMI-window exiting"
+ *     VM-execution control and lower priority events.
+ *
+ * [3] Debug-trap exceptions and higher priority events take priority over VM exits
+ *     caused by "NMI-window exiting".  VM exits caused by this control take
+ *     priority over non-maskable interrupts (NMIs) and lower priority events.
+ *
+ * [4] Virtual-interrupt delivery has the same priority as that of VM exits due to
+ *     the 1-setting of the "interrupt-window exiting" VM-execution control.  Thus,
+ *     non-maskable interrupts (NMIs) and higher priority events take priority over
+ *     delivery of a virtual interrupt; delivery of a virtual interrupt takes
+ *     priority over external interrupts and lower priority events.
+ */
 static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
-	unsigned long exit_qual;
 	/*
 	 * Only a pending nested run blocks a pending exception.  If there is a
 	 * previously injected event, the pending exception occurred while said
@@ -3943,19 +4011,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		/* Fallthrough, the SIPI is completely ignored. */
 	}
 
-	/*
-	 * Process exceptions that are higher priority than Monitor Trap Flag:
-	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
-	 * could theoretically come in from userspace), and ICEBP (INT1).
-	 */
+	if (vcpu->arch.exception_vmexit.pending &&
+	    !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
+		if (block_nested_exceptions)
+			return -EBUSY;
+
+		nested_vmx_inject_exception_vmexit(vcpu);
+		return 0;
+	}
+
 	if (vcpu->arch.exception.pending &&
-	    !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
+	    !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
 		if (block_nested_exceptions)
 			return -EBUSY;
-		if (!nested_vmx_check_exception(vcpu, &exit_qual))
-			goto no_vmexit;
-		nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
-		return 0;
+		goto no_vmexit;
 	}
 
 	if (vmx->nested.mtf_pending) {
@@ -3966,13 +4035,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
+	if (vcpu->arch.exception_vmexit.pending) {
+		if (block_nested_exceptions)
+			return -EBUSY;
+
+		nested_vmx_inject_exception_vmexit(vcpu);
+		return 0;
+	}
+
 	if (vcpu->arch.exception.pending) {
 		if (block_nested_exceptions)
 			return -EBUSY;
-		if (!nested_vmx_check_exception(vcpu, &exit_qual))
-			goto no_vmexit;
-		nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
-		return 0;
+		goto no_vmexit;
 	}
 
 	if (nested_vmx_preemption_timer_pending(vcpu)) {
@@ -6863,8 +6937,8 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
 
 struct kvm_x86_nested_ops vmx_nested_ops = {
 	.leave_nested = vmx_leave_nested,
+	.is_exception_vmexit = nested_vmx_is_exception_vmexit,
 	.check_events = vmx_check_nested_events,
-	.handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround,
 	.hv_timer_pending = nested_vmx_preemption_timer_pending,
 	.triple_fault = nested_vmx_triple_fault,
 	.get_state = vmx_get_nested_state,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7ef5659a1bbd..3591fdf7ecf9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1585,7 +1585,9 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 	 */
 	if (nested_cpu_has_mtf(vmcs12) &&
 	    (!vcpu->arch.exception.pending ||
-	     vcpu->arch.exception.vector == DB_VECTOR))
+	     vcpu->arch.exception.vector == DB_VECTOR) &&
+	    (!vcpu->arch.exception_vmexit.pending ||
+	     vcpu->arch.exception_vmexit.vector == DB_VECTOR))
 		vmx->nested.mtf_pending = true;
 	else
 		vmx->nested.mtf_pending = false;
@@ -5624,7 +5626,7 @@ static bool vmx_emulation_required_with_pending_exception(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
 	return vmx->emulation_required && !vmx->rmode.vm86_active &&
-	       (vcpu->arch.exception.pending || vcpu->arch.exception.injected);
+	       (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected);
 }
 
 static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1a301a1730a5..63ee79da50df 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -609,6 +609,21 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
 }
 EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
 
+static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vector,
+				       bool has_error_code, u32 error_code,
+				       bool has_payload, unsigned long payload)
+{
+	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
+
+	ex->vector = vector;
+	ex->injected = false;
+	ex->pending = true;
+	ex->has_error_code = has_error_code;
+	ex->error_code = error_code;
+	ex->has_payload = has_payload;
+	ex->payload = payload;
+}
+
 static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 		unsigned nr, bool has_error, u32 error_code,
 	        bool has_payload, unsigned long payload, bool reinject)
@@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
+	/*
+	 * If the exception is destined for L2 and isn't being reinjected,
+	 * morph it to a VM-Exit if L1 wants to intercept the exception.  A
+	 * previously injected exception is not checked because it was checked
+	 * when it was original queued, and re-checking is incorrect if _L1_
+	 * injected the exception, in which case it's exempt from interception.
+	 */
+	if (!reinject && is_guest_mode(vcpu) &&
+	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
+		kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
+					   has_payload, payload);
+		return;
+	}
+
 	if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
 	queue:
 		if (reinject) {
 			/*
-			 * On vmentry, vcpu->arch.exception.pending is only
-			 * true if an event injection was blocked by
-			 * nested_run_pending.  In that case, however,
-			 * vcpu_enter_guest requests an immediate exit,
-			 * and the guest shouldn't proceed far enough to
-			 * need reinjection.
+			 * On VM-Entry, an exception can be pending if and only
+			 * if event injection was blocked by nested_run_pending.
+			 * In that case, however, vcpu_enter_guest() requests an
+			 * immediate exit, and the guest shouldn't proceed far
+			 * enough to need reinjection.
 			 */
-			WARN_ON_ONCE(vcpu->arch.exception.pending);
+			WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
 			vcpu->arch.exception.injected = true;
 			if (WARN_ON_ONCE(has_payload)) {
 				/*
@@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
 {
 	++vcpu->stat.pf_guest;
-	vcpu->arch.exception.nested_apf =
-		is_guest_mode(vcpu) && fault->async_page_fault;
-	if (vcpu->arch.exception.nested_apf) {
-		vcpu->arch.apf.nested_apf_token = fault->address;
-		kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
-	} else {
+
+	/*
+	 * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
+	 * whether or not L1 wants to intercept "regular" #PF.
+	 */
+	if (is_guest_mode(vcpu) && fault->async_page_fault)
+		kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
+					   true, fault->error_code,
+					   true, fault->address);
+	else
 		kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
 					fault->address);
-	}
 }
 EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
 
-/* Returns true if the page fault was immediately morphed into a VM-Exit. */
-bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
+void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault)
 {
 	struct kvm_mmu *fault_mmu;
@@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 		kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
 				       fault_mmu->root.hpa);
 
-	/*
-	 * A workaround for KVM's bad exception handling.  If KVM injected an
-	 * exception into L2, and L2 encountered a #PF while vectoring the
-	 * injected exception, manually check to see if L1 wants to intercept
-	 * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
-	 * In all other cases, defer the check to nested_ops->check_events(),
-	 * which will correctly handle priority (this does not).  Note, other
-	 * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
-	 * most problematic, e.g. when L0 and L1 are both intercepting #PF for
-	 * shadow paging.
-	 *
-	 * TODO: Rewrite exception handling to track injected and pending
-	 *       (VM-Exit) exceptions separately.
-	 */
-	if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
-	    kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
-		return true;
-
 	fault_mmu->inject_page_fault(vcpu, fault);
-	return false;
 }
 EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
 
@@ -4752,7 +4763,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
 	return (kvm_arch_interrupt_allowed(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu) &&
 		!kvm_event_needs_reinjection(vcpu) &&
-		!vcpu->arch.exception.pending);
+		!kvm_is_exception_pending(vcpu));
 }
 
 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
@@ -4881,13 +4892,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
 static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 					       struct kvm_vcpu_events *events)
 {
-	struct kvm_queued_exception *ex = &vcpu->arch.exception;
+	struct kvm_queued_exception *ex;
 
 	process_nmi(vcpu);
 
 	if (kvm_check_request(KVM_REQ_SMI, vcpu))
 		process_smi(vcpu);
 
+	/*
+	 * KVM's ABI only allows for one exception to be migrated.  Luckily,
+	 * the only time there can be two queued exceptions is if there's a
+	 * non-exiting _injected_ exception, and a pending exiting exception.
+	 * In that case, ignore the VM-Exiting exception as it's an extension
+	 * of the injected exception.
+	 */
+	if (vcpu->arch.exception_vmexit.pending &&
+	    !vcpu->arch.exception.pending &&
+	    !vcpu->arch.exception.injected)
+		ex = &vcpu->arch.exception_vmexit;
+	else
+		ex = &vcpu->arch.exception;
+
 	/*
 	 * In guest mode, payload delivery should be deferred if the exception
 	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
@@ -4994,6 +5019,19 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		return -EINVAL;
 
 	process_nmi(vcpu);
+
+	/*
+	 * Flag that userspace is stuffing an exception, the next KVM_RUN will
+	 * morph the exception to a VM-Exit if appropriate.  Do this only for
+	 * pending exceptions, already-injected exceptions are not subject to
+	 * intercpetion.  Note, userspace that conflates pending and injected
+	 * is hosed, and will incorrectly convert an injected exception into a
+	 * pending exception, which in turn may cause a spurious VM-Exit.
+	 */
+	vcpu->arch.exception_from_userspace = events->exception.pending;
+
+	vcpu->arch.exception_vmexit.pending = false;
+
 	vcpu->arch.exception.injected = events->exception.injected;
 	vcpu->arch.exception.pending = events->exception.pending;
 	vcpu->arch.exception.vector = events->exception.nr;
@@ -7977,18 +8015,17 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
 	}
 }
 
-static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
+static void inject_emulated_exception(struct kvm_vcpu *vcpu)
 {
 	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+
 	if (ctxt->exception.vector == PF_VECTOR)
-		return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
-
-	if (ctxt->exception.error_code_valid)
+		kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
+	else if (ctxt->exception.error_code_valid)
 		kvm_queue_exception_e(vcpu, ctxt->exception.vector,
 				      ctxt->exception.error_code);
 	else
 		kvm_queue_exception(vcpu, ctxt->exception.vector);
-	return false;
 }
 
 static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)
@@ -8601,8 +8638,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 
 	if (ctxt->have_exception) {
 		r = 1;
-		if (inject_emulated_exception(vcpu))
-			return r;
+		inject_emulated_exception(vcpu);
 	} else if (vcpu->arch.pio.count) {
 		if (!vcpu->arch.pio.in) {
 			/* FIXME: return into emulator if single-stepping.  */
@@ -9540,7 +9576,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 	 */
 	if (vcpu->arch.exception.injected)
 		kvm_inject_exception(vcpu);
-	else if (vcpu->arch.exception.pending)
+	else if (kvm_is_exception_pending(vcpu))
 		; /* see above */
 	else if (vcpu->arch.nmi_injected)
 		static_call(kvm_x86_inject_nmi)(vcpu);
@@ -9567,6 +9603,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 	if (r < 0)
 		goto out;
 
+	/*
+	 * A pending exception VM-Exit should either result in nested VM-Exit
+	 * or force an immediate re-entry and exit to/from L2, and exception
+	 * VM-Exits cannot be injected (flag should _never_ be set).
+	 */
+	WARN_ON_ONCE(vcpu->arch.exception_vmexit.injected ||
+		     vcpu->arch.exception_vmexit.pending);
+
 	/*
 	 * New events, other than exceptions, cannot be injected if KVM needs
 	 * to re-inject a previous event.  See above comments on re-injecting
@@ -9666,7 +9710,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
 	    kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
 		*req_immediate_exit = true;
 
-	WARN_ON(vcpu->arch.exception.pending);
+	WARN_ON(kvm_is_exception_pending(vcpu));
 	return 0;
 
 out:
@@ -10680,6 +10724,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 {
+	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	struct kvm_run *kvm_run = vcpu->run;
 	int r;
 
@@ -10738,6 +10783,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	/*
+	 * If userspace set a pending exception and L2 is active, convert it to
+	 * a pending VM-Exit if L1 wants to intercept the exception.
+	 */
+	if (vcpu->arch.exception_from_userspace && is_guest_mode(vcpu) &&
+	    kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, ex->vector,
+							ex->error_code)) {
+		kvm_queue_exception_vmexit(vcpu, ex->vector,
+					   ex->has_error_code, ex->error_code,
+					   ex->has_payload, ex->payload);
+		ex->injected = false;
+		ex->pending = false;
+	}
+	vcpu->arch.exception_from_userspace = false;
+
 	if (unlikely(vcpu->arch.complete_userspace_io)) {
 		int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
 		vcpu->arch.complete_userspace_io = NULL;
@@ -10842,6 +10902,7 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
 	kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
 
 	vcpu->arch.exception.pending = false;
+	vcpu->arch.exception_vmexit.pending = false;
 
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 }
@@ -11209,7 +11270,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
 
 	if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
 		r = -EBUSY;
-		if (vcpu->arch.exception.pending)
+		if (kvm_is_exception_pending(vcpu))
 			goto out;
 		if (dbg->control & KVM_GUESTDBG_INJECT_DB)
 			kvm_queue_exception(vcpu, DB_VECTOR);
@@ -12387,7 +12448,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.pv.pv_unhalted)
 		return true;
 
-	if (vcpu->arch.exception.pending)
+	if (kvm_is_exception_pending(vcpu))
 		return true;
 
 	if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
@@ -12641,7 +12702,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
 {
 	if (unlikely(!lapic_in_kernel(vcpu) ||
 		     kvm_event_needs_reinjection(vcpu) ||
-		     vcpu->arch.exception.pending))
+		     kvm_is_exception_pending(vcpu)))
 		return false;
 
 	if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index dc2af0146220..eee259e387d3 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -82,10 +82,17 @@ static inline unsigned int __shrink_ple_window(unsigned int val,
 void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
 int kvm_check_nested_events(struct kvm_vcpu *vcpu);
 
+static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.exception.pending ||
+	       vcpu->arch.exception_vmexit.pending;
+}
+
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.injected = false;
+	vcpu->arch.exception_vmexit.pending = false;
 }
 
 static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (16 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:16   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 19/21] KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior Sean Christopherson
                   ` (4 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Treat pending TRIPLE_FAULTS as pending exceptions.  A triple fault is an
exception for all intents and purposes, it's just not tracked as such
because there's no vector associated the exception.  E.g. if userspace
were to set vcpu->request_interrupt_window while running L2 and L2 hit a
triple fault, a triple fault nested VM-Exit should be synthesized to L1
before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN.

Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/x86.c | 3 ---
 arch/x86/kvm/x86.h | 3 ++-
 2 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 63ee79da50df..8e54a074b7ff 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12477,9 +12477,6 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 	if (kvm_xen_has_pending_events(vcpu))
 		return true;
 
-	if (kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu))
-		return true;
-
 	return false;
 }
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index eee259e387d3..078765287ec6 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -85,7 +85,8 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu);
 static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
 {
 	return vcpu->arch.exception.pending ||
-	       vcpu->arch.exception_vmexit.pending;
+	       vcpu->arch.exception_vmexit.pending ||
+	       kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
 }
 
 static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 19/21] KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (17 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-06-14 20:47 ` [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes Sean Christopherson
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Document the oddities of ICEBP interception (trap-like #DB is intercepted
as a fault-like exception), and how using VMX's inner "skip" helper
deliberately bypasses the pending MTF and single-step #DB logic.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/vmx/vmx.c | 16 +++++++++++-----
 1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3591fdf7ecf9..91b8e171f232 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1578,9 +1578,13 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 
 	/*
 	 * Per the SDM, MTF takes priority over debug-trap exceptions besides
-	 * T-bit traps. As instruction emulation is completed (i.e. at the
-	 * instruction boundary), any #DB exception pending delivery must be a
-	 * debug-trap. Record the pending MTF state to be delivered in
+	 * TSS T-bit traps and ICEBP (INT1).  KVM doesn't emulate T-bit traps
+	 * or ICEBP (in the emulator proper), and skipping of ICEBP after an
+	 * intercepted #DB deliberately avoids single-step #DB and MTF updates
+	 * as ICEBP is higher priority than both.  As instruction emulation is
+	 * completed at this point (i.e. KVM is at the instruction boundary),
+	 * any #DB exception pending delivery must be a debug-trap of lower
+	 * priority than MTF.  Record the pending MTF state to be delivered in
 	 * vmx_check_nested_events().
 	 */
 	if (nested_cpu_has_mtf(vmcs12) &&
@@ -5071,8 +5075,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 			 * instruction.  ICEBP generates a trap-like #DB, but
 			 * despite its interception control being tied to #DB,
 			 * is an instruction intercept, i.e. the VM-Exit occurs
-			 * on the ICEBP itself.  Note, skipping ICEBP also
-			 * clears STI and MOVSS blocking.
+			 * on the ICEBP itself.  Use the inner "skip" helper to
+			 * avoid single-step #DB and MTF updates, as ICEBP is
+			 * higher priority.  Note, skipping ICEBP still clears
+			 * STI and MOVSS blocking.
 			 *
 			 * For all other #DBs, set vmcs.PENDING_DBG_EXCEPTIONS.BS
 			 * if single-step is enabled in RFLAGS and STI or MOVSS
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (18 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 19/21] KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:16   ` Maxim Levitsky
  2022-06-14 20:47 ` [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing Sean Christopherson
                   ` (2 subsequent siblings)
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Include the vmx.h and svm.h uapi headers that KVM so kindly provides
instead of manually defining all the same exit reasons/code.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 .../selftests/kvm/include/x86_64/svm_util.h   |  7 +--
 .../selftests/kvm/include/x86_64/vmx.h        | 51 +------------------
 2 files changed, 4 insertions(+), 54 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
index a339b537a575..7aee6244ab6a 100644
--- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
+++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
@@ -9,15 +9,12 @@
 #ifndef SELFTEST_KVM_SVM_UTILS_H
 #define SELFTEST_KVM_SVM_UTILS_H
 
+#include <asm/svm.h>
+
 #include <stdint.h>
 #include "svm.h"
 #include "processor.h"
 
-#define SVM_EXIT_EXCP_BASE	0x040
-#define SVM_EXIT_HLT		0x078
-#define SVM_EXIT_MSR		0x07c
-#define SVM_EXIT_VMMCALL	0x081
-
 struct svm_test_data {
 	/* VMCB */
 	struct vmcb *vmcb; /* gva */
diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
index 99fa1410964c..e4206f69b716 100644
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@@ -8,6 +8,8 @@
 #ifndef SELFTEST_KVM_VMX_H
 #define SELFTEST_KVM_VMX_H
 
+#include <asm/vmx.h>
+
 #include <stdint.h>
 #include "processor.h"
 #include "apic.h"
@@ -100,55 +102,6 @@
 #define VMX_EPT_VPID_CAP_AD_BITS		0x00200000
 
 #define EXIT_REASON_FAILED_VMENTRY	0x80000000
-#define EXIT_REASON_EXCEPTION_NMI	0
-#define EXIT_REASON_EXTERNAL_INTERRUPT	1
-#define EXIT_REASON_TRIPLE_FAULT	2
-#define EXIT_REASON_INTERRUPT_WINDOW	7
-#define EXIT_REASON_NMI_WINDOW		8
-#define EXIT_REASON_TASK_SWITCH		9
-#define EXIT_REASON_CPUID		10
-#define EXIT_REASON_HLT			12
-#define EXIT_REASON_INVD		13
-#define EXIT_REASON_INVLPG		14
-#define EXIT_REASON_RDPMC		15
-#define EXIT_REASON_RDTSC		16
-#define EXIT_REASON_VMCALL		18
-#define EXIT_REASON_VMCLEAR		19
-#define EXIT_REASON_VMLAUNCH		20
-#define EXIT_REASON_VMPTRLD		21
-#define EXIT_REASON_VMPTRST		22
-#define EXIT_REASON_VMREAD		23
-#define EXIT_REASON_VMRESUME		24
-#define EXIT_REASON_VMWRITE		25
-#define EXIT_REASON_VMOFF		26
-#define EXIT_REASON_VMON		27
-#define EXIT_REASON_CR_ACCESS		28
-#define EXIT_REASON_DR_ACCESS		29
-#define EXIT_REASON_IO_INSTRUCTION	30
-#define EXIT_REASON_MSR_READ		31
-#define EXIT_REASON_MSR_WRITE		32
-#define EXIT_REASON_INVALID_STATE	33
-#define EXIT_REASON_MWAIT_INSTRUCTION	36
-#define EXIT_REASON_MONITOR_INSTRUCTION 39
-#define EXIT_REASON_PAUSE_INSTRUCTION	40
-#define EXIT_REASON_MCE_DURING_VMENTRY	41
-#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
-#define EXIT_REASON_APIC_ACCESS		44
-#define EXIT_REASON_EOI_INDUCED		45
-#define EXIT_REASON_EPT_VIOLATION	48
-#define EXIT_REASON_EPT_MISCONFIG	49
-#define EXIT_REASON_INVEPT		50
-#define EXIT_REASON_RDTSCP		51
-#define EXIT_REASON_PREEMPTION_TIMER	52
-#define EXIT_REASON_INVVPID		53
-#define EXIT_REASON_WBINVD		54
-#define EXIT_REASON_XSETBV		55
-#define EXIT_REASON_APIC_WRITE		56
-#define EXIT_REASON_INVPCID		58
-#define EXIT_REASON_PML_FULL		62
-#define EXIT_REASON_XSAVES		63
-#define EXIT_REASON_XRSTORS		64
-#define LAST_EXIT_REASON		64
 
 enum vmcs_field {
 	VIRTUAL_PROCESSOR_ID		= 0x00000000,
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (19 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes Sean Christopherson
@ 2022-06-14 20:47 ` Sean Christopherson
  2022-07-06 12:17   ` Maxim Levitsky
  2022-06-16 13:16 ` [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Maxim Levitsky
  2022-06-29 11:16 ` Maxim Levitsky
  22 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-06-14 20:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

Add a test to verify that KVM_{G,S}ET_EVENTS play nice with pending vs.
injected exceptions when an exception is being queued for L2, and that
KVM correctly handles L1's exception intercept wants.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 tools/testing/selftests/kvm/.gitignore        |   1 +
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/x86_64/nested_exceptions_test.c       | 295 ++++++++++++++++++
 3 files changed, 297 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 0ab0e255d292..7c8adb8cff83 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -27,6 +27,7 @@
 /x86_64/hyperv_svm_test
 /x86_64/max_vcpuid_cap_test
 /x86_64/mmio_warning_test
+/x86_64/nested_exceptions_test
 /x86_64/platform_info_test
 /x86_64/pmu_event_filter_test
 /x86_64/set_boot_cpu_id
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 2ca5400220b9..6db2dd5eca96 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -83,6 +83,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
 TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
 TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
+TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
 TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
diff --git a/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
new file mode 100644
index 000000000000..ac33835f78f4
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
@@ -0,0 +1,295 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#define _GNU_SOURCE /* for program_invocation_short_name */
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "vmx.h"
+#include "svm_util.h"
+
+#define L2_GUEST_STACK_SIZE 256
+
+/*
+ * Arbitrary, never shoved into KVM/hardware, just need to avoid conflict with
+ * the "real" exceptions used, #SS/#GP/#DF (12/13/8).
+ */
+#define FAKE_TRIPLE_FAULT_VECTOR	0xaa
+
+/* Arbitrary 32-bit error code injected by this test. */
+#define SS_ERROR_CODE 0xdeadbeef
+
+/*
+ * Bit '0' is set on Intel if the exception occurs while delivering a previous
+ * event/exception.  AMD's wording is ambiguous, but presumably the bit is set
+ * if the exception occurs while delivering an external event, e.g. NMI or INTR,
+ * but not for exceptions that occur when delivering other exceptions or
+ * software interrupts.
+ *
+ * Note, Intel's name for it, "External event", is misleading and much more
+ * aligned with AMD's behavior, but the SDM is quite clear on its behavior.
+ */
+#define ERROR_CODE_EXT_FLAG	BIT(0)
+
+/*
+ * Bit '1' is set if the fault occurred when looking up a descriptor in the
+ * IDT, which is the case here as the IDT is empty/NULL.
+ */
+#define ERROR_CODE_IDT_FLAG	BIT(1)
+
+/*
+ * The #GP that occurs when vectoring #SS should show the index into the IDT
+ * for #SS, plus have the "IDT flag" set.
+ */
+#define GP_ERROR_CODE_AMD ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG)
+#define GP_ERROR_CODE_INTEL ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG | ERROR_CODE_EXT_FLAG)
+
+/*
+ * Intel and AMD both shove '0' into the error code on #DF, regardless of what
+ * led to the double fault.
+ */
+#define DF_ERROR_CODE 0
+
+#define INTERCEPT_SS		(BIT_ULL(SS_VECTOR))
+#define INTERCEPT_SS_DF		(INTERCEPT_SS | BIT_ULL(DF_VECTOR))
+#define INTERCEPT_SS_GP_DF	(INTERCEPT_SS_DF | BIT_ULL(GP_VECTOR))
+
+static void l2_ss_pending_test(void)
+{
+	GUEST_SYNC(SS_VECTOR);
+}
+
+static void l2_ss_injected_gp_test(void)
+{
+	GUEST_SYNC(GP_VECTOR);
+}
+
+static void l2_ss_injected_df_test(void)
+{
+	GUEST_SYNC(DF_VECTOR);
+}
+
+static void l2_ss_injected_tf_test(void)
+{
+	GUEST_SYNC(FAKE_TRIPLE_FAULT_VECTOR);
+}
+
+static void svm_run_l2(struct svm_test_data *svm, void *l2_code, int vector,
+		       uint32_t error_code)
+{
+	struct vmcb *vmcb = svm->vmcb;
+	struct vmcb_control_area *ctrl = &vmcb->control;
+
+	vmcb->save.rip = (u64)l2_code;
+	run_guest(vmcb, svm->vmcb_gpa);
+
+	if (vector == FAKE_TRIPLE_FAULT_VECTOR)
+		return;
+
+	GUEST_ASSERT_EQ(ctrl->exit_code, (SVM_EXIT_EXCP_BASE + vector));
+	GUEST_ASSERT_EQ(ctrl->exit_info_1, error_code);
+}
+
+static void l1_svm_code(struct svm_test_data *svm)
+{
+	struct vmcb_control_area *ctrl = &svm->vmcb->control;
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+	generic_svm_setup(svm, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+	svm->vmcb->save.idtr.limit = 0;
+	ctrl->intercept |= BIT_ULL(INTERCEPT_SHUTDOWN);
+
+	ctrl->intercept_exceptions = INTERCEPT_SS_GP_DF;
+	svm_run_l2(svm, l2_ss_pending_test, SS_VECTOR, SS_ERROR_CODE);
+	svm_run_l2(svm, l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_AMD);
+
+	ctrl->intercept_exceptions = INTERCEPT_SS_DF;
+	svm_run_l2(svm, l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
+
+	ctrl->intercept_exceptions = INTERCEPT_SS;
+	svm_run_l2(svm, l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
+	GUEST_ASSERT_EQ(ctrl->exit_code, SVM_EXIT_SHUTDOWN);
+
+	GUEST_DONE();
+}
+
+static void vmx_run_l2(void *l2_code, int vector, uint32_t error_code)
+{
+	GUEST_ASSERT(!vmwrite(GUEST_RIP, (u64)l2_code));
+
+	GUEST_ASSERT_EQ(vector == SS_VECTOR ? vmlaunch() : vmresume(), 0);
+
+	if (vector == FAKE_TRIPLE_FAULT_VECTOR)
+		return;
+
+	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_EXCEPTION_NMI);
+	GUEST_ASSERT_EQ((vmreadz(VM_EXIT_INTR_INFO) & 0xff), vector);
+	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_INTR_ERROR_CODE), error_code);
+}
+
+static void l1_vmx_code(struct vmx_pages *vmx)
+{
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+	GUEST_ASSERT_EQ(prepare_for_vmx_operation(vmx), true);
+
+	GUEST_ASSERT_EQ(load_vmcs(vmx), true);
+
+	prepare_vmcs(vmx, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+	GUEST_ASSERT_EQ(vmwrite(GUEST_IDTR_LIMIT, 0), 0);
+
+	/*
+	 * VMX disallows injecting an exception with error_code[31:16] != 0,
+	 * and hardware will never generate a VM-Exit with bits 31:16 set.
+	 * KVM should likewise truncate the "bad" userspace value.
+	 */
+	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_GP_DF), 0);
+	vmx_run_l2(l2_ss_pending_test, SS_VECTOR, (u16)SS_ERROR_CODE);
+	vmx_run_l2(l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_INTEL);
+
+	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_DF), 0);
+	vmx_run_l2(l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
+
+	GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS), 0);
+	vmx_run_l2(l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
+	GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_TRIPLE_FAULT);
+
+	GUEST_DONE();
+}
+
+static void __attribute__((__flatten__)) l1_guest_code(void *test_data)
+{
+	if (this_cpu_has(X86_FEATURE_SVM))
+		l1_svm_code(test_data);
+	else
+		l1_vmx_code(test_data);
+}
+
+static void assert_ucall_vector(struct kvm_vcpu *vcpu, int vector)
+{
+	struct kvm_run *run = vcpu->run;
+	struct ucall uc;
+
+	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+		    "Unexpected exit reason: %u (%s),\n",
+		    run->exit_reason, exit_reason_str(run->exit_reason));
+
+	switch (get_ucall(vcpu, &uc)) {
+	case UCALL_SYNC:
+		TEST_ASSERT(vector == uc.args[1],
+			    "Expected L2 to ask for %d, got %ld", vector, uc.args[1]);
+		break;
+	case UCALL_DONE:
+		TEST_ASSERT(vector == -1,
+			    "Expected L2 to ask for %d, L2 says it's done", vector);
+		break;
+	case UCALL_ABORT:
+		TEST_FAIL("%s at %s:%ld (0x%lx != 0x%lx)",
+			  (const char *)uc.args[0], __FILE__, uc.args[1],
+			  uc.args[2], uc.args[3]);
+		break;
+	default:
+		TEST_FAIL("Expected L2 to ask for %d, got unexpected ucall %lu", vector, uc.cmd);
+	}
+}
+
+static void queue_ss_exception(struct kvm_vcpu *vcpu, bool inject)
+{
+	struct kvm_vcpu_events events;
+
+	vcpu_events_get(vcpu, &events);
+
+	TEST_ASSERT(!events.exception.pending,
+		    "Vector %d unexpectedlt pending", events.exception.nr);
+	TEST_ASSERT(!events.exception.injected,
+		    "Vector %d unexpectedly injected", events.exception.nr);
+
+	events.flags = KVM_VCPUEVENT_VALID_PAYLOAD;
+	events.exception.pending = !inject;
+	events.exception.injected = inject;
+	events.exception.nr = SS_VECTOR;
+	events.exception.has_error_code = true;
+	events.exception.error_code = SS_ERROR_CODE;
+	vcpu_events_set(vcpu, &events);
+}
+
+/*
+ * Verify KVM_{G,S}ET_EVENTS play nice with pending vs. injected exceptions
+ * when an exception is being queued for L2.  Specifically, verify that KVM
+ * honors L1 exception intercept controls when a #SS is pending/injected,
+ * triggers a #GP on vectoring the #SS, morphs to #DF if #GP isn't intercepted
+ * by L1, and finally causes (nested) SHUTDOWN if #DF isn't intercepted by L1.
+ */
+int main(int argc, char *argv[])
+{
+	vm_vaddr_t nested_test_data_gva;
+	struct kvm_vcpu_events events;
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXCEPTION_PAYLOAD));
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM) || kvm_cpu_has(X86_FEATURE_VMX));
+
+	vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code);
+	vm_enable_cap(vm, KVM_CAP_EXCEPTION_PAYLOAD, -2ul);
+
+	if (kvm_cpu_has(X86_FEATURE_SVM))
+		vcpu_alloc_svm(vm, &nested_test_data_gva);
+	else
+		vcpu_alloc_vmx(vm, &nested_test_data_gva);
+
+	vcpu_args_set(vcpu, 1, nested_test_data_gva);
+
+	/* Run L1 => L2.  L2 should sync and request #SS. */
+	vcpu_run(vcpu);
+	assert_ucall_vector(vcpu, SS_VECTOR);
+
+	/* Pend #SS and request immediate exit.  #SS should still be pending. */
+	queue_ss_exception(vcpu, false);
+	vcpu->run->immediate_exit = true;
+	vcpu_run_complete_io(vcpu);
+
+	/* Verify the pending events comes back out the same as it went in. */
+	vcpu_events_get(vcpu, &events);
+	ASSERT_EQ(events.flags & KVM_VCPUEVENT_VALID_PAYLOAD,
+		  KVM_VCPUEVENT_VALID_PAYLOAD);
+	ASSERT_EQ(events.exception.pending, true);
+	ASSERT_EQ(events.exception.nr, SS_VECTOR);
+	ASSERT_EQ(events.exception.has_error_code, true);
+	ASSERT_EQ(events.exception.error_code, SS_ERROR_CODE);
+
+	/*
+	 * Run for real with the pending #SS, L1 should get a VM-Exit due to
+	 * #SS interception and re-enter L2 to request #GP (via injected #SS).
+	 */
+	vcpu->run->immediate_exit = false;
+	vcpu_run(vcpu);
+	assert_ucall_vector(vcpu, GP_VECTOR);
+
+	/*
+	 * Inject #SS, the #SS should bypass interception and cause #GP, which
+	 * L1 should intercept before KVM morphs it to #DF.  L1 should then
+	 * disable #GP interception and run L2 to request #DF (via #SS => #GP).
+	 */
+	queue_ss_exception(vcpu, true);
+	vcpu_run(vcpu);
+	assert_ucall_vector(vcpu, DF_VECTOR);
+
+	/*
+	 * Inject #SS, the #SS should bypass interception and cause #GP, which
+	 * L1 is no longer interception, and so should see a #DF VM-Exit.  L1
+	 * should then signal that is done.
+	 */
+	queue_ss_exception(vcpu, true);
+	vcpu_run(vcpu);
+	assert_ucall_vector(vcpu, FAKE_TRIPLE_FAULT_VECTOR);
+
+	/*
+	 * Inject #SS yet again.  L1 is not intercepting #GP or #DF, and so
+	 * should see nested TRIPLE_FAULT / SHUTDOWN.
+	 */
+	queue_ss_exception(vcpu, true);
+	vcpu_run(vcpu);
+	assert_ucall_vector(vcpu, -1);
+
+	kvm_vm_free(vm);
+}
-- 
2.36.1.476.g0c4daa206d-goog


^ permalink raw reply related	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (20 preceding siblings ...)
  2022-06-14 20:47 ` [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing Sean Christopherson
@ 2022-06-16 13:16 ` Maxim Levitsky
  2022-06-29 11:16 ` Maxim Levitsky
  22 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-06-16 13:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests.
> 
> v2:
>   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
>     overhaul.
>     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
>   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> 
> v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> 
> Sean Christopherson (21):
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  35 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 102 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
>  arch/x86/kvm/x86.h                            |  11 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
>  15 files changed, 886 insertions(+), 418 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb

Next week I will review all of this patch series.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
  2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
@ 2022-06-16 23:47   ` Jim Mattson
  2022-07-06 11:40   ` Maxim Levitsky
  1 sibling, 0 replies; 78+ messages in thread
From: Jim Mattson @ 2022-06-16 23:47 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Drop pending exceptions and events queued for re-injection when leaving
> nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
> by host userspace.  Failure to purge events could result in an event
> belonging to L2 being injected into L1.
>
> This _should_ never happen for VM-Fail as all events should be blocked by
> nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
> the source of VM-Fail when running vmcs02.
>
> SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
> to SMM is blocked by pending exceptions and re-injected events.
>
> Forced exit is definitely buggy, but has likely gone unnoticed because
> userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
> some other ioctl() that purges the queue).
>
> Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
                   ` (21 preceding siblings ...)
  2022-06-16 13:16 ` [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Maxim Levitsky
@ 2022-06-29 11:16 ` Maxim Levitsky
  2022-06-29 13:42   ` Jim Mattson
  2022-06-29 15:53   ` Jim Mattson
  22 siblings, 2 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-06-29 11:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> The main goal of this series is to fix KVM's longstanding bug of not
> honoring L1's exception intercepts wants when handling an exception that
> occurs during delivery of a different exception.  E.g. if L0 and L1 are
> using shadow paging, and L2 hits a #PF, and then hits another #PF while
> vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> so that the #PF is routed to L1, not injected into L2 as a #DF.
> 
> nVMX has hacked around the bug for years by overriding the #PF injector
> for shadow paging to go straight to VM-Exit, and nSVM has started doing
> the same.  The hacks mostly work, but they're incomplete, confusing, and
> lead to other hacky code, e.g. bailing from the emulator because #PF
> injection forced a VM-Exit and suddenly KVM is back in L1.
> 
> Everything leading up to that are related fixes and cleanups I encountered
> along the way; some through code inspection, some through tests.
> 
> v2:
>   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
>     overhaul.
>     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
>   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> 
> v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> 
> Sean Christopherson (21):
>   KVM: nVMX: Unconditionally purge queued/injected events on nested
>     "exit"
>   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
>   KVM: x86: Don't check for code breakpoints when emulating on exception
>   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
>   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
>   KVM: x86: Treat #DBs from the emulator as fault-like (code and
>     DR7.GD=1)
>   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
>   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
>   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
>   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
>   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
>   KVM: x86: Make kvm_queued_exception a properly named, visible struct
>   KVM: x86: Formalize blocking of nested pending exceptions
>   KVM: x86: Use kvm_queue_exception_e() to queue #DF
>   KVM: x86: Hoist nested event checks above event injection logic
>   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
>     VM-Exit
>   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
>   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
>   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
>     behavior
>   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
>   KVM: selftests: Add an x86-only test to verify nested exception
>     queueing
> 
>  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
>  arch/x86/include/asm/kvm_host.h               |  35 +-
>  arch/x86/kvm/emulate.c                        |   3 +-
>  arch/x86/kvm/svm/nested.c                     | 102 ++---
>  arch/x86/kvm/svm/svm.c                        |  18 +-
>  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
>  arch/x86/kvm/vmx/sgx.c                        |   2 +-
>  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
>  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
>  arch/x86/kvm/x86.h                            |  11 +-
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
>  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
>  15 files changed, 886 insertions(+), 418 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> 
> base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb

Hi Sean and everyone!
 
 
Before I continue reviewing the patch series, I would like you to check if
I understand the monitor trap/pending debug exception/event injection
logic on VMX correctly. I was looking at the spec for several hours and I still have more
questions that answers about it.
 
So let me state what I understand:
 
1. Event injection (aka eventinj in SVM terms):
 
  (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
 
  If I understand correctly all event injections types just like on SVM just inject,
  and never create something pending, and/or drop the injection if event is not allowed
  (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
  if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
  I haven't checked this)
 
  All event injections happen right away, don't deliver any payload (like DR6), etc.
 
  Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
  type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
  to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
  with cases when these trap like events are intercepted, where the interception happens
  on the start of the instruction despite exceptions being trap-like.
 
 
2. #DB is the only trap like exception that can be pending for one more instruction
   if MOV SS shadow is on (any other cases?).
   (AMD just ignores the whole thing, rightfully)
 
   That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
   I understand that it will be written by CPU in case we have VM exit at the moment
   where #DB is already pending but not yet delivered.
 
   That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
   and this #DB will actually update DR6 and such, and might be delayed/lost.
 
 
3. Facts about MTF:
 
   * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
     instruction', and is enabled in primary execution controls.
 
   * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
     and that has no connection to the 'feature', although usually this injection will be useful
     when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
 
   * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
     which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
     overrides it.
 
   * MTF event is raised (when the primary execution controls bit is enabled) when:
 
	- after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
	  (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
	  I am not 100% sure about this but this seems to be what PRM implies:
 
	  "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
	  26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
	  VM entry."
 
	- If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
	  then once again MTF will happen on first instruction of the exception/interrupt handler
 
	  "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
	  (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
	  on the instruction boundary following delivery of the event (or any nested exception)."
 
	  That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
 
	- If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
 
	- Otherwise after an instruction (or REP iteration) retires.
 

If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-29 11:16 ` Maxim Levitsky
@ 2022-06-29 13:42   ` Jim Mattson
  2022-06-30  8:22     ` Maxim Levitsky
  2022-07-06 11:54     ` Maxim Levitsky
  2022-06-29 15:53   ` Jim Mattson
  1 sibling, 2 replies; 78+ messages in thread
From: Jim Mattson @ 2022-06-29 13:42 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> >
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> >
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests.
> >
> > v2:
> >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> >     overhaul.
> >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> >
> > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> >
> > Sean Christopherson (21):
> >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> >     "exit"
> >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> >   KVM: x86: Don't check for code breakpoints when emulating on exception
> >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> >     DR7.GD=1)
> >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> >   KVM: x86: Formalize blocking of nested pending exceptions
> >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> >   KVM: x86: Hoist nested event checks above event injection logic
> >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> >     VM-Exit
> >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> >     behavior
> >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> >   KVM: selftests: Add an x86-only test to verify nested exception
> >     queueing
> >
> >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> >  arch/x86/include/asm/kvm_host.h               |  35 +-
> >  arch/x86/kvm/emulate.c                        |   3 +-
> >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> >  arch/x86/kvm/svm/svm.c                        |  18 +-
> >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> >  arch/x86/kvm/x86.h                            |  11 +-
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> >  15 files changed, 886 insertions(+), 418 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> >
> >
> > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
>
> Hi Sean and everyone!
>
>
> Before I continue reviewing the patch series, I would like you to check if
> I understand the monitor trap/pending debug exception/event injection
> logic on VMX correctly. I was looking at the spec for several hours and I still have more
> questions that answers about it.
>
> So let me state what I understand:
>
> 1. Event injection (aka eventinj in SVM terms):
>
>   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
>
>   If I understand correctly all event injections types just like on SVM just inject,
>   and never create something pending, and/or drop the injection if event is not allowed
>   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
>   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
>   I haven't checked this)

The event is never just "dropped." If it is illegal to deliver the
event, VM-entry fails. See the second bullet under section 26.2.1.3:
VM-Entry Control Fields, in the SDM, volume 3.


>   All event injections happen right away, don't deliver any payload (like DR6), etc.

Correct.

>   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
>   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
>   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
>   with cases when these trap like events are intercepted, where the interception happens
>   on the start of the instruction despite exceptions being trap-like.

Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
at the start of the instruction. In early Intel VT-x parts, one could
not easily reinject an intercepted software interrupt or exception
using event injection, because VM-entry required a non-zero
instruction length, and the guest RIP had already advanced. On CPUs
that support a non-zero instruction length, one can now reinject a
software interrupt or exception, by setting the VM-entry instruction
length to 0.

> 2. #DB is the only trap like exception that can be pending for one more instruction
>    if MOV SS shadow is on (any other cases?).

I believe that's it. I'm not entirely sure about RTM,though.

>    (AMD just ignores the whole thing, rightfully)

When you say "ignores," do you mean that AMD ignores a data breakpoint
or single-step trap generated by MOV-SS, or it ignores the fact that
delivering such a #DB trap between the MOV-SS and the subsequent
MOV-ESP will create a stack frame in the wrong place?

>    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
>    I understand that it will be written by CPU in case we have VM exit at the moment
>    where #DB is already pending but not yet delivered.
>
>    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
>    and this #DB will actually update DR6 and such, and might be delayed/lost.

Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
easier than emulating the next instruction or using MTF to step
through the next instruction, and getting all of the deferred #DB
delivery rules right. :-)

>
> 3. Facts about MTF:
>
>    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
>      instruction', and is enabled in primary execution controls.
>
>    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
>      and that has no connection to the 'feature', although usually this injection will be useful
>      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
>
>    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
>      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
>      overrides it.

Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."

>    * MTF event is raised (when the primary execution controls bit is enabled) when:
>
>         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
>           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
>           I am not 100% sure about this but this seems to be what PRM implies:
>
>           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
>           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
>           VM entry."
>
>         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
>           then once again MTF will happen on first instruction of the exception/interrupt handler
>
>           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
>           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
>           on the instruction boundary following delivery of the event (or any nested exception)."
>
>           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB

These are single-step, I/O and data breakpoint traps.

>         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
>
>         - Otherwise after an instruction (or REP iteration) retires.
>
>
> If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

You might be interested in my augmented Table 6-2 (from volume 3 of
the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-29 11:16 ` Maxim Levitsky
  2022-06-29 13:42   ` Jim Mattson
@ 2022-06-29 15:53   ` Jim Mattson
  2022-06-30  8:24     ` Maxim Levitsky
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-06-29 15:53 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > The main goal of this series is to fix KVM's longstanding bug of not
> > honoring L1's exception intercepts wants when handling an exception that
> > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > so that the #PF is routed to L1, not injected into L2 as a #DF.
> >
> > nVMX has hacked around the bug for years by overriding the #PF injector
> > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > lead to other hacky code, e.g. bailing from the emulator because #PF
> > injection forced a VM-Exit and suddenly KVM is back in L1.
> >
> > Everything leading up to that are related fixes and cleanups I encountered
> > along the way; some through code inspection, some through tests.
> >
> > v2:
> >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> >     overhaul.
> >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> >
> > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> >
> > Sean Christopherson (21):
> >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> >     "exit"
> >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> >   KVM: x86: Don't check for code breakpoints when emulating on exception
> >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> >     DR7.GD=1)
> >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> >   KVM: x86: Formalize blocking of nested pending exceptions
> >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> >   KVM: x86: Hoist nested event checks above event injection logic
> >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> >     VM-Exit
> >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> >     behavior
> >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> >   KVM: selftests: Add an x86-only test to verify nested exception
> >     queueing
> >
> >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> >  arch/x86/include/asm/kvm_host.h               |  35 +-
> >  arch/x86/kvm/emulate.c                        |   3 +-
> >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> >  arch/x86/kvm/svm/svm.c                        |  18 +-
> >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> >  arch/x86/kvm/x86.h                            |  11 +-
> >  tools/testing/selftests/kvm/.gitignore        |   1 +
> >  tools/testing/selftests/kvm/Makefile          |   1 +
> >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> >  15 files changed, 886 insertions(+), 418 deletions(-)
> >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> >
> >
> > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
>
> Hi Sean and everyone!
>
>
> Before I continue reviewing the patch series, I would like you to check if
> I understand the monitor trap/pending debug exception/event injection
> logic on VMX correctly. I was looking at the spec for several hours and I still have more
> questions that answers about it.
>
> So let me state what I understand:
>
> 1. Event injection (aka eventinj in SVM terms):
>
>   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
>
>   If I understand correctly all event injections types just like on SVM just inject,
>   and never create something pending, and/or drop the injection if event is not allowed
>   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
>   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
>   I haven't checked this)
>
>   All event injections happen right away, don't deliver any payload (like DR6), etc.
>
>   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
>   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
>   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
>   with cases when these trap like events are intercepted, where the interception happens
>   on the start of the instruction despite exceptions being trap-like.
>
>
> 2. #DB is the only trap like exception that can be pending for one more instruction
>    if MOV SS shadow is on (any other cases?).
>    (AMD just ignores the whole thing, rightfully)
>
>    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
>    I understand that it will be written by CPU in case we have VM exit at the moment
>    where #DB is already pending but not yet delivered.
>
>    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
>    and this #DB will actually update DR6 and such, and might be delayed/lost.
>
>
> 3. Facts about MTF:
>
>    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
>      instruction', and is enabled in primary execution controls.
>
>    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
>      and that has no connection to the 'feature', although usually this injection will be useful
>      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
>
>    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
>      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
>      overrides it.
>
>    * MTF event is raised (when the primary execution controls bit is enabled) when:
>
>         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
>           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
>           I am not 100% sure about this but this seems to be what PRM implies:
>
>           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
>           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
>           VM entry."
>
>         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
>           then once again MTF will happen on first instruction of the exception/interrupt handler
>
>           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
>           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
>           on the instruction boundary following delivery of the event (or any nested exception)."
>
>           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
>
>         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
>
>         - Otherwise after an instruction (or REP iteration) retires.
>
>
> If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!

Here's a comprehensive spreadsheet on virtualizing MTF, compiled by
Peter Shier. (Just in case anyone is interested in *truly*
virtualizing the feature under KVM, rather than just setting a
VM-execution control bit in vmcs02 and calling it done.)

https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-29 13:42   ` Jim Mattson
@ 2022-06-30  8:22     ` Maxim Levitsky
  2022-06-30 12:17       ` Jim Mattson
  2022-06-30 16:28       ` Jim Mattson
  2022-07-06 11:54     ` Maxim Levitsky
  1 sibling, 2 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-06-30  8:22 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> 
> The event is never just "dropped." If it is illegal to deliver the
> event, VM-entry fails. See the second bullet under section 26.2.1.3:
> VM-Entry Control Fields, in the SDM, volume 3.
Yes, that is what I wanted to confirm.


> 
> 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> 
> Correct.
> 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> 
> Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> at the start of the instruction.

Are you sure about that? 


This is what the SDM says for direct intercepts.

27.3.3 Saving RIP, RSP, RFLAGS, and SSP

"If the VM exit is due to a software exception (due to an execution of INT3 or INTO) or a privileged software
exception (due to an execution of INT1), the value saved references the INT3, INTO, or INT1 instruction
that caused that exception."


If these events were intecepted indirectly (via IDT_VECTORING_INFO_FIELD),
its hard to understand from the PRM where the RIP points, but at least we at KVM,
do read VM_EXIT_INSTRUCTION_LEN, and then stuff it into the VM_ENTRY_INSTRUCTION_LEN
when we re-inject the event.
(look at __vmx_complete_interrupts)


Also note that you can't intercept INTn on Intel at all,
but they can be intercepted indirectly via the IDT vectoring field.


> In early Intel VT-x parts, one could
> not easily reinject an intercepted software interrupt or exception
> using event injection, because VM-entry required a non-zero
> instruction length, and the guest RIP had already advanced. On CPUs
> that support a non-zero instruction length, one can now reinject a
> software interrupt or exception, by setting the VM-entry instruction
> length to 0.
> 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> 
> I believe that's it. I'm not entirely sure about RTM,though.
> 
> >    (AMD just ignores the whole thing, rightfully)
> 
> When you say "ignores," do you mean that AMD ignores a data breakpoint
> or single-step trap generated by MOV-SS, or it ignores the fact that
> delivering such a #DB trap between the MOV-SS and the subsequent
> MOV-ESP will create a stack frame in the wrong place?

Two things which can be infered from the SVM spec. 
	- AMD doesn't distinguish between MOV SS and STI int shadow.
	- AMD has no 'pending debug exception field' in the vmcb.

I don't know what AMD does for #DB that happens on MOV SS, nor if it
does distinguish these internally, 
probably just drops the #DB or something.



> 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> 
> Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
> easier than emulating the next instruction or using MTF to step
> through the next instruction, and getting all of the deferred #DB
> delivery rules right. :-)


> 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> 
> Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
Yes, though that is would be mostly useful for nesting.

For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
the MTF, it can just process the MTF - why to re-inject it?


> 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> 
> These are single-step, I/O and data breakpoint traps.

I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
emulate the task switches, this has to be injected.

> 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> You might be interested in my augmented Table 6-2 (from volume 3 of
> the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml
> 

I can't access this document for some reason (from my redhat account, which is gmail as well).

Thanks for the info,
Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-29 15:53   ` Jim Mattson
@ 2022-06-30  8:24     ` Maxim Levitsky
  2022-06-30 12:20       ` Jim Mattson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-06-30  8:24 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-06-29 at 08:53 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> > 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> > 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> > 
> > 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> >    (AMD just ignores the whole thing, rightfully)
> > 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> > 
> > 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> > 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> > 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> Here's a comprehensive spreadsheet on virtualizing MTF, compiled by
> Peter Shier. (Just in case anyone is interested in *truly*
> virtualizing the feature under KVM, rather than just setting a
> VM-execution control bit in vmcs02 and calling it done.)
> 
> https://docs.google.com/spreadsheets/d/e/2PACX-1vQYP3PgY_JT42zQaR8uMp4U5LCey0qSlvMb80MLwjw-kkgfr31HqLSqAOGtdZ56aU2YdVTvfkruhuon/pubhtml

Neither can I access this document sadly :(

Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-30  8:22     ` Maxim Levitsky
@ 2022-06-30 12:17       ` Jim Mattson
  2022-06-30 13:10         ` Maxim Levitsky
  2022-06-30 16:28       ` Jim Mattson
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-06-30 12:17 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> I can't access this document for some reason (from my redhat account, which is gmail as well).

Try this one: https://docs.google.com/spreadsheets/d/13Yp7Cdg3ZyKoeZ3Qebp3uWi7urlPNmo5CQU5zFlayzs

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-30  8:24     ` Maxim Levitsky
@ 2022-06-30 12:20       ` Jim Mattson
  0 siblings, 0 replies; 78+ messages in thread
From: Jim Mattson @ 2022-06-30 12:20 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, Jun 30, 2022 at 1:24 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> Neither can I access this document sadly :(
Try this one: https://docs.google.com/spreadsheets/d/1u6yjgj0Fshd31YKFJ524mwle7BhxB3yuEy9fhdSoh-0

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-30 12:17       ` Jim Mattson
@ 2022-06-30 13:10         ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-06-30 13:10 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, 2022-06-30 at 05:17 -0700, Jim Mattson wrote:
> On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> > I can't access this document for some reason (from my redhat account, which is gmail as well).
> 
> Try this one: https://docs.google.com/spreadsheets/d/13Yp7Cdg3ZyKoeZ3Qebp3uWi7urlPNmo5CQU5zFlayzs
> 
Thanks, now I can access both documents.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-30  8:22     ` Maxim Levitsky
  2022-06-30 12:17       ` Jim Mattson
@ 2022-06-30 16:28       ` Jim Mattson
  2022-07-01  7:37         ` Maxim Levitsky
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-06-30 16:28 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
>
> On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:

> > Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> > at the start of the instruction.
>
> Are you sure about that?

I had been sure when I wrote that, but now that I see your response, I
have to question my memory. The SDM is definitely more authoritative
than I am.

> > When you say "ignores," do you mean that AMD ignores a data breakpoint
> > or single-step trap generated by MOV-SS, or it ignores the fact that
> > delivering such a #DB trap between the MOV-SS and the subsequent
> > MOV-ESP will create a stack frame in the wrong place?
>
> Two things which can be infered from the SVM spec.
>         - AMD doesn't distinguish between MOV SS and STI int shadow.
>         - AMD has no 'pending debug exception field' in the vmcb.
>
> I don't know what AMD does for #DB that happens on MOV SS, nor if it
> does distinguish these internally,
> probably just drops the #DB or something.

Without carrying pending debug exceptions, it seems that the only two
choices are to deliver the #DB, with the exception frame in an
unintended location or to drop the #DB. The latter seems preferable,
but neither one seems good. What I don't understand is why you claim
that AMD does this "rightfully." Are you saying that anyone with the
audacity to run a debugger on legacy code deserves to be thrown in
front of a moving train?

> > Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> Yes, though that is would be mostly useful for nesting.
>
> For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
> the MTF, it can just process the MTF - why to re-inject it?

You're right. The facility is probably just there to make MTF
virtualizable. Intel was paying much closer attention to
virtualizability by the time MTF came along.

> >
> > These are single-step, I/O and data breakpoint traps.
>
> I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
> while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
> emulate the task switches, this has to be injected.

Just enumerating. No more, no less.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-30 16:28       ` Jim Mattson
@ 2022-07-01  7:37         ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-01  7:37 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, 2022-06-30 at 09:28 -0700, Jim Mattson wrote:
> On Thu, Jun 30, 2022 at 1:22 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> > > Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> > > at the start of the instruction.
> > 
> > Are you sure about that?
> 
> I had been sure when I wrote that, but now that I see your response, I
> have to question my memory. The SDM is definitely more authoritative
> than I am.

x86 is like a fractal, the more I know it more I realize I don't.

> 
> > > When you say "ignores," do you mean that AMD ignores a data breakpoint
> > > or single-step trap generated by MOV-SS, or it ignores the fact that
> > > delivering such a #DB trap between the MOV-SS and the subsequent
> > > MOV-ESP will create a stack frame in the wrong place?
> > 
> > Two things which can be infered from the SVM spec.
> >         - AMD doesn't distinguish between MOV SS and STI int shadow.
> >         - AMD has no 'pending debug exception field' in the vmcb.
> > 
> > I don't know what AMD does for #DB that happens on MOV SS, nor if it
> > does distinguish these internally,
> > probably just drops the #DB or something.
> 
> Without carrying pending debug exceptions, it seems that the only two
> choices are to deliver the #DB, with the exception frame in an
> unintended location or to drop the #DB. The latter seems preferable,
> but neither one seems good. What I don't understand is why you claim
> that AMD does this "rightfully." Are you saying that anyone with the
> audacity to run a debugger on legacy code deserves to be thrown in
> front of a moving train?

I understand what you mean, its a tradeof of 100% compliant implementation
vs complexity the corner cases introduce. #DB can already be missed in some
cases I think, especially from my experience from debuggers, and even more especially
when debugging an OS.

It is a pain, as the OS naturally tries to switch tasks and process
interrupts all the time, I even added that _BLOCKIRQ flag to KVM to make it a bit better.

But still I understand what you mean, so maybe indeed VMX did it better.

> 
> > > Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> > Yes, though that is would be mostly useful for nesting.
> > 
> > For not nesting hypervisor, if the hypervisor figured out that a higher priority event overrode
> > the MTF, it can just process the MTF - why to re-inject it?
> 
> You're right. The facility is probably just there to make MTF
> virtualizable. Intel was paying much closer attention to
> virtualizability by the time MTF came along.

That makes sense.


> 
> > > These are single-step, I/O and data breakpoint traps.
> > 
> > I am not sure what you mean. single-step, IO, data breakpoints are indeed the trap #DB,
> > while "general detect", code breakpoint are fault #DB, and we also have the task switch #DB, but since the hardware doesn't
> > emulate the task switches, this has to be injected.
> 
> Just enumerating. No more, no less.
> 

All right, thank you very much for the help, especialy for the tables you provided,
all of this should be enough now for me to review the patch series.

Thanks,
Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit"
  2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
  2022-06-16 23:47   ` Jim Mattson
@ 2022-07-06 11:40   ` Maxim Levitsky
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:40 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Drop pending exceptions and events queued for re-injection when leaving
> nested guest mode, even if the "exit" is due to VM-Fail, SMI, or forced
> by host userspace.  Failure to purge events could result in an event
> belonging to L2 being injected into L1.
> 
> This _should_ never happen for VM-Fail as all events should be blocked by
> nested_run_pending, but it's possible if KVM, not the L1 hypervisor, is
> the source of VM-Fail when running vmcs02.
> 
> SMI is a nop (barring unknown bugs) as recognition of SMI and thus entry
> to SMM is blocked by pending exceptions and re-injected events.
> 
> Forced exit is definitely buggy, but has likely gone unnoticed because
> userspace probably follows the forced exit with KVM_SET_VCPU_EVENTS (or
> some other ioctl() that purges the queue).
> 
> Fixes: 4f350c6dbcb9 ("kvm: nVMX: Handle deferred early VMLAUNCH/VMRESUME failure properly")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 19 +++++++++++--------
>  1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 7d8cd0ebcc75..ee6f27dffdba 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4263,14 +4263,6 @@ static void prepare_vmcs12(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
>  			nested_vmx_abort(vcpu,
>  					 VMX_ABORT_SAVE_GUEST_MSR_FAIL);
>  	}
> -
> -	/*
> -	 * Drop what we picked up for L2 via vmx_complete_interrupts. It is
> -	 * preserved above and would only end up incorrectly in L1.
> -	 */
> -	vcpu->arch.nmi_injected = false;
> -	kvm_clear_exception_queue(vcpu);
> -	kvm_clear_interrupt_queue(vcpu);
>  }
>  
>  /*
> @@ -4609,6 +4601,17 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  		WARN_ON_ONCE(nested_early_check);
>  	}
>  
> +	/*
> +	 * Drop events/exceptions that were queued for re-injection to L2
> +	 * (picked up via vmx_complete_interrupts()), as well as exceptions
> +	 * that were pending for L2.  Note, this must NOT be hoisted above
> +	 * prepare_vmcs12(), events/exceptions queued for re-injection need to
> +	 * be captured in vmcs12 (see vmcs12_save_pending_event()).
> +	 */
> +	vcpu->arch.nmi_injected = false;
> +	kvm_clear_exception_queue(vcpu);
> +	kvm_clear_interrupt_queue(vcpu);
> +
>  	vmx_switch_vmcs(vcpu, &vmx->vmcs01);
>  
>  	/* Update any VMCS fields that might have changed while L2 ran */

Makes sense.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  2022-06-14 20:47 ` [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS Sean Christopherson
@ 2022-07-06 11:43   ` Maxim Levitsky
  2022-07-06 16:12     ` Sean Christopherson
  2022-07-06 20:02   ` Jim Mattson
  1 sibling, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Deliberately truncate the exception error code when shoving it into the
> VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
> Intel CPUs are incapable of handling 32-bit error codes and will never
> generate an error code with bits 31:16, but userspace can provide an
> arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
> on exception injection results in failed VM-Entry, as VMX disallows
> setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
> L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
> reinject the exception back into L2.

Wouldn't it be better to fail KVM_SET_VCPU_EVENTS instead if it tries
to set error code with uppper 16 bits set?

Or if that is considered ABI breakage, then KVM_SET_VCPU_EVENTS code
can truncate the user given value to 16 bit.

Best regards,
	Maxim Levitsky


> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c |  9 ++++++++-
>  arch/x86/kvm/vmx/vmx.c    | 11 ++++++++++-
>  2 files changed, 18 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ee6f27dffdba..33ffc8bcf9cd 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3833,7 +3833,14 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  	u32 intr_info = nr | INTR_INFO_VALID_MASK;
>  
>  	if (vcpu->arch.exception.has_error_code) {
> -		vmcs12->vm_exit_intr_error_code = vcpu->arch.exception.error_code;
> +		/*
> +		 * Intel CPUs will never generate an error code with bits 31:16
> +		 * set, and more importantly VMX disallows setting bits 31:16
> +		 * in the injected error code for VM-Entry.  Drop the bits to
> +		 * mimic hardware and avoid inducing failure on nested VM-Entry
> +		 * if L1 chooses to inject the exception back to L2.
> +		 */
> +		vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
>  		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>  	}
>  
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5e14e4c40007..ec98992024e2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1621,7 +1621,16 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu)
>  	kvm_deliver_exception_payload(vcpu);
>  
>  	if (has_error_code) {
> -		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, error_code);
> +		/*
> +		 * Despite the error code being architecturally defined as 32
> +		 * bits, and the VMCS field being 32 bits, Intel CPUs and thus
> +		 * VMX don't actually supporting setting bits 31:16.  Hardware
> +		 * will (should) never provide a bogus error code, but KVM's
> +		 * ABI lets userspace shove in arbitrary 32-bit values.  Drop
> +		 * the upper bits to avoid VM-Fail, losing information that
> +		 * does't really exist is preferable to killing the VM.
> +		 */
> +		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
>  		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>  	}
>  



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception
  2022-06-14 20:47 ` [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception Sean Christopherson
@ 2022-07-06 11:43   ` Maxim Levitsky
  2022-07-06 22:17   ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:43 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Don't check for code breakpoints during instruction emulation if the
> emulation was triggered by exception interception.  Code breakpoints are
> the highest priority fault-like exception, and KVM only emulates on
> exceptions that are fault-like.  Thus, if hardware signaled a different
> exception, then the vCPU is already passed the stage of checking for
> hardware breakpoints.
> 
> This is likely a glorified nop in terms of functionality, and is more for
> clarification and is technically an optimization.  Intel's SDM explicitly
> states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the
> value that would have been saved on the stack had the exception not been
> intercepted, i.e. will be '1' due to all fault-like exceptions setting RF
> to '1'.  AMD says "guest state saved ... is the processor state as of the
> moment the intercept triggers", but that begs the question, "when does
> the intercept trigger?".
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 21 ++++++++++++++++++---
>  1 file changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2318a99139fa..c5db31b4bd6f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8364,8 +8364,24 @@ int kvm_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_skip_emulated_instruction);
>  
> -static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu, int *r)
> +static bool kvm_vcpu_check_code_breakpoint(struct kvm_vcpu *vcpu,
> +					   int emulation_type, int *r)
>  {
> +	WARN_ON_ONCE(emulation_type & EMULTYPE_NO_DECODE);
> +
> +	/*
> +	 * Do not check for code breakpoints if hardware has already done the
> +	 * checks, as inferred from the emulation type.  On NO_DECODE and SKIP,
> +	 * the instruction has passed all exception checks, and all intercepted
> +	 * exceptions that trigger emulation have lower priority than code
> +	 * breakpoints, i.e. the fact that the intercepted exception occurred
> +	 * means any code breakpoints have already been serviced.
> +	 */
> +	if (emulation_type & (EMULTYPE_NO_DECODE | EMULTYPE_SKIP |
> +			      EMULTYPE_TRAP_UD | EMULTYPE_TRAP_UD_FORCED |
> +			      EMULTYPE_VMWARE_GP | EMULTYPE_PF))
> +		return false;
> +
>  	if (unlikely(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP) &&
>  	    (vcpu->arch.guest_debug_dr7 & DR7_BP_EN_MASK)) {
>  		struct kvm_run *kvm_run = vcpu->run;
> @@ -8487,8 +8503,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		 * are fault-like and are higher priority than any faults on
>  		 * the code fetch itself.
>  		 */
> -		if (!(emulation_type & EMULTYPE_SKIP) &&
> -		    kvm_vcpu_check_code_breakpoint(vcpu, &r))
> +		if (kvm_vcpu_check_code_breakpoint(vcpu, emulation_type, &r))
>  			return r;
>  
>  		r = x86_decode_emulated_instruction(vcpu, emulation_type,


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
  2022-06-14 20:47 ` [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like Sean Christopherson
@ 2022-07-06 11:45   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:45 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Exclude General Detect #DBs, which have fault-like behavior but also have
> a non-zero payload (DR6.BD=1), from nVMX's handling of pending debug
> traps.  Opportunistically rewrite the comment to better document what is
> being checked, i.e. "has a non-zero payload" vs. "has a payload", and to
> call out the many caveats surrounding #DBs that KVM dodges one way or
> another.
> 
> Cc: Oliver Upton <oupton@google.com>
> Cc: Peter Shier <pshier@google.com>
> Fixes: 684c0422da71 ("KVM: nVMX: Handle pending #DB when injecting INIT VM-exit")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 36 +++++++++++++++++++++++++-----------
>  1 file changed, 25 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 33ffc8bcf9cd..61bc80fc4cfa 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3857,16 +3857,29 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  }
>  
>  /*
> - * Returns true if a debug trap is pending delivery.
> + * Returns true if a debug trap is (likely) pending delivery.  Infer the class
> + * of a #DB (trap-like vs. fault-like) from the exception payload (to-be-DR6).
> + * Using the payload is flawed because code breakpoints (fault-like) and data
> + * breakpoints (trap-like) set the same bits in DR6 (breakpoint detected), i.e.
> + * this will return false positives if a to-be-injected code breakpoint #DB is
> + * pending (from KVM's perspective, but not "pending" across an instruction
> + * boundary).  ICEBP, a.k.a. INT1, is also not reflected here even though it
> + * too is trap-like.
>   *
> - * In KVM, debug traps bear an exception payload. As such, the class of a #DB
> - * exception may be inferred from the presence of an exception payload.
> + * KVM "works" despite these flaws as ICEBP isn't currently supported by the
> + * emulator, Monitor Trap Flag is not marked pending on intercepted #DBs (the
> + * #DB has already happened), and MTF isn't marked pending on code breakpoints
> + * from the emulator (because such #DBs are fault-like and thus don't trigger
> + * actions that fire on instruction retire).

Makes sense overall, but I am still not 100% sure to be honest I understand that
new description fully.

The patch itself seems to be correct, so,

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




>   */
> -static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
> +static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
>  {
> -	return vcpu->arch.exception.pending &&
> -			vcpu->arch.exception.nr == DB_VECTOR &&
> -			vcpu->arch.exception.payload;
> +	if (!vcpu->arch.exception.pending ||
> +	    vcpu->arch.exception.nr != DB_VECTOR)
> +		return 0;
> +
> +	/* General Detect #DBs are always fault-like. */
> +	return vcpu->arch.exception.payload & ~DR6_BD;
>  }
>  
>  /*
> @@ -3878,9 +3891,10 @@ static inline bool vmx_pending_dbg_trap(struct kvm_vcpu *vcpu)
>   */
>  static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
>  {
> -	if (vmx_pending_dbg_trap(vcpu))
> -		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS,
> -			    vcpu->arch.exception.payload);
> +	unsigned long pending_dbg = vmx_get_pending_dbg_trap(vcpu);
> +
> +	if (pending_dbg)
> +		vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg);
>  }
>  
>  static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
> @@ -3937,7 +3951,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	 * while delivering the pending exception.
>  	 */
>  
> -	if (vcpu->arch.exception.pending && !vmx_pending_dbg_trap(vcpu)) {
> +	if (vcpu->arch.exception.pending && !vmx_get_pending_dbg_trap(vcpu)) {
>  		if (vmx->nested.nested_run_pending)
>  			return -EBUSY;
>  		if (!nested_vmx_check_exception(vcpu, &exit_qual))








^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-06-29 13:42   ` Jim Mattson
  2022-06-30  8:22     ` Maxim Levitsky
@ 2022-07-06 11:54     ` Maxim Levitsky
  2022-07-06 17:13       ` Jim Mattson
  1 sibling, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:54 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-06-29 at 06:42 -0700, Jim Mattson wrote:
> On Wed, Jun 29, 2022 at 4:17 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > The main goal of this series is to fix KVM's longstanding bug of not
> > > honoring L1's exception intercepts wants when handling an exception that
> > > occurs during delivery of a different exception.  E.g. if L0 and L1 are
> > > using shadow paging, and L2 hits a #PF, and then hits another #PF while
> > > vectoring the first #PF due to _L1_ not having a shadow page for the IDT,
> > > KVM needs to check L1's intercepts before morphing the #PF => #PF => #DF
> > > so that the #PF is routed to L1, not injected into L2 as a #DF.
> > > 
> > > nVMX has hacked around the bug for years by overriding the #PF injector
> > > for shadow paging to go straight to VM-Exit, and nSVM has started doing
> > > the same.  The hacks mostly work, but they're incomplete, confusing, and
> > > lead to other hacky code, e.g. bailing from the emulator because #PF
> > > injection forced a VM-Exit and suddenly KVM is back in L1.
> > > 
> > > Everything leading up to that are related fixes and cleanups I encountered
> > > along the way; some through code inspection, some through tests.
> > > 
> > > v2:
> > >   - Rebased to kvm/queue (commit 8baacf67c76c) + selftests CPUID
> > >     overhaul.
> > >     https://lore.kernel.org/all/20220614200707.3315957-1-seanjc@google.com
> > >   - Treat KVM_REQ_TRIPLE_FAULT as a pending exception.
> > > 
> > > v1: https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com
> > > 
> > > Sean Christopherson (21):
> > >   KVM: nVMX: Unconditionally purge queued/injected events on nested
> > >     "exit"
> > >   KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
> > >   KVM: x86: Don't check for code breakpoints when emulating on exception
> > >   KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like
> > >   KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
> > >   KVM: x86: Treat #DBs from the emulator as fault-like (code and
> > >     DR7.GD=1)
> > >   KVM: x86: Use DR7_GD macro instead of open coding check in emulator
> > >   KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
> > >   KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
> > >   KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
> > >   KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
> > >   KVM: x86: Make kvm_queued_exception a properly named, visible struct
> > >   KVM: x86: Formalize blocking of nested pending exceptions
> > >   KVM: x86: Use kvm_queue_exception_e() to queue #DF
> > >   KVM: x86: Hoist nested event checks above event injection logic
> > >   KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential
> > >     VM-Exit
> > >   KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
> > >   KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
> > >   KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle
> > >     behavior
> > >   KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
> > >   KVM: selftests: Add an x86-only test to verify nested exception
> > >     queueing
> > > 
> > >  arch/x86/include/asm/kvm-x86-ops.h            |   2 +-
> > >  arch/x86/include/asm/kvm_host.h               |  35 +-
> > >  arch/x86/kvm/emulate.c                        |   3 +-
> > >  arch/x86/kvm/svm/nested.c                     | 102 ++---
> > >  arch/x86/kvm/svm/svm.c                        |  18 +-
> > >  arch/x86/kvm/vmx/nested.c                     | 319 +++++++++-----
> > >  arch/x86/kvm/vmx/sgx.c                        |   2 +-
> > >  arch/x86/kvm/vmx/vmx.c                        |  53 ++-
> > >  arch/x86/kvm/x86.c                            | 404 +++++++++++-------
> > >  arch/x86/kvm/x86.h                            |  11 +-
> > >  tools/testing/selftests/kvm/.gitignore        |   1 +
> > >  tools/testing/selftests/kvm/Makefile          |   1 +
> > >  .../selftests/kvm/include/x86_64/svm_util.h   |   7 +-
> > >  .../selftests/kvm/include/x86_64/vmx.h        |  51 +--
> > >  .../kvm/x86_64/nested_exceptions_test.c       | 295 +++++++++++++
> > >  15 files changed, 886 insertions(+), 418 deletions(-)
> > >  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> > > 
> > > 
> > > base-commit: 816967202161955f398ce379f9cbbedcb1eb03cb
> > 
> > Hi Sean and everyone!
> > 
> > 
> > Before I continue reviewing the patch series, I would like you to check if
> > I understand the monitor trap/pending debug exception/event injection
> > logic on VMX correctly. I was looking at the spec for several hours and I still have more
> > questions that answers about it.
> > 
> > So let me state what I understand:
> > 
> > 1. Event injection (aka eventinj in SVM terms):
> > 
> >   (VM_ENTRY_INTR_INFO_FIELD/VM_ENTRY_EXCEPTION_ERROR_CODE/VM_ENTRY_INSTRUCTION_LEN)
> > 
> >   If I understand correctly all event injections types just like on SVM just inject,
> >   and never create something pending, and/or drop the injection if event is not allowed
> >   (like if EFLAGS.IF is 0). VMX might have some checks that could fail VM entry,
> >   if for example you try to inject type 0 (hardware interrupt) and EFLAGS.IF is 0,
> >   I haven't checked this)
> 
> The event is never just "dropped." If it is illegal to deliver the
> event, VM-entry fails. See the second bullet under section 26.2.1.3:
> VM-Entry Control Fields, in the SDM, volume 3.
> 
> 
> >   All event injections happen right away, don't deliver any payload (like DR6), etc.
> 
> Correct.
> 
> >   Injection types 4/5/6, do the same as injection types 0/2/3 but in addition to that,
> >   type 4/6 do a DPL check in IDT, and also these types can promote the RIP prior
> >   to pushing it to the exception stack using VM_ENTRY_INSTRUCTION_LEN to be consistent
> >   with cases when these trap like events are intercepted, where the interception happens
> >   on the start of the instruction despite exceptions being trap-like.
> 
> Unlike the AMD "INTn intercept," these trap intercepts *do not* happen
> at the start of the instruction. In early Intel VT-x parts, one could
> not easily reinject an intercepted software interrupt or exception
> using event injection, because VM-entry required a non-zero
> instruction length, and the guest RIP had already advanced. On CPUs
> that support a non-zero instruction length, one can now reinject a
> software interrupt or exception, by setting the VM-entry instruction
> length to 0.
> 
> > 2. #DB is the only trap like exception that can be pending for one more instruction
> >    if MOV SS shadow is on (any other cases?).
> 
> I believe that's it. I'm not entirely sure about RTM,though.
> 
> >    (AMD just ignores the whole thing, rightfully)
> 
> When you say "ignores," do you mean that AMD ignores a data breakpoint
> or single-step trap generated by MOV-SS, or it ignores the fact that
> delivering such a #DB trap between the MOV-SS and the subsequent
> MOV-ESP will create a stack frame in the wrong place?
> 
> >    That is why we have the GUEST_PENDING_DBG_EXCEPTIONS vmcs field.
> >    I understand that it will be written by CPU in case we have VM exit at the moment
> >    where #DB is already pending but not yet delivered.
> > 
> >    That field can also be (sadly) used to "inject" #DB to the guest, if the hypervisor sets it,
> >    and this #DB will actually update DR6 and such, and might be delayed/lost.
> 
> Injecting a #DB this way (if the hypervisor just emulated MOV-SS) is
> easier than emulating the next instruction or using MTF to step
> through the next instruction, and getting all of the deferred #DB
> delivery rules right. :-)
> 
> > 3. Facts about MTF:
> > 
> >    * MTF as a feature is basically 'single step the guest by generating MTF VM exits after each executed
> >      instruction', and is enabled in primary execution controls.
> > 
> >    * MTF is also an 'event', and it can be injected separately by the hypervisor with event type 7,
> >      and that has no connection to the 'feature', although usually this injection will be useful
> >      when the hypervisor does some kind of re-injection, triggered by the actual MTF feature.
> > 
> >    * MTF event can be lost, if higher priority VM exit happens, this is why the SDM says about 'pending MTF',
> >      which means that MTF vmexit should happen unless something else prevents it and/or higher priority VM exit
> >      overrides it.
> 
> Hence, the facility for injecting a "pending MTF"--so that it won't be "lost."
> 
> >    * MTF event is raised (when the primary execution controls bit is enabled) when:
> > 
> >         - after an injected (vectored), aka eventinj/VM_ENTRY_INTR_INFO_FIELD, done updating the guest state
> >           (that is stack was switched, stuff was pushed to new exception stack, RIP updated to the handler)
> >           I am not 100% sure about this but this seems to be what PRM implies:
> > 
> >           "If the “monitor trap flag” VM-execution control is 1 and VM entry is injecting a vectored event (see Section
> >           26.6.1), an MTF VM exit is pending on the instruction boundary before the first instruction following the
> >           VM entry."
> > 
> >         - If an interrupt and or #DB exception happens prior to executing first instruction of the guest,
> >           then once again MTF will happen on first instruction of the exception/interrupt handler
> > 
> >           "If the “monitor trap flag” VM-execution control is 1, VM entry is not injecting an event, and a pending event
> >           (e.g., debug exception or interrupt) is delivered before an instruction can execute, an MTF VM exit is pending
> >           on the instruction boundary following delivery of the event (or any nested exception)."
> > 
> >           That means that #DB has higher priority that MTF, but not specified if fault DB or trap DB
> 
> These are single-step, I/O and data breakpoint traps.
> 
> >         - If instruction causes exception, once again, on first instruction of the exception handler MTF will happen.
> > 
> >         - Otherwise after an instruction (or REP iteration) retires.
> > 
> > 
> > If you have more facts about MTF and related stuff and/or if I made a mistake in the above, I am all ears to listen!
> 
> You might be interested in my augmented Table 6-2 (from volume 3 of
> the SDM): https://docs.google.com/spreadsheets/d/e/2PACX-1vR8TkbSl4TqXtD62agRUs1QY3SY-98mKtOh-s8vYDzaDmDOcdfyTvlAxF9aVnHWRu7uyGhRwvHUziXT/pubhtml
> 


This is this table, slightly processed by me:

--
=====================================================================================
My Notes:
=====================================================================================


- Events happen on the instruction boundary.

- On the instruction boundary, the previous instruction is fully finished executing,
  which means that it is retired, or in other words, the arch state changes made by it
  are fully commited, and that includes transfer to an exception handler if
  that instruction caused a fault like exception.
  
  (Statement about transfer to an exception is not 100% true in KVM, since we use hardware to inject
   exceptions, thus when we deal with events, we still have to finish last instruction
   by delivering an exception it caused if any).

- On instruction boundary, the next instruction might already started executing, but none of its results
  were not committed to arch state YET.

- The events are from highest (1) to lowest (10). The highest event always wins,
  meaning that it is delivered, while all other events are lost.

=====================================================================================
Previous instruciton is retired
=====================================================================================

1.0 Hardware Reset and Machine Checks
  - RESET
  - Machine Check


2.0 Trap on Task Switch
  - T flag in TSS is set, and the task switch was done by previous instruction


3.0 External Hardware Interventions
  - FLUSH
  - STOPCLK
  - SMI
  - INIT

3.5 Pending MTF VM-exit
   "System-management interrupts (SMIs), INIT signals, and higher priority events take priority over MTF
   VM exits. MTF VM exits take priority over debug-trap exceptions and lower priority events."

   - Note that MTF became pending due to previous instruction and/or injection.

4.0 #DB Traps on Previous Instruction
  - Breakpoints
  - Debug Trap Exceptions (TF flag set or data/IO breakpoint)

4.3 VMX-preemption timer expired VM-exit
   "Debug-trap exceptions and higher priority events take priority over VM exits caused by the VMX-preemption
    timer. VM exits caused by the VMX-preemption timer take priority over VM exits caused by the “NMI-window
    exiting” VM-execution control and lower priority events."

4.6 NMI-window exiting VM-exit
   "Debug-trap exceptions (see Section 26.7.3) and higher priority events take priority over VM exits caused by
   [NMI-window exiting]. VM exits caused by this control take priority over non-maskable interrupts (NMIs) and lower
   priority events."

5.0 Nonmaskable Interrupts (NMI)


5.5 Interrupt-window exiting VM-exit + Virtual-interrupt delivery (if "interrupt-window exiting" is 0)

  "Virtual-interrupt delivery has the same priority as that of VM exits due to the 1-setting of the “interrupt-window
  exiting” VM-execution control.1 Thus, non-maskable interrupts (NMIs) and higher priority events take priority over
  delivery of a virtual interrupt; delivery of a virtual interrupt takes priority over external interrupts and lower priority
  events. "

6.0 Maskable Hardware Interrupts
  - Real hardware interrupts

=====================================================================================
Execution of next instruction starts
=====================================================================================

7.0 #DB Fault on next instruction
  - Instruction breakpoint
  - General detect ?? 

8.0 Faults from Fetching Next Instruction
  - Code-Segment Limit Violation
  - Code Page Fault
  - Control protection exception (missing ENDBRANCH at target of indirect call or jump)

9.0 Faults from Decoding Next Instruction
  - Instruction length > 15 bytes
  - Invalid Opcode
  - Coprocessor Not Available

10. Faults on Executing Next Instruction
  - Overflow
  - Bound error
  - Invalid TSS
  - Segment Not Present
  - Stack fault
  - General Protection
  - Data Page Fault
  - Alignment Check
  - x86 FPU Floating-point exception
  - SIMD floating-point exception
  - Virtualization exception
  - Control protection exception
--


So here are my questions:


1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
   and the dual monitor handler figures out that #MTF was pending and re-injects it when it
   VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
   do anything about it.

   Or maybe pending MTF is saved to SMRAM somewhere.

   In case you will say that I am inventing this again, I am saying now that the above is
   just a guess.


2. For case 7, what about General Detect? Since to raise it, the CPU needs to decode the instruction
   Its more natural to have it belong to case 9.


3. Finally just to state it, looks like MTF can only be lost due to #SMI or machine check,
   because the trap on task switch under VMX is purely software thing - task switch is emulated,
   thus the hypervisor can (and KVM doesn't) test that bit upon completion of the task switch,
   and do all monitor trap related things it needs then.

   (RESET/INIT doesn't matter as that makes the CPU lose most of its state)


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  2022-06-14 20:47 ` [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag Sean Christopherson
@ 2022-07-06 11:57   ` Maxim Levitsky
  2022-07-06 23:51   ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
> priority than MTF. 
>  KVM itself doesn't emulate TSS #DBs, and any such
> exceptions injected from L1 will be handled by hardware (or morphed to
> a fault-like exception if injection fails), but theoretically userspace
> could pend a TSS T-flag #DB in conjunction with a pending MTF.


After reading the Jim's table 6-2, this makes sense, however note that
*check_nested_events is a bit different in the regard that CPU checks
the events when the previous instruction fully done committing it state,
and all it is left of it is maybe pending trap like events,

but in the KVM, the *check_nested_events, happens when we still didn't deliver
the fault like exception from the previous instruction, and thus,
a fault like exception appears to have higher priority than a pending MTF.

Assuming that my analysis is right:

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


> 
> Note, there's no known use case this fixes, it's purely to be technically
> correct with respect to Intel's SDM.
> 
> Cc: Oliver Upton <oupton@google.com>
> Cc: Peter Shier <pshier@google.com>
> Fixes: 5ef8acbdd687 ("KVM: nVMX: Emulate MTF when performing instruction emulation")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 8 +++++---
>  1 file changed, 5 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 61bc80fc4cfa..e794791a6bdd 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3943,15 +3943,17 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	}
>  
>  	/*
> -	 * Process any exceptions that are not debug traps before MTF.
> +	 * Process exceptions that are higher priority than Monitor Trap Flag:
> +	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
> +	 * could theoretically come in from userspace), and ICEBP (INT1).
>  	 *
>  	 * Note that only a pending nested run can block a pending exception.
>  	 * Otherwise an injected NMI/interrupt should either be
>  	 * lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
>  	 * while delivering the pending exception.
>  	 */
> -
> -	if (vcpu->arch.exception.pending && !vmx_get_pending_dbg_trap(vcpu)) {
> +	if (vcpu->arch.exception.pending &&
> +	    !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
>  		if (vmx->nested.nested_run_pending)
>  			return -EBUSY;
>  		if (!nested_vmx_check_exception(vcpu, &exit_qual))





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
  2022-06-14 20:47 ` [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1) Sean Christopherson
@ 2022-07-06 11:57   ` Maxim Levitsky
  2022-07-06 23:55   ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:57 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
> trap-like depending the sub-type of #DB, and effectively defer the
> decision of what to do with the #DB to the caller.
> 
> For the emulator's two calls to exception_type(), treat the #DB as
> fault-like, as the emulator handles only code breakpoint and general
> detect #DBs, both of which are fault-like.
> 
> For event injection, which uses exception_type() to determine whether to
> set EFLAGS.RF=1 on the stack, keep the current behavior of not setting
> RF=1 for #DBs.  Intel and AMD explicitly state RF isn't set on code #DBs,
> so exempting by failing the "== EXCPT_FAULT" check is correct.  The only
> other fault-like #DB is General Detect, and despite Intel and AMD both
> strongly implying (through omission) that General Detect #DBs should set
> RF=1, hardware (multiple generations of both Intel and AMD), in fact does
> not.  Through insider knowledge, extreme foresight, sheer dumb luck, or
> some combination thereof, KVM correctly handled RF for General Detect #DBs.
> 
> Fixes: 38827dbd3fb8 ("KVM: x86: Do not update EFLAGS on faulting emulation")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 27 +++++++++++++++++++++++++--
>  1 file changed, 25 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c5db31b4bd6f..7c3ce601bdcc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -529,6 +529,7 @@ static int exception_class(int vector)
>  #define EXCPT_TRAP		1
>  #define EXCPT_ABORT		2
>  #define EXCPT_INTERRUPT		3
> +#define EXCPT_DB		4
>  
>  static int exception_type(int vector)
>  {
> @@ -539,8 +540,14 @@ static int exception_type(int vector)
>  
>  	mask = 1 << vector;
>  
> -	/* #DB is trap, as instruction watchpoints are handled elsewhere */
> -	if (mask & ((1 << DB_VECTOR) | (1 << BP_VECTOR) | (1 << OF_VECTOR)))
> +	/*
> +	 * #DBs can be trap-like or fault-like, the caller must check other CPU
> +	 * state, e.g. DR6, to determine whether a #DB is a trap or fault.
> +	 */
> +	if (mask & (1 << DB_VECTOR))
> +		return EXCPT_DB;
> +
> +	if (mask & ((1 << BP_VECTOR) | (1 << OF_VECTOR)))
>  		return EXCPT_TRAP;
>  
>  	if (mask & ((1 << DF_VECTOR) | (1 << MC_VECTOR)))
> @@ -8632,6 +8639,12 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		unsigned long rflags = static_call(kvm_x86_get_rflags)(vcpu);
>  		toggle_interruptibility(vcpu, ctxt->interruptibility);
>  		vcpu->arch.emulate_regs_need_sync_to_vcpu = false;
> +
> +		/*
> +		 * Note, EXCPT_DB is assumed to be fault-like as the emulator
> +		 * only supports code breakpoints and general detect #DB, both
> +		 * of which are fault-like.
> +		 */
>  		if (!ctxt->have_exception ||
>  		    exception_type(ctxt->exception.vector) == EXCPT_TRAP) {
>  			kvm_pmu_trigger_event(vcpu, PERF_COUNT_HW_INSTRUCTIONS);
> @@ -9546,6 +9559,16 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  
>  	/* try to inject new event if pending */
>  	if (vcpu->arch.exception.pending) {
> +		/*
> +		 * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS
> +		 * value pushed on the stack.  Trap-like exception and all #DBs
> +		 * leave RF as-is (KVM follows Intel's behavior in this regard;
> +		 * AMD states that code breakpoint #DBs excplitly clear RF=0).
> +		 *
> +		 * Note, most versions of Intel's SDM and AMD's APM incorrectly
> +		 * describe the behavior of General Detect #DBs, which are
> +		 * fault-like.  They do _not_ set RF, a la code breakpoints.
> +		 */
>  		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
>  			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
>  					     X86_EFLAGS_RF);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator
  2022-06-14 20:47 ` [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator Sean Christopherson
@ 2022-07-06 11:58   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:58 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Use DR7_GD in the emulator instead of open coding the check, and drop a
> comically wrong comment.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/emulate.c | 3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index 39ea9138224c..bf499716d9d3 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -4182,8 +4182,7 @@ static int check_dr7_gd(struct x86_emulate_ctxt *ctxt)
>  
>  	ctxt->ops->get_dr(ctxt, 7, &dr7);
>  
> -	/* Check if DR7.Global_Enable is set */
> -	return dr7 & (1 << 13);
> +	return dr7 & DR7_GD;
>  }
>  
>  static int check_dr_read(struct x86_emulate_ctxt *ctxt)


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS
  2022-06-14 20:47 ` [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS Sean Christopherson
@ 2022-07-06 11:59   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 11:59 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Fall through to handling other pending exception/events for L2 if SIPI
> is pending while the CPU is not in Wait-for-SIPI.  KVM correctly ignores
> the event, but incorrectly returns immediately, e.g. a SIPI coincident
> with another event could lead to KVM incorrectly routing the event to L1
> instead of L2.
> 
> Fixes: bf0cd88ce363 ("KVM: x86: emulate wait-for-SIPI and SIPI-VMExit")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index e794791a6bdd..d080bfca16ef 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3936,10 +3936,12 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  			return -EBUSY;
>  
>  		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
> -		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED)
> +		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
>  			nested_vmx_vmexit(vcpu, EXIT_REASON_SIPI_SIGNAL, 0,
>  						apic->sipi_vector & 0xFFUL);
> -		return 0;
> +			return 0;
> +		}
> +		/* Fallthrough, the SIPI is completely ignored. */
>  	}
>  
>  	/*



Makes sense.

Note that svm_check_nested_events lacks the code to check for SIPI at all,
but SVM lacks SIPI intercept, thus this is likely correct, 
the place which delivers SIPI to L1 is I think kvm_apic_accept_events,
and it will ignore it unless the CPU is in INIT state, in which
it will not be in nested mode.


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  2022-06-14 20:47 ` [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit Sean Christopherson
@ 2022-07-06 12:00   ` Maxim Levitsky
  2022-07-06 16:45     ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:00 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Clear mtf_pending on nested VM-Exit instead of handling the clear on a
> case-by-case basis in vmx_check_nested_events().  The pending MTF should
> rever survive nested VM-Exit, as it is a property of KVM's run of the
^^ typo: never

Also it is not clear what the 'case by case' means.

I see that the vmx_check_nested_events always clears it unless nested run is pending
or we re-inject an event.



> current L2, i.e. should never affect the next L2 run by L1.  In practice,
> this is likely a nop as getting to L1 with nested_run_pending is
> impossible, and KVM doesn't correctly handle morphing a pending exception
> that occurs on a prior injected exception (need for re-injected exception
> being the other case where MTF isn't cleared).  However, KVM will
> hopefully soon correctly deal with a pending exception on top of an
> injected exception.



> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 16 +++++++---------
>  1 file changed, 7 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index d080bfca16ef..7b644513c82b 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3909,16 +3909,8 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	unsigned long exit_qual;
>  	bool block_nested_events =
>  	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
> -	bool mtf_pending = vmx->nested.mtf_pending;
>  	struct kvm_lapic *apic = vcpu->arch.apic;
>  
> -	/*
> -	 * Clear the MTF state. If a higher priority VM-exit is delivered first,
> -	 * this state is discarded.
> -	 */
> -	if (!block_nested_events)
> -		vmx->nested.mtf_pending = false;
> -
>  	if (lapic_in_kernel(vcpu) &&
>  		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
>  		if (block_nested_events)
> @@ -3927,6 +3919,9 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  		clear_bit(KVM_APIC_INIT, &apic->pending_events);
>  		if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED)
>  			nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> +
> +		/* MTF is discarded if the vCPU is in WFS. */
> +		vmx->nested.mtf_pending = false;
>  		return 0;

I guess MTF should also be discarded if we enter SMM, and I see that
VMX also enter SMM with a pseudo VM exit (in vmx_enter_smm) which
will clear the MTF. Good.

>  	}
>  
> @@ -3964,7 +3959,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  		return 0;
>  	}
>  
> -	if (mtf_pending) {
> +	if (vmx->nested.mtf_pending) {
>  		if (block_nested_events)
>  			return -EBUSY;
>  		nested_vmx_update_pending_dbg(vcpu);
> @@ -4562,6 +4557,9 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  
> +	/* Pending MTF traps are discarded on VM-Exit. */
> +	vmx->nested.mtf_pending = false;
> +
>  	/* trying to cancel vmlaunch/vmresume is a bug */
>  	WARN_ON_ONCE(vmx->nested.nested_run_pending);
>  


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF
  2022-06-14 20:47 ` [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF Sean Christopherson
@ 2022-07-06 12:00   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:00 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Treat #PFs that occur during emulation of ENCLS as, wait for it, emulated
> page faults.  Practically speaking, this is a glorified nop as the
> exception is never of the nested flavor, and it's extremely unlikely the
> guest is relying on the side effect of an implicit INVLPG on the faulting
> address.
> 
> Fixes: 70210c044b4e ("KVM: VMX: Add SGX ENCLS[ECREATE] handler to enforce CPUID restrictions")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/vmx/sgx.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/sgx.c b/arch/x86/kvm/vmx/sgx.c
> index 35e7ec91ae86..966cfa228f2a 100644
> --- a/arch/x86/kvm/vmx/sgx.c
> +++ b/arch/x86/kvm/vmx/sgx.c
> @@ -129,7 +129,7 @@ static int sgx_inject_fault(struct kvm_vcpu *vcpu, gva_t gva, int trapnr)
>  		ex.address = gva;
>  		ex.error_code_valid = true;
>  		ex.nested_page_fault = false;
> -		kvm_inject_page_fault(vcpu, &ex);
> +		kvm_inject_emulated_page_fault(vcpu, &ex);
>  	} else {
>  		kvm_inject_gp(vcpu, 0);
>  	}

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception
  2022-06-14 20:47 ` [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception Sean Christopherson
@ 2022-07-06 12:01   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:01 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Rename the kvm_x86_ops hook for exception injection to better reflect
> reality, and to align with pretty much every other related function name
> in KVM.

100% True.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h | 2 +-
>  arch/x86/include/asm/kvm_host.h    | 2 +-
>  arch/x86/kvm/svm/svm.c             | 4 ++--
>  arch/x86/kvm/vmx/vmx.c             | 4 ++--
>  arch/x86/kvm/x86.c                 | 2 +-
>  5 files changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 6f2f1affbb78..a42e2d9b04fe 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -67,7 +67,7 @@ KVM_X86_OP(get_interrupt_shadow)
>  KVM_X86_OP(patch_hypercall)
>  KVM_X86_OP(inject_irq)
>  KVM_X86_OP(inject_nmi)
> -KVM_X86_OP(queue_exception)
> +KVM_X86_OP(inject_exception)
>  KVM_X86_OP(cancel_injection)
>  KVM_X86_OP(interrupt_allowed)
>  KVM_X86_OP(nmi_allowed)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7e98b2876380..16a7f91cdf75 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1505,7 +1505,7 @@ struct kvm_x86_ops {
>  				unsigned char *hypercall_addr);
>  	void (*inject_irq)(struct kvm_vcpu *vcpu, bool reinjected);
>  	void (*inject_nmi)(struct kvm_vcpu *vcpu);
> -	void (*queue_exception)(struct kvm_vcpu *vcpu);
> +	void (*inject_exception)(struct kvm_vcpu *vcpu);
>  	void (*cancel_injection)(struct kvm_vcpu *vcpu);
>  	int (*interrupt_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
>  	int (*nmi_allowed)(struct kvm_vcpu *vcpu, bool for_injection);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index c6cca0ce127b..ca39f76ca44b 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -430,7 +430,7 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> -static void svm_queue_exception(struct kvm_vcpu *vcpu)
> +static void svm_inject_exception(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_svm *svm = to_svm(vcpu);
>  	unsigned nr = vcpu->arch.exception.nr;
> @@ -4761,7 +4761,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>  	.patch_hypercall = svm_patch_hypercall,
>  	.inject_irq = svm_inject_irq,
>  	.inject_nmi = svm_inject_nmi,
> -	.queue_exception = svm_queue_exception,
> +	.inject_exception = svm_inject_exception,
>  	.cancel_injection = svm_cancel_injection,
>  	.interrupt_allowed = svm_interrupt_allowed,
>  	.nmi_allowed = svm_nmi_allowed,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index ec98992024e2..26b863c78a9f 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1610,7 +1610,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
>  		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
>  }
>  
> -static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> +static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	unsigned nr = vcpu->arch.exception.nr;
> @@ -7993,7 +7993,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>  	.patch_hypercall = vmx_patch_hypercall,
>  	.inject_irq = vmx_inject_irq,
>  	.inject_nmi = vmx_inject_nmi,
> -	.queue_exception = vmx_queue_exception,
> +	.inject_exception = vmx_inject_exception,
>  	.cancel_injection = vmx_cancel_injection,
>  	.interrupt_allowed = vmx_interrupt_allowed,
>  	.nmi_allowed = vmx_nmi_allowed,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 7c3ce601bdcc..b63421d511c5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9504,7 +9504,7 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  
>  	if (vcpu->arch.exception.error_code && !is_protmode(vcpu))
>  		vcpu->arch.exception.error_code = false;
> -	static_call(kvm_x86_queue_exception)(vcpu);
> +	static_call(kvm_x86_inject_exception)(vcpu);
>  }
>  
>  static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky <mlevitsk@redhat.com>





^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct
  2022-06-14 20:47 ` [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct Sean Christopherson
@ 2022-07-06 12:02   ` Maxim Levitsky
  2022-07-18 13:07   ` Maxim Levitsky
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:02 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
> in anticipation of adding a second instance in kvm_vcpu_arch to handle
> exceptions that occur when vectoring an injected exception and are
> morphed to VM-Exit instead of leading to #DF.
> 
> Opportunistically take advantage of the churn to rename "nr" to "vector".
> 
> No functional change intended.


Nitpick: This patch does a bit more refactoring than is stated in the changelog.

It might be worth it to split it into few patches.

I didn't find any issues, and the refactoring is looking good overall,
but I might have missed something.

Best regards,
	Maxim Levitsky

> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 23 +++++-----
>  arch/x86/kvm/svm/nested.c       | 45 ++++++++++---------
>  arch/x86/kvm/svm/svm.c          | 14 +++---
>  arch/x86/kvm/vmx/nested.c       | 42 +++++++++--------
>  arch/x86/kvm/vmx/vmx.c          | 20 ++++-----
>  arch/x86/kvm/x86.c              | 80 ++++++++++++++++-----------------
>  arch/x86/kvm/x86.h              |  3 +-
>  7 files changed, 111 insertions(+), 116 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 16a7f91cdf75..7f321d53a7e9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -640,6 +640,17 @@ struct kvm_vcpu_xen {
>  	struct timer_list poll_timer;
>  };
>  
> +struct kvm_queued_exception {
> +	bool pending;
> +	bool injected;
> +	bool has_error_code;
> +	u8 vector;
> +	u32 error_code;
> +	unsigned long payload;
> +	bool has_payload;
> +	u8 nested_apf;
> +};
> +
>  struct kvm_vcpu_arch {
>  	/*
>  	 * rip and regs accesses must go through
> @@ -739,16 +750,8 @@ struct kvm_vcpu_arch {
>  
>  	u8 event_exit_inst_len;
>  
> -	struct kvm_queued_exception {
> -		bool pending;
> -		bool injected;
> -		bool has_error_code;
> -		u8 nr;
> -		u32 error_code;
> -		unsigned long payload;
> -		bool has_payload;
> -		u8 nested_apf;
> -	} exception;
> +	/* Exceptions to be injected to the guest. */
> +	struct kvm_queued_exception exception;
>  
>  	struct kvm_queued_interrupt {
>  		bool injected;
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 83bae1f2eeb8..471d40e97890 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -462,7 +462,7 @@ static void nested_save_pending_event_to_vmcb12(struct vcpu_svm *svm,
>  	unsigned int nr;
>  
>  	if (vcpu->arch.exception.injected) {
> -		nr = vcpu->arch.exception.nr;
> +		nr = vcpu->arch.exception.vector;
>  		exit_int_info = nr | SVM_EVTINJ_VALID | SVM_EVTINJ_TYPE_EXEPT;
>  
>  		if (vcpu->arch.exception.has_error_code) {
> @@ -1299,42 +1299,43 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
>  
>  static bool nested_exit_on_exception(struct vcpu_svm *svm)
>  {
> -	unsigned int nr = svm->vcpu.arch.exception.nr;
> +	unsigned int vector = svm->vcpu.arch.exception.vector;
>  
> -	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(nr));
> +	return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
>  }
>  
> -static void nested_svm_inject_exception_vmexit(struct vcpu_svm *svm)
> +static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
>  {
> -	unsigned int nr = svm->vcpu.arch.exception.nr;
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +	struct vcpu_svm *svm = to_svm(vcpu);
>  	struct vmcb *vmcb = svm->vmcb;
>  
> -	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + nr;
> +	vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + ex->vector;
>  	vmcb->control.exit_code_hi = 0;
>  
> -	if (svm->vcpu.arch.exception.has_error_code)
> -		vmcb->control.exit_info_1 = svm->vcpu.arch.exception.error_code;
> +	if (ex->has_error_code)
> +		vmcb->control.exit_info_1 = ex->error_code;
>  
>  	/*
>  	 * EXITINFO2 is undefined for all exception intercepts other
>  	 * than #PF.
>  	 */
> -	if (nr == PF_VECTOR) {
> -		if (svm->vcpu.arch.exception.nested_apf)
> -			vmcb->control.exit_info_2 = svm->vcpu.arch.apf.nested_apf_token;
> -		else if (svm->vcpu.arch.exception.has_payload)
> -			vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
> +	if (ex->vector == PF_VECTOR) {
> +		if (ex->has_payload)
> +			vmcb->control.exit_info_2 = ex->payload;
>  		else
> -			vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
> -	} else if (nr == DB_VECTOR) {
> +			vmcb->control.exit_info_2 = vcpu->arch.cr2;
> +	} else if (ex->vector == DB_VECTOR) {
>  		/* See inject_pending_event.  */
> -		kvm_deliver_exception_payload(&svm->vcpu);
> -		if (svm->vcpu.arch.dr7 & DR7_GD) {
> -			svm->vcpu.arch.dr7 &= ~DR7_GD;
> -			kvm_update_dr7(&svm->vcpu);
> +		kvm_deliver_exception_payload(vcpu, ex);
> +
> +		if (vcpu->arch.dr7 & DR7_GD) {
> +			vcpu->arch.dr7 &= ~DR7_GD;
> +			kvm_update_dr7(vcpu);
>  		}
> -	} else
> -		WARN_ON(svm->vcpu.arch.exception.has_payload);
> +	} else {
> +		WARN_ON(ex->has_payload);
> +	}
>  
>  	nested_svm_vmexit(svm);
>  }
> @@ -1372,7 +1373,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>                          return -EBUSY;
>  		if (!nested_exit_on_exception(svm))
>  			return 0;
> -		nested_svm_inject_exception_vmexit(svm);
> +		nested_svm_inject_exception_vmexit(vcpu);
>  		return 0;
>  	}
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ca39f76ca44b..6b80046a014f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -432,22 +432,20 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>  
>  static void svm_inject_exception(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
>  	struct vcpu_svm *svm = to_svm(vcpu);
> -	unsigned nr = vcpu->arch.exception.nr;
> -	bool has_error_code = vcpu->arch.exception.has_error_code;
> -	u32 error_code = vcpu->arch.exception.error_code;
>  
> -	kvm_deliver_exception_payload(vcpu);
> +	kvm_deliver_exception_payload(vcpu, ex);
>  
> -	if (kvm_exception_is_soft(nr) &&
> +	if (kvm_exception_is_soft(ex->vector) &&
>  	    svm_update_soft_interrupt_rip(vcpu))
>  		return;
>  
> -	svm->vmcb->control.event_inj = nr
> +	svm->vmcb->control.event_inj = ex->vector
>  		| SVM_EVTINJ_VALID
> -		| (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
> +		| (ex->has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
>  		| SVM_EVTINJ_TYPE_EXEPT;
> -	svm->vmcb->control.event_inj_err = error_code;
> +	svm->vmcb->control.event_inj_err = ex->error_code;
>  }
>  
>  static void svm_init_erratum_383(void)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 7b644513c82b..fafdcbfeca1f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -445,29 +445,27 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>   */
>  static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
>  {
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -	unsigned int nr = vcpu->arch.exception.nr;
> -	bool has_payload = vcpu->arch.exception.has_payload;
> -	unsigned long payload = vcpu->arch.exception.payload;
>  
> -	if (nr == PF_VECTOR) {
> -		if (vcpu->arch.exception.nested_apf) {
> +	if (ex->vector == PF_VECTOR) {
> +		if (ex->nested_apf) {
>  			*exit_qual = vcpu->arch.apf.nested_apf_token;
>  			return 1;
>  		}
> -		if (nested_vmx_is_page_fault_vmexit(vmcs12,
> -						    vcpu->arch.exception.error_code)) {
> -			*exit_qual = has_payload ? payload : vcpu->arch.cr2;
> +		if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
> +			*exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
>  			return 1;
>  		}
> -	} else if (vmcs12->exception_bitmap & (1u << nr)) {
> -		if (nr == DB_VECTOR) {
> -			if (!has_payload) {
> -				payload = vcpu->arch.dr6;
> -				payload &= ~DR6_BT;
> -				payload ^= DR6_ACTIVE_LOW;
> +	} else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
> +		if (ex->vector == DB_VECTOR) {
> +			if (ex->has_payload) {
> +				*exit_qual = ex->payload;
> +			} else {
> +				*exit_qual = vcpu->arch.dr6;
> +				*exit_qual &= ~DR6_BT;
> +				*exit_qual ^= DR6_ACTIVE_LOW;
>  			}
> -			*exit_qual = payload;
>  		} else
>  			*exit_qual = 0;
>  		return 1;
> @@ -3724,7 +3722,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
>  	     is_double_fault(exit_intr_info))) {
>  		vmcs12->idt_vectoring_info_field = 0;
>  	} else if (vcpu->arch.exception.injected) {
> -		nr = vcpu->arch.exception.nr;
> +		nr = vcpu->arch.exception.vector;
>  		idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
>  
>  		if (kvm_exception_is_soft(nr)) {
> @@ -3828,11 +3826,11 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>  static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  					       unsigned long exit_qual)
>  {
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -	unsigned int nr = vcpu->arch.exception.nr;
> -	u32 intr_info = nr | INTR_INFO_VALID_MASK;
>  
> -	if (vcpu->arch.exception.has_error_code) {
> +	if (ex->has_error_code) {
>  		/*
>  		 * Intel CPUs will never generate an error code with bits 31:16
>  		 * set, and more importantly VMX disallows setting bits 31:16
> @@ -3840,11 +3838,11 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  		 * mimic hardware and avoid inducing failure on nested VM-Entry
>  		 * if L1 chooses to inject the exception back to L2.
>  		 */
> -		vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
> +		vmcs12->vm_exit_intr_error_code = (u16)ex->error_code;
>  		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>  	}
>  
> -	if (kvm_exception_is_soft(nr))
> +	if (kvm_exception_is_soft(ex->vector))
>  		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>  	else
>  		intr_info |= INTR_TYPE_HARD_EXCEPTION;
> @@ -3875,7 +3873,7 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
>  {
>  	if (!vcpu->arch.exception.pending ||
> -	    vcpu->arch.exception.nr != DB_VECTOR)
> +	    vcpu->arch.exception.vector != DB_VECTOR)
>  		return 0;
>  
>  	/* General Detect #DBs are always fault-like. */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 26b863c78a9f..7ef5659a1bbd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1585,7 +1585,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>  	 */
>  	if (nested_cpu_has_mtf(vmcs12) &&
>  	    (!vcpu->arch.exception.pending ||
> -	     vcpu->arch.exception.nr == DB_VECTOR))
> +	     vcpu->arch.exception.vector == DB_VECTOR))
>  		vmx->nested.mtf_pending = true;
>  	else
>  		vmx->nested.mtf_pending = false;
> @@ -1612,15 +1612,13 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
>  
>  static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>  {
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	unsigned nr = vcpu->arch.exception.nr;
> -	bool has_error_code = vcpu->arch.exception.has_error_code;
> -	u32 error_code = vcpu->arch.exception.error_code;
> -	u32 intr_info = nr | INTR_INFO_VALID_MASK;
>  
> -	kvm_deliver_exception_payload(vcpu);
> +	kvm_deliver_exception_payload(vcpu, ex);
>  
> -	if (has_error_code) {
> +	if (ex->has_error_code) {
>  		/*
>  		 * Despite the error code being architecturally defined as 32
>  		 * bits, and the VMCS field being 32 bits, Intel CPUs and thus
> @@ -1630,21 +1628,21 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>  		 * the upper bits to avoid VM-Fail, losing information that
>  		 * does't really exist is preferable to killing the VM.
>  		 */
> -		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
> +		vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)ex->error_code);
>  		intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>  	}
>  
>  	if (vmx->rmode.vm86_active) {
>  		int inc_eip = 0;
> -		if (kvm_exception_is_soft(nr))
> +		if (kvm_exception_is_soft(ex->vector))
>  			inc_eip = vcpu->arch.event_exit_inst_len;
> -		kvm_inject_realmode_interrupt(vcpu, nr, inc_eip);
> +		kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
>  		return;
>  	}
>  
>  	WARN_ON_ONCE(vmx->emulation_required);
>  
> -	if (kvm_exception_is_soft(nr)) {
> +	if (kvm_exception_is_soft(ex->vector)) {
>  		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>  			     vmx->vcpu.arch.event_exit_inst_len);
>  		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b63421d511c5..511c0c8af80e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -557,16 +557,13 @@ static int exception_type(int vector)
>  	return EXCPT_FAULT;
>  }
>  
> -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
> +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> +				   struct kvm_queued_exception *ex)
>  {
> -	unsigned nr = vcpu->arch.exception.nr;
> -	bool has_payload = vcpu->arch.exception.has_payload;
> -	unsigned long payload = vcpu->arch.exception.payload;
> -
> -	if (!has_payload)
> +	if (!ex->has_payload)
>  		return;
>  
> -	switch (nr) {
> +	switch (ex->vector) {
>  	case DB_VECTOR:
>  		/*
>  		 * "Certain debug exceptions may clear bit 0-3.  The
> @@ -591,8 +588,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
>  		 * So they need to be flipped for DR6.
>  		 */
>  		vcpu->arch.dr6 |= DR6_ACTIVE_LOW;
> -		vcpu->arch.dr6 |= payload;
> -		vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW;
> +		vcpu->arch.dr6 |= ex->payload;
> +		vcpu->arch.dr6 ^= ex->payload & DR6_ACTIVE_LOW;
>  
>  		/*
>  		 * The #DB payload is defined as compatible with the 'pending
> @@ -603,12 +600,12 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
>  		vcpu->arch.dr6 &= ~BIT(12);
>  		break;
>  	case PF_VECTOR:
> -		vcpu->arch.cr2 = payload;
> +		vcpu->arch.cr2 = ex->payload;
>  		break;
>  	}
>  
> -	vcpu->arch.exception.has_payload = false;
> -	vcpu->arch.exception.payload = 0;
> +	ex->has_payload = false;
> +	ex->payload = 0;
>  }
>  EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
>  
> @@ -647,17 +644,18 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>  			vcpu->arch.exception.injected = false;
>  		}
>  		vcpu->arch.exception.has_error_code = has_error;
> -		vcpu->arch.exception.nr = nr;
> +		vcpu->arch.exception.vector = nr;
>  		vcpu->arch.exception.error_code = error_code;
>  		vcpu->arch.exception.has_payload = has_payload;
>  		vcpu->arch.exception.payload = payload;
>  		if (!is_guest_mode(vcpu))
> -			kvm_deliver_exception_payload(vcpu);
> +			kvm_deliver_exception_payload(vcpu,
> +						      &vcpu->arch.exception);
>  		return;
>  	}
>  
>  	/* to check exception */
> -	prev_nr = vcpu->arch.exception.nr;
> +	prev_nr = vcpu->arch.exception.vector;
>  	if (prev_nr == DF_VECTOR) {
>  		/* triple fault -> shutdown */
>  		kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> @@ -675,7 +673,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>  		vcpu->arch.exception.pending = true;
>  		vcpu->arch.exception.injected = false;
>  		vcpu->arch.exception.has_error_code = true;
> -		vcpu->arch.exception.nr = DF_VECTOR;
> +		vcpu->arch.exception.vector = DF_VECTOR;
>  		vcpu->arch.exception.error_code = 0;
>  		vcpu->arch.exception.has_payload = false;
>  		vcpu->arch.exception.payload = 0;
> @@ -4886,25 +4884,24 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>  					       struct kvm_vcpu_events *events)
>  {
> +	struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +
>  	process_nmi(vcpu);
>  
>  	if (kvm_check_request(KVM_REQ_SMI, vcpu))
>  		process_smi(vcpu);
>  
>  	/*
> -	 * In guest mode, payload delivery should be deferred,
> -	 * so that the L1 hypervisor can intercept #PF before
> -	 * CR2 is modified (or intercept #DB before DR6 is
> -	 * modified under nVMX). Unless the per-VM capability,
> -	 * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
> -	 * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
> -	 * opportunistically defer the exception payload, deliver it if the
> -	 * capability hasn't been requested before processing a
> -	 * KVM_GET_VCPU_EVENTS.
> +	 * In guest mode, payload delivery should be deferred if the exception
> +	 * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> +	 * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> +	 * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> +	 * propagate the payload and so it cannot be safely deferred.  Deliver
> +	 * the payload if the capability hasn't been requested.
>  	 */
>  	if (!vcpu->kvm->arch.exception_payload_enabled &&
> -	    vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
> -		kvm_deliver_exception_payload(vcpu);
> +	    ex->pending && ex->has_payload)
> +		kvm_deliver_exception_payload(vcpu, ex);
>  
>  	/*
>  	 * The API doesn't provide the instruction length for software
> @@ -4912,26 +4909,25 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>  	 * isn't advanced, we should expect to encounter the exception
>  	 * again.
>  	 */
> -	if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> +	if (kvm_exception_is_soft(ex->vector)) {
>  		events->exception.injected = 0;
>  		events->exception.pending = 0;
>  	} else {
> -		events->exception.injected = vcpu->arch.exception.injected;
> -		events->exception.pending = vcpu->arch.exception.pending;
> +		events->exception.injected = ex->injected;
> +		events->exception.pending = ex->pending;
>  		/*
>  		 * For ABI compatibility, deliberately conflate
>  		 * pending and injected exceptions when
>  		 * KVM_CAP_EXCEPTION_PAYLOAD isn't enabled.
>  		 */
>  		if (!vcpu->kvm->arch.exception_payload_enabled)
> -			events->exception.injected |=
> -				vcpu->arch.exception.pending;
> +			events->exception.injected |= ex->pending;
>  	}
> -	events->exception.nr = vcpu->arch.exception.nr;
> -	events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> -	events->exception.error_code = vcpu->arch.exception.error_code;
> -	events->exception_has_payload = vcpu->arch.exception.has_payload;
> -	events->exception_payload = vcpu->arch.exception.payload;
> +	events->exception.nr = ex->vector;
> +	events->exception.has_error_code = ex->has_error_code;
> +	events->exception.error_code = ex->error_code;
> +	events->exception_has_payload = ex->has_payload;
> +	events->exception_payload = ex->payload;
>  
>  	events->interrupt.injected =
>  		vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> @@ -5003,7 +4999,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>  	process_nmi(vcpu);
>  	vcpu->arch.exception.injected = events->exception.injected;
>  	vcpu->arch.exception.pending = events->exception.pending;
> -	vcpu->arch.exception.nr = events->exception.nr;
> +	vcpu->arch.exception.vector = events->exception.nr;
>  	vcpu->arch.exception.has_error_code = events->exception.has_error_code;
>  	vcpu->arch.exception.error_code = events->exception.error_code;
>  	vcpu->arch.exception.has_payload = events->exception_has_payload;
> @@ -9497,7 +9493,7 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
>  
>  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  {
> -	trace_kvm_inj_exception(vcpu->arch.exception.nr,
> +	trace_kvm_inj_exception(vcpu->arch.exception.vector,
>  				vcpu->arch.exception.has_error_code,
>  				vcpu->arch.exception.error_code,
>  				vcpu->arch.exception.injected);
> @@ -9569,12 +9565,12 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  		 * describe the behavior of General Detect #DBs, which are
>  		 * fault-like.  They do _not_ set RF, a la code breakpoints.
>  		 */
> -		if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
> +		if (exception_type(vcpu->arch.exception.vector) == EXCPT_FAULT)
>  			__kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
>  					     X86_EFLAGS_RF);
>  
> -		if (vcpu->arch.exception.nr == DB_VECTOR) {
> -			kvm_deliver_exception_payload(vcpu);
> +		if (vcpu->arch.exception.vector == DB_VECTOR) {
> +			kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception);
>  			if (vcpu->arch.dr7 & DR7_GD) {
>  				vcpu->arch.dr7 &= ~DR7_GD;
>  				kvm_update_dr7(vcpu);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 501b884b8cc4..dc2af0146220 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -286,7 +286,8 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
>  
>  int handle_ud(struct kvm_vcpu *vcpu);
>  
> -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu);
> +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> +				   struct kvm_queued_exception *ex);
>  
>  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
>  u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions
  2022-06-14 20:47 ` [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions Sean Christopherson
@ 2022-07-06 12:04   ` Maxim Levitsky
  2022-07-06 17:36     ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:04 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Capture nested_run_pending as block_pending_exceptions so that the logic
> of why exceptions are blocked only needs to be documented once instead of
> at every place that employs the logic.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/svm/nested.c | 20 ++++++++++----------
>  arch/x86/kvm/vmx/nested.c | 23 ++++++++++++-----------
>  2 files changed, 22 insertions(+), 21 deletions(-)
> 
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 471d40e97890..460161e67ce5 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -1347,10 +1347,16 @@ static inline bool nested_exit_on_init(struct vcpu_svm *svm)
>  
>  static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>  {
> -	struct vcpu_svm *svm = to_svm(vcpu);
> -	bool block_nested_events =
> -		kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> +	struct vcpu_svm *svm = to_svm(vcpu);
> +	/*
> +	 * Only a pending nested run blocks a pending exception.  If there is a
> +	 * previously injected event, the pending exception occurred while said
> +	 * event was being delivered and thus needs to be handled.
> +	 */

Tiny nitpick about the comment:

One can say that if there is an injected event, this means that we
are in the middle of handling it, thus we are not on instruction boundary,
and thus we don't process events (e.g interrupts).

So maybe write something like that?


> +	bool block_nested_exceptions = svm->nested.nested_run_pending;
> +	bool block_nested_events = block_nested_exceptions ||
> +				   kvm_event_needs_reinjection(vcpu);

Tiny nitpick: I don't like that much the name 'nested' as
it can also mean a nested exception (e.g exception that
happened while jumping to an exception  handler).

Here we mean just exception/events for the guest, so I would suggest
to just drop the word 'nested'.

>  
>  	if (lapic_in_kernel(vcpu) &&
>  	    test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> @@ -1363,13 +1369,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (vcpu->arch.exception.pending) {
> -		/*
> -		 * Only a pending nested run can block a pending exception.
> -		 * Otherwise an injected NMI/interrupt should either be
> -		 * lost or delivered to the nested hypervisor in the EXITINTINFO
> -		 * vmcb field, while delivering the pending exception.
> -		 */
> -		if (svm->nested.nested_run_pending)
> +		if (block_nested_exceptions)
>                          return -EBUSY;
>  		if (!nested_exit_on_exception(svm))
>  			return 0;
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index fafdcbfeca1f..50fe66f0cc1b 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -3903,11 +3903,17 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
>  
>  static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  {
> -	struct vcpu_vmx *vmx = to_vmx(vcpu);
> -	unsigned long exit_qual;
> -	bool block_nested_events =
> -	    vmx->nested.nested_run_pending || kvm_event_needs_reinjection(vcpu);
>  	struct kvm_lapic *apic = vcpu->arch.apic;
> +	struct vcpu_vmx *vmx = to_vmx(vcpu);
> +	unsigned long exit_qual;
> +	/*
> +	 * Only a pending nested run blocks a pending exception.  If there is a
> +	 * previously injected event, the pending exception occurred while said
> +	 * event was being delivered and thus needs to be handled.
> +	 */
> +	bool block_nested_exceptions = vmx->nested.nested_run_pending;
> +	bool block_nested_events = block_nested_exceptions ||
> +				   kvm_event_needs_reinjection(vcpu);
>  
>  	if (lapic_in_kernel(vcpu) &&
>  		test_bit(KVM_APIC_INIT, &apic->pending_events)) {
> @@ -3941,15 +3947,10 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	 * Process exceptions that are higher priority than Monitor Trap Flag:
>  	 * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
>  	 * could theoretically come in from userspace), and ICEBP (INT1).
> -	 *
> -	 * Note that only a pending nested run can block a pending exception.
> -	 * Otherwise an injected NMI/interrupt should either be
> -	 * lost or delivered to the nested hypervisor in the IDT_VECTORING_INFO,
> -	 * while delivering the pending exception.
>  	 */
>  	if (vcpu->arch.exception.pending &&
>  	    !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
> -		if (vmx->nested.nested_run_pending)
> +		if (block_nested_exceptions)
>  			return -EBUSY;
>  		if (!nested_vmx_check_exception(vcpu, &exit_qual))
>  			goto no_vmexit;
> @@ -3966,7 +3967,7 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  	}
>  
>  	if (vcpu->arch.exception.pending) {
> -		if (vmx->nested.nested_run_pending)
> +		if (block_nested_exceptions)
>  			return -EBUSY;
>  		if (!nested_vmx_check_exception(vcpu, &exit_qual))
>  			goto no_vmexit;

Besides the nitpicks:

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF
  2022-06-14 20:47 ` [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF Sean Christopherson
@ 2022-07-06 12:04   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:04 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Queue #DF by recursing on kvm_multiple_exception() by way of
> kvm_queue_exception_e() instead of open coding the behavior.  This will
> allow KVM to Just Work when a future commit moves exception interception
> checks (for L2 => L1) into kvm_multiple_exception().

Typo: You mean Just Work (tm) ;-) 

> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 21 +++++++++------------
>  1 file changed, 9 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 511c0c8af80e..e45465075005 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -663,25 +663,22 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>  	}
>  	class1 = exception_class(prev_nr);
>  	class2 = exception_class(nr);
> -	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY)
> -		|| (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
> +	if ((class1 == EXCPT_CONTRIBUTORY && class2 == EXCPT_CONTRIBUTORY) ||
> +	    (class1 == EXCPT_PF && class2 != EXCPT_BENIGN)) {
>  		/*
> -		 * Generate double fault per SDM Table 5-5.  Set
> -		 * exception.pending = true so that the double fault
> -		 * can trigger a nested vmexit.
> +		 * Synthesize #DF.  Clear the previously injected or pending
> +		 * exception so as not to incorrectly trigger shutdown.
>  		 */
> -		vcpu->arch.exception.pending = true;
>  		vcpu->arch.exception.injected = false;
> -		vcpu->arch.exception.has_error_code = true;
> -		vcpu->arch.exception.vector = DF_VECTOR;
> -		vcpu->arch.exception.error_code = 0;
> -		vcpu->arch.exception.has_payload = false;
> -		vcpu->arch.exception.payload = 0;
> -	} else
> +		vcpu->arch.exception.pending = false;
> +
> +		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
> +	} else {
>  		/* replace previous exception with a new one in a hope
>  		   that instruction re-execution will regenerate lost
>  		   exception */
>  		goto queue;
> +	}
>  }
>  
>  void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic
  2022-06-14 20:47 ` [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic Sean Christopherson
@ 2022-07-06 12:05   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Perform nested event checks before re-injecting exceptions/events into
> L2.  If a pending exception causes VM-Exit to L1, re-injecting events
> into vmcs02 is premature and wasted effort.  Take care to ensure events
> that need to be re-injected are still re-injected if checking for nested
> events "fails", i.e. if KVM needs to force an immediate entry+exit to
> complete the to-be-re-injecteed event.
> 
> Keep the "can_inject" logic the same for now; it too can be pushed below
> the nested checks, but is a slightly riskier change (see past bugs about
> events not being properly purged on nested VM-Exit).
> 
> Add and/or modify comments to better document the various interactions.
> Of note is the comment regarding "blocking" previously injected NMIs and
> IRQs if an exception is pending.  The old comment isn't wrong strictly
> speaking, but it failed to capture the reason why the logic even exists.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 89 +++++++++++++++++++++++++++-------------------
>  1 file changed, 53 insertions(+), 36 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e45465075005..930de833aa2b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9502,53 +9502,70 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  
>  static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  {
> +	bool can_inject = !kvm_event_needs_reinjection(vcpu);
>  	int r;
> -	bool can_inject = true;
>  
> -	/* try to reinject previous events if any */
> +	/*
> +	 * Process nested events first, as nested VM-Exit supercedes event
> +	 * re-injection.  If there's an event queued for re-injection, it will
> +	 * be saved into the appropriate vmc{b,s}12 fields on nested VM-Exit.
> +	 */
> +	if (is_guest_mode(vcpu))
> +		r = kvm_check_nested_events(vcpu);
> +	else
> +		r = 0;

Makes sense a lot!

>  
> -	if (vcpu->arch.exception.injected) {
> +	/*
> +	 * Re-inject exceptions and events *especially* if immediate entry+exit
> +	 * to/from L2 is needed, as any event that has already been injected
> +	 * into L2 needs to complete its lifecycle before injecting a new event.
> +	 *
> +	 * Don't re-inject an NMI or interrupt if there is a pending exception.
> +	 * This collision arises if an exception occurred while vectoring the
> +	 * injected event, KVM intercepted said exception, and KVM ultimately
> +	 * determined the fault belongs to the guest and queues the exception
> +	 * for injection back into the guest.
> +	 *
> +	 * "Injected" interrupts can also collide with pending exceptions if
> +	 * userspace ignores the "ready for injection" flag and blindly queues
> +	 * an interrupt.  In that case, prioritizing the exception is correct,
> +	 * as the exception "occurred" before the exit to userspace.  Trap-like
> +	 * exceptions, e.g. most #DBs, have higher priority than interrupts.
> +	 * And while fault-like exceptions, e.g. #GP and #PF, are the lowest
> +	 * priority, they're only generated (pended) during instruction
> +	 * execution, and interrupts are recognized at instruction boundaries.
> +	 * Thus a pending fault-like exception means the fault occurred on the
> +	 * *previous* instruction and must be serviced prior to recognizing any
> +	 * new events in order to fully complete the previous instruction.
> +	 */
> +	if (vcpu->arch.exception.injected)
>  		kvm_inject_exception(vcpu);
> -		can_inject = false;
> -	}
> +	else if (vcpu->arch.exception.pending)
> +		; /* see above */
> +	else if (vcpu->arch.nmi_injected)
> +		static_call(kvm_x86_inject_nmi)(vcpu);
> +	else if (vcpu->arch.interrupt.injected)
> +		static_call(kvm_x86_inject_irq)(vcpu, true);
> +
>  	/*
> -	 * Do not inject an NMI or interrupt if there is a pending
> -	 * exception.  Exceptions and interrupts are recognized at
> -	 * instruction boundaries, i.e. the start of an instruction.
> -	 * Trap-like exceptions, e.g. #DB, have higher priority than
> -	 * NMIs and interrupts, i.e. traps are recognized before an
> -	 * NMI/interrupt that's pending on the same instruction.
> -	 * Fault-like exceptions, e.g. #GP and #PF, are the lowest
> -	 * priority, but are only generated (pended) during instruction
> -	 * execution, i.e. a pending fault-like exception means the
> -	 * fault occurred on the *previous* instruction and must be
> -	 * serviced prior to recognizing any new events in order to
> -	 * fully complete the previous instruction.
> +	 * Exceptions that morph to VM-Exits are handled above, and pending
> +	 * exceptions on top of injected exceptions that do not VM-Exit should
> +	 * either morph to #DF or, sadly, override the injected exception.
>  	 */
> -	else if (!vcpu->arch.exception.pending) {
> -		if (vcpu->arch.nmi_injected) {
> -			static_call(kvm_x86_inject_nmi)(vcpu);
> -			can_inject = false;
> -		} else if (vcpu->arch.interrupt.injected) {
> -			static_call(kvm_x86_inject_irq)(vcpu, true);
> -			can_inject = false;
> -		}
> -	}
> -
>  	WARN_ON_ONCE(vcpu->arch.exception.injected &&
>  		     vcpu->arch.exception.pending);
>  
>  	/*
> -	 * Call check_nested_events() even if we reinjected a previous event
> -	 * in order for caller to determine if it should require immediate-exit
> -	 * from L2 to L1 due to pending L1 events which require exit
> -	 * from L2 to L1.
> +	 * Bail if immediate entry+exit to/from the guest is needed to complete
> +	 * nested VM-Enter or event re-injection so that a different pending
> +	 * event can be serviced (or if KVM needs to exit to userspace).
> +	 *
> +	 * Otherwise, continue processing events even if VM-Exit occurred.  The
> +	 * VM-Exit will have cleared exceptions that were meant for L2, but
> +	 * there may now be events that can be injected into L1.
>  	 */
> -	if (is_guest_mode(vcpu)) {
> -		r = kvm_check_nested_events(vcpu);
> -		if (r < 0)
> -			goto out;
> -	}
> +	if (r < 0)
> +		goto out;
>  
>  	/* try to inject new event if pending */
>  	if (vcpu->arch.exception.pending) {

All makes sense AFAIK.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit
  2022-06-14 20:47 ` [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit Sean Christopherson
@ 2022-07-06 12:05   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:05 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Determine whether or not new events can be injected after checking nested
> events.  If a VM-Exit occurred during nested event handling, any previous
> event that needed re-injection is gone from's KVM perspective; the event
> is captured in the vmc*12 VM-Exit information, but doesn't exist in terms
> of what needs to be done for entry to L1.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 930de833aa2b..1a301a1730a5 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9502,7 +9502,7 @@ static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  
>  static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  {
> -	bool can_inject = !kvm_event_needs_reinjection(vcpu);
> +	bool can_inject;
>  	int r;
>  
>  	/*
> @@ -9567,7 +9567,13 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>  	if (r < 0)
>  		goto out;
>  
> -	/* try to inject new event if pending */
> +	/*
> +	 * New events, other than exceptions, cannot be injected if KVM needs
> +	 * to re-inject a previous event.  See above comments on re-injecting
> +	 * for why pending exceptions get priority.
> +	 */
> +	can_inject = !kvm_event_needs_reinjection(vcpu);
> +
>  	if (vcpu->arch.exception.pending) {
>  		/*
>  		 * Fault-class exceptions, except #DBs, set RF=1 in the RFLAGS

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  2022-06-14 20:47 ` [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time Sean Christopherson
@ 2022-07-06 12:15   ` Maxim Levitsky
  2022-07-07  1:24     ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:15 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Morph pending exceptions to pending VM-Exits (due to interception) when
> the exception is queued instead of waiting until nested events are
> checked at VM-Entry.  This fixes a longstanding bug where KVM fails to
> handle an exception that occurs during delivery of a previous exception,
> KVM (L0) and L1 both want to intercept the exception (e.g. #PF for shadow
> paging), and KVM determines that the exception is in the guest's domain,
> i.e. queues the new exception for L2.  Deferring the interception check
> causes KVM to esclate various combinations of injected+pending exceptions
> to double fault (#DF) without consulting L1's interception desires, and
> ends up injecting a spurious #DF into L2.
> 
> KVM has fudged around the issue for #PF by special casing emulated #PF
> injection for shadow paging, but the underlying issue is not unique to
> shadow paging in L0, e.g. if KVM is intercepting #PF because the guest
> has a smaller maxphyaddr and L1 (but not L0) is using shadow paging.
> Other exceptions are affected as well, e.g. if KVM is intercepting #GP
> for one of SVM's workaround or for the VMware backdoor emulation stuff.
> The other cases have gone unnoticed because the #DF is spurious if and
> only if L1 resolves the exception, e.g. KVM's goofs go unnoticed if L1
> would have injected #DF anyways.
> 
> The hack-a-fix has also led to ugly code, e.g. bailing from the emulator
> if #PF injection forced a nested VM-Exit and the emulator finds itself
> back in L1.  Allowing for direct-to-VM-Exit queueing also neatly solves
> the async #PF in L2 mess; no need to set a magic flag and token, simply
> queue a #PF nested VM-Exit.
> 
> Deal with event migration by flagging that a pending exception was queued
> by userspace and check for interception at the next KVM_RUN, e.g. so that
> KVM does the right thing regardless of the order in which userspace
> restores nested state vs. event state.
> 
> When "getting" events from userspace, simply drop any pending excpetion
> that is destined to be intercepted if there is also an injected exception
> to be migrated.  Ideally, KVM would migrate both events, but that would
> require new ABI, and practically speaking losing the event is unlikely to
> be noticed, let alone fatal.  The injected exception is captured, RIP
> still points at the original faulting instruction, etc...  So either the
> injection on the target will trigger the same intercepted exception, or
> the source of the intercepted exception was transient and/or
> non-deterministic, thus dropping it is ok-ish.
> 
> Opportunistically add a gigantic comment above vmx_check_nested_events()
> to document the priorities of all known events on Intel CPUs.  Kudos to
> Jim Mattson for doing the hard work of collecting and interpreting the
> priorities from various locations throughtout the SDM (because putting
> them all in one place in the SDM would be too easy).
> 
> Fixes: a04aead144fd ("KVM: nSVM: fix running nested guests when npt=0")
> Fixes: feaf0c7dc473 ("KVM: nVMX: Do not generate #DF if #PF happens during exception delivery into L2")
> Cc: Jim Mattson <jmattson@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  12 +-
>  arch/x86/kvm/svm/nested.c       |  41 ++----
>  arch/x86/kvm/vmx/nested.c       | 220 +++++++++++++++++++++-----------
>  arch/x86/kvm/vmx/vmx.c          |   6 +-
>  arch/x86/kvm/x86.c              | 159 ++++++++++++++++-------
>  arch/x86/kvm/x86.h              |   7 +
>  6 files changed, 287 insertions(+), 158 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 7f321d53a7e9..3bf7fdeeb25c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -648,7 +648,6 @@ struct kvm_queued_exception {
>         u32 error_code;
>         unsigned long payload;
>         bool has_payload;
> -       u8 nested_apf;
>  };
>  
>  struct kvm_vcpu_arch {
> @@ -750,8 +749,12 @@ struct kvm_vcpu_arch {
>  
>         u8 event_exit_inst_len;
>  
> +       bool exception_from_userspace;
> +
>         /* Exceptions to be injected to the guest. */
>         struct kvm_queued_exception exception;
> +       /* Exception VM-Exits to be synthesized to L1. */
> +       struct kvm_queued_exception exception_vmexit;
>  
>         struct kvm_queued_interrupt {
>                 bool injected;
> @@ -861,7 +864,6 @@ struct kvm_vcpu_arch {
>                 u32 id;
>                 bool send_user_only;
>                 u32 host_apf_flags;
> -               unsigned long nested_apf_token;
>                 bool delivery_as_pf_vmexit;
>                 bool pageready_pending;
>         } apf;
> @@ -1618,9 +1620,9 @@ struct kvm_x86_ops {
>  
>  struct kvm_x86_nested_ops {
>         void (*leave_nested)(struct kvm_vcpu *vcpu);
> +       bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
> +                                   u32 error_code);
>         int (*check_events)(struct kvm_vcpu *vcpu);
> -       bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
> -                                            struct x86_exception *fault);

I think that since this patch is already quite large, it would make sense
to split the removal of workaround/hack code to patch after this one?


>         bool (*hv_timer_pending)(struct kvm_vcpu *vcpu);
>         void (*triple_fault)(struct kvm_vcpu *vcpu);
>         int (*get_state)(struct kvm_vcpu *vcpu,
> @@ -1847,7 +1849,7 @@ void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long pay
>  void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
>  void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
> -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                                     struct x86_exception *fault);
>  bool kvm_require_cpl(struct kvm_vcpu *vcpu, int required_cpl);
>  bool kvm_require_dr(struct kvm_vcpu *vcpu, int dr);
> diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> index 460161e67ce5..4075deefd132 100644
> --- a/arch/x86/kvm/svm/nested.c
> +++ b/arch/x86/kvm/svm/nested.c
> @@ -55,28 +55,6 @@ static void nested_svm_inject_npf_exit(struct kvm_vcpu *vcpu,
>         nested_svm_vmexit(svm);
>  }
>  
> -static bool nested_svm_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
> -                                                   struct x86_exception *fault)
> -{
> -       struct vcpu_svm *svm = to_svm(vcpu);
> -       struct vmcb *vmcb = svm->vmcb;
> -
> -       WARN_ON(!is_guest_mode(vcpu));
> -
> -       if (vmcb12_is_intercept(&svm->nested.ctl,
> -                               INTERCEPT_EXCEPTION_OFFSET + PF_VECTOR) &&
> -           !WARN_ON_ONCE(svm->nested.nested_run_pending)) {
> -               vmcb->control.exit_code = SVM_EXIT_EXCP_BASE + PF_VECTOR;
> -               vmcb->control.exit_code_hi = 0;
> -               vmcb->control.exit_info_1 = fault->error_code;
> -               vmcb->control.exit_info_2 = fault->address;
> -               nested_svm_vmexit(svm);
> -               return true;
> -       }
> -
> -       return false;
> -}
> -
>  static u64 nested_svm_get_tdp_pdptr(struct kvm_vcpu *vcpu, int index)
>  {
>         struct vcpu_svm *svm = to_svm(vcpu);
> @@ -1297,16 +1275,17 @@ int nested_svm_check_permissions(struct kvm_vcpu *vcpu)
>         return 0;
>  }
>  
> -static bool nested_exit_on_exception(struct vcpu_svm *svm)
> +static bool nested_svm_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
> +                                          u32 error_code)
>  {
> -       unsigned int vector = svm->vcpu.arch.exception.vector;
> +       struct vcpu_svm *svm = to_svm(vcpu);
>  
>         return (svm->nested.ctl.intercepts[INTERCEPT_EXCEPTION] & BIT(vector));
>  }
>  
>  static void nested_svm_inject_exception_vmexit(struct kvm_vcpu *vcpu)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
>         struct vcpu_svm *svm = to_svm(vcpu);
>         struct vmcb *vmcb = svm->vmcb;
>  
> @@ -1368,15 +1347,19 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>                 return 0;
>         }
>  
> -       if (vcpu->arch.exception.pending) {
> +       if (vcpu->arch.exception_vmexit.pending) {
>                 if (block_nested_exceptions)
>                          return -EBUSY;
> -               if (!nested_exit_on_exception(svm))
> -                       return 0;
>                 nested_svm_inject_exception_vmexit(vcpu);
>                 return 0;
>         }

I see, so my approach was to have pending and injected exception,
while your approach is basically to have the 'pending' exception
only when it can't be merged right away with the injected exception.


It's less elegant IMHO, but on the other hand is less
risky, so I agree upon it.

You also still want to delay the actual VM exit to the vcpu run time,
which also reduces the risk, which is also a justified design choice.

>  
> +       if (vcpu->arch.exception.pending) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;
> +               return 0;
> +       }
> +
>         if (vcpu->arch.smi_pending && !svm_smi_blocked(vcpu)) {
>                 if (block_nested_events)
>                         return -EBUSY;
> @@ -1714,8 +1697,8 @@ static bool svm_get_nested_state_pages(struct kvm_vcpu *vcpu)
>  
>  struct kvm_x86_nested_ops svm_nested_ops = {
>         .leave_nested = svm_leave_nested,
> +       .is_exception_vmexit = nested_svm_is_exception_vmexit,
>         .check_events = svm_check_nested_events,
> -       .handle_page_fault_workaround = nested_svm_handle_page_fault_workaround,
>         .triple_fault = nested_svm_triple_fault,
>         .get_nested_state_pages = svm_get_nested_state_pages,
>         .get_state = svm_get_nested_state,
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 50fe66f0cc1b..53f6ea15081d 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -438,59 +438,22 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>         return inequality ^ bit;
>  }
>  
> -
> -/*
> - * KVM wants to inject page-faults which it got to the guest. This function
> - * checks whether in a nested guest, we need to inject them to L1 or L2.
> - */
> -static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
> -{
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> -       struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -
> -       if (ex->vector == PF_VECTOR) {
> -               if (ex->nested_apf) {
> -                       *exit_qual = vcpu->arch.apf.nested_apf_token;
> -                       return 1;
> -               }
> -               if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
> -                       *exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
> -                       return 1;
> -               }
> -       } else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
> -               if (ex->vector == DB_VECTOR) {
> -                       if (ex->has_payload) {
> -                               *exit_qual = ex->payload;
> -                       } else {
> -                               *exit_qual = vcpu->arch.dr6;
> -                               *exit_qual &= ~DR6_BT;
> -                               *exit_qual ^= DR6_ACTIVE_LOW;
> -                       }
> -               } else
> -                       *exit_qual = 0;
> -               return 1;
> -       }
> -
> -       return 0;
> -}
> -
> -static bool nested_vmx_handle_page_fault_workaround(struct kvm_vcpu *vcpu,
> -                                                   struct x86_exception *fault)
> +static bool nested_vmx_is_exception_vmexit(struct kvm_vcpu *vcpu, u8 vector,
> +                                          u32 error_code)
>  {
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  
> -       WARN_ON(!is_guest_mode(vcpu));
> +       /*
> +        * Drop bits 31:16 of the error code when performing the #PF mask+match
> +        * check.  All VMCS fields involved are 32 bits, but Intel CPUs never
> +        * set bits 31:16 and VMX disallows setting bits 31:16 in the injected
> +        * error code.  Including the to-be-dropped bits in the check might
> +        * result in an "impossible" or missed exit from L1's perspective.
> +        */
> +       if (vector == PF_VECTOR)
> +               return nested_vmx_is_page_fault_vmexit(vmcs12, (u16)error_code);
>  
> -       if (nested_vmx_is_page_fault_vmexit(vmcs12, fault->error_code) &&
> -           !WARN_ON_ONCE(to_vmx(vcpu)->nested.nested_run_pending)) {
> -               vmcs12->vm_exit_intr_error_code = fault->error_code;
> -               nested_vmx_vmexit(vcpu, EXIT_REASON_EXCEPTION_NMI,
> -                                 PF_VECTOR | INTR_TYPE_HARD_EXCEPTION |
> -                                 INTR_INFO_DELIVER_CODE_MASK | INTR_INFO_VALID_MASK,
> -                                 fault->address);
> -               return true;
> -       }
> -       return false;
> +       return (vmcs12->exception_bitmap & (1u << vector));
>  }
>  
>  static int nested_vmx_check_io_bitmap_controls(struct kvm_vcpu *vcpu,
> @@ -3823,12 +3786,24 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>         return -ENXIO;
>  }
>  
> -static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> -                                              unsigned long exit_qual)
> +static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
>         u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> +       unsigned long exit_qual;
> +
> +       if (ex->has_payload) {
> +               exit_qual = ex->payload;
> +       } else if (ex->vector == PF_VECTOR) {
> +               exit_qual = vcpu->arch.cr2;
> +       } else if (ex->vector == DB_VECTOR) {
> +               exit_qual = vcpu->arch.dr6;
> +               exit_qual &= ~DR6_BT;
> +               exit_qual ^= DR6_ACTIVE_LOW;
> +       } else {
> +               exit_qual = 0;
> +       }
>  
>         if (ex->has_error_code) {
>                 /*
> @@ -3870,14 +3845,24 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>   * from the emulator (because such #DBs are fault-like and thus don't trigger
>   * actions that fire on instruction retire).
>   */
> -static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
> +static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex)
Any reason to remove the inline?
>  {
> -       if (!vcpu->arch.exception.pending ||
> -           vcpu->arch.exception.vector != DB_VECTOR)
> +       if (!ex->pending || ex->vector != DB_VECTOR)
>                 return 0;
>  
>         /* General Detect #DBs are always fault-like. */
> -       return vcpu->arch.exception.payload & ~DR6_BD;
> +       return ex->payload & ~DR6_BD;
> +}
> +
> +/*
> + * Returns true if there's a pending #DB exception that is lower priority than
> + * a pending Monitor Trap Flag VM-Exit.  TSS T-flag #DBs are not emulated by
> + * KVM, but could theoretically be injected by userspace.  Note, this code is
> + * imperfect, see above.
> + */
> +static bool vmx_is_low_priority_db_trap(struct kvm_queued_exception *ex)
> +{
> +       return vmx_get_pending_dbg_trap(ex) & ~DR6_BT;
>  }
>  
>  /*
> @@ -3889,8 +3874,9 @@ static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
>   */
>  static void nested_vmx_update_pending_dbg(struct kvm_vcpu *vcpu)
>  {
> -       unsigned long pending_dbg = vmx_get_pending_dbg_trap(vcpu);
> +       unsigned long pending_dbg;
>  
> +       pending_dbg = vmx_get_pending_dbg_trap(&vcpu->arch.exception);
>         if (pending_dbg)
>                 vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, pending_dbg);
>  }
> @@ -3901,11 +3887,93 @@ static bool nested_vmx_preemption_timer_pending(struct kvm_vcpu *vcpu)
>                to_vmx(vcpu)->nested.preemption_timer_expired;
>  }
>  
> +/*
> + * Per the Intel SDM's table "Priority Among Concurrent Events", with minor
> + * edits to fill in missing examples, e.g. #DB due to split-lock accesses,
> + * and less minor edits to splice in the priority of VMX Non-Root specific
> + * events, e.g. MTF and NMI/INTR-window exiting.
> + *
> + * 1 Hardware Reset and Machine Checks
> + *     - RESET
> + *     - Machine Check
> + *
> + * 2 Trap on Task Switch
> + *     - T flag in TSS is set (on task switch)
> + *
> + * 3 External Hardware Interventions
> + *     - FLUSH
> + *     - STOPCLK
> + *     - SMI
> + *     - INIT
> + *
> + * 3.5 Monitor Trap Flag (MTF) VM-exit[1]
> + *
> + * 4 Traps on Previous Instruction
> + *     - Breakpoints
> + *     - Trap-class Debug Exceptions (#DB due to TF flag set, data/I-O
> + *       breakpoint, or #DB due to a split-lock access)
> + *
> + * 4.3 VMX-preemption timer expired VM-exit
> + *
> + * 4.6 NMI-window exiting VM-exit[2]
> + *
> + * 5 Nonmaskable Interrupts (NMI)
> + *
> + * 5.5 Interrupt-window exiting VM-exit and Virtual-interrupt delivery
> + *
> + * 6 Maskable Hardware Interrupts
> + *
> + * 7 Code Breakpoint Fault
> + *
> + * 8 Faults from Fetching Next Instruction
> + *     - Code-Segment Limit Violation
> + *     - Code Page Fault
> + *     - Control protection exception (missing ENDBRANCH at target of indirect
> + *                                     call or jump)
> + *
> + * 9 Faults from Decoding Next Instruction
> + *     - Instruction length > 15 bytes
> + *     - Invalid Opcode
> + *     - Coprocessor Not Available
> + *
> + *10 Faults on Executing Instruction
> + *     - Overflow
> + *     - Bound error
> + *     - Invalid TSS
> + *     - Segment Not Present
> + *     - Stack fault
> + *     - General Protection
> + *     - Data Page Fault
> + *     - Alignment Check
> + *     - x86 FPU Floating-point exception
> + *     - SIMD floating-point exception
> + *     - Virtualization exception
> + *     - Control protection exception
> + *
> + * [1] Per the "Monitor Trap Flag" section: System-management interrupts (SMIs),
> + *     INIT signals, and higher priority events take priority over MTF VM exits.
> + *     MTF VM exits take priority over debug-trap exceptions and lower priority
> + *     events.
> + *
> + * [2] Debug-trap exceptions and higher priority events take priority over VM exits
> + *     caused by the VMX-preemption timer.  VM exits caused by the VMX-preemption
> + *     timer take priority over VM exits caused by the "NMI-window exiting"
> + *     VM-execution control and lower priority events.
> + *
> + * [3] Debug-trap exceptions and higher priority events take priority over VM exits
> + *     caused by "NMI-window exiting".  VM exits caused by this control take
> + *     priority over non-maskable interrupts (NMIs) and lower priority events.
> + *
> + * [4] Virtual-interrupt delivery has the same priority as that of VM exits due to
> + *     the 1-setting of the "interrupt-window exiting" VM-execution control.  Thus,
> + *     non-maskable interrupts (NMIs) and higher priority events take priority over
> + *     delivery of a virtual interrupt; delivery of a virtual interrupt takes
> + *     priority over external interrupts and lower priority events.
> + */
This comment also probably should go to a separate patch to reduce this patch size.

Other than that, this is a _very_ good idea to add it to KVM, although
maybe we should put it in Documentation folder instead?
(but I don't have a strong preference on this)


>  static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_lapic *apic = vcpu->arch.apic;
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
> -       unsigned long exit_qual;
>         /*
>          * Only a pending nested run blocks a pending exception.  If there is a
>          * previously injected event, the pending exception occurred while said
> @@ -3943,19 +4011,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>                 /* Fallthrough, the SIPI is completely ignored. */
>         }
>  
> -       /*
> -        * Process exceptions that are higher priority than Monitor Trap Flag:
> -        * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
> -        * could theoretically come in from userspace), and ICEBP (INT1).
> -        */
> +       if (vcpu->arch.exception_vmexit.pending &&
> +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;
> +
> +               nested_vmx_inject_exception_vmexit(vcpu);
> +               return 0;
> +       }

> +
>         if (vcpu->arch.exception.pending &&
> -           !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
> +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
Small nitpick: vmx_is_low_priority_db_trap refactoring could be done in a separate patch

+ Maybe it would be nice to add a WARN_ON_ONCE check here that this exception is not intercepted
by the guest

>                 if (block_nested_exceptions)
>                         return -EBUSY;
> -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> -                       goto no_vmexit;
> -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> -               return 0;
> +               goto no_vmexit;
>         }
>  
>         if (vmx->nested.mtf_pending) {
> @@ -3966,13 +4035,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
>                 return 0;
>         }
>  
> +       if (vcpu->arch.exception_vmexit.pending) {
> +               if (block_nested_exceptions)
> +                       return -EBUSY;

And here add a WARN_ON_ONCE check that it is intercepted.

> +
> +               nested_vmx_inject_exception_vmexit(vcpu);
> +               return 0;
> +       }
> +
>         if (vcpu->arch.exception.pending) {
>                 if (block_nested_exceptions)
>                         return -EBUSY;
> -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> -                       goto no_vmexit;
> -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> -               return 0;
> +               goto no_vmexit;
>         }
>  
>         if (nested_vmx_preemption_timer_pending(vcpu)) {
> @@ -6863,8 +6937,8 @@ __init int nested_vmx_hardware_setup(int (*exit_handlers[])(struct kvm_vcpu *))
>  
>  struct kvm_x86_nested_ops vmx_nested_ops = {
>         .leave_nested = vmx_leave_nested,
> +       .is_exception_vmexit = nested_vmx_is_exception_vmexit,
>         .check_events = vmx_check_nested_events,
> -       .handle_page_fault_workaround = nested_vmx_handle_page_fault_workaround,
>         .hv_timer_pending = nested_vmx_preemption_timer_pending,
>         .triple_fault = nested_vmx_triple_fault,
>         .get_state = vmx_get_nested_state,
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7ef5659a1bbd..3591fdf7ecf9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1585,7 +1585,9 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>          */
>         if (nested_cpu_has_mtf(vmcs12) &&
>             (!vcpu->arch.exception.pending ||
> -            vcpu->arch.exception.vector == DB_VECTOR))
> +            vcpu->arch.exception.vector == DB_VECTOR) &&
> +           (!vcpu->arch.exception_vmexit.pending ||
> +            vcpu->arch.exception_vmexit.vector == DB_VECTOR))


>                 vmx->nested.mtf_pending = true;
>         else
>                 vmx->nested.mtf_pending = false;
> @@ -5624,7 +5626,7 @@ static bool vmx_emulation_required_with_pending_exception(struct kvm_vcpu *vcpu)
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
>  
>         return vmx->emulation_required && !vmx->rmode.vm86_active &&
> -              (vcpu->arch.exception.pending || vcpu->arch.exception.injected);
> +              (kvm_is_exception_pending(vcpu) || vcpu->arch.exception.injected);
>  }
>  
>  static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1a301a1730a5..63ee79da50df 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -609,6 +609,21 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
>  }
>  EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
>  
> +static void kvm_queue_exception_vmexit(struct kvm_vcpu *vcpu, unsigned int vector,
> +                                      bool has_error_code, u32 error_code,
> +                                      bool has_payload, unsigned long payload)
> +{
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
> +
> +       ex->vector = vector;
> +       ex->injected = false;
> +       ex->pending = true;
> +       ex->has_error_code = has_error_code;
> +       ex->error_code = error_code;
> +       ex->has_payload = has_payload;
> +       ex->payload = payload;
> +}
> +
>  static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>                 unsigned nr, bool has_error, u32 error_code,
>                 bool has_payload, unsigned long payload, bool reinject)
> @@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>  
>         kvm_make_request(KVM_REQ_EVENT, vcpu);
>  
> +       /*
> +        * If the exception is destined for L2 and isn't being reinjected,
> +        * morph it to a VM-Exit if L1 wants to intercept the exception.  A
> +        * previously injected exception is not checked because it was checked
> +        * when it was original queued, and re-checking is incorrect if _L1_
> +        * injected the exception, in which case it's exempt from interception.
> +        */
> +       if (!reinject && is_guest_mode(vcpu) &&
> +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
> +               kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
> +                                          has_payload, payload);
> +               return;
> +       }
> +
>         if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
>         queue:
>                 if (reinject) {
>                         /*
> -                        * On vmentry, vcpu->arch.exception.pending is only
> -                        * true if an event injection was blocked by
> -                        * nested_run_pending.  In that case, however,
> -                        * vcpu_enter_guest requests an immediate exit,
> -                        * and the guest shouldn't proceed far enough to
> -                        * need reinjection.
> +                        * On VM-Entry, an exception can be pending if and only
> +                        * if event injection was blocked by nested_run_pending.
> +                        * In that case, however, vcpu_enter_guest() requests an
> +                        * immediate exit, and the guest shouldn't proceed far
> +                        * enough to need reinjection.

Now that I had read the Jim's document on event priorities, I think we can
update the comment:

On VMX we set expired preemption timer, and on SVM we do self IPI, thus pend a real interrupt.
Both events should have higher priority than processing the injected event

(This is something I didn't find in the Intel/AMD docs, so I might be wrong here)

thus the CPU will not attempt to process the injected event 
(via EVENTINJ on SVM, or via VM_ENTRY_INTR_INFO_FIELD) and instead just straight copy
them back to exit_int_info/IDT_VECTORING_INFO_FIELD)

So in this case the event will actually be re-injected, but no new exception can
be generated since we will re-execute the VMRUN/VMRESUME instruction.


>                          */
> -                       WARN_ON_ONCE(vcpu->arch.exception.pending);
> +                       WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
>                         vcpu->arch.exception.injected = true;
>                         if (WARN_ON_ONCE(has_payload)) {
>                                 /*
> @@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
>  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
>  {
>         ++vcpu->stat.pf_guest;
> -       vcpu->arch.exception.nested_apf =
> -               is_guest_mode(vcpu) && fault->async_page_fault;
> -       if (vcpu->arch.exception.nested_apf) {
> -               vcpu->arch.apf.nested_apf_token = fault->address;
> -               kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
> -       } else {
> +
> +       /*
> +        * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
> +        * whether or not L1 wants to intercept "regular" #PF.

We might want to also mention that the L1 has to opt-in to this
(vcpu->arch.apf.delivery_as_pf_vmexit), but the fact that we are
here, means that it did opt-in

(otherwise kvm_can_deliver_async_pf won't return true).

A WARN_ON_ONCE(!vcpu->arch.apf.delivery_as_pf_vmexit) would be
nice to also check this in the runtime.

Also note that AFAIK, qemu doesn't opt-in for this feature sadly,
thus this code is not tested (unless there is some unit test).


> +        */
> +       if (is_guest_mode(vcpu) && fault->async_page_fault)
> +               kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
> +                                          true, fault->error_code,
> +                                          true, fault->address);
> +       else
>                 kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
>                                         fault->address);
> -       }
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
>  
> -/* Returns true if the page fault was immediately morphed into a VM-Exit. */
> -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                                     struct x86_exception *fault)
>  {
>         struct kvm_mmu *fault_mmu;
> @@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
>                 kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
>                                        fault_mmu->root.hpa);
>  
> -       /*
> -        * A workaround for KVM's bad exception handling.  If KVM injected an
> -        * exception into L2, and L2 encountered a #PF while vectoring the
> -        * injected exception, manually check to see if L1 wants to intercept
> -        * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
> -        * In all other cases, defer the check to nested_ops->check_events(),
> -        * which will correctly handle priority (this does not).  Note, other
> -        * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
> -        * most problematic, e.g. when L0 and L1 are both intercepting #PF for
> -        * shadow paging.
> -        *
> -        * TODO: Rewrite exception handling to track injected and pending
> -        *       (VM-Exit) exceptions separately.
> -        */
> -       if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
> -           kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
> -               return true;
> -
>         fault_mmu->inject_page_fault(vcpu, fault);
> -       return false;
>  }
>  EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
>  
> @@ -4752,7 +4763,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>         return (kvm_arch_interrupt_allowed(vcpu) &&
>                 kvm_cpu_accept_dm_intr(vcpu) &&
>                 !kvm_event_needs_reinjection(vcpu) &&
> -               !vcpu->arch.exception.pending);
> +               !kvm_is_exception_pending(vcpu));
>  }
>  
>  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
> @@ -4881,13 +4892,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                                                struct kvm_vcpu_events *events)
>  {
> -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       struct kvm_queued_exception *ex;
>  
>         process_nmi(vcpu);
>  
>         if (kvm_check_request(KVM_REQ_SMI, vcpu))
>                 process_smi(vcpu);
>  
> +       /*
> +        * KVM's ABI only allows for one exception to be migrated.  Luckily,
> +        * the only time there can be two queued exceptions is if there's a
> +        * non-exiting _injected_ exception, and a pending exiting exception.
> +        * In that case, ignore the VM-Exiting exception as it's an extension
> +        * of the injected exception.
> +        */

I think that we will lose the injected exception, thus will only deliver after
the migration the VM-exiting exception but without the correct IDT_VECTORING_INFO_FIELD/exit_int_info.

It's not that big deal and can be fixed by extending this API, with a new cap,
as I did in my patches. This can be done later, but the above comment
which tries to justify it, should be updated to mention that it is wrong.



> +       if (vcpu->arch.exception_vmexit.pending &&
> +           !vcpu->arch.exception.pending &&
> +           !vcpu->arch.exception.injected)
> +               ex = &vcpu->arch.exception_vmexit;
> +       else
> +               ex = &vcpu->arch.exception;


> +
>         /*
>          * In guest mode, payload delivery should be deferred if the exception
>          * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> @@ -4994,6 +5019,19 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>                 return -EINVAL;
>  
>         process_nmi(vcpu);
> +
> +       /*
> +        * Flag that userspace is stuffing an exception, the next KVM_RUN will
> +        * morph the exception to a VM-Exit if appropriate.  Do this only for
> +        * pending exceptions, already-injected exceptions are not subject to
> +        * intercpetion.  Note, userspace that conflates pending and injected
> +        * is hosed, and will incorrectly convert an injected exception into a
> +        * pending exception, which in turn may cause a spurious VM-Exit.
> +        */
> +       vcpu->arch.exception_from_userspace = events->exception.pending;

If I understand correctly, the only reason you added arch.exception_from_userspace,
is that you don't want to check if the L2 intercepts the exception here,
and set the exception_vmexit directly here, because nested state might not be loaded
yet, etc.

> +
> +       vcpu->arch.exception_vmexit.pending = false;
> +
>         vcpu->arch.exception.injected = events->exception.injected;
>         vcpu->arch.exception.pending = events->exception.pending;
>         vcpu->arch.exception.vector = events->exception.nr;
> @@ -7977,18 +8015,17 @@ static void toggle_interruptibility(struct kvm_vcpu *vcpu, u32 mask)
>         }
>  }
>  
> -static bool inject_emulated_exception(struct kvm_vcpu *vcpu)
> +static void inject_emulated_exception(struct kvm_vcpu *vcpu)
>  {
>         struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +
>         if (ctxt->exception.vector == PF_VECTOR)
> -               return kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
> -
> -       if (ctxt->exception.error_code_valid)
> +               kvm_inject_emulated_page_fault(vcpu, &ctxt->exception);
> +       else if (ctxt->exception.error_code_valid)
>                 kvm_queue_exception_e(vcpu, ctxt->exception.vector,
>                                       ctxt->exception.error_code);
>         else
>                 kvm_queue_exception(vcpu, ctxt->exception.vector);
> -       return false;
>  }
>  
>  static struct x86_emulate_ctxt *alloc_emulate_ctxt(struct kvm_vcpu *vcpu)
> @@ -8601,8 +8638,7 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  
>         if (ctxt->have_exception) {
>                 r = 1;
> -               if (inject_emulated_exception(vcpu))
> -                       return r;
> +               inject_emulated_exception(vcpu);
>         } else if (vcpu->arch.pio.count) {
>                 if (!vcpu->arch.pio.in) {
>                         /* FIXME: return into emulator if single-stepping.  */
> @@ -9540,7 +9576,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>          */
>         if (vcpu->arch.exception.injected)
>                 kvm_inject_exception(vcpu);
> -       else if (vcpu->arch.exception.pending)
> +       else if (kvm_is_exception_pending(vcpu))
>                 ; /* see above */
>         else if (vcpu->arch.nmi_injected)
>                 static_call(kvm_x86_inject_nmi)(vcpu);
> @@ -9567,6 +9603,14 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>         if (r < 0)
>                 goto out;
>  
> +       /*
> +        * A pending exception VM-Exit should either result in nested VM-Exit
> +        * or force an immediate re-entry and exit to/from L2, and exception
> +        * VM-Exits cannot be injected (flag should _never_ be set).
> +        */
> +       WARN_ON_ONCE(vcpu->arch.exception_vmexit.injected ||
> +                    vcpu->arch.exception_vmexit.pending);
> +
>         /*
>          * New events, other than exceptions, cannot be injected if KVM needs
>          * to re-inject a previous event.  See above comments on re-injecting
> @@ -9666,7 +9710,7 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>             kvm_x86_ops.nested_ops->hv_timer_pending(vcpu))
>                 *req_immediate_exit = true;
>  
> -       WARN_ON(vcpu->arch.exception.pending);
> +       WARN_ON(kvm_is_exception_pending(vcpu));
>         return 0;
>  
>  out:
> @@ -10680,6 +10724,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
>  
>  int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
>         struct kvm_run *kvm_run = vcpu->run;
>         int r;
>  
> @@ -10738,6 +10783,21 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
>                 }
>         }
>  
> +       /*
> +        * If userspace set a pending exception and L2 is active, convert it to
> +        * a pending VM-Exit if L1 wants to intercept the exception.
> +        */
> +       if (vcpu->arch.exception_from_userspace && is_guest_mode(vcpu) &&
> +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, ex->vector,
> +                                                       ex->error_code)) {
> +               kvm_queue_exception_vmexit(vcpu, ex->vector,
> +                                          ex->has_error_code, ex->error_code,
> +                                          ex->has_payload, ex->payload);
> +               ex->injected = false;
> +               ex->pending = false;
> +       }
> +       vcpu->arch.exception_from_userspace = false;
> +
>         if (unlikely(vcpu->arch.complete_userspace_io)) {
>                 int (*cui)(struct kvm_vcpu *) = vcpu->arch.complete_userspace_io;
>                 vcpu->arch.complete_userspace_io = NULL;
> @@ -10842,6 +10902,7 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs)
>         kvm_set_rflags(vcpu, regs->rflags | X86_EFLAGS_FIXED);
>  
>         vcpu->arch.exception.pending = false;
> +       vcpu->arch.exception_vmexit.pending = false;
>  
>         kvm_make_request(KVM_REQ_EVENT, vcpu);
>  }
> @@ -11209,7 +11270,7 @@ int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu,
>  
>         if (dbg->control & (KVM_GUESTDBG_INJECT_DB | KVM_GUESTDBG_INJECT_BP)) {
>                 r = -EBUSY;
> -               if (vcpu->arch.exception.pending)
> +               if (kvm_is_exception_pending(vcpu))
>                         goto out;
>                 if (dbg->control & KVM_GUESTDBG_INJECT_DB)
>                         kvm_queue_exception(vcpu, DB_VECTOR);
> @@ -12387,7 +12448,7 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>         if (vcpu->arch.pv.pv_unhalted)
>                 return true;
>  
> -       if (vcpu->arch.exception.pending)
> +       if (kvm_is_exception_pending(vcpu))
>                 return true;
>  
>         if (kvm_test_request(KVM_REQ_NMI, vcpu) ||
> @@ -12641,7 +12702,7 @@ bool kvm_can_do_async_pf(struct kvm_vcpu *vcpu)
>  {
>         if (unlikely(!lapic_in_kernel(vcpu) ||
>                      kvm_event_needs_reinjection(vcpu) ||
> -                    vcpu->arch.exception.pending))
> +                    kvm_is_exception_pending(vcpu)))
>                 return false;
>  
>         if (kvm_hlt_in_guest(vcpu->kvm) && !kvm_can_deliver_async_pf(vcpu))
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index dc2af0146220..eee259e387d3 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -82,10 +82,17 @@ static inline unsigned int __shrink_ple_window(unsigned int val,
>  void kvm_service_local_tlb_flush_requests(struct kvm_vcpu *vcpu);
>  int kvm_check_nested_events(struct kvm_vcpu *vcpu);
>  
> +static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
> +{
> +       return vcpu->arch.exception.pending ||
> +              vcpu->arch.exception_vmexit.pending;
> +}
> +
>  static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
>  {
>         vcpu->arch.exception.pending = false;
>         vcpu->arch.exception.injected = false;
> +       vcpu->arch.exception_vmexit.pending = false;
>  }
>  
>  static inline void kvm_queue_interrupt(struct kvm_vcpu *vcpu, u8 vector,


So overall it looks like you encountered the same pain points I encountered and overall your
approach is a bit less risky than my approach so to me it looks OK.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions
  2022-06-14 20:47 ` [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions Sean Christopherson
@ 2022-07-06 12:16   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Treat pending TRIPLE_FAULTS as pending exceptions.  A triple fault is an
> exception for all intents and purposes, it's just not tracked as such
> because there's no vector associated the exception.  E.g. if userspace
> were to set vcpu->request_interrupt_window while running L2 and L2 hit a
> triple fault, a triple fault nested VM-Exit should be synthesized to L1
> before exiting to userspace with KVM_EXIT_IRQ_WINDOW_OPEN.
> 
> Link: https://lore.kernel.org/all/YoVHAIGcFgJit1qp@google.com
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/x86.c | 3 ---
>  arch/x86/kvm/x86.h | 3 ++-
>  2 files changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 63ee79da50df..8e54a074b7ff 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12477,9 +12477,6 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>         if (kvm_xen_has_pending_events(vcpu))
>                 return true;
>  
> -       if (kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu))
> -               return true;
> -
>         return false;
>  }
>  
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index eee259e387d3..078765287ec6 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -85,7 +85,8 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu);
>  static inline bool kvm_is_exception_pending(struct kvm_vcpu *vcpu)
>  {
>         return vcpu->arch.exception.pending ||
> -              vcpu->arch.exception_vmexit.pending;
> +              vcpu->arch.exception_vmexit.pending ||
> +              kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>  }
>  
>  static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes
  2022-06-14 20:47 ` [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes Sean Christopherson
@ 2022-07-06 12:16   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:16 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Include the vmx.h and svm.h uapi headers that KVM so kindly provides
> instead of manually defining all the same exit reasons/code.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  .../selftests/kvm/include/x86_64/svm_util.h   |  7 +--
>  .../selftests/kvm/include/x86_64/vmx.h        | 51 +------------------
>  2 files changed, 4 insertions(+), 54 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/include/x86_64/svm_util.h b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> index a339b537a575..7aee6244ab6a 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/svm_util.h
> @@ -9,15 +9,12 @@
>  #ifndef SELFTEST_KVM_SVM_UTILS_H
>  #define SELFTEST_KVM_SVM_UTILS_H
>  
> +#include <asm/svm.h>
> +
>  #include <stdint.h>
>  #include "svm.h"
>  #include "processor.h"
>  
> -#define SVM_EXIT_EXCP_BASE     0x040
> -#define SVM_EXIT_HLT           0x078
> -#define SVM_EXIT_MSR           0x07c
> -#define SVM_EXIT_VMMCALL       0x081
> -
>  struct svm_test_data {
>         /* VMCB */
>         struct vmcb *vmcb; /* gva */
> diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
> index 99fa1410964c..e4206f69b716 100644
> --- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
> +++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
> @@ -8,6 +8,8 @@
>  #ifndef SELFTEST_KVM_VMX_H
>  #define SELFTEST_KVM_VMX_H
>  
> +#include <asm/vmx.h>
> +
>  #include <stdint.h>
>  #include "processor.h"
>  #include "apic.h"
> @@ -100,55 +102,6 @@
>  #define VMX_EPT_VPID_CAP_AD_BITS               0x00200000
>  
>  #define EXIT_REASON_FAILED_VMENTRY     0x80000000
> -#define EXIT_REASON_EXCEPTION_NMI      0
> -#define EXIT_REASON_EXTERNAL_INTERRUPT 1
> -#define EXIT_REASON_TRIPLE_FAULT       2
> -#define EXIT_REASON_INTERRUPT_WINDOW   7
> -#define EXIT_REASON_NMI_WINDOW         8
> -#define EXIT_REASON_TASK_SWITCH                9
> -#define EXIT_REASON_CPUID              10
> -#define EXIT_REASON_HLT                        12
> -#define EXIT_REASON_INVD               13
> -#define EXIT_REASON_INVLPG             14
> -#define EXIT_REASON_RDPMC              15
> -#define EXIT_REASON_RDTSC              16
> -#define EXIT_REASON_VMCALL             18
> -#define EXIT_REASON_VMCLEAR            19
> -#define EXIT_REASON_VMLAUNCH           20
> -#define EXIT_REASON_VMPTRLD            21
> -#define EXIT_REASON_VMPTRST            22
> -#define EXIT_REASON_VMREAD             23
> -#define EXIT_REASON_VMRESUME           24
> -#define EXIT_REASON_VMWRITE            25
> -#define EXIT_REASON_VMOFF              26
> -#define EXIT_REASON_VMON               27
> -#define EXIT_REASON_CR_ACCESS          28
> -#define EXIT_REASON_DR_ACCESS          29
> -#define EXIT_REASON_IO_INSTRUCTION     30
> -#define EXIT_REASON_MSR_READ           31
> -#define EXIT_REASON_MSR_WRITE          32
> -#define EXIT_REASON_INVALID_STATE      33
> -#define EXIT_REASON_MWAIT_INSTRUCTION  36
> -#define EXIT_REASON_MONITOR_INSTRUCTION 39
> -#define EXIT_REASON_PAUSE_INSTRUCTION  40
> -#define EXIT_REASON_MCE_DURING_VMENTRY 41
> -#define EXIT_REASON_TPR_BELOW_THRESHOLD 43
> -#define EXIT_REASON_APIC_ACCESS                44
> -#define EXIT_REASON_EOI_INDUCED                45
> -#define EXIT_REASON_EPT_VIOLATION      48
> -#define EXIT_REASON_EPT_MISCONFIG      49
> -#define EXIT_REASON_INVEPT             50
> -#define EXIT_REASON_RDTSCP             51
> -#define EXIT_REASON_PREEMPTION_TIMER   52
> -#define EXIT_REASON_INVVPID            53
> -#define EXIT_REASON_WBINVD             54
> -#define EXIT_REASON_XSETBV             55
> -#define EXIT_REASON_APIC_WRITE         56
> -#define EXIT_REASON_INVPCID            58
> -#define EXIT_REASON_PML_FULL           62
> -#define EXIT_REASON_XSAVES             63
> -#define EXIT_REASON_XRSTORS            64
> -#define LAST_EXIT_REASON               64
>  
>  enum vmcs_field {
>         VIRTUAL_PROCESSOR_ID            = 0x00000000,

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky




^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing
  2022-06-14 20:47 ` [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing Sean Christopherson
@ 2022-07-06 12:17   ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 12:17 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Add a test to verify that KVM_{G,S}ET_EVENTS play nice with pending vs.
> injected exceptions when an exception is being queued for L2, and that
> KVM correctly handles L1's exception intercept wants.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  tools/testing/selftests/kvm/.gitignore        |   1 +
>  tools/testing/selftests/kvm/Makefile          |   1 +
>  .../kvm/x86_64/nested_exceptions_test.c       | 295 ++++++++++++++++++
>  3 files changed, 297 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> 
> diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
> index 0ab0e255d292..7c8adb8cff83 100644
> --- a/tools/testing/selftests/kvm/.gitignore
> +++ b/tools/testing/selftests/kvm/.gitignore
> @@ -27,6 +27,7 @@
>  /x86_64/hyperv_svm_test
>  /x86_64/max_vcpuid_cap_test
>  /x86_64/mmio_warning_test
> +/x86_64/nested_exceptions_test
>  /x86_64/platform_info_test
>  /x86_64/pmu_event_filter_test
>  /x86_64/set_boot_cpu_id
> diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
> index 2ca5400220b9..6db2dd5eca96 100644
> --- a/tools/testing/selftests/kvm/Makefile
> +++ b/tools/testing/selftests/kvm/Makefile
> @@ -83,6 +83,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/hyperv_svm_test
>  TEST_GEN_PROGS_x86_64 += x86_64/kvm_clock_test
>  TEST_GEN_PROGS_x86_64 += x86_64/kvm_pv_test
>  TEST_GEN_PROGS_x86_64 += x86_64/mmio_warning_test
> +TEST_GEN_PROGS_x86_64 += x86_64/nested_exceptions_test
>  TEST_GEN_PROGS_x86_64 += x86_64/platform_info_test
>  TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
>  TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
> diff --git a/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> new file mode 100644
> index 000000000000..ac33835f78f4
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86_64/nested_exceptions_test.c
> @@ -0,0 +1,295 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +#define _GNU_SOURCE /* for program_invocation_short_name */
> +
> +#include "test_util.h"
> +#include "kvm_util.h"
> +#include "processor.h"
> +#include "vmx.h"
> +#include "svm_util.h"
> +
> +#define L2_GUEST_STACK_SIZE 256
> +
> +/*
> + * Arbitrary, never shoved into KVM/hardware, just need to avoid conflict with
> + * the "real" exceptions used, #SS/#GP/#DF (12/13/8).
> + */
> +#define FAKE_TRIPLE_FAULT_VECTOR       0xaa
> +
> +/* Arbitrary 32-bit error code injected by this test. */
> +#define SS_ERROR_CODE 0xdeadbeef
> +
> +/*
> + * Bit '0' is set on Intel if the exception occurs while delivering a previous
> + * event/exception.  AMD's wording is ambiguous, but presumably the bit is set
> + * if the exception occurs while delivering an external event, e.g. NMI or INTR,
> + * but not for exceptions that occur when delivering other exceptions or
> + * software interrupts.
> + *
> + * Note, Intel's name for it, "External event", is misleading and much more
> + * aligned with AMD's behavior, but the SDM is quite clear on its behavior.
WOW, I never noticed that in the SDM, I guess learing something new.
I was sure that Ext bit is only set when an interrupt event delivery was 'interrupted'
by an exception (like non present IDT entry or something).

However intel does exclude software interrupts from that:

"When set, indicates that the exception occurred during delivery of an
event external to the program, such as an interrupt or an earlier exception. 5 The bit is cleared if the
exception occurred during delivery of a software interrupt (INT n, INT3, or INTO)."




> + */
> +#define ERROR_CODE_EXT_FLAG    BIT(0)
> +
> +/*
> + * Bit '1' is set if the fault occurred when looking up a descriptor in the
> + * IDT, which is the case here as the IDT is empty/NULL.
> + */
> +#define ERROR_CODE_IDT_FLAG    BIT(1)
> +
> +/*
> + * The #GP that occurs when vectoring #SS should show the index into the IDT
> + * for #SS, plus have the "IDT flag" set.
> + */
> +#define GP_ERROR_CODE_AMD ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG)
> +#define GP_ERROR_CODE_INTEL ((SS_VECTOR * 8) | ERROR_CODE_IDT_FLAG | ERROR_CODE_EXT_FLAG)
> +
> +/*
> + * Intel and AMD both shove '0' into the error code on #DF, regardless of what
> + * led to the double fault.
> + */
> +#define DF_ERROR_CODE 0
> +
> +#define INTERCEPT_SS           (BIT_ULL(SS_VECTOR))
> +#define INTERCEPT_SS_DF                (INTERCEPT_SS | BIT_ULL(DF_VECTOR))
> +#define INTERCEPT_SS_GP_DF     (INTERCEPT_SS_DF | BIT_ULL(GP_VECTOR))
> +
> +static void l2_ss_pending_test(void)
> +{
> +       GUEST_SYNC(SS_VECTOR);
> +}
> +
> +static void l2_ss_injected_gp_test(void)
> +{
> +       GUEST_SYNC(GP_VECTOR);
> +}
> +
> +static void l2_ss_injected_df_test(void)
> +{
> +       GUEST_SYNC(DF_VECTOR);
> +}
> +
> +static void l2_ss_injected_tf_test(void)
> +{
> +       GUEST_SYNC(FAKE_TRIPLE_FAULT_VECTOR);
> +}
> +
> +static void svm_run_l2(struct svm_test_data *svm, void *l2_code, int vector,
> +                      uint32_t error_code)
> +{
> +       struct vmcb *vmcb = svm->vmcb;
> +       struct vmcb_control_area *ctrl = &vmcb->control;
> +
> +       vmcb->save.rip = (u64)l2_code;
> +       run_guest(vmcb, svm->vmcb_gpa);
> +
> +       if (vector == FAKE_TRIPLE_FAULT_VECTOR)
> +               return;
> +
> +       GUEST_ASSERT_EQ(ctrl->exit_code, (SVM_EXIT_EXCP_BASE + vector));
> +       GUEST_ASSERT_EQ(ctrl->exit_info_1, error_code);
> +}
> +
> +static void l1_svm_code(struct svm_test_data *svm)
> +{
> +       struct vmcb_control_area *ctrl = &svm->vmcb->control;
> +       unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> +
> +       generic_svm_setup(svm, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> +       svm->vmcb->save.idtr.limit = 0;
> +       ctrl->intercept |= BIT_ULL(INTERCEPT_SHUTDOWN);
> +
> +       ctrl->intercept_exceptions = INTERCEPT_SS_GP_DF;
> +       svm_run_l2(svm, l2_ss_pending_test, SS_VECTOR, SS_ERROR_CODE);
> +       svm_run_l2(svm, l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_AMD);
> +
> +       ctrl->intercept_exceptions = INTERCEPT_SS_DF;
> +       svm_run_l2(svm, l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
> +
> +       ctrl->intercept_exceptions = INTERCEPT_SS;
> +       svm_run_l2(svm, l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
> +       GUEST_ASSERT_EQ(ctrl->exit_code, SVM_EXIT_SHUTDOWN);
> +
> +       GUEST_DONE();
> +}
> +
> +static void vmx_run_l2(void *l2_code, int vector, uint32_t error_code)
> +{
> +       GUEST_ASSERT(!vmwrite(GUEST_RIP, (u64)l2_code));
> +
> +       GUEST_ASSERT_EQ(vector == SS_VECTOR ? vmlaunch() : vmresume(), 0);
> +
> +       if (vector == FAKE_TRIPLE_FAULT_VECTOR)
> +               return;
> +
> +       GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_EXCEPTION_NMI);
> +       GUEST_ASSERT_EQ((vmreadz(VM_EXIT_INTR_INFO) & 0xff), vector);
> +       GUEST_ASSERT_EQ(vmreadz(VM_EXIT_INTR_ERROR_CODE), error_code);
> +}
> +
> +static void l1_vmx_code(struct vmx_pages *vmx)
> +{
> +       unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
> +
> +       GUEST_ASSERT_EQ(prepare_for_vmx_operation(vmx), true);
> +
> +       GUEST_ASSERT_EQ(load_vmcs(vmx), true);
> +
> +       prepare_vmcs(vmx, NULL, &l2_guest_stack[L2_GUEST_STACK_SIZE]);
> +       GUEST_ASSERT_EQ(vmwrite(GUEST_IDTR_LIMIT, 0), 0);
> +
> +       /*
> +        * VMX disallows injecting an exception with error_code[31:16] != 0,
> +        * and hardware will never generate a VM-Exit with bits 31:16 set.
> +        * KVM should likewise truncate the "bad" userspace value.
> +        */
> +       GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_GP_DF), 0);
> +       vmx_run_l2(l2_ss_pending_test, SS_VECTOR, (u16)SS_ERROR_CODE);
> +       vmx_run_l2(l2_ss_injected_gp_test, GP_VECTOR, GP_ERROR_CODE_INTEL);
> +
> +       GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS_DF), 0);
> +       vmx_run_l2(l2_ss_injected_df_test, DF_VECTOR, DF_ERROR_CODE);
> +
> +       GUEST_ASSERT_EQ(vmwrite(EXCEPTION_BITMAP, INTERCEPT_SS), 0);
> +       vmx_run_l2(l2_ss_injected_tf_test, FAKE_TRIPLE_FAULT_VECTOR, 0);
> +       GUEST_ASSERT_EQ(vmreadz(VM_EXIT_REASON), EXIT_REASON_TRIPLE_FAULT);
> +
> +       GUEST_DONE();
> +}
> +
> +static void __attribute__((__flatten__)) l1_guest_code(void *test_data)
> +{
> +       if (this_cpu_has(X86_FEATURE_SVM))
> +               l1_svm_code(test_data);
> +       else
> +               l1_vmx_code(test_data);
> +}
> +
> +static void assert_ucall_vector(struct kvm_vcpu *vcpu, int vector)
> +{
> +       struct kvm_run *run = vcpu->run;
> +       struct ucall uc;
> +
> +       TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
> +                   "Unexpected exit reason: %u (%s),\n",
> +                   run->exit_reason, exit_reason_str(run->exit_reason));
> +
> +       switch (get_ucall(vcpu, &uc)) {
> +       case UCALL_SYNC:
> +               TEST_ASSERT(vector == uc.args[1],
> +                           "Expected L2 to ask for %d, got %ld", vector, uc.args[1]);
> +               break;
> +       case UCALL_DONE:
> +               TEST_ASSERT(vector == -1,
> +                           "Expected L2 to ask for %d, L2 says it's done", vector);
> +               break;
> +       case UCALL_ABORT:
> +               TEST_FAIL("%s at %s:%ld (0x%lx != 0x%lx)",
> +                         (const char *)uc.args[0], __FILE__, uc.args[1],
> +                         uc.args[2], uc.args[3]);
> +               break;
> +       default:
> +               TEST_FAIL("Expected L2 to ask for %d, got unexpected ucall %lu", vector, uc.cmd);
> +       }
> +}
> +
> +static void queue_ss_exception(struct kvm_vcpu *vcpu, bool inject)
> +{
> +       struct kvm_vcpu_events events;
> +
> +       vcpu_events_get(vcpu, &events);
> +
> +       TEST_ASSERT(!events.exception.pending,
> +                   "Vector %d unexpectedlt pending", events.exception.nr);
> +       TEST_ASSERT(!events.exception.injected,
> +                   "Vector %d unexpectedly injected", events.exception.nr);
> +
> +       events.flags = KVM_VCPUEVENT_VALID_PAYLOAD;
> +       events.exception.pending = !inject;
> +       events.exception.injected = inject;
> +       events.exception.nr = SS_VECTOR;
> +       events.exception.has_error_code = true;
> +       events.exception.error_code = SS_ERROR_CODE;
> +       vcpu_events_set(vcpu, &events);
> +}
> +
> +/*
> + * Verify KVM_{G,S}ET_EVENTS play nice with pending vs. injected exceptions
> + * when an exception is being queued for L2.  Specifically, verify that KVM
> + * honors L1 exception intercept controls when a #SS is pending/injected,
> + * triggers a #GP on vectoring the #SS, morphs to #DF if #GP isn't intercepted
> + * by L1, and finally causes (nested) SHUTDOWN if #DF isn't intercepted by L1.
> + */
> +int main(int argc, char *argv[])
> +{
> +       vm_vaddr_t nested_test_data_gva;
> +       struct kvm_vcpu_events events;
> +       struct kvm_vcpu *vcpu;
> +       struct kvm_vm *vm;
> +
> +       TEST_REQUIRE(kvm_has_cap(KVM_CAP_EXCEPTION_PAYLOAD));
> +       TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_SVM) || kvm_cpu_has(X86_FEATURE_VMX));
> +
> +       vm = vm_create_with_one_vcpu(&vcpu, l1_guest_code);
> +       vm_enable_cap(vm, KVM_CAP_EXCEPTION_PAYLOAD, -2ul);
> +
> +       if (kvm_cpu_has(X86_FEATURE_SVM))
> +               vcpu_alloc_svm(vm, &nested_test_data_gva);
> +       else
> +               vcpu_alloc_vmx(vm, &nested_test_data_gva);
> +
> +       vcpu_args_set(vcpu, 1, nested_test_data_gva);
> +
> +       /* Run L1 => L2.  L2 should sync and request #SS. */
> +       vcpu_run(vcpu);
> +       assert_ucall_vector(vcpu, SS_VECTOR);
> +
> +       /* Pend #SS and request immediate exit.  #SS should still be pending. */
> +       queue_ss_exception(vcpu, false);
> +       vcpu->run->immediate_exit = true;
> +       vcpu_run_complete_io(vcpu);
> +
> +       /* Verify the pending events comes back out the same as it went in. */
> +       vcpu_events_get(vcpu, &events);
> +       ASSERT_EQ(events.flags & KVM_VCPUEVENT_VALID_PAYLOAD,
> +                 KVM_VCPUEVENT_VALID_PAYLOAD);
> +       ASSERT_EQ(events.exception.pending, true);
> +       ASSERT_EQ(events.exception.nr, SS_VECTOR);
> +       ASSERT_EQ(events.exception.has_error_code, true);
> +       ASSERT_EQ(events.exception.error_code, SS_ERROR_CODE);
> +
> +       /*
> +        * Run for real with the pending #SS, L1 should get a VM-Exit due to
> +        * #SS interception and re-enter L2 to request #GP (via injected #SS).
> +        */
> +       vcpu->run->immediate_exit = false;
> +       vcpu_run(vcpu);
> +       assert_ucall_vector(vcpu, GP_VECTOR);
> +
> +       /*
> +        * Inject #SS, the #SS should bypass interception and cause #GP, which
> +        * L1 should intercept before KVM morphs it to #DF.  L1 should then
> +        * disable #GP interception and run L2 to request #DF (via #SS => #GP).
> +        */
> +       queue_ss_exception(vcpu, true);
> +       vcpu_run(vcpu);
> +       assert_ucall_vector(vcpu, DF_VECTOR);
> +
> +       /*
> +        * Inject #SS, the #SS should bypass interception and cause #GP, which
> +        * L1 is no longer interception, and so should see a #DF VM-Exit.  L1
> +        * should then signal that is done.
> +        */
> +       queue_ss_exception(vcpu, true);
> +       vcpu_run(vcpu);
> +       assert_ucall_vector(vcpu, FAKE_TRIPLE_FAULT_VECTOR);
> +
> +       /*
> +        * Inject #SS yet again.  L1 is not intercepting #GP or #DF, and so
> +        * should see nested TRIPLE_FAULT / SHUTDOWN.
> +        */
> +       queue_ss_exception(vcpu, true);
> +       vcpu_run(vcpu);
> +       assert_ucall_vector(vcpu, -1);
> +
> +       kvm_vm_free(vm);
> +}

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


Few more tests, maybe done in kvm-unit-tests, won't hurt, especially
tests that test the interaction between L1 which does EVENTINJ and nested exceptions.

Also a test that causes an interrupt to cause a nested exception, I wrote it for kvm-unit-tests,
don't remember if it was accepted upstream.


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  2022-07-06 11:43   ` Maxim Levitsky
@ 2022-07-06 16:12     ` Sean Christopherson
  2022-07-06 18:50       ` Maxim Levitsky
  0 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-07-06 16:12 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > Deliberately truncate the exception error code when shoving it into the
> > VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
> > Intel CPUs are incapable of handling 32-bit error codes and will never
> > generate an error code with bits 31:16, but userspace can provide an
> > arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
> > on exception injection results in failed VM-Entry, as VMX disallows
> > setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
> > L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
> > reinject the exception back into L2.
> 
> Wouldn't it be better to fail KVM_SET_VCPU_EVENTS instead if it tries
> to set error code with uppper 16 bits set?

No, because AMD CPUs generate error codes with bits 31:16 set.  KVM "supports"
cross-vendor live migration, so outright rejecting is not an option.

> Or if that is considered ABI breakage, then KVM_SET_VCPU_EVENTS code
> can truncate the user given value to 16 bit.

Again, AMD, and more specifically SVM, allows bits 31:16 to be non-zero, so
truncation is only correct for VMX.  I say "VMX" instead of "Intel" because
architecturally the Intel CPUs do have 32-bit error codes, it's just the VMX
architecture that doesn't allow injection of 32-bit values.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  2022-07-06 12:00   ` Maxim Levitsky
@ 2022-07-06 16:45     ` Sean Christopherson
  2022-07-06 20:03       ` Maxim Levitsky
  0 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-07-06 16:45 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > Clear mtf_pending on nested VM-Exit instead of handling the clear on a
> > case-by-case basis in vmx_check_nested_events().  The pending MTF should
> > rever survive nested VM-Exit, as it is a property of KVM's run of the
> ^^ typo: never
> 
> Also it is not clear what the 'case by case' means.
> 
> I see that the vmx_check_nested_events always clears it unless nested run is pending
> or we re-inject an event.

Those two "unless ..." are the "cases".  The point I'm trying to make in the changelog
is that there's no need for any conditional logic whatsoever.

> > @@ -3927,6 +3919,9 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> >  		clear_bit(KVM_APIC_INIT, &apic->pending_events);
> >  		if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED)
> >  			nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> > +
> > +		/* MTF is discarded if the vCPU is in WFS. */
> > +		vmx->nested.mtf_pending = false;
> >  		return 0;
> 
> I guess MTF should also be discarded if we enter SMM, and I see that
> VMX also enter SMM with a pseudo VM exit (in vmx_enter_smm) which
> will clear the MTF. Good.

No, a pending MTF should be preserved across SMI.  It's not a regression because
KVM incorrectly prioritizes MTF (and trap-like #DBs) over SMI (and because if KVM
did prioritize SMI, the existing code would also drop the pending MTF).  Note, this
isn't the only flaw that needs to be addressed in order to correctly prioritize SMIs,
e.g. KVM_{G,S}ET_NESTED_STATE would need to save/restore a pending MTF if the vCPU is
in SMM after an SMI that arrived while L2 was active.

Tangentially related, KVM's pseudo VM-Exit on SMI emulation is completely wrong[*].

[*] https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-07-06 11:54     ` Maxim Levitsky
@ 2022-07-06 17:13       ` Jim Mattson
  2022-07-06 17:52         ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 17:13 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Sean Christopherson, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:

> 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
>    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
>    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
>    do anything about it.
>
>    Or maybe pending MTF is saved to SMRAM somewhere.
>
>    In case you will say that I am inventing this again, I am saying now that the above is
>    just a guess.

This is covered in the SDM, volume 3, section 31.14.1: "Default
Treatment of SMI Delivery:"

The pseudocode above makes reference to the saving of VMX-critical
state. This state consists of the following:
(1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
state of blocking by STI and by MOV SS (see
Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
(only if the processor is in VMX non-root oper-
ation and the “virtual NMIs” VM-execution control is 1); and (5) an
indication of whether an MTF VM exit is pending
(see Section 25.5.2). These data may be saved internal to the
processor or in the VMCS region of the current
VMCS. Processors that do not support SMI recognition while there is
blocking by STI or by MOV SS need not save
the state of such blocking.

Saving VMX-critical state to SMRAM is not documented as an option.

> 2. For case 7, what about General Detect? Since to raise it, the CPU needs to decode the instruction
>    Its more natural to have it belong to case 9.

I think it actually belongs in case 10. The Intel table says,
"Fault-class Debug Exceptions (#DB due to instruction breakpoint),"
and I probably just blindly added "General Detect," because it is a
fault-class debug exception. You're right; the CPU has to decode the
instruction before it can deliver a #DB for general detect.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions
  2022-07-06 12:04   ` Maxim Levitsky
@ 2022-07-06 17:36     ` Sean Christopherson
  2022-07-06 20:03       ` Maxim Levitsky
  0 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-07-06 17:36 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > Capture nested_run_pending as block_pending_exceptions so that the logic
> > of why exceptions are blocked only needs to be documented once instead of
> > at every place that employs the logic.
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kvm/svm/nested.c | 20 ++++++++++----------
> >  arch/x86/kvm/vmx/nested.c | 23 ++++++++++++-----------
> >  2 files changed, 22 insertions(+), 21 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > index 471d40e97890..460161e67ce5 100644
> > --- a/arch/x86/kvm/svm/nested.c
> > +++ b/arch/x86/kvm/svm/nested.c
> > @@ -1347,10 +1347,16 @@ static inline bool nested_exit_on_init(struct vcpu_svm *svm)
> >  
> >  static int svm_check_nested_events(struct kvm_vcpu *vcpu)
> >  {
> > -	struct vcpu_svm *svm = to_svm(vcpu);
> > -	bool block_nested_events =
> > -		kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
> >  	struct kvm_lapic *apic = vcpu->arch.apic;
> > +	struct vcpu_svm *svm = to_svm(vcpu);
> > +	/*
> > +	 * Only a pending nested run blocks a pending exception.  If there is a
> > +	 * previously injected event, the pending exception occurred while said
> > +	 * event was being delivered and thus needs to be handled.
> > +	 */
> 
> Tiny nitpick about the comment:
> 
> One can say that if there is an injected event, this means that we
> are in the middle of handling it, thus we are not on instruction boundary,
> and thus we don't process events (e.g interrupts).
> 
> So maybe write something like that?

Hmm, that's another way to look at things.  My goal with the comment was to try
and call out that any pending exception is a continuation of the injected event,
i.e. that the injected event won't be lost.  Talking about instruction boundaries
only explains why non-exception events are blocked, it doesn't explain why exceptions
are _not_ blocked.

I'll add a second comment above block_nested_events to capture the instruction
boundary angle.

> > +	bool block_nested_exceptions = svm->nested.nested_run_pending;
> > +	bool block_nested_events = block_nested_exceptions ||
> > +				   kvm_event_needs_reinjection(vcpu);
> 
> Tiny nitpick: I don't like that much the name 'nested' as
> it can also mean a nested exception (e.g exception that
> happened while jumping to an exception  handler).
> 
> Here we mean just exception/events for the guest, so I would suggest
> to just drop the word 'nested'.

I don't disagree, but I'd prefer to keep the current naming because the helper
itself is *_check_nested_events().  I'm not opposed to renaming things in the
future, but I don't want to do that in this series.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-07-06 17:13       ` Jim Mattson
@ 2022-07-06 17:52         ` Sean Christopherson
  2022-07-06 20:03           ` Maxim Levitsky
  2022-07-06 20:11           ` Jim Mattson
  0 siblings, 2 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-07-06 17:52 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Jim Mattson wrote:
> On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> 
> > 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
> >    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
> >    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
> >    do anything about it.
> >
> >    Or maybe pending MTF is saved to SMRAM somewhere.
> >
> >    In case you will say that I am inventing this again, I am saying now that the above is
> >    just a guess.
> 
> This is covered in the SDM, volume 3, section 31.14.1: "Default
> Treatment of SMI Delivery:"
> 
> The pseudocode above makes reference to the saving of VMX-critical
> state. This state consists of the following:
> (1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
> state of blocking by STI and by MOV SS (see
> Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
> (only if the processor is in VMX non-root oper-
> ation and the “virtual NMIs” VM-execution control is 1); and (5) an
> indication of whether an MTF VM exit is pending
> (see Section 25.5.2). These data may be saved internal to the
> processor or in the VMCS region of the current
> VMCS. Processors that do not support SMI recognition while there is
> blocking by STI or by MOV SS need not save
> the state of such blocking.
> 
> Saving VMX-critical state to SMRAM is not documented as an option.

Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
processor" as "undocumented SMRAM fields".  But I could also be misremembering
the SMI flows.

Regardless, I do like the idea of using vmcs12 instead of SMRAM.  That would provide
some extra motivation for moving away from KVM's broken pseudo VM-Exit implementation.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  2022-07-06 16:12     ` Sean Christopherson
@ 2022-07-06 18:50       ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 18:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-07-06 at 16:12 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > Deliberately truncate the exception error code when shoving it into the
> > > VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
> > > Intel CPUs are incapable of handling 32-bit error codes and will never
> > > generate an error code with bits 31:16, but userspace can provide an
> > > arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
> > > on exception injection results in failed VM-Entry, as VMX disallows
> > > setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
> > > L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
> > > reinject the exception back into L2.
> > 
> > Wouldn't it be better to fail KVM_SET_VCPU_EVENTS instead if it tries
> > to set error code with uppper 16 bits set?
> 
> No, because AMD CPUs generate error codes with bits 31:16 set.  KVM "supports"
> cross-vendor live migration, so outright rejecting is not an option.
> 
> > Or if that is considered ABI breakage, then KVM_SET_VCPU_EVENTS code
> > can truncate the user given value to 16 bit.
> 
> Again, AMD, and more specifically SVM, allows bits 31:16 to be non-zero, so
> truncation is only correct for VMX.  I say "VMX" instead of "Intel" because
> architecturally the Intel CPUs do have 32-bit error codes, it's just the VMX
> architecture that doesn't allow injection of 32-bit values.
> 

Oh, I see AMD uses bit 31 for RMP (from SEV-SNP) page fault,
Thanks for the explanation!

You might want to add this piece of info somewhere as a comment if you wish.

Thanks,
Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS
  2022-06-14 20:47 ` [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS Sean Christopherson
  2022-07-06 11:43   ` Maxim Levitsky
@ 2022-07-06 20:02   ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 20:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Deliberately truncate the exception error code when shoving it into the
> VMCS (VM-Entry field for vmcs01 and vmcs02, VM-Exit field for vmcs12).
> Intel CPUs are incapable of handling 32-bit error codes and will never

Never say "never." :-)

> generate an error code with bits 31:16, but userspace can provide an
> arbitrary error code via KVM_SET_VCPU_EVENTS.  Failure to drop the bits
> on exception injection results in failed VM-Entry, as VMX disallows
> setting bits 31:16.  Setting the bits on VM-Exit would at best confuse
> L1, and at worse induce a nested VM-Entry failure, e.g. if L1 decided to
> reinject the exception back into L2.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: Jim Mattson <jmattson@google.com>

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit
  2022-07-06 16:45     ` Sean Christopherson
@ 2022-07-06 20:03       ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 20:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-07-06 at 16:45 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > Clear mtf_pending on nested VM-Exit instead of handling the clear on a
> > > case-by-case basis in vmx_check_nested_events().  The pending MTF should
> > > rever survive nested VM-Exit, as it is a property of KVM's run of the
> > ^^ typo: never
> > 
> > Also it is not clear what the 'case by case' means.
> > 
> > I see that the vmx_check_nested_events always clears it unless nested run is pending
> > or we re-inject an event.
> 
> Those two "unless ..." are the "cases".  The point I'm trying to make in the changelog
> is that there's no need for any conditional logic whatsoever.
> 
> > > @@ -3927,6 +3919,9 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> > >  		clear_bit(KVM_APIC_INIT, &apic->pending_events);
> > >  		if (vcpu->arch.mp_state != KVM_MP_STATE_INIT_RECEIVED)
> > >  			nested_vmx_vmexit(vcpu, EXIT_REASON_INIT_SIGNAL, 0, 0);
> > > +
> > > +		/* MTF is discarded if the vCPU is in WFS. */
> > > +		vmx->nested.mtf_pending = false;
> > >  		return 0;
> > 
> > I guess MTF should also be discarded if we enter SMM, and I see that
> > VMX also enter SMM with a pseudo VM exit (in vmx_enter_smm) which
> > will clear the MTF. Good.
> 
> No, a pending MTF should be preserved across SMI. 

Indeed, now I see it:

"If an MTF VM exit was pending at the time of the previous SMI, an MTF VM exit is pending on the instruction
boundary following execution of RSM. The following items detail the treatment of MTF VM exits that may be
pending following RSM:"

You might also want to add it as some comment in the source.



>  It's not a regression because
> KVM incorrectly prioritizes MTF (and trap-like #DBs) over SMI (and because if KVM
> did prioritize SMI, the existing code would also drop the pending MTF).  Note, this
> isn't the only flaw that needs to be addressed in order to correctly prioritize SMIs,
> e.g. KVM_{G,S}ET_NESTED_STATE would need to save/restore a pending MTF if the vCPU is
> in SMM after an SMI that arrived while L2 was active.

When we fix this, should we store it to SMRAM, or to some KVM internal state?
Or VMCS12, as noted in the other mail.


> 
> Tangentially related, KVM's pseudo VM-Exit on SMI emulation is completely wrong[*].
> 
> [*] https://lore.kernel.org/all/Yobt1XwOfb5M6Dfa@google.com
> 


I have seen a patch on the KVM mailing list recently exactly about preserving CET
state in the SMRAM, I'll need to take a look.



Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions
  2022-07-06 17:36     ` Sean Christopherson
@ 2022-07-06 20:03       ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 20:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-07-06 at 17:36 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > Capture nested_run_pending as block_pending_exceptions so that the logic
> > > of why exceptions are blocked only needs to be documented once instead of
> > > at every place that employs the logic.
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kvm/svm/nested.c | 20 ++++++++++----------
> > >  arch/x86/kvm/vmx/nested.c | 23 ++++++++++++-----------
> > >  2 files changed, 22 insertions(+), 21 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
> > > index 471d40e97890..460161e67ce5 100644
> > > --- a/arch/x86/kvm/svm/nested.c
> > > +++ b/arch/x86/kvm/svm/nested.c
> > > @@ -1347,10 +1347,16 @@ static inline bool nested_exit_on_init(struct vcpu_svm *svm)
> > >  
> > >  static int svm_check_nested_events(struct kvm_vcpu *vcpu)
> > >  {
> > > -	struct vcpu_svm *svm = to_svm(vcpu);
> > > -	bool block_nested_events =
> > > -		kvm_event_needs_reinjection(vcpu) || svm->nested.nested_run_pending;
> > >  	struct kvm_lapic *apic = vcpu->arch.apic;
> > > +	struct vcpu_svm *svm = to_svm(vcpu);
> > > +	/*
> > > +	 * Only a pending nested run blocks a pending exception.  If there is a
> > > +	 * previously injected event, the pending exception occurred while said
> > > +	 * event was being delivered and thus needs to be handled.
> > > +	 */
> > 
> > Tiny nitpick about the comment:
> > 
> > One can say that if there is an injected event, this means that we
> > are in the middle of handling it, thus we are not on instruction boundary,
> > and thus we don't process events (e.g interrupts).
> > 
> > So maybe write something like that?
> 
> Hmm, that's another way to look at things.  My goal with the comment was to try
> and call out that any pending exception is a continuation of the injected event,
> i.e. that the injected event won't be lost.  Talking about instruction boundaries
> only explains why non-exception events are blocked, it doesn't explain why exceptions
> are _not_ blocked.
> 
> I'll add a second comment above block_nested_events to capture the instruction
> boundary angle.
> 
> > > +	bool block_nested_exceptions = svm->nested.nested_run_pending;
> > > +	bool block_nested_events = block_nested_exceptions ||
> > > +				   kvm_event_needs_reinjection(vcpu);
> > 
> > Tiny nitpick: I don't like that much the name 'nested' as
> > it can also mean a nested exception (e.g exception that
> > happened while jumping to an exception  handler).
> > 
> > Here we mean just exception/events for the guest, so I would suggest
> > to just drop the word 'nested'.
> 
> I don't disagree, but I'd prefer to keep the current naming because the helper
> itself is *_check_nested_events().  I'm not opposed to renaming things in the
> future, but I don't want to do that in this series.
> 
Yep, makes sense.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-07-06 17:52         ` Sean Christopherson
@ 2022-07-06 20:03           ` Maxim Levitsky
  2022-07-06 20:11           ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-06 20:03 UTC (permalink / raw)
  To: Sean Christopherson, Jim Mattson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-07-06 at 17:52 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Jim Mattson wrote:
> > On Wed, Jul 6, 2022 at 4:55 AM Maxim Levitsky <mlevitsk@redhat.com> wrote:
> > 
> > > 1. Since #SMI is higher priority than the #MTF, that means that unless dual monitor treatment is used,
> > >    and the dual monitor handler figures out that #MTF was pending and re-injects it when it
> > >    VMRESUME's the 'host', the MTF gets lost, and there is no way for a normal hypervisor to
> > >    do anything about it.
> > > 
> > >    Or maybe pending MTF is saved to SMRAM somewhere.
> > > 
> > >    In case you will say that I am inventing this again, I am saying now that the above is
> > >    just a guess.
> > 
> > This is covered in the SDM, volume 3, section 31.14.1: "Default
> > Treatment of SMI Delivery:"
> > 
> > The pseudocode above makes reference to the saving of VMX-critical
> > state. This state consists of the following:
> > (1) SS.DPL (the current privilege level); (2) RFLAGS.VM2; (3) the
> > state of blocking by STI and by MOV SS (see
> > Table 24-3 in Section 24.4.2); (4) the state of virtual-NMI blocking
> > (only if the processor is in VMX non-root oper-
> > ation and the “virtual NMIs” VM-execution control is 1); and (5) an
> > indication of whether an MTF VM exit is pending
> > (see Section 25.5.2). These data may be saved internal to the
> > processor or in the VMCS region of the current
> > VMCS. Processors that do not support SMI recognition while there is
> > blocking by STI or by MOV SS need not save
> > the state of such blocking.
> > 
> > Saving VMX-critical state to SMRAM is not documented as an option.
> 
> Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> processor" as "undocumented SMRAM fields".  But I could also be misremembering
> the SMI flows.
> 
> Regardless, I do like the idea of using vmcs12 instead of SMRAM.  That would provide
> some extra motivation for moving away from KVM's broken pseudo VM-Exit implementation.
> 

For preserving pending MTF, I guess it makes sense to use vmcb12, especially since we own
its format.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-07-06 17:52         ` Sean Christopherson
  2022-07-06 20:03           ` Maxim Levitsky
@ 2022-07-06 20:11           ` Jim Mattson
  2022-07-10 15:58             ` Maxim Levitsky
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 20:11 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Maxim Levitsky, Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 6, 2022 at 10:52 AM Sean Christopherson <seanjc@google.com> wrote:

> Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> processor" as "undocumented SMRAM fields".  But I could also be misremembering
> the SMI flows.

Start using reserved SMRAM, and you will regret it when the vendor
assigns some new bit of state to the same location.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception
  2022-06-14 20:47 ` [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception Sean Christopherson
  2022-07-06 11:43   ` Maxim Levitsky
@ 2022-07-06 22:17   ` Jim Mattson
  1 sibling, 0 replies; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 22:17 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Don't check for code breakpoints during instruction emulation if the
> emulation was triggered by exception interception.  Code breakpoints are
> the highest priority fault-like exception, and KVM only emulates on
> exceptions that are fault-like.  Thus, if hardware signaled a different
> exception, then the vCPU is already passed the stage of checking for
> hardware breakpoints.
>
> This is likely a glorified nop in terms of functionality, and is more for
> clarification and is technically an optimization.  Intel's SDM explicitly
> states vmcs.GUEST_RFLAGS.RF on exception interception is the same as the
> value that would have been saved on the stack had the exception not been
> intercepted, i.e. will be '1' due to all fault-like exceptions setting RF
> to '1'.  AMD says "guest state saved ... is the processor state as of the
> moment the intercept triggers", but that begs the question, "when does
> the intercept trigger?".

IIRC, AMD does not prematurely clobber EFLAGS.RF on an intercepted exception.

This is actually a big deal with shadow paging. On Intel, the
hypervisor can't fully squash a #PF and restart the guest instruction
after filling in the shadow page table entry...not easily, anyway.

(OTOH, AMD does prematurely clobber DR6 and DR7 on an intercepted #DB.
So, no one should be celebrating!)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  2022-06-14 20:47 ` [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag Sean Christopherson
  2022-07-06 11:57   ` Maxim Levitsky
@ 2022-07-06 23:51   ` Jim Mattson
  2022-07-07 17:14     ` Sean Christopherson
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 23:51 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
> priority than MTF.  KVM itself doesn't emulate TSS #DBs, and any such

Is there a KVM erratum for that?

> exceptions injected from L1 will be handled by hardware (or morphed to
> a fault-like exception if injection fails), but theoretically userspace
> could pend a TSS T-flag #DB in conjunction with a pending MTF.
>
> Note, there's no known use case this fixes, it's purely to be technically
> correct with respect to Intel's SDM.

A test would be nice. :-)

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
  2022-06-14 20:47 ` [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1) Sean Christopherson
  2022-07-06 11:57   ` Maxim Levitsky
@ 2022-07-06 23:55   ` Jim Mattson
  2022-07-07 17:19     ` Sean Christopherson
  1 sibling, 1 reply; 78+ messages in thread
From: Jim Mattson @ 2022-07-06 23:55 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
>
> Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
> trap-like depending the sub-type of #DB, and effectively defer the
> decision of what to do with the #DB to the caller.
>
> For the emulator's two calls to exception_type(), treat the #DB as
> fault-like, as the emulator handles only code breakpoint and general
> detect #DBs, both of which are fault-like.

Does this mean that data and I/O breakpoint traps are just dropped?
Are there KVM errata for those misbehaviors?
What about single-stepping? Is that handled outwith the emulator?

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  2022-07-06 12:15   ` Maxim Levitsky
@ 2022-07-07  1:24     ` Sean Christopherson
  2022-07-10 15:56       ` Maxim Levitsky
  0 siblings, 1 reply; 78+ messages in thread
From: Sean Christopherson @ 2022-07-07  1:24 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > @@ -1618,9 +1620,9 @@ struct kvm_x86_ops {
> >  
> >  struct kvm_x86_nested_ops {
> >         void (*leave_nested)(struct kvm_vcpu *vcpu);
> > +       bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
> > +                                   u32 error_code);
> >         int (*check_events)(struct kvm_vcpu *vcpu);
> > -       bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
> > -                                            struct x86_exception *fault);
> 
> I think that since this patch is already quite large, it would make sense
> to split the removal of workaround/hack code to patch after this one?

Hmm, at a glance it seems doable, but I'd prefer to keep it as a single patch, in
no small part because I don't want to risking creating a transient bug whether KVM
blows up during bisection due to some weird interaction.  IMO, keeping the #PF hack
for a single patch would yield nonsensical code for that one patch.  It's a lot of
code, but logically the changes are very much a single thing.

> > @@ -3870,14 +3845,24 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> >   * from the emulator (because such #DBs are fault-like and thus don't trigger
> >   * actions that fire on instruction retire).
> >   */
> > -static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
> > +static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex)
> Any reason to remove the inline?

Mainly to avoid an unnecessarily long line, and because generally speaking local
static functions shouldn't be tagged inline.  The compiler is almost always smarter
than humans when it comes to inlining (or not).

> >  {
> > -       if (!vcpu->arch.exception.pending ||
> > -           vcpu->arch.exception.vector != DB_VECTOR)
> > +       if (!ex->pending || ex->vector != DB_VECTOR)
> >                 return 0;
> >  
> >         /* General Detect #DBs are always fault-like. */
> > -       return vcpu->arch.exception.payload & ~DR6_BD;
> > +       return ex->payload & ~DR6_BD;
> > +}

...

> This comment also probably should go to a separate patch to reduce this patch size.

I'll split it out.

> Other than that, this is a _very_ good idea to add it to KVM, although
> maybe we should put it in Documentation folder instead?
> (but I don't have a strong preference on this)

I definitely want a comment in KVM that's relatively close to the code.  I'm not
opposed to also adding something in Documentation, but I'd want that to be an "and"
not an "or".

> >  static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> >  {
> >         struct kvm_lapic *apic = vcpu->arch.apic;
> >         struct vcpu_vmx *vmx = to_vmx(vcpu);
> > -       unsigned long exit_qual;
> >         /*
> >          * Only a pending nested run blocks a pending exception.  If there is a
> >          * previously injected event, the pending exception occurred while said
> > @@ -3943,19 +4011,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> >                 /* Fallthrough, the SIPI is completely ignored. */
> >         }
> >  
> > -       /*
> > -        * Process exceptions that are higher priority than Monitor Trap Flag:
> > -        * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
> > -        * could theoretically come in from userspace), and ICEBP (INT1).
> > -        */
> > +       if (vcpu->arch.exception_vmexit.pending &&
> > +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
> > +               if (block_nested_exceptions)
> > +                       return -EBUSY;
> > +
> > +               nested_vmx_inject_exception_vmexit(vcpu);
> > +               return 0;
> > +       }
> 
> > +
> >         if (vcpu->arch.exception.pending &&
> > -           !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
> > +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
> Small nitpick: vmx_is_low_priority_db_trap refactoring could be done in a separate patch

Ya, will do.  No idea why I didn't do that.

> + Maybe it would be nice to add a WARN_ON_ONCE check here that this exception
> is not intercepted by the guest
>
> >                 if (block_nested_exceptions)
> >                         return -EBUSY;
> > -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> > -                       goto no_vmexit;
> > -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> > -               return 0;
> > +               goto no_vmexit;
> >         }
> >  
> >         if (vmx->nested.mtf_pending) {
> > @@ -3966,13 +4035,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> >                 return 0;
> >         }
> >  
> > +       if (vcpu->arch.exception_vmexit.pending) {
> > +               if (block_nested_exceptions)
> > +                       return -EBUSY;
> 
> And here add a WARN_ON_ONCE check that it is intercepted.

I like the idea of sanity check, but I really don't want to splatter them in both
VMX and SVM, and it's a bit kludgy to implement the checks in common code.  I'll
play with it and see if I can figure out a decent solution.

> > @@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
> >  
> >         kvm_make_request(KVM_REQ_EVENT, vcpu);
> >  
> > +       /*
> > +        * If the exception is destined for L2 and isn't being reinjected,
> > +        * morph it to a VM-Exit if L1 wants to intercept the exception.  A
> > +        * previously injected exception is not checked because it was checked
> > +        * when it was original queued, and re-checking is incorrect if _L1_
> > +        * injected the exception, in which case it's exempt from interception.
> > +        */
> > +       if (!reinject && is_guest_mode(vcpu) &&
> > +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
> > +               kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
> > +                                          has_payload, payload);
> > +               return;
> > +       }
> > +
> >         if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
> >         queue:
> >                 if (reinject) {
> >                         /*
> > -                        * On vmentry, vcpu->arch.exception.pending is only
> > -                        * true if an event injection was blocked by
> > -                        * nested_run_pending.  In that case, however,
> > -                        * vcpu_enter_guest requests an immediate exit,
> > -                        * and the guest shouldn't proceed far enough to
> > -                        * need reinjection.
> > +                        * On VM-Entry, an exception can be pending if and only
> > +                        * if event injection was blocked by nested_run_pending.
> > +                        * In that case, however, vcpu_enter_guest() requests an
> > +                        * immediate exit, and the guest shouldn't proceed far
> > +                        * enough to need reinjection.
> 
> Now that I had read the Jim's document on event priorities, I think we can
> update the comment:
>
> On VMX we set expired preemption timer, and on SVM we do self IPI, thus pend a real interrupt.
> Both events should have higher priority than processing the injected event

No, they don't.  Injected events (exceptions, IRQs, NMIs, etc...) "occur" as part
of the VMRUN/VMLAUNCH/VMRESUME, i.e. are vectored before an interrupt window is
opened at the next instruction boundary.  E.g. if the hypervisor intercepts an
exception and then reflects it back into the guest, any pending event must not be
recognized until after the injected exception is delivered, otherwise the event
would, from the guest's perspective, arrive in the middle of an instruction.

This is calling out something slightly different.  What it's saying is that if
there was a pending exception, then KVM should _not_ have injected said pending
exception and instead should have requested an immediate exit.  That "immediate
exit" should have forced a VM-Exit before the CPU could fetch a new instruction,
and thus before the guest could trigger an exception that would require reinjection.

The "immediate exit" trick works because all events with higher priority than the
VMX preeemption timer (or IRQ) are guaranteed to exit, e.g. a hardware SMI can't
cause a fault in the guest.

Though there might be an edge case with vmcs12.GUEST_PENDING_DBG_EXCEPTIONS that
could result in a #DB => #PF interception + reinjection when using shadow paging.
Maybe.

> (This is something I didn't find in the Intel/AMD docs, so I might be wrong here)
> thus the CPU will not attempt to process the injected event 
> (via EVENTINJ on SVM, or via VM_ENTRY_INTR_INFO_FIELD) and instead just straight copy
> them back to exit_int_info/IDT_VECTORING_INFO_FIELD)
> 
> So in this case the event will actually be re-injected, but no new exception can
> be generated since we will re-execute the VMRUN/VMRESUME instruction.
> 
> 
> >                          */
> > -                       WARN_ON_ONCE(vcpu->arch.exception.pending);
> > +                       WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
> >                         vcpu->arch.exception.injected = true;
> >                         if (WARN_ON_ONCE(has_payload)) {
> >                                 /*
> > @@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
> >  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> >  {
> >         ++vcpu->stat.pf_guest;
> > -       vcpu->arch.exception.nested_apf =
> > -               is_guest_mode(vcpu) && fault->async_page_fault;
> > -       if (vcpu->arch.exception.nested_apf) {
> > -               vcpu->arch.apf.nested_apf_token = fault->address;
> > -               kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
> > -       } else {
> > +
> > +       /*
> > +        * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
> > +        * whether or not L1 wants to intercept "regular" #PF.
> 
> We might want to also mention that the L1 has to opt-in to this
> (vcpu->arch.apf.delivery_as_pf_vmexit), but the fact that we are
> here, means that it did opt-in
> 
> (otherwise kvm_can_deliver_async_pf won't return true).
> 
> A WARN_ON_ONCE(!vcpu->arch.apf.delivery_as_pf_vmexit) would be
> nice to also check this in the runtime.

Eh, I'm not convinced this would be a worthwhile WARN, the logic is fully contained
in kvm_can_deliver_async_pf() and I don't see that changing, i.e. odds of breaking
this are very, very low.  At some point we just have to trust that we don't suck
that much :-)

> Also note that AFAIK, qemu doesn't opt-in for this feature sadly,
> thus this code is not tested (unless there is some unit test).
> 
> 
> > +        */
> > +       if (is_guest_mode(vcpu) && fault->async_page_fault)
> > +               kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
> > +                                          true, fault->error_code,
> > +                                          true, fault->address);
> > +       else
> >                 kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
> >                                         fault->address);
> > -       }
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
> >  
> > -/* Returns true if the page fault was immediately morphed into a VM-Exit. */
> > -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> > +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> >                                     struct x86_exception *fault)
> >  {
> >         struct kvm_mmu *fault_mmu;
> > @@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> >                 kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
> >                                        fault_mmu->root.hpa);
> >  
> > -       /*
> > -        * A workaround for KVM's bad exception handling.  If KVM injected an
> > -        * exception into L2, and L2 encountered a #PF while vectoring the
> > -        * injected exception, manually check to see if L1 wants to intercept
> > -        * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
> > -        * In all other cases, defer the check to nested_ops->check_events(),
> > -        * which will correctly handle priority (this does not).  Note, other
> > -        * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
> > -        * most problematic, e.g. when L0 and L1 are both intercepting #PF for
> > -        * shadow paging.
> > -        *
> > -        * TODO: Rewrite exception handling to track injected and pending
> > -        *       (VM-Exit) exceptions separately.
> > -        */
> > -       if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
> > -           kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
> > -               return true;
> > -
> >         fault_mmu->inject_page_fault(vcpu, fault);
> > -       return false;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
> >  
> > @@ -4752,7 +4763,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
> >         return (kvm_arch_interrupt_allowed(vcpu) &&
> >                 kvm_cpu_accept_dm_intr(vcpu) &&
> >                 !kvm_event_needs_reinjection(vcpu) &&
> > -               !vcpu->arch.exception.pending);
> > +               !kvm_is_exception_pending(vcpu));
> >  }
> >  
> >  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
> > @@ -4881,13 +4892,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> >  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> >                                                struct kvm_vcpu_events *events)
> >  {
> > -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> > +       struct kvm_queued_exception *ex;
> >  
> >         process_nmi(vcpu);
> >  
> >         if (kvm_check_request(KVM_REQ_SMI, vcpu))
> >                 process_smi(vcpu);
> >  
> > +       /*
> > +        * KVM's ABI only allows for one exception to be migrated.  Luckily,
> > +        * the only time there can be two queued exceptions is if there's a
> > +        * non-exiting _injected_ exception, and a pending exiting exception.
> > +        * In that case, ignore the VM-Exiting exception as it's an extension
> > +        * of the injected exception.
> > +        */
> 
> I think that we will lose the injected exception, thus will only deliver after
> the migration the VM-exiting exception but without the correct IDT_VECTORING_INFO_FIELD/exit_int_info.
> 
> It's not that big deal and can be fixed by extending this API, with a new cap,
> as I did in my patches. This can be done later, but the above comment
> which tries to justify it, should be updated to mention that it is wrong.

No?  The below will migrate the pending VM-Exit if and only if there's no pending
or injected exception.  The pending VM-Exit is dropped, and in theory could be
lost if something fixes the underlying exception (that results in VM-Exit) before
the guest is resumed, but that's ok.  If the exception was somehow fixed then the
exception was inherently non-deterministic anyways, i.e. the guest can't have
guaranteed that it would occur.

Or did I misunderstand?

> > +       if (vcpu->arch.exception_vmexit.pending &&
> > +           !vcpu->arch.exception.pending &&
> > +           !vcpu->arch.exception.injected)
> > +               ex = &vcpu->arch.exception_vmexit;
> > +       else
> > +               ex = &vcpu->arch.exception;

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag
  2022-07-06 23:51   ` Jim Mattson
@ 2022-07-07 17:14     ` Sean Christopherson
  0 siblings, 0 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-07-07 17:14 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Jim Mattson wrote:
> On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Service TSS T-flag #DBs prior to pending MTFs, as such #DBs are higher
> > priority than MTF.  KVM itself doesn't emulate TSS #DBs, and any such
> 
> Is there a KVM erratum for that?

Nope, just this hilarious TODO:

	/*
	 * TODO: What about debug traps on tss switch?
	 *       Are we supposed to inject them and update dr6?
	 */

> > exceptions injected from L1 will be handled by hardware (or morphed to
> > a fault-like exception if injection fails), but theoretically userspace
> > could pend a TSS T-flag #DB in conjunction with a pending MTF.
> >
> > Note, there's no known use case this fixes, it's purely to be technically
> > correct with respect to Intel's SDM.
> 
> A test would be nice. :-)

LOL, yeah, but ensuring userspace-injected TSS T-bit #DBs work isn't exactly on
my list of top 100 things to look at.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1)
  2022-07-06 23:55   ` Jim Mattson
@ 2022-07-07 17:19     ` Sean Christopherson
  0 siblings, 0 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-07-07 17:19 UTC (permalink / raw)
  To: Jim Mattson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Wed, Jul 06, 2022, Jim Mattson wrote:
> On Tue, Jun 14, 2022 at 1:47 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > Add a dedicated "exception type" for #DBs, as #DBs can be fault-like or
> > trap-like depending the sub-type of #DB, and effectively defer the
> > decision of what to do with the #DB to the caller.
> >
> > For the emulator's two calls to exception_type(), treat the #DB as
> > fault-like, as the emulator handles only code breakpoint and general
> > detect #DBs, both of which are fault-like.
> 
> Does this mean that data and I/O breakpoint traps are just dropped?

Yep.

> Are there KVM errata for those misbehaviors?

Nope.

> What about single-stepping? Is that handled outwith the emulator?

Single-step is emulated, and AFAIK there are no _known_ bugs.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  2022-07-07  1:24     ` Sean Christopherson
@ 2022-07-10 15:56       ` Maxim Levitsky
  2022-07-11 15:22         ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-10 15:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Thu, 2022-07-07 at 01:24 +0000, Sean Christopherson wrote:
> On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > > @@ -1618,9 +1620,9 @@ struct kvm_x86_ops {
> > >  
> > >  struct kvm_x86_nested_ops {
> > >         void (*leave_nested)(struct kvm_vcpu *vcpu);
> > > +       bool (*is_exception_vmexit)(struct kvm_vcpu *vcpu, u8 vector,
> > > +                                   u32 error_code);
> > >         int (*check_events)(struct kvm_vcpu *vcpu);
> > > -       bool (*handle_page_fault_workaround)(struct kvm_vcpu *vcpu,
> > > -                                            struct x86_exception *fault);
> > 
> > I think that since this patch is already quite large, it would make sense
> > to split the removal of workaround/hack code to patch after this one?
> 
> Hmm, at a glance it seems doable, but I'd prefer to keep it as a single patch, in
> no small part because I don't want to risking creating a transient bug whether KVM
> blows up during bisection due to some weird interaction.  IMO, keeping the #PF hack
> for a single patch would yield nonsensical code for that one patch.  It's a lot of
> code, but logically the changes are very much a single thing.

Let it be then. I don't have a strong preference on this matter.

> 
> > > @@ -3870,14 +3845,24 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> > >   * from the emulator (because such #DBs are fault-like and thus don't trigger
> > >   * actions that fire on instruction retire).
> > >   */
> > > -static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
> > > +static unsigned long vmx_get_pending_dbg_trap(struct kvm_queued_exception *ex)
> > Any reason to remove the inline?
> 
> Mainly to avoid an unnecessarily long line, and because generally speaking local
> static functions shouldn't be tagged inline.  The compiler is almost always smarter
> than humans when it comes to inlining (or not).

Makes sense.

> 
> > >  {
> > > -       if (!vcpu->arch.exception.pending ||
> > > -           vcpu->arch.exception.vector != DB_VECTOR)
> > > +       if (!ex->pending || ex->vector != DB_VECTOR)
> > >                 return 0;
> > >  
> > >         /* General Detect #DBs are always fault-like. */
> > > -       return vcpu->arch.exception.payload & ~DR6_BD;
> > > +       return ex->payload & ~DR6_BD;
> > > +}
> 
> ...
> 
> > This comment also probably should go to a separate patch to reduce this patch size.
> 
> I'll split it out.
> 
> > Other than that, this is a _very_ good idea to add it to KVM, although
> > maybe we should put it in Documentation folder instead?
> > (but I don't have a strong preference on this)
> 
> I definitely want a comment in KVM that's relatively close to the code.  I'm not
> opposed to also adding something in Documentation, but I'd want that to be an "and"
> not an "or".

Also makes sense. 

I do think that it is worthwhile to also add a comment about the way KVM
handles exceptions, which means that inject_pending_event is not always called on instruction
boundary. When we have a pending/injected exception we have first to get rid of it,
and only then we will be on instruction boundary.

And to be sure that we will inject pending interrupts on the closest instruction
boundary, we actually open an interrupt/smi/nmi window there.

> 
> > >  static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> > >  {
> > >         struct kvm_lapic *apic = vcpu->arch.apic;
> > >         struct vcpu_vmx *vmx = to_vmx(vcpu);
> > > -       unsigned long exit_qual;
> > >         /*
> > >          * Only a pending nested run blocks a pending exception.  If there is a
> > >          * previously injected event, the pending exception occurred while said
> > > @@ -3943,19 +4011,20 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> > >                 /* Fallthrough, the SIPI is completely ignored. */
> > >         }
> > >  
> > > -       /*
> > > -        * Process exceptions that are higher priority than Monitor Trap Flag:
> > > -        * fault-like exceptions, TSS T flag #DB (not emulated by KVM, but
> > > -        * could theoretically come in from userspace), and ICEBP (INT1).
> > > -        */
> > > +       if (vcpu->arch.exception_vmexit.pending &&
> > > +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception_vmexit)) {
> > > +               if (block_nested_exceptions)
> > > +                       return -EBUSY;
> > > +
> > > +               nested_vmx_inject_exception_vmexit(vcpu);
> > > +               return 0;
> > > +       }
> > > +
> > >         if (vcpu->arch.exception.pending &&
> > > -           !(vmx_get_pending_dbg_trap(vcpu) & ~DR6_BT)) {
> > > +           !vmx_is_low_priority_db_trap(&vcpu->arch.exception)) {
> > Small nitpick: vmx_is_low_priority_db_trap refactoring could be done in a separate patch
> 
> Ya, will do.  No idea why I didn't do that.
> 
> > + Maybe it would be nice to add a WARN_ON_ONCE check here that this exception
> > is not intercepted by the guest
> > 
> > >                 if (block_nested_exceptions)
> > >                         return -EBUSY;
> > > -               if (!nested_vmx_check_exception(vcpu, &exit_qual))
> > > -                       goto no_vmexit;
> > > -               nested_vmx_inject_exception_vmexit(vcpu, exit_qual);
> > > -               return 0;
> > > +               goto no_vmexit;
> > >         }
> > >  
> > >         if (vmx->nested.mtf_pending) {
> > > @@ -3966,13 +4035,18 @@ static int vmx_check_nested_events(struct kvm_vcpu *vcpu)
> > >                 return 0;
> > >         }
> > >  
> > > +       if (vcpu->arch.exception_vmexit.pending) {
> > > +               if (block_nested_exceptions)
> > > +                       return -EBUSY;
> > 
> > And here add a WARN_ON_ONCE check that it is intercepted.
> 
> I like the idea of sanity check, but I really don't want to splatter them in both
> VMX and SVM, and it's a bit kludgy to implement the checks in common code.  I'll
> play with it and see if I can figure out a decent solution.

Makes sense, that is what I am thinking as well, I also don't have a strong preference
on this case.

> 
> > > @@ -618,18 +633,31 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
> > >  
> > >         kvm_make_request(KVM_REQ_EVENT, vcpu);
> > >  
> > > +       /*
> > > +        * If the exception is destined for L2 and isn't being reinjected,
> > > +        * morph it to a VM-Exit if L1 wants to intercept the exception.  A
> > > +        * previously injected exception is not checked because it was checked
> > > +        * when it was original queued, and re-checking is incorrect if _L1_
> > > +        * injected the exception, in which case it's exempt from interception.
> > > +        */
> > > +       if (!reinject && is_guest_mode(vcpu) &&
> > > +           kvm_x86_ops.nested_ops->is_exception_vmexit(vcpu, nr, error_code)) {
> > > +               kvm_queue_exception_vmexit(vcpu, nr, has_error, error_code,
> > > +                                          has_payload, payload);
> > > +               return;
> > > +       }
> > > +
> > >         if (!vcpu->arch.exception.pending && !vcpu->arch.exception.injected) {
> > >         queue:
> > >                 if (reinject) {
> > >                         /*
> > > -                        * On vmentry, vcpu->arch.exception.pending is only
> > > -                        * true if an event injection was blocked by
> > > -                        * nested_run_pending.  In that case, however,
> > > -                        * vcpu_enter_guest requests an immediate exit,
> > > -                        * and the guest shouldn't proceed far enough to
> > > -                        * need reinjection.
> > > +                        * On VM-Entry, an exception can be pending if and only
> > > +                        * if event injection was blocked by nested_run_pending.
> > > +                        * In that case, however, vcpu_enter_guest() requests an
> > > +                        * immediate exit, and the guest shouldn't proceed far
> > > +                        * enough to need reinjection.
> > 
> > Now that I had read the Jim's document on event priorities, I think we can
> > update the comment:
> > 
> > On VMX we set expired preemption timer, and on SVM we do self IPI, thus pend a real interrupt.
> > Both events should have higher priority than processing the injected event
> 
> No, they don't.  Injected events (exceptions, IRQs, NMIs, etc...) "occur" as part
> of the VMRUN/VMLAUNCH/VMRESUME, i.e. are vectored before an interrupt window is
> opened at the next instruction boundary. 
>  E.g. if the hypervisor intercepts an
> exception and then reflects it back into the guest, any pending event must not be
> recognized until after the injected exception is delivered, otherwise the event
> would, from the guest's perspective, arrive in the middle of an instruction.

Yes I was afraid that I didn't understand that correctly.

> 
> This is calling out something slightly different.  What it's saying is that if
> there was a pending exception, then KVM should _not_ have injected said pending
> exception and instead should have requested an immediate exit.  That "immediate
> exit" should have forced a VM-Exit before the CPU could fetch a new instruction,
> and thus before the guest could trigger an exception that would require reinjection.
> 
> The "immediate exit" trick works because all events with higher priority than the
> VMX preeemption timer (or IRQ) are guaranteed to exit, e.g. a hardware SMI can't
> cause a fault in the guest.

Yes it all makes sense now. It really helps thinking in terms of instruction boundary.

However, that makes me think: Can that actually happen?

A pending exception can only be generated by KVM itself (nested hypervisor,
and CPU reflected exceptions/interrupts are all injected).

If VMRUN/VMRESUME has a pending exception, it means that it itself generated it,
in which case we won't be entering the guest, but rather jump to the
exception handler, and thus nested run will not be pending.

We can though have pending NMI/SMI/interrupts.

Also just a note about injected exceptions/interrupts during VMRUN/VMRESUME.

If nested_run_pending is true, then the injected exception due to the same
reasoning can not come from VMRUN/VMRESUME. It can come from nested hypevisor's EVENTINJ,
but in this case we currently just copy it from vmcb12/vmcs12 to vmcb02/vmcs02,
without touching vcpu->arch.interrupt.

Luckily this doesn't cause issues because when the nested run is pending
we don't inject anything to the guest.

If nested_run_pending is false however, the opposite is true. The EVENTINJ
will be already delivered, and we can only have injected exception/interrupt
that come from the cpu itself via exit_int_info/IDT_VECTORING_INFO_FIELD which
we will copy back as injected interrupt/exception to 'vcpu->arch.exception/interrupt'.
and later re-inject, next time we run the same VMRUN instruction.

> 
> Though there might be an edge case with vmcs12.GUEST_PENDING_DBG_EXCEPTIONS that
> could result in a #DB => #PF interception + reinjection when using shadow paging.
> Maybe.

x86 is a fractal :(

> 
> > (This is something I didn't find in the Intel/AMD docs, so I might be wrong here)
> > thus the CPU will not attempt to process the injected event 
> > (via EVENTINJ on SVM, or via VM_ENTRY_INTR_INFO_FIELD) and instead just straight copy
> > them back to exit_int_info/IDT_VECTORING_INFO_FIELD)
> > 
> > So in this case the event will actually be re-injected, but no new exception can
> > be generated since we will re-execute the VMRUN/VMRESUME instruction.
> > 
> > 
> > >                          */
> > > -                       WARN_ON_ONCE(vcpu->arch.exception.pending);
> > > +                       WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
> > >                         vcpu->arch.exception.injected = true;
> > >                         if (WARN_ON_ONCE(has_payload)) {
> > >                                 /*
> > > @@ -732,20 +760,22 @@ static int complete_emulated_insn_gp(struct kvm_vcpu *vcpu, int err)
> > >  void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault)
> > >  {
> > >         ++vcpu->stat.pf_guest;
> > > -       vcpu->arch.exception.nested_apf =
> > > -               is_guest_mode(vcpu) && fault->async_page_fault;
> > > -       if (vcpu->arch.exception.nested_apf) {
> > > -               vcpu->arch.apf.nested_apf_token = fault->address;
> > > -               kvm_queue_exception_e(vcpu, PF_VECTOR, fault->error_code);
> > > -       } else {
> > > +
> > > +       /*
> > > +        * Async #PF in L2 is always forwarded to L1 as a VM-Exit regardless of
> > > +        * whether or not L1 wants to intercept "regular" #PF.
> > 
> > We might want to also mention that the L1 has to opt-in to this
> > (vcpu->arch.apf.delivery_as_pf_vmexit), but the fact that we are
> > here, means that it did opt-in
> > 
> > (otherwise kvm_can_deliver_async_pf won't return true).
> > 
> > A WARN_ON_ONCE(!vcpu->arch.apf.delivery_as_pf_vmexit) would be
> > nice to also check this in the runtime.
> 
> Eh, I'm not convinced this would be a worthwhile WARN, the logic is fully contained
> in kvm_can_deliver_async_pf() and I don't see that changing, i.e. odds of breaking
> this are very, very low.  At some point we just have to trust that we don't suck
> that much :-)

Agree, but my goal was more to add this warning as a form of a documentation
and not to make it catch a bug.


> 
> > Also note that AFAIK, qemu doesn't opt-in for this feature sadly,
> > thus this code is not tested (unless there is some unit test).
> > 
> > 
> > > +        */
> > > +       if (is_guest_mode(vcpu) && fault->async_page_fault)
> > > +               kvm_queue_exception_vmexit(vcpu, PF_VECTOR,
> > > +                                          true, fault->error_code,
> > > +                                          true, fault->address);
> > > +       else
> > >                 kvm_queue_exception_e_p(vcpu, PF_VECTOR, fault->error_code,
> > >                                         fault->address);
> > > -       }
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_inject_page_fault);
> > >  
> > > -/* Returns true if the page fault was immediately morphed into a VM-Exit. */
> > > -bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> > > +void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> > >                                     struct x86_exception *fault)
> > >  {
> > >         struct kvm_mmu *fault_mmu;
> > > @@ -763,26 +793,7 @@ bool kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
> > >                 kvm_mmu_invalidate_gva(vcpu, fault_mmu, fault->address,
> > >                                        fault_mmu->root.hpa);
> > >  
> > > -       /*
> > > -        * A workaround for KVM's bad exception handling.  If KVM injected an
> > > -        * exception into L2, and L2 encountered a #PF while vectoring the
> > > -        * injected exception, manually check to see if L1 wants to intercept
> > > -        * #PF, otherwise queuing the #PF will lead to #DF or a lost exception.
> > > -        * In all other cases, defer the check to nested_ops->check_events(),
> > > -        * which will correctly handle priority (this does not).  Note, other
> > > -        * exceptions, e.g. #GP, are theoretically affected, #PF is simply the
> > > -        * most problematic, e.g. when L0 and L1 are both intercepting #PF for
> > > -        * shadow paging.
> > > -        *
> > > -        * TODO: Rewrite exception handling to track injected and pending
> > > -        *       (VM-Exit) exceptions separately.
> > > -        */
> > > -       if (unlikely(vcpu->arch.exception.injected && is_guest_mode(vcpu)) &&
> > > -           kvm_x86_ops.nested_ops->handle_page_fault_workaround(vcpu, fault))
> > > -               return true;
> > > -
> > >         fault_mmu->inject_page_fault(vcpu, fault);
> > > -       return false;
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_inject_emulated_page_fault);
> > >  
> > > @@ -4752,7 +4763,7 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
> > >         return (kvm_arch_interrupt_allowed(vcpu) &&
> > >                 kvm_cpu_accept_dm_intr(vcpu) &&
> > >                 !kvm_event_needs_reinjection(vcpu) &&
> > > -               !vcpu->arch.exception.pending);
> > > +               !kvm_is_exception_pending(vcpu));
> > >  }
> > >  
> > >  static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
> > > @@ -4881,13 +4892,27 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> > >  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> > >                                                struct kvm_vcpu_events *events)
> > >  {
> > > -       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> > > +       struct kvm_queued_exception *ex;
> > >  
> > >         process_nmi(vcpu);
> > >  
> > >         if (kvm_check_request(KVM_REQ_SMI, vcpu))
> > >                 process_smi(vcpu);
> > >  
> > > +       /*
> > > +        * KVM's ABI only allows for one exception to be migrated.  Luckily,
> > > +        * the only time there can be two queued exceptions is if there's a
> > > +        * non-exiting _injected_ exception, and a pending exiting exception.
> > > +        * In that case, ignore the VM-Exiting exception as it's an extension
> > > +        * of the injected exception.
> > > +        */
> > 
> > I think that we will lose the injected exception, thus will only deliver after
> > the migration the VM-exiting exception but without the correct IDT_VECTORING_INFO_FIELD/exit_int_info.
> > 
> > It's not that big deal and can be fixed by extending this API, with a new cap,
> > as I did in my patches. This can be done later, but the above comment
> > which tries to justify it, should be updated to mention that it is wrong.
> 
> No?  The below will migrate the pending VM-Exit if and only if there's no pending
> or injected exception.  

I missed that part.

BTW, another thing to note that there can't be two pending exceptions, because
pending exceptions can only be generated by KVM, and as long as it is not injected,
there can be no new pending exception generated.

Same for pending interrupt + pending exception, as long as we didn't inject the interrupt
we won't get any new exception.



> The pending VM-Exit is dropped, and in theory could be
> lost if something fixes the underlying exception (that results in VM-Exit) before
> the guest is resumed, but that's ok.  If the exception was somehow fixed then the
> exception was inherently non-deterministic anyways, i.e. the guest can't have
> guaranteed that it would occur.
> 
> Or did I misunderstand?

No, it looks reasonable now.

So basically you are saying that we have an injected exception/interrupt,
and during the injection we got another exception (only exception possible).

If we inject the exception again, we should get the nested exception again,
unless it was non deterministic (e.g #MC or page fault that was fixed
somehow during migration), and it indeed looks OK to drop the nested exception
in this case.

As usual there could be corner cases, similar on how recently a case
of nested hypervisor injecting a software interrupt on RIP that doesn't contain
INTn instruction was fixed, which was wrong before the fix when we just re-executed the instruction,
but all of this is a corner case of a corner case of a corner case, so I fully approve it.

It is 3 times corner case because:

1. Nested exceptions are a corner case.
2. Migrating right when a nested exception happened is a corner case.
3. Somehow losing that vmexit would be a corner case as well.


Best regards,
	Maxim Levitsky

> 
> > > +       if (vcpu->arch.exception_vmexit.pending &&
> > > +           !vcpu->arch.exception.pending &&
> > > +           !vcpu->arch.exception.injected)
> > > +               ex = &vcpu->arch.exception_vmexit;
> > > +       else
> > > +               ex = &vcpu->arch.exception;



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups
  2022-07-06 20:11           ` Jim Mattson
@ 2022-07-10 15:58             ` Maxim Levitsky
  0 siblings, 0 replies; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-10 15:58 UTC (permalink / raw)
  To: Jim Mattson, Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Wed, 2022-07-06 at 13:11 -0700, Jim Mattson wrote:
> On Wed, Jul 6, 2022 at 10:52 AM Sean Christopherson <seanjc@google.com> wrote:
> 
> > Hmm, I'm not entirely convinced that Intel doesn't interpret "internal to the
> > processor" as "undocumented SMRAM fields".  But I could also be misremembering
> > the SMI flows.
> 
> Start using reserved SMRAM, and you will regret it when the vendor
> assigns some new bit of state to the same location.
> 
This is true to some extent, but our SMRAM layout doesn't follow the
spec anyway. This is the reason I asked (I posted an RFC as a good citizen),
in the first place all of you, if you prefer SMRAM or KVM internal state.

Anyway if this is a concern, I can just save the interrupt shadow in KVM,
and migrate it, its not hard, in fact the v1 of my patches did exactly that.

Paolo, what should I do?

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time
  2022-07-10 15:56       ` Maxim Levitsky
@ 2022-07-11 15:22         ` Sean Christopherson
  0 siblings, 0 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-07-11 15:22 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Sun, Jul 10, 2022, Maxim Levitsky wrote:
> On Thu, 2022-07-07 at 01:24 +0000, Sean Christopherson wrote:
> > On Wed, Jul 06, 2022, Maxim Levitsky wrote:
> > > Other than that, this is a _very_ good idea to add it to KVM, although
> > > maybe we should put it in Documentation folder instead?
> > > (but I don't have a strong preference on this)
> > 
> > I definitely want a comment in KVM that's relatively close to the code.  I'm not
> > opposed to also adding something in Documentation, but I'd want that to be an "and"
> > not an "or".
> 
> Also makes sense. 
> 
> I do think that it is worthwhile to also add a comment about the way KVM
> handles exceptions, which means that inject_pending_event is not always called on instruction
> boundary. When we have a pending/injected exception we have first to get rid of it,
> and only then we will be on instruction boundary.

Yeah, though it's not like KVM has much of a choice, e.g. intercepted=>reflected
exceptions must be injected during instruction execution.  I wouldn't be opposed
to renaming inject_pending_event() if someone can come up with a decent alternative
that's sufficiently descriptive but not comically verbose.

kvm_check_events() to pair with kvm_check_nested_events()?  kvm_check_and_inject_events()?  

> And to be sure that we will inject pending interrupts on the closest instruction
> boundary, we actually open an interrupt/smi/nmi window there.
> > This is calling out something slightly different.  What it's saying is that if
> > there was a pending exception, then KVM should _not_ have injected said pending
> > exception and instead should have requested an immediate exit.  That "immediate
> > exit" should have forced a VM-Exit before the CPU could fetch a new instruction,
> > and thus before the guest could trigger an exception that would require reinjection.
> > 
> > The "immediate exit" trick works because all events with higher priority than the
> > VMX preeemption timer (or IRQ) are guaranteed to exit, e.g. a hardware SMI can't
> > cause a fault in the guest.
> 
> Yes it all makes sense now. It really helps thinking in terms of instruction boundary.
> 
> However, that makes me think: Can that actually happen?

I don't think KVM can get itself in that state, but I believe userspace could force
it by using KVM_SET_VCPU_EVENTS + KVM_SET_NESTED_STATE.

> A pending exception can only be generated by KVM itself (nested hypervisor,
> and CPU reflected exceptions/interrupts are all injected).
> 
> If VMRUN/VMRESUME has a pending exception, it means that it itself generated it,
> in which case we won't be entering the guest, but rather jump to the
> exception handler, and thus nested run will not be pending.

Notably, SVM handles single-step #DBs on VMRUN in the nested VM-Exit path.  That's
the only exception that I can think of off the top of my head that can be coincident
with a successful VM-Entry (ignoring things like NMI=>#PF).

> We can though have pending NMI/SMI/interrupts.
> 
> Also just a note about injected exceptions/interrupts during VMRUN/VMRESUME.
> 
> If nested_run_pending is true, then the injected exception due to the same
> reasoning can not come from VMRUN/VMRESUME. It can come from nested hypevisor's EVENTINJ,
> but in this case we currently just copy it from vmcb12/vmcs12 to vmcb02/vmcs02,
> without touching vcpu->arch.interrupt.
> 
> Luckily this doesn't cause issues because when the nested run is pending
> we don't inject anything to the guest.
> 
> If nested_run_pending is false however, the opposite is true. The EVENTINJ
> will be already delivered, and we can only have injected exception/interrupt
> that come from the cpu itself via exit_int_info/IDT_VECTORING_INFO_FIELD which
> we will copy back as injected interrupt/exception to 'vcpu->arch.exception/interrupt'.
> and later re-inject, next time we run the same VMRUN instruction.

^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct
  2022-06-14 20:47 ` [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct Sean Christopherson
  2022-07-06 12:02   ` Maxim Levitsky
@ 2022-07-18 13:07   ` Maxim Levitsky
  2022-07-18 13:10     ` Maxim Levitsky
  1 sibling, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-18 13:07 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
> in anticipation of adding a second instance in kvm_vcpu_arch to handle
> exceptions that occur when vectoring an injected exception and are
> morphed to VM-Exit instead of leading to #DF.
> 
> Opportunistically take advantage of the churn to rename "nr" to "vector".
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> 
...


Is this change below intentional? My memory on nested_apf_token is quite rusty, but at least
if possible, I would prefer this to be done in separate patch.


Best regards,
	Maxim Levitsky

> -               else if (svm->vcpu.arch.exception.has_payload)
> -                       vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
> +       if (ex->vector == PF_VECTOR) {
> +               if (ex->has_payload)
> +                       vmcb->control.exit_info_2 = ex->payload;
>                 else
> -                       vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
> -       } else if (nr == DB_VECTOR) {
> +                       vmcb->control.exit_info_2 = vcpu->arch.cr2;
> +       } else if (ex->vector == DB_VECTOR) {
>                 /* See inject_pending_event.  */
> -               kvm_deliver_exception_payload(&svm->vcpu);
> -               if (svm->vcpu.arch.dr7 & DR7_GD) {
> -                       svm->vcpu.arch.dr7 &= ~DR7_GD;
> -                       kvm_update_dr7(&svm->vcpu);
> +               kvm_deliver_exception_payload(vcpu, ex);
> +
> +               if (vcpu->arch.dr7 & DR7_GD) {
> +                       vcpu->arch.dr7 &= ~DR7_GD;
> +                       kvm_update_dr7(vcpu);
>                 }
> -       } else
> -               WARN_ON(svm->vcpu.arch.exception.has_payload);
> +       } else {
> +               WARN_ON(ex->has_payload);
> +       }
>  
>         nested_svm_vmexit(svm);
>  }
> @@ -1372,7 +1373,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
>                          return -EBUSY;
>                 if (!nested_exit_on_exception(svm))
>                         return 0;
> -               nested_svm_inject_exception_vmexit(svm);
> +               nested_svm_inject_exception_vmexit(vcpu);
>                 return 0;
>         }
>  
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index ca39f76ca44b..6b80046a014f 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -432,22 +432,20 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
>  
>  static void svm_inject_exception(struct kvm_vcpu *vcpu)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
>         struct vcpu_svm *svm = to_svm(vcpu);
> -       unsigned nr = vcpu->arch.exception.nr;
> -       bool has_error_code = vcpu->arch.exception.has_error_code;
> -       u32 error_code = vcpu->arch.exception.error_code;
>  
> -       kvm_deliver_exception_payload(vcpu);
> +       kvm_deliver_exception_payload(vcpu, ex);
>  
> -       if (kvm_exception_is_soft(nr) &&
> +       if (kvm_exception_is_soft(ex->vector) &&
>             svm_update_soft_interrupt_rip(vcpu))
>                 return;
>  
> -       svm->vmcb->control.event_inj = nr
> +       svm->vmcb->control.event_inj = ex->vector
>                 | SVM_EVTINJ_VALID
> -               | (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
> +               | (ex->has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
>                 | SVM_EVTINJ_TYPE_EXEPT;
> -       svm->vmcb->control.event_inj_err = error_code;
> +       svm->vmcb->control.event_inj_err = ex->error_code;
>  }
>  
>  static void svm_init_erratum_383(void)
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 7b644513c82b..fafdcbfeca1f 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -445,29 +445,27 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
>   */
>  static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -       unsigned int nr = vcpu->arch.exception.nr;
> -       bool has_payload = vcpu->arch.exception.has_payload;
> -       unsigned long payload = vcpu->arch.exception.payload;
>  
> -       if (nr == PF_VECTOR) {
> -               if (vcpu->arch.exception.nested_apf) {
> +       if (ex->vector == PF_VECTOR) {
> +               if (ex->nested_apf) {
>                         *exit_qual = vcpu->arch.apf.nested_apf_token;
>                         return 1;
>                 }
> -               if (nested_vmx_is_page_fault_vmexit(vmcs12,
> -                                                   vcpu->arch.exception.error_code)) {
> -                       *exit_qual = has_payload ? payload : vcpu->arch.cr2;
> +               if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
> +                       *exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
>                         return 1;
>                 }
> -       } else if (vmcs12->exception_bitmap & (1u << nr)) {
> -               if (nr == DB_VECTOR) {
> -                       if (!has_payload) {
> -                               payload = vcpu->arch.dr6;
> -                               payload &= ~DR6_BT;
> -                               payload ^= DR6_ACTIVE_LOW;
> +       } else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
> +               if (ex->vector == DB_VECTOR) {
> +                       if (ex->has_payload) {
> +                               *exit_qual = ex->payload;
> +                       } else {
> +                               *exit_qual = vcpu->arch.dr6;
> +                               *exit_qual &= ~DR6_BT;
> +                               *exit_qual ^= DR6_ACTIVE_LOW;
>                         }
> -                       *exit_qual = payload;
>                 } else
>                         *exit_qual = 0;
>                 return 1;
> @@ -3724,7 +3722,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
>              is_double_fault(exit_intr_info))) {
>                 vmcs12->idt_vectoring_info_field = 0;
>         } else if (vcpu->arch.exception.injected) {
> -               nr = vcpu->arch.exception.nr;
> +               nr = vcpu->arch.exception.vector;
>                 idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
>  
>                 if (kvm_exception_is_soft(nr)) {
> @@ -3828,11 +3826,11 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
>  static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>                                                unsigned long exit_qual)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> -       unsigned int nr = vcpu->arch.exception.nr;
> -       u32 intr_info = nr | INTR_INFO_VALID_MASK;
>  
> -       if (vcpu->arch.exception.has_error_code) {
> +       if (ex->has_error_code) {
>                 /*
>                  * Intel CPUs will never generate an error code with bits 31:16
>                  * set, and more importantly VMX disallows setting bits 31:16
> @@ -3840,11 +3838,11 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>                  * mimic hardware and avoid inducing failure on nested VM-Entry
>                  * if L1 chooses to inject the exception back to L2.
>                  */
> -               vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
> +               vmcs12->vm_exit_intr_error_code = (u16)ex->error_code;
>                 intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>         }
>  
> -       if (kvm_exception_is_soft(nr))
> +       if (kvm_exception_is_soft(ex->vector))
>                 intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>         else
>                 intr_info |= INTR_TYPE_HARD_EXCEPTION;
> @@ -3875,7 +3873,7 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
>  static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
>  {
>         if (!vcpu->arch.exception.pending ||
> -           vcpu->arch.exception.nr != DB_VECTOR)
> +           vcpu->arch.exception.vector != DB_VECTOR)
>                 return 0;
>  
>         /* General Detect #DBs are always fault-like. */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 26b863c78a9f..7ef5659a1bbd 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -1585,7 +1585,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>          */
>         if (nested_cpu_has_mtf(vmcs12) &&
>             (!vcpu->arch.exception.pending ||
> -            vcpu->arch.exception.nr == DB_VECTOR))
> +            vcpu->arch.exception.vector == DB_VECTOR))
>                 vmx->nested.mtf_pending = true;
>         else
>                 vmx->nested.mtf_pending = false;
> @@ -1612,15 +1612,13 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
>  
>  static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +       u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
>         struct vcpu_vmx *vmx = to_vmx(vcpu);
> -       unsigned nr = vcpu->arch.exception.nr;
> -       bool has_error_code = vcpu->arch.exception.has_error_code;
> -       u32 error_code = vcpu->arch.exception.error_code;
> -       u32 intr_info = nr | INTR_INFO_VALID_MASK;
>  
> -       kvm_deliver_exception_payload(vcpu);
> +       kvm_deliver_exception_payload(vcpu, ex);
>  
> -       if (has_error_code) {
> +       if (ex->has_error_code) {
>                 /*
>                  * Despite the error code being architecturally defined as 32
>                  * bits, and the VMCS field being 32 bits, Intel CPUs and thus
> @@ -1630,21 +1628,21 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
>                  * the upper bits to avoid VM-Fail, losing information that
>                  * does't really exist is preferable to killing the VM.
>                  */
> -               vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
> +               vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)ex->error_code);
>                 intr_info |= INTR_INFO_DELIVER_CODE_MASK;
>         }
>  
>         if (vmx->rmode.vm86_active) {
>                 int inc_eip = 0;
> -               if (kvm_exception_is_soft(nr))
> +               if (kvm_exception_is_soft(ex->vector))
>                         inc_eip = vcpu->arch.event_exit_inst_len;
> -               kvm_inject_realmode_interrupt(vcpu, nr, inc_eip);
> +               kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
>                 return;
>         }
>  
>         WARN_ON_ONCE(vmx->emulation_required);
>  
> -       if (kvm_exception_is_soft(nr)) {
> +       if (kvm_exception_is_soft(ex->vector)) {
>                 vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
>                              vmx->vcpu.arch.event_exit_inst_len);
>                 intr_info |= INTR_TYPE_SOFT_EXCEPTION;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index b63421d511c5..511c0c8af80e 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -557,16 +557,13 @@ static int exception_type(int vector)
>         return EXCPT_FAULT;
>  }
>  
> -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
> +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> +                                  struct kvm_queued_exception *ex)
>  {
> -       unsigned nr = vcpu->arch.exception.nr;
> -       bool has_payload = vcpu->arch.exception.has_payload;
> -       unsigned long payload = vcpu->arch.exception.payload;
> -
> -       if (!has_payload)
> +       if (!ex->has_payload)
>                 return;
>  
> -       switch (nr) {
> +       switch (ex->vector) {
>         case DB_VECTOR:
>                 /*
>                  * "Certain debug exceptions may clear bit 0-3.  The
> @@ -591,8 +588,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
>                  * So they need to be flipped for DR6.
>                  */
>                 vcpu->arch.dr6 |= DR6_ACTIVE_LOW;
> -               vcpu->arch.dr6 |= payload;
> -               vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW;
> +               vcpu->arch.dr6 |= ex->payload;
> +               vcpu->arch.dr6 ^= ex->payload & DR6_ACTIVE_LOW;
>  
>                 /*
>                  * The #DB payload is defined as compatible with the 'pending
> @@ -603,12 +600,12 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
>                 vcpu->arch.dr6 &= ~BIT(12);
>                 break;
>         case PF_VECTOR:
> -               vcpu->arch.cr2 = payload;
> +               vcpu->arch.cr2 = ex->payload;
>                 break;
>         }
>  
> -       vcpu->arch.exception.has_payload = false;
> -       vcpu->arch.exception.payload = 0;
> +       ex->has_payload = false;
> +       ex->payload = 0;
>  }
>  EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
>  
> @@ -647,17 +644,18 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>                         vcpu->arch.exception.injected = false;
>                 }
>                 vcpu->arch.exception.has_error_code = has_error;
> -               vcpu->arch.exception.nr = nr;
> +               vcpu->arch.exception.vector = nr;
>                 vcpu->arch.exception.error_code = error_code;
>                 vcpu->arch.exception.has_payload = has_payload;
>                 vcpu->arch.exception.payload = payload;
>                 if (!is_guest_mode(vcpu))
> -                       kvm_deliver_exception_payload(vcpu);
> +                       kvm_deliver_exception_payload(vcpu,
> +                                                     &vcpu->arch.exception);
>                 return;
>         }
>  
>         /* to check exception */
> -       prev_nr = vcpu->arch.exception.nr;
> +       prev_nr = vcpu->arch.exception.vector;
>         if (prev_nr == DF_VECTOR) {
>                 /* triple fault -> shutdown */
>                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> @@ -675,7 +673,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
>                 vcpu->arch.exception.pending = true;
>                 vcpu->arch.exception.injected = false;
>                 vcpu->arch.exception.has_error_code = true;
> -               vcpu->arch.exception.nr = DF_VECTOR;
> +               vcpu->arch.exception.vector = DF_VECTOR;
>                 vcpu->arch.exception.error_code = 0;
>                 vcpu->arch.exception.has_payload = false;
>                 vcpu->arch.exception.payload = 0;
> @@ -4886,25 +4884,24 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
>  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>                                                struct kvm_vcpu_events *events)
>  {
> +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> +
>         process_nmi(vcpu);
>  
>         if (kvm_check_request(KVM_REQ_SMI, vcpu))
>                 process_smi(vcpu);
>  
>         /*
> -        * In guest mode, payload delivery should be deferred,
> -        * so that the L1 hypervisor can intercept #PF before
> -        * CR2 is modified (or intercept #DB before DR6 is
> -        * modified under nVMX). Unless the per-VM capability,
> -        * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
> -        * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
> -        * opportunistically defer the exception payload, deliver it if the
> -        * capability hasn't been requested before processing a
> -        * KVM_GET_VCPU_EVENTS.
> +        * In guest mode, payload delivery should be deferred if the exception
> +        * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> +        * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> +        * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> +        * propagate the payload and so it cannot be safely deferred.  Deliver
> +        * the payload if the capability hasn't been requested.
>          */
>         if (!vcpu->kvm->arch.exception_payload_enabled &&
> -           vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
> -               kvm_deliver_exception_payload(vcpu);
> +           ex->pending && ex->has_payload)
> +               kvm_deliver_exception_payload(vcpu, ex);
>  
>         /*
>          * The API doesn't provide the instruction length for software
> @@ -4912,26 +4909,25 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>          * isn't advanced, we should expect to encounter the exception
>          * again.
>          */
> -       if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> +       if (kvm_exception_is_soft(ex->vector)) {
>                 events->exception.injected = 0;
>                 events->exception.pending = 0;
>         } else {
> -               events->exception.injected = vcpu->arch.exception.injected;
> -               events->exception.pending = vcpu->arch.exception.pending;
> +               events->exception.injected = ex->injected;
> +               events->exception.pending = ex->pending;
>                 /*
>                  * For ABI compatibility, deliberately conflate
>                  * pending and injected exceptions when
>                  * KVM_CAP_EXCEPTION_PAYLOAD isn't enabled.
>                  */
>                 if (!vcpu->kvm->arch.exception_payload_enabled)
> -                       events->exception.injected |=
> -                               vcpu->arch.exception.pending;
> +                       events->exception.injected |= ex->pending;
>         }
> -       events->exception.nr = vcpu->arch.exception.nr;
> -       events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> -       events->exception.error_code = vcpu->arch.exception.error_code;
> -       events->exception_has_payload = vcpu->arch.exception.has_payload;
> -       events->exception_payload = vcpu->arch.exception.payload;
> +       events->exception.nr = ex->vector;
> +       events->exception.has_error_code = ex->has_error_code;
> +       events->exception.error_code = ex->error_code;
> +       events->exception_has_payload = ex->has_payload;
> +       events->exception_payload = ex->payload;
>  
>         events->interrupt.injected =
>                 vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> @@ -5003,7 +4999,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
>         process_nmi(vcpu);
>         vcpu->arch.exception.injected = events->exception.injected;
>         vcpu->arch.exception.pending = events->exception.pending;
> -       vcpu->arch.exception.nr = events->exception.nr;
> +       vcpu->arch.exception.vector = events->exception.nr;
>         vcpu->arch.exception.has_error_code = events->exception.has_error_code;
>         vcpu->arch.exception.error_code = events->exception.error_code;
>         vcpu->arch.exception.has_payload = events->exception_has_payload;
> @@ -9497,7 +9493,7 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
>  
>  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
>  {
> -       trace_kvm_inj_exception(vcpu->arch.exception.nr,
> +       trace_kvm_inj_exception(vcpu->arch.exception.vector,
>                                 vcpu->arch.exception.has_error_code,
>                                 vcpu->arch.exception.error_code,
>                                 vcpu->arch.exception.injected);
> @@ -9569,12 +9565,12 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
>                  * describe the behavior of General Detect #DBs, which are
>                  * fault-like.  They do _not_ set RF, a la code breakpoints.
>                  */
> -               if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
> +               if (exception_type(vcpu->arch.exception.vector) == EXCPT_FAULT)
>                         __kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
>                                              X86_EFLAGS_RF);
>  
> -               if (vcpu->arch.exception.nr == DB_VECTOR) {
> -                       kvm_deliver_exception_payload(vcpu);
> +               if (vcpu->arch.exception.vector == DB_VECTOR) {
> +                       kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception);
>                         if (vcpu->arch.dr7 & DR7_GD) {
>                                 vcpu->arch.dr7 &= ~DR7_GD;
>                                 kvm_update_dr7(vcpu);
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 501b884b8cc4..dc2af0146220 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -286,7 +286,8 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
>  
>  int handle_ud(struct kvm_vcpu *vcpu);
>  
> -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu);
> +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> +                                  struct kvm_queued_exception *ex);
>  
>  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
>  u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct
  2022-07-18 13:07   ` Maxim Levitsky
@ 2022-07-18 13:10     ` Maxim Levitsky
  2022-07-18 15:40       ` Sean Christopherson
  0 siblings, 1 reply; 78+ messages in thread
From: Maxim Levitsky @ 2022-07-18 13:10 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: Vitaly Kuznetsov, Wanpeng Li, Jim Mattson, Joerg Roedel, kvm,
	linux-kernel, Oliver Upton, Peter Shier

On Mon, 2022-07-18 at 16:07 +0300, Maxim Levitsky wrote:
> On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
> > in anticipation of adding a second instance in kvm_vcpu_arch to handle
> > exceptions that occur when vectoring an injected exception and are
> > morphed to VM-Exit instead of leading to #DF.
> > 
> > Opportunistically take advantage of the churn to rename "nr" to "vector".
> > 
> > No functional change intended.
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> > 
> ...
> 
> 
> Is this change below intentional? My memory on nested_apf_token is quite rusty, but at least
> if possible, I would prefer this to be done in separate patch.


Sorry, I replied to the wrong mail, but the newer version also has the same issue.
(It should be v3 btw.)

Best regards,
	Maxim Levitsky
> 
> 
> Best regards,
>         Maxim Levitsky
> 
> > -               else if (svm->vcpu.arch.exception.has_payload)
> > -                       vmcb->control.exit_info_2 = svm->vcpu.arch.exception.payload;
> > +       if (ex->vector == PF_VECTOR) {
> > +               if (ex->has_payload)
> > +                       vmcb->control.exit_info_2 = ex->payload;
> >                 else
> > -                       vmcb->control.exit_info_2 = svm->vcpu.arch.cr2;
> > -       } else if (nr == DB_VECTOR) {
> > +                       vmcb->control.exit_info_2 = vcpu->arch.cr2;
> > +       } else if (ex->vector == DB_VECTOR) {
> >                 /* See inject_pending_event.  */
> > -               kvm_deliver_exception_payload(&svm->vcpu);
> > -               if (svm->vcpu.arch.dr7 & DR7_GD) {
> > -                       svm->vcpu.arch.dr7 &= ~DR7_GD;
> > -                       kvm_update_dr7(&svm->vcpu);
> > +               kvm_deliver_exception_payload(vcpu, ex);
> > +
> > +               if (vcpu->arch.dr7 & DR7_GD) {
> > +                       vcpu->arch.dr7 &= ~DR7_GD;
> > +                       kvm_update_dr7(vcpu);
> >                 }
> > -       } else
> > -               WARN_ON(svm->vcpu.arch.exception.has_payload);
> > +       } else {
> > +               WARN_ON(ex->has_payload);
> > +       }
> >  
> >         nested_svm_vmexit(svm);
> >  }
> > @@ -1372,7 +1373,7 @@ static int svm_check_nested_events(struct kvm_vcpu *vcpu)
> >                          return -EBUSY;
> >                 if (!nested_exit_on_exception(svm))
> >                         return 0;
> > -               nested_svm_inject_exception_vmexit(svm);
> > +               nested_svm_inject_exception_vmexit(vcpu);
> >                 return 0;
> >         }
> >  
> > diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> > index ca39f76ca44b..6b80046a014f 100644
> > --- a/arch/x86/kvm/svm/svm.c
> > +++ b/arch/x86/kvm/svm/svm.c
> > @@ -432,22 +432,20 @@ static int svm_update_soft_interrupt_rip(struct kvm_vcpu *vcpu)
> >  
> >  static void svm_inject_exception(struct kvm_vcpu *vcpu)
> >  {
> > +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> >         struct vcpu_svm *svm = to_svm(vcpu);
> > -       unsigned nr = vcpu->arch.exception.nr;
> > -       bool has_error_code = vcpu->arch.exception.has_error_code;
> > -       u32 error_code = vcpu->arch.exception.error_code;
> >  
> > -       kvm_deliver_exception_payload(vcpu);
> > +       kvm_deliver_exception_payload(vcpu, ex);
> >  
> > -       if (kvm_exception_is_soft(nr) &&
> > +       if (kvm_exception_is_soft(ex->vector) &&
> >             svm_update_soft_interrupt_rip(vcpu))
> >                 return;
> >  
> > -       svm->vmcb->control.event_inj = nr
> > +       svm->vmcb->control.event_inj = ex->vector
> >                 | SVM_EVTINJ_VALID
> > -               | (has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
> > +               | (ex->has_error_code ? SVM_EVTINJ_VALID_ERR : 0)
> >                 | SVM_EVTINJ_TYPE_EXEPT;
> > -       svm->vmcb->control.event_inj_err = error_code;
> > +       svm->vmcb->control.event_inj_err = ex->error_code;
> >  }
> >  
> >  static void svm_init_erratum_383(void)
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 7b644513c82b..fafdcbfeca1f 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -445,29 +445,27 @@ static bool nested_vmx_is_page_fault_vmexit(struct vmcs12 *vmcs12,
> >   */
> >  static int nested_vmx_check_exception(struct kvm_vcpu *vcpu, unsigned long *exit_qual)
> >  {
> > +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> >         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -       unsigned int nr = vcpu->arch.exception.nr;
> > -       bool has_payload = vcpu->arch.exception.has_payload;
> > -       unsigned long payload = vcpu->arch.exception.payload;
> >  
> > -       if (nr == PF_VECTOR) {
> > -               if (vcpu->arch.exception.nested_apf) {
> > +       if (ex->vector == PF_VECTOR) {
> > +               if (ex->nested_apf) {
> >                         *exit_qual = vcpu->arch.apf.nested_apf_token;
> >                         return 1;
> >                 }
> > -               if (nested_vmx_is_page_fault_vmexit(vmcs12,
> > -                                                   vcpu->arch.exception.error_code)) {
> > -                       *exit_qual = has_payload ? payload : vcpu->arch.cr2;
> > +               if (nested_vmx_is_page_fault_vmexit(vmcs12, ex->error_code)) {
> > +                       *exit_qual = ex->has_payload ? ex->payload : vcpu->arch.cr2;
> >                         return 1;
> >                 }
> > -       } else if (vmcs12->exception_bitmap & (1u << nr)) {
> > -               if (nr == DB_VECTOR) {
> > -                       if (!has_payload) {
> > -                               payload = vcpu->arch.dr6;
> > -                               payload &= ~DR6_BT;
> > -                               payload ^= DR6_ACTIVE_LOW;
> > +       } else if (vmcs12->exception_bitmap & (1u << ex->vector)) {
> > +               if (ex->vector == DB_VECTOR) {
> > +                       if (ex->has_payload) {
> > +                               *exit_qual = ex->payload;
> > +                       } else {
> > +                               *exit_qual = vcpu->arch.dr6;
> > +                               *exit_qual &= ~DR6_BT;
> > +                               *exit_qual ^= DR6_ACTIVE_LOW;
> >                         }
> > -                       *exit_qual = payload;
> >                 } else
> >                         *exit_qual = 0;
> >                 return 1;
> > @@ -3724,7 +3722,7 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
> >              is_double_fault(exit_intr_info))) {
> >                 vmcs12->idt_vectoring_info_field = 0;
> >         } else if (vcpu->arch.exception.injected) {
> > -               nr = vcpu->arch.exception.nr;
> > +               nr = vcpu->arch.exception.vector;
> >                 idt_vectoring = nr | VECTORING_INFO_VALID_MASK;
> >  
> >                 if (kvm_exception_is_soft(nr)) {
> > @@ -3828,11 +3826,11 @@ static int vmx_complete_nested_posted_interrupt(struct kvm_vcpu *vcpu)
> >  static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> >                                                unsigned long exit_qual)
> >  {
> > +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> > +       u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
> >         struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
> > -       unsigned int nr = vcpu->arch.exception.nr;
> > -       u32 intr_info = nr | INTR_INFO_VALID_MASK;
> >  
> > -       if (vcpu->arch.exception.has_error_code) {
> > +       if (ex->has_error_code) {
> >                 /*
> >                  * Intel CPUs will never generate an error code with bits 31:16
> >                  * set, and more importantly VMX disallows setting bits 31:16
> > @@ -3840,11 +3838,11 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> >                  * mimic hardware and avoid inducing failure on nested VM-Entry
> >                  * if L1 chooses to inject the exception back to L2.
> >                  */
> > -               vmcs12->vm_exit_intr_error_code = (u16)vcpu->arch.exception.error_code;
> > +               vmcs12->vm_exit_intr_error_code = (u16)ex->error_code;
> >                 intr_info |= INTR_INFO_DELIVER_CODE_MASK;
> >         }
> >  
> > -       if (kvm_exception_is_soft(nr))
> > +       if (kvm_exception_is_soft(ex->vector))
> >                 intr_info |= INTR_TYPE_SOFT_EXCEPTION;
> >         else
> >                 intr_info |= INTR_TYPE_HARD_EXCEPTION;
> > @@ -3875,7 +3873,7 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu,
> >  static inline unsigned long vmx_get_pending_dbg_trap(struct kvm_vcpu *vcpu)
> >  {
> >         if (!vcpu->arch.exception.pending ||
> > -           vcpu->arch.exception.nr != DB_VECTOR)
> > +           vcpu->arch.exception.vector != DB_VECTOR)
> >                 return 0;
> >  
> >         /* General Detect #DBs are always fault-like. */
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 26b863c78a9f..7ef5659a1bbd 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -1585,7 +1585,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
> >          */
> >         if (nested_cpu_has_mtf(vmcs12) &&
> >             (!vcpu->arch.exception.pending ||
> > -            vcpu->arch.exception.nr == DB_VECTOR))
> > +            vcpu->arch.exception.vector == DB_VECTOR))
> >                 vmx->nested.mtf_pending = true;
> >         else
> >                 vmx->nested.mtf_pending = false;
> > @@ -1612,15 +1612,13 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
> >  
> >  static void vmx_inject_exception(struct kvm_vcpu *vcpu)
> >  {
> > +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> > +       u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
> >         struct vcpu_vmx *vmx = to_vmx(vcpu);
> > -       unsigned nr = vcpu->arch.exception.nr;
> > -       bool has_error_code = vcpu->arch.exception.has_error_code;
> > -       u32 error_code = vcpu->arch.exception.error_code;
> > -       u32 intr_info = nr | INTR_INFO_VALID_MASK;
> >  
> > -       kvm_deliver_exception_payload(vcpu);
> > +       kvm_deliver_exception_payload(vcpu, ex);
> >  
> > -       if (has_error_code) {
> > +       if (ex->has_error_code) {
> >                 /*
> >                  * Despite the error code being architecturally defined as 32
> >                  * bits, and the VMCS field being 32 bits, Intel CPUs and thus
> > @@ -1630,21 +1628,21 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
> >                  * the upper bits to avoid VM-Fail, losing information that
> >                  * does't really exist is preferable to killing the VM.
> >                  */
> > -               vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)error_code);
> > +               vmcs_write32(VM_ENTRY_EXCEPTION_ERROR_CODE, (u16)ex->error_code);
> >                 intr_info |= INTR_INFO_DELIVER_CODE_MASK;
> >         }
> >  
> >         if (vmx->rmode.vm86_active) {
> >                 int inc_eip = 0;
> > -               if (kvm_exception_is_soft(nr))
> > +               if (kvm_exception_is_soft(ex->vector))
> >                         inc_eip = vcpu->arch.event_exit_inst_len;
> > -               kvm_inject_realmode_interrupt(vcpu, nr, inc_eip);
> > +               kvm_inject_realmode_interrupt(vcpu, ex->vector, inc_eip);
> >                 return;
> >         }
> >  
> >         WARN_ON_ONCE(vmx->emulation_required);
> >  
> > -       if (kvm_exception_is_soft(nr)) {
> > +       if (kvm_exception_is_soft(ex->vector)) {
> >                 vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> >                              vmx->vcpu.arch.event_exit_inst_len);
> >                 intr_info |= INTR_TYPE_SOFT_EXCEPTION;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index b63421d511c5..511c0c8af80e 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -557,16 +557,13 @@ static int exception_type(int vector)
> >         return EXCPT_FAULT;
> >  }
> >  
> > -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
> > +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> > +                                  struct kvm_queued_exception *ex)
> >  {
> > -       unsigned nr = vcpu->arch.exception.nr;
> > -       bool has_payload = vcpu->arch.exception.has_payload;
> > -       unsigned long payload = vcpu->arch.exception.payload;
> > -
> > -       if (!has_payload)
> > +       if (!ex->has_payload)
> >                 return;
> >  
> > -       switch (nr) {
> > +       switch (ex->vector) {
> >         case DB_VECTOR:
> >                 /*
> >                  * "Certain debug exceptions may clear bit 0-3.  The
> > @@ -591,8 +588,8 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
> >                  * So they need to be flipped for DR6.
> >                  */
> >                 vcpu->arch.dr6 |= DR6_ACTIVE_LOW;
> > -               vcpu->arch.dr6 |= payload;
> > -               vcpu->arch.dr6 ^= payload & DR6_ACTIVE_LOW;
> > +               vcpu->arch.dr6 |= ex->payload;
> > +               vcpu->arch.dr6 ^= ex->payload & DR6_ACTIVE_LOW;
> >  
> >                 /*
> >                  * The #DB payload is defined as compatible with the 'pending
> > @@ -603,12 +600,12 @@ void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu)
> >                 vcpu->arch.dr6 &= ~BIT(12);
> >                 break;
> >         case PF_VECTOR:
> > -               vcpu->arch.cr2 = payload;
> > +               vcpu->arch.cr2 = ex->payload;
> >                 break;
> >         }
> >  
> > -       vcpu->arch.exception.has_payload = false;
> > -       vcpu->arch.exception.payload = 0;
> > +       ex->has_payload = false;
> > +       ex->payload = 0;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_deliver_exception_payload);
> >  
> > @@ -647,17 +644,18 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
> >                         vcpu->arch.exception.injected = false;
> >                 }
> >                 vcpu->arch.exception.has_error_code = has_error;
> > -               vcpu->arch.exception.nr = nr;
> > +               vcpu->arch.exception.vector = nr;
> >                 vcpu->arch.exception.error_code = error_code;
> >                 vcpu->arch.exception.has_payload = has_payload;
> >                 vcpu->arch.exception.payload = payload;
> >                 if (!is_guest_mode(vcpu))
> > -                       kvm_deliver_exception_payload(vcpu);
> > +                       kvm_deliver_exception_payload(vcpu,
> > +                                                     &vcpu->arch.exception);
> >                 return;
> >         }
> >  
> >         /* to check exception */
> > -       prev_nr = vcpu->arch.exception.nr;
> > +       prev_nr = vcpu->arch.exception.vector;
> >         if (prev_nr == DF_VECTOR) {
> >                 /* triple fault -> shutdown */
> >                 kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> > @@ -675,7 +673,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
> >                 vcpu->arch.exception.pending = true;
> >                 vcpu->arch.exception.injected = false;
> >                 vcpu->arch.exception.has_error_code = true;
> > -               vcpu->arch.exception.nr = DF_VECTOR;
> > +               vcpu->arch.exception.vector = DF_VECTOR;
> >                 vcpu->arch.exception.error_code = 0;
> >                 vcpu->arch.exception.has_payload = false;
> >                 vcpu->arch.exception.payload = 0;
> > @@ -4886,25 +4884,24 @@ static int kvm_vcpu_ioctl_x86_set_mce(struct kvm_vcpu *vcpu,
> >  static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> >                                                struct kvm_vcpu_events *events)
> >  {
> > +       struct kvm_queued_exception *ex = &vcpu->arch.exception;
> > +
> >         process_nmi(vcpu);
> >  
> >         if (kvm_check_request(KVM_REQ_SMI, vcpu))
> >                 process_smi(vcpu);
> >  
> >         /*
> > -        * In guest mode, payload delivery should be deferred,
> > -        * so that the L1 hypervisor can intercept #PF before
> > -        * CR2 is modified (or intercept #DB before DR6 is
> > -        * modified under nVMX). Unless the per-VM capability,
> > -        * KVM_CAP_EXCEPTION_PAYLOAD, is set, we may not defer the delivery of
> > -        * an exception payload and handle after a KVM_GET_VCPU_EVENTS. Since we
> > -        * opportunistically defer the exception payload, deliver it if the
> > -        * capability hasn't been requested before processing a
> > -        * KVM_GET_VCPU_EVENTS.
> > +        * In guest mode, payload delivery should be deferred if the exception
> > +        * will be intercepted by L1, e.g. KVM should not modifying CR2 if L1
> > +        * intercepts #PF, ditto for DR6 and #DBs.  If the per-VM capability,
> > +        * KVM_CAP_EXCEPTION_PAYLOAD, is not set, userspace may or may not
> > +        * propagate the payload and so it cannot be safely deferred.  Deliver
> > +        * the payload if the capability hasn't been requested.
> >          */
> >         if (!vcpu->kvm->arch.exception_payload_enabled &&
> > -           vcpu->arch.exception.pending && vcpu->arch.exception.has_payload)
> > -               kvm_deliver_exception_payload(vcpu);
> > +           ex->pending && ex->has_payload)
> > +               kvm_deliver_exception_payload(vcpu, ex);
> >  
> >         /*
> >          * The API doesn't provide the instruction length for software
> > @@ -4912,26 +4909,25 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
> >          * isn't advanced, we should expect to encounter the exception
> >          * again.
> >          */
> > -       if (kvm_exception_is_soft(vcpu->arch.exception.nr)) {
> > +       if (kvm_exception_is_soft(ex->vector)) {
> >                 events->exception.injected = 0;
> >                 events->exception.pending = 0;
> >         } else {
> > -               events->exception.injected = vcpu->arch.exception.injected;
> > -               events->exception.pending = vcpu->arch.exception.pending;
> > +               events->exception.injected = ex->injected;
> > +               events->exception.pending = ex->pending;
> >                 /*
> >                  * For ABI compatibility, deliberately conflate
> >                  * pending and injected exceptions when
> >                  * KVM_CAP_EXCEPTION_PAYLOAD isn't enabled.
> >                  */
> >                 if (!vcpu->kvm->arch.exception_payload_enabled)
> > -                       events->exception.injected |=
> > -                               vcpu->arch.exception.pending;
> > +                       events->exception.injected |= ex->pending;
> >         }
> > -       events->exception.nr = vcpu->arch.exception.nr;
> > -       events->exception.has_error_code = vcpu->arch.exception.has_error_code;
> > -       events->exception.error_code = vcpu->arch.exception.error_code;
> > -       events->exception_has_payload = vcpu->arch.exception.has_payload;
> > -       events->exception_payload = vcpu->arch.exception.payload;
> > +       events->exception.nr = ex->vector;
> > +       events->exception.has_error_code = ex->has_error_code;
> > +       events->exception.error_code = ex->error_code;
> > +       events->exception_has_payload = ex->has_payload;
> > +       events->exception_payload = ex->payload;
> >  
> >         events->interrupt.injected =
> >                 vcpu->arch.interrupt.injected && !vcpu->arch.interrupt.soft;
> > @@ -5003,7 +4999,7 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
> >         process_nmi(vcpu);
> >         vcpu->arch.exception.injected = events->exception.injected;
> >         vcpu->arch.exception.pending = events->exception.pending;
> > -       vcpu->arch.exception.nr = events->exception.nr;
> > +       vcpu->arch.exception.vector = events->exception.nr;
> >         vcpu->arch.exception.has_error_code = events->exception.has_error_code;
> >         vcpu->arch.exception.error_code = events->exception.error_code;
> >         vcpu->arch.exception.has_payload = events->exception_has_payload;
> > @@ -9497,7 +9493,7 @@ int kvm_check_nested_events(struct kvm_vcpu *vcpu)
> >  
> >  static void kvm_inject_exception(struct kvm_vcpu *vcpu)
> >  {
> > -       trace_kvm_inj_exception(vcpu->arch.exception.nr,
> > +       trace_kvm_inj_exception(vcpu->arch.exception.vector,
> >                                 vcpu->arch.exception.has_error_code,
> >                                 vcpu->arch.exception.error_code,
> >                                 vcpu->arch.exception.injected);
> > @@ -9569,12 +9565,12 @@ static int inject_pending_event(struct kvm_vcpu *vcpu, bool *req_immediate_exit)
> >                  * describe the behavior of General Detect #DBs, which are
> >                  * fault-like.  They do _not_ set RF, a la code breakpoints.
> >                  */
> > -               if (exception_type(vcpu->arch.exception.nr) == EXCPT_FAULT)
> > +               if (exception_type(vcpu->arch.exception.vector) == EXCPT_FAULT)
> >                         __kvm_set_rflags(vcpu, kvm_get_rflags(vcpu) |
> >                                              X86_EFLAGS_RF);
> >  
> > -               if (vcpu->arch.exception.nr == DB_VECTOR) {
> > -                       kvm_deliver_exception_payload(vcpu);
> > +               if (vcpu->arch.exception.vector == DB_VECTOR) {
> > +                       kvm_deliver_exception_payload(vcpu, &vcpu->arch.exception);
> >                         if (vcpu->arch.dr7 & DR7_GD) {
> >                                 vcpu->arch.dr7 &= ~DR7_GD;
> >                                 kvm_update_dr7(vcpu);
> > diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> > index 501b884b8cc4..dc2af0146220 100644
> > --- a/arch/x86/kvm/x86.h
> > +++ b/arch/x86/kvm/x86.h
> > @@ -286,7 +286,8 @@ int kvm_write_guest_virt_system(struct kvm_vcpu *vcpu,
> >  
> >  int handle_ud(struct kvm_vcpu *vcpu);
> >  
> > -void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu);
> > +void kvm_deliver_exception_payload(struct kvm_vcpu *vcpu,
> > +                                  struct kvm_queued_exception *ex);
> >  
> >  void kvm_vcpu_mtrr_init(struct kvm_vcpu *vcpu);
> >  u8 kvm_mtrr_get_guest_memory_type(struct kvm_vcpu *vcpu, gfn_t gfn);
> 



^ permalink raw reply	[flat|nested] 78+ messages in thread

* Re: [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct
  2022-07-18 13:10     ` Maxim Levitsky
@ 2022-07-18 15:40       ` Sean Christopherson
  0 siblings, 0 replies; 78+ messages in thread
From: Sean Christopherson @ 2022-07-18 15:40 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, kvm, linux-kernel, Oliver Upton, Peter Shier

On Mon, Jul 18, 2022, Maxim Levitsky wrote:
> On Mon, 2022-07-18 at 16:07 +0300, Maxim Levitsky wrote:
> > On Tue, 2022-06-14 at 20:47 +0000, Sean Christopherson wrote:
> > > Move the definition of "struct kvm_queued_exception" out of kvm_vcpu_arch
> > > in anticipation of adding a second instance in kvm_vcpu_arch to handle
> > > exceptions that occur when vectoring an injected exception and are
> > > morphed to VM-Exit instead of leading to #DF.
> > > 
> > > Opportunistically take advantage of the churn to rename "nr" to "vector".
> > > 
> > > No functional change intended.
> > > 
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > > 
> > ...
> > 
> > 
> > Is this change below intentional? My memory on nested_apf_token is quite rusty, but at least
> > if possible, I would prefer this to be done in separate patch.

Yikes!  It's not intention as of this patch.  It _is_ intentional for the "morph"
patch, as KVM can simply force "has_payload" in kvm_inject_page_fault() when
directly queueing the VM-Exit.  I suspect I botched this patch when splitting the
original changes into separate patches.

> Sorry, I replied to the wrong mail, but the newer version also has the same issue.
> (It should be v3 btw.)

Argh, sorry about the versioning mixup.  I distinctly remember thinking this
couldn't possibly have been only the second version...  Should have double-checked
instead of trusting my archive.

^ permalink raw reply	[flat|nested] 78+ messages in thread

end of thread, other threads:[~2022-07-18 15:41 UTC | newest]

Thread overview: 78+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-14 20:47 [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 01/21] KVM: nVMX: Unconditionally purge queued/injected events on nested "exit" Sean Christopherson
2022-06-16 23:47   ` Jim Mattson
2022-07-06 11:40   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 02/21] KVM: VMX: Drop bits 31:16 when shoving exception error code into VMCS Sean Christopherson
2022-07-06 11:43   ` Maxim Levitsky
2022-07-06 16:12     ` Sean Christopherson
2022-07-06 18:50       ` Maxim Levitsky
2022-07-06 20:02   ` Jim Mattson
2022-06-14 20:47 ` [PATCH v2 03/21] KVM: x86: Don't check for code breakpoints when emulating on exception Sean Christopherson
2022-07-06 11:43   ` Maxim Levitsky
2022-07-06 22:17   ` Jim Mattson
2022-06-14 20:47 ` [PATCH v2 04/21] KVM: nVMX: Treat General Detect #DB (DR7.GD=1) as fault-like Sean Christopherson
2022-07-06 11:45   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 05/21] KVM: nVMX: Prioritize TSS T-flag #DBs over Monitor Trap Flag Sean Christopherson
2022-07-06 11:57   ` Maxim Levitsky
2022-07-06 23:51   ` Jim Mattson
2022-07-07 17:14     ` Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 06/21] KVM: x86: Treat #DBs from the emulator as fault-like (code and DR7.GD=1) Sean Christopherson
2022-07-06 11:57   ` Maxim Levitsky
2022-07-06 23:55   ` Jim Mattson
2022-07-07 17:19     ` Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 07/21] KVM: x86: Use DR7_GD macro instead of open coding check in emulator Sean Christopherson
2022-07-06 11:58   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 08/21] KVM: nVMX: Ignore SIPI that arrives in L2 when vCPU is not in WFS Sean Christopherson
2022-07-06 11:59   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 09/21] KVM: nVMX: Unconditionally clear mtf_pending on nested VM-Exit Sean Christopherson
2022-07-06 12:00   ` Maxim Levitsky
2022-07-06 16:45     ` Sean Christopherson
2022-07-06 20:03       ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 10/21] KVM: VMX: Inject #PF on ENCLS as "emulated" #PF Sean Christopherson
2022-07-06 12:00   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 11/21] KVM: x86: Rename kvm_x86_ops.queue_exception to inject_exception Sean Christopherson
2022-07-06 12:01   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 12/21] KVM: x86: Make kvm_queued_exception a properly named, visible struct Sean Christopherson
2022-07-06 12:02   ` Maxim Levitsky
2022-07-18 13:07   ` Maxim Levitsky
2022-07-18 13:10     ` Maxim Levitsky
2022-07-18 15:40       ` Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 13/21] KVM: x86: Formalize blocking of nested pending exceptions Sean Christopherson
2022-07-06 12:04   ` Maxim Levitsky
2022-07-06 17:36     ` Sean Christopherson
2022-07-06 20:03       ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 14/21] KVM: x86: Use kvm_queue_exception_e() to queue #DF Sean Christopherson
2022-07-06 12:04   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 15/21] KVM: x86: Hoist nested event checks above event injection logic Sean Christopherson
2022-07-06 12:05   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 16/21] KVM: x86: Evaluate ability to inject SMI/NMI/IRQ after potential VM-Exit Sean Christopherson
2022-07-06 12:05   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 17/21] KVM: x86: Morph pending exceptions to pending VM-Exits at queue time Sean Christopherson
2022-07-06 12:15   ` Maxim Levitsky
2022-07-07  1:24     ` Sean Christopherson
2022-07-10 15:56       ` Maxim Levitsky
2022-07-11 15:22         ` Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 18/21] KVM: x86: Treat pending TRIPLE_FAULT requests as pending exceptions Sean Christopherson
2022-07-06 12:16   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 19/21] KVM: VMX: Update MTF and ICEBP comments to document KVM's subtle behavior Sean Christopherson
2022-06-14 20:47 ` [PATCH v2 20/21] KVM: selftests: Use uapi header to get VMX and SVM exit reasons/codes Sean Christopherson
2022-07-06 12:16   ` Maxim Levitsky
2022-06-14 20:47 ` [PATCH v2 21/21] KVM: selftests: Add an x86-only test to verify nested exception queueing Sean Christopherson
2022-07-06 12:17   ` Maxim Levitsky
2022-06-16 13:16 ` [PATCH v2 00/21] KVM: x86: Event/exception fixes and cleanups Maxim Levitsky
2022-06-29 11:16 ` Maxim Levitsky
2022-06-29 13:42   ` Jim Mattson
2022-06-30  8:22     ` Maxim Levitsky
2022-06-30 12:17       ` Jim Mattson
2022-06-30 13:10         ` Maxim Levitsky
2022-06-30 16:28       ` Jim Mattson
2022-07-01  7:37         ` Maxim Levitsky
2022-07-06 11:54     ` Maxim Levitsky
2022-07-06 17:13       ` Jim Mattson
2022-07-06 17:52         ` Sean Christopherson
2022-07-06 20:03           ` Maxim Levitsky
2022-07-06 20:11           ` Jim Mattson
2022-07-10 15:58             ` Maxim Levitsky
2022-06-29 15:53   ` Jim Mattson
2022-06-30  8:24     ` Maxim Levitsky
2022-06-30 12:20       ` Jim Mattson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.