linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/3] Introduce Notify VM exit
@ 2022-04-21  7:29 Chenyi Qiang
  2022-04-21  7:29 ` [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request Chenyi Qiang
                   ` (4 more replies)
  0 siblings, 5 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-04-21  7:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Virtual machines can exploit Intel ISA characterstics to cause
functional denial of service to the VMM. This series introduces a new
feature named Notify VM exit, which can help mitigate such kind of
attacks.

Patch 1: An extension of KVM_SET_VCPU_EVENTS ioctl to inject a
synthesized shutdown event from user space. This is also a fix for other
synthesized triple fault, e.g. the RSM patch or nested_vmx_abort(),
which could get lost when exit to userspace to do migrate.

Patch 2: A selftest about get/set triple fault event.

Patch 3: The main patch to enable Notify VM exit.

---
Change logs:
v5 -> v6
- Do some changes in document.
- Add a selftest about get/set triple fault event. (Sean)
- extend the argument to include both the notify window and some flags
  when enabling KVM_CAP_X86_BUS_LOCK_EXIT CAP. (Sean)
- Change to use KVM_VCPUEVENT_VALID_TRIPE_FAULT in flags field and add
  pending_triple_fault field in struct kvm_vcpu_events, which allows
  userspace to make/clear triple fault request. (Sean)
- Add a flag in kvm_x86_ops to avoid the kvm_has_notify_vmexit global
  varialbe and its export.(Sean)
- v5: https://lore.kernel.org/lkml/20220318074955.22428-1-chenyi.qiang@intel.com/

v4 -> v5
- rename KVM_VCPUEVENTS_SHUTDOWN to KVM_VCPUEVENTS_TRIPLE_FAULT. Make it
  bidirection and add it to get_vcpu_events. (Sean)
- v4: https://lore.kernel.org/all/20220310084001.10235-1-chenyi.qiang@intel.com/

v3 -> v4
- Change this feature to per-VM scope. (Jim)
- Once VM_CONTEXT_INVALID set in exit_qualification, exit to user space
  notify this fatal case, especially the notify VM exit happens in L2.
  (Jim)
- extend KVM_SET_VCPU_EVENTS to allow user space to inject a shutdown
  event. (Jim)
- A minor code changes.
- Add document for the new KVM capability.
- v3: https://lore.kernel.org/lkml/20220223062412.22334-1-chenyi.qiang@intel.com/

v2 -> v3
- add a vcpu state notify_window_exits to record the number of
  occurence as well as a pr_warn output. (Sean)
- Add the handling in nested VM to prevent L1 bypassing the restriction
  through launching a L2. (Sean)
- Only kill L2 when L2 VM is context invalid, synthesize a
  EXIT_REASON_TRIPLE_FAULT to L1 (Sean)
- To ease the current implementation, make module parameter
  notify_window read-only. (Sean)
- Disable notify window exit by default.
- v2: https://lore.kernel.org/lkml/20210525051204.1480610-1-tao3.xu@intel.com/

v1 -> v2
- Default set notify window to 0, less than 0 to disable.
- Add more description in commit message.
---

Chenyi Qiang (2):
  KVM: X86: Save&restore the triple fault request
  KVM: selftests: Add a test to get/set triple fault event

Tao Xu (1):
  KVM: VMX: Enable Notify VM exit

 Documentation/virt/kvm/api.rst                | 55 +++++++++++
 arch/x86/include/asm/kvm_host.h               |  9 ++
 arch/x86/include/asm/vmx.h                    |  7 ++
 arch/x86/include/asm/vmxfeatures.h            |  1 +
 arch/x86/include/uapi/asm/kvm.h               |  4 +-
 arch/x86/include/uapi/asm/vmx.h               |  4 +-
 arch/x86/kvm/vmx/capabilities.h               |  6 ++
 arch/x86/kvm/vmx/nested.c                     |  8 ++
 arch/x86/kvm/vmx/vmx.c                        | 48 +++++++++-
 arch/x86/kvm/x86.c                            | 33 ++++++-
 arch/x86/kvm/x86.h                            |  5 +
 include/uapi/linux/kvm.h                      | 10 ++
 tools/testing/selftests/kvm/.gitignore        |  1 +
 tools/testing/selftests/kvm/Makefile          |  1 +
 .../kvm/x86_64/triple_fault_event_test.c      | 96 +++++++++++++++++++
 15 files changed, 280 insertions(+), 8 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c

-- 
2.17.1


^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request
  2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
@ 2022-04-21  7:29 ` Chenyi Qiang
  2022-05-18 18:42   ` Sean Christopherson
  2022-04-21  7:29 ` [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event Chenyi Qiang
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Chenyi Qiang @ 2022-04-21  7:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel, Chenyi Qiang

For the triple fault sythesized by KVM, e.g. the RSM path or
nested_vmx_abort(), if KVM exits to userspace before the request is
serviced, userspace could migrate the VM and lose the triple fault.

Add the support to save and restore the triple fault event from
userspace. Introduce a new event KVM_VCPUEVENT_VALID_TRIPLE_FAULT in
get/set_vcpu_events to track the triple fault request.

Note that in the set_vcpu_events path, userspace is able to set/clear
the triple fault request through triple_fault_pending field.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 Documentation/virt/kvm/api.rst  |  7 +++++++
 arch/x86/include/uapi/asm/kvm.h |  4 +++-
 arch/x86/kvm/x86.c              | 15 +++++++++++++--
 3 files changed, 23 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 72183ae628f7..e09ce3cb49c5 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1150,6 +1150,9 @@ The following bits are defined in the flags field:
   fields contain a valid state. This bit will be set whenever
   KVM_CAP_EXCEPTION_PAYLOAD is enabled.
 
+- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
+  triple_fault_pending field contains a valid state.
+
 ARM64:
 ^^^^^^
 
@@ -1245,6 +1248,10 @@ can be set in the flags field to signal that the
 exception_has_payload, exception_payload, and exception.pending fields
 contain a valid state and shall be written into the VCPU.
 
+KVM_VCPUEVENT_VALID_TRIPLE_FAULT can be set in flags field to signal that
+the triple_fault_pending field contains a valid state and shall be written
+into the VCPU.
+
 ARM64:
 ^^^^^^
 
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 21614807a2cb..fd083f6337af 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -325,6 +325,7 @@ struct kvm_reinject_control {
 #define KVM_VCPUEVENT_VALID_SHADOW	0x00000004
 #define KVM_VCPUEVENT_VALID_SMM		0x00000008
 #define KVM_VCPUEVENT_VALID_PAYLOAD	0x00000010
+#define KVM_VCPUEVENT_VALID_TRIPLE_FAULT	0x00000020
 
 /* Interrupt shadow states */
 #define KVM_X86_SHADOW_INT_MOV_SS	0x01
@@ -359,7 +360,8 @@ struct kvm_vcpu_events {
 		__u8 smm_inside_nmi;
 		__u8 latched_init;
 	} smi;
-	__u8 reserved[27];
+	__u8 triple_fault_pending;
+	__u8 reserved[26];
 	__u8 exception_has_payload;
 	__u64 exception_payload;
 };
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ab336f7c82e4..c8b9b0bc42aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4911,9 +4911,12 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
 		!!(vcpu->arch.hflags & HF_SMM_INSIDE_NMI_MASK);
 	events->smi.latched_init = kvm_lapic_latched_init(vcpu);
 
+	events->triple_fault_pending = kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+
 	events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
 			 | KVM_VCPUEVENT_VALID_SHADOW
-			 | KVM_VCPUEVENT_VALID_SMM);
+			 | KVM_VCPUEVENT_VALID_SMM
+			 | KVM_VCPUEVENT_VALID_TRIPLE_FAULT);
 	if (vcpu->kvm->arch.exception_payload_enabled)
 		events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;
 
@@ -4929,7 +4932,8 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 			      | KVM_VCPUEVENT_VALID_SIPI_VECTOR
 			      | KVM_VCPUEVENT_VALID_SHADOW
 			      | KVM_VCPUEVENT_VALID_SMM
-			      | KVM_VCPUEVENT_VALID_PAYLOAD))
+			      | KVM_VCPUEVENT_VALID_PAYLOAD
+			      | KVM_VCPUEVENT_VALID_TRIPLE_FAULT))
 		return -EINVAL;
 
 	if (events->flags & KVM_VCPUEVENT_VALID_PAYLOAD) {
@@ -5002,6 +5006,13 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu,
 		}
 	}
 
+	if (events->flags & KVM_VCPUEVENT_VALID_TRIPLE_FAULT) {
+		if (events->triple_fault_pending)
+			kvm_make_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+		else
+			kvm_clear_request(KVM_REQ_TRIPLE_FAULT, vcpu);
+	}
+
 	kvm_make_request(KVM_REQ_EVENT, vcpu);
 
 	return 0;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event
  2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
  2022-04-21  7:29 ` [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request Chenyi Qiang
@ 2022-04-21  7:29 ` Chenyi Qiang
  2022-05-18 19:20   ` Sean Christopherson
  2022-04-21  7:29 ` [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit Chenyi Qiang
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 17+ messages in thread
From: Chenyi Qiang @ 2022-04-21  7:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel, Chenyi Qiang

Add a selftest to get/set triple fault event by:
  - exiting to userspace via I/O from L2;
  - using KVM_GET_VCPU_EVENTS to verify the TRIPEL_FAULT field is valid
    in flags.
  - using KVM_SET_VCPU_EVENTS to inject a triple fault event so that L1
    can see the appropriate exit.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 tools/testing/selftests/kvm/.gitignore        |  1 +
 tools/testing/selftests/kvm/Makefile          |  1 +
 .../kvm/x86_64/triple_fault_event_test.c      | 96 +++++++++++++++++++
 3 files changed, 98 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c

diff --git a/tools/testing/selftests/kvm/.gitignore b/tools/testing/selftests/kvm/.gitignore
index 56140068b763..989fd4672f61 100644
--- a/tools/testing/selftests/kvm/.gitignore
+++ b/tools/testing/selftests/kvm/.gitignore
@@ -55,6 +55,7 @@
 /x86_64/xen_vmcall_test
 /x86_64/xss_msr_test
 /x86_64/vmx_pmu_msrs_test
+/x86_64/triple_fault_event_test
 /access_tracking_perf_test
 /demand_paging_test
 /dirty_log_test
diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index af582d168621..ce147b50b9be 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -88,6 +88,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test
 TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test
 TEST_GEN_PROGS_x86_64 += x86_64/sev_migrate_tests
 TEST_GEN_PROGS_x86_64 += x86_64/amx_test
+TEST_GEN_PROGS_x86_64 += x86_64/triple_fault_event_test
 TEST_GEN_PROGS_x86_64 += access_tracking_perf_test
 TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
diff --git a/tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c b/tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c
new file mode 100644
index 000000000000..20bf8f9a61d7
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c
@@ -0,0 +1,96 @@
+// SPDX-License-Identifier: GPL-2.0-only
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+#include "vmx.h"
+
+#include <string.h>
+#include <sys/ioctl.h>
+
+#include "kselftest.h"
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+#define VCPU_ID			0
+#define ARBITRARY_IO_PORT	0x2000
+
+/* The virtual machine object. */
+static struct kvm_vm *vm;
+
+static void l2_guest_code(void)
+{
+	/*
+	 * Generate an exit to L0 userspace, i.e. main(), via I/O to an
+	 * arbitrary port.
+	 */
+	asm volatile("inb %%dx, %%al"
+		     : : [port] "d" (ARBITRARY_IO_PORT) : "rax");
+}
+
+void l1_guest_code(struct vmx_pages *vmx)
+{
+#define L2_GUEST_STACK_SIZE 64
+	unsigned long l2_guest_stack[L2_GUEST_STACK_SIZE];
+
+	GUEST_ASSERT(vmx->vmcs_gpa);
+	GUEST_ASSERT(prepare_for_vmx_operation(vmx));
+	GUEST_ASSERT(load_vmcs(vmx));
+
+	prepare_vmcs(vmx, l2_guest_code,
+		     &l2_guest_stack[L2_GUEST_STACK_SIZE]);
+
+	GUEST_ASSERT(!vmlaunch());
+	/* L2 should triple fault after a triple fault event injected. */
+	GUEST_ASSERT(vmreadz(VM_EXIT_REASON) == EXIT_REASON_TRIPLE_FAULT);
+	GUEST_DONE();
+}
+
+int main(void)
+{
+	struct kvm_run *run;
+	struct kvm_vcpu_events events;
+	vm_vaddr_t vmx_pages_gva;
+	struct ucall uc;
+
+	if (!nested_vmx_supported()) {
+		print_skip("Nested VMX not supported");
+		exit(KSFT_SKIP);
+	}
+
+	vm = vm_create_default(VCPU_ID, 0, (void *) l1_guest_code);
+
+	vcpu_alloc_vmx(vm, &vmx_pages_gva);
+	vcpu_args_set(vm, VCPU_ID, 1, vmx_pages_gva);
+
+	vcpu_run(vm, VCPU_ID);
+
+	run = vcpu_state(vm, VCPU_ID);
+
+	TEST_ASSERT(run->exit_reason == KVM_EXIT_IO,
+		    "Expected KVM_EXIT_IO, got: %u (%s)\n",
+		    run->exit_reason, exit_reason_str(run->exit_reason));
+
+	TEST_ASSERT(run->io.port == ARBITRARY_IO_PORT,
+		    "Expected IN from port %d from L2, got port %d",
+		    ARBITRARY_IO_PORT, run->io.port);
+
+	vcpu_events_get(vm, VCPU_ID, &events);
+	TEST_ASSERT(events.flags & KVM_VCPUEVENT_VALID_TRIPLE_FAULT,
+		    "Triple fault event invalid");
+	events.triple_fault_pending = true;
+	vcpu_events_set(vm, VCPU_ID, &events);
+
+	vcpu_run(vm, VCPU_ID);
+
+	switch (get_ucall(vm, VCPU_ID, &uc)) {
+	case UCALL_DONE:
+		break;
+	case UCALL_ABORT:
+		TEST_FAIL("%s", (const char *)uc.args[0]);
+	default:
+		TEST_FAIL("Unexpected ucall: %lu", uc.cmd);
+	}
+
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit
  2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
  2022-04-21  7:29 ` [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request Chenyi Qiang
  2022-04-21  7:29 ` [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event Chenyi Qiang
@ 2022-04-21  7:29 ` Chenyi Qiang
  2022-05-17  0:59   ` Chenyi Qiang
  2022-05-18 22:30   ` Sean Christopherson
  2022-05-06  2:43 ` [PATCH v6 0/3] Introduce " Chenyi Qiang
  2022-05-23 19:30 ` Paolo Bonzini
  4 siblings, 2 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-04-21  7:29 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel, Chenyi Qiang

From: Tao Xu <tao3.xu@intel.com>

There are cases that malicious virtual machines can cause CPU stuck (due
to event windows don't open up), e.g., infinite loop in microcode when
nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
IRQ) can be delivered. It leads the CPU to be unavailable to host or
other VMs.

VMM can enable notify VM exit that a VM exit generated if no event
window occurs in VM non-root mode for a specified amount of time (notify
window).

Feature enabling:
- The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
  enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
  the expected notify window.
- Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
  can query and enable this feature in per-VM scope. The argument is a
  64bit value: bits 63:32 are used for notify window, and bits 31:0 are
  for flags. Current supported flags:
  - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
    window provided.
  - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
- It's safe to even set notify window to zero since an internal hardware
  threshold is added to vmcs.notify_window.

VM exit handling:
- Introduce a vcpu state notify_window_exits to records the count of
  notify VM exits and expose it through the debugfs.
- Notify VM exit can happen incident to delivery of a vector event.
  Allow it in KVM.
- Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
  bit is set.

Nested handling
- Nested notify VM exits are not supported yet. Keep the same notify
  window control in vmcs02 as vmcs01, so that L1 can't escape the
  restriction of notify VM exits through launching L2 VM.
- When L2 VM is context invalid and user space should synthesize a shutdown
  event to a vcpu. KVM makes KVM_REQ_TRIPLE_FAULT request accordingly
  and it would synthesize a nested triple fault exit to L1 hypervisor to
  kill L2.

Notify VM exit is defined in latest Intel Architecture Instruction Set
Extensions Programming Reference, chapter 9.2.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Tao Xu <tao3.xu@intel.com>
Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
---
 Documentation/virt/kvm/api.rst     | 48 ++++++++++++++++++++++++++++++
 arch/x86/include/asm/kvm_host.h    |  9 ++++++
 arch/x86/include/asm/vmx.h         |  7 +++++
 arch/x86/include/asm/vmxfeatures.h |  1 +
 arch/x86/include/uapi/asm/vmx.h    |  4 ++-
 arch/x86/kvm/vmx/capabilities.h    |  6 ++++
 arch/x86/kvm/vmx/nested.c          |  8 +++++
 arch/x86/kvm/vmx/vmx.c             | 48 ++++++++++++++++++++++++++++--
 arch/x86/kvm/x86.c                 | 18 ++++++++++-
 arch/x86/kvm/x86.h                 |  5 ++++
 include/uapi/linux/kvm.h           | 10 +++++++
 11 files changed, 159 insertions(+), 5 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index e09ce3cb49c5..821ca979234d 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6321,6 +6321,26 @@ array field represents return values. The userspace should update the return
 values of SBI call before resuming the VCPU. For more details on RISC-V SBI
 spec refer, https://github.com/riscv/riscv-sbi-doc.
 
+::
+
+    /* KVM_EXIT_NOTIFY */
+    struct {
+  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
+      __u32 flags;
+    } notify;
+
+Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
+enabled, a VM exit generated if no event window occurs in VM non-root mode
+for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
+enabling the cap, it would exit to userspace with the exit reason
+KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
+detailed info.
+
+The valid value for 'data' is:
+
+  - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
+    in VMCS. It would run into unknown result if resume the target VM.
+
 ::
 
 		/* Fix the size of the union. */
@@ -7266,6 +7286,34 @@ The valid bits in cap.args[0] are:
                                     generate a #UD within the guest.
 =================================== ============================================
 
+7.31 KVM_CAP_X86_NOTIFY_VMEXIT
+------------------------------
+
+:Architectures: x86
+:Target: VM
+:Parameters: args[0] is the value of notify window as well as some flags
+:Returns: 0 on success, -EINVAL if hardware doesn't support notify VM exit.
+
+Bits 63:32 of args[0] are used for notify window.
+Bits 31:0 of args[0] are for some flags. Valid bits are::
+
+  #define KVM_X86_NOTIFY_VMEXIT_ENABLED    (1 << 0)
+  #define KVM_X86_NOTIFY_VMEXIT_USER       (1 << 1)
+
+This capability allows userspace to configure the notify VM exit on/off
+in per-VM scope during VM creation. Notify VM exit is disabled by default.
+When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM would
+enable this feature with the notify window provided, which will generate
+a VM exit if no event window occurs in VM non-root mode for a specified of
+time (notify window).
+
+If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
+KVM would exit to userspace for handling.
+
+This capability is aimed to mitigate the threat that malicious VMs can
+cause CPU stuck (due to event windows don't open up) and make the CPU
+unavailable to host or other VMs.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 2c20f715f009..28a42658eec5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -65,6 +65,9 @@
 #define KVM_BUS_LOCK_DETECTION_VALID_MODE	(KVM_BUS_LOCK_DETECTION_OFF | \
 						 KVM_BUS_LOCK_DETECTION_EXIT)
 
+#define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
+						 KVM_X86_NOTIFY_VMEXIT_USER)
+
 /* x86-specific vcpu->requests bit members */
 #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
 #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
@@ -1157,6 +1160,9 @@ struct kvm_arch {
 
 	bool bus_lock_detection_enabled;
 	bool enable_pmu;
+
+	u32 notify_window;
+	u32 notify_vmexit_flags;
 	/*
 	 * If exit_on_emulation_error is set, and the in-kernel instruction
 	 * emulator fails to emulate an instruction, allow userspace
@@ -1293,6 +1299,7 @@ struct kvm_vcpu_stat {
 	u64 directed_yield_attempted;
 	u64 directed_yield_successful;
 	u64 guest_mode;
+	u64 notify_window_exits;
 };
 
 struct x86_instruction_info;
@@ -1504,6 +1511,8 @@ struct kvm_x86_ops {
 	 * Returns vCPU specific APICv inhibit reasons
 	 */
 	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
+
+	bool has_notify_vmexit;
 };
 
 struct kvm_x86_nested_ops {
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 6c343c6a1855..2aa266007873 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -74,6 +74,7 @@
 #define SECONDARY_EXEC_TSC_SCALING              VMCS_CONTROL_BIT(TSC_SCALING)
 #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
 #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
+#define SECONDARY_EXEC_NOTIFY_VM_EXITING	VMCS_CONTROL_BIT(NOTIFY_VM_EXITING)
 
 #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
 #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
@@ -269,6 +270,7 @@ enum vmcs_field {
 	SECONDARY_VM_EXEC_CONTROL       = 0x0000401e,
 	PLE_GAP                         = 0x00004020,
 	PLE_WINDOW                      = 0x00004022,
+	NOTIFY_WINDOW                   = 0x00004024,
 	VM_INSTRUCTION_ERROR            = 0x00004400,
 	VM_EXIT_REASON                  = 0x00004402,
 	VM_EXIT_INTR_INFO               = 0x00004404,
@@ -553,6 +555,11 @@ enum vm_entry_failure_code {
 #define EPT_VIOLATION_GVA_IS_VALID	(1 << EPT_VIOLATION_GVA_IS_VALID_BIT)
 #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
 
+/*
+ * Exit Qualifications for NOTIFY VM EXIT
+ */
+#define NOTIFY_VM_CONTEXT_INVALID     BIT(0)
+
 /*
  * VM-instruction error numbers
  */
diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
index d9a74681a77d..15f0f2ab4f95 100644
--- a/arch/x86/include/asm/vmxfeatures.h
+++ b/arch/x86/include/asm/vmxfeatures.h
@@ -84,5 +84,6 @@
 #define VMX_FEATURE_USR_WAIT_PAUSE	( 2*32+ 26) /* Enable TPAUSE, UMONITOR, UMWAIT in guest */
 #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
 #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
+#define VMX_FEATURE_NOTIFY_VM_EXITING	( 2*32+ 31) /* VM-Exit when no event windows after notify window */
 
 #endif /* _ASM_X86_VMXFEATURES_H */
diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 946d761adbd3..a5faf6d88f1b 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -91,6 +91,7 @@
 #define EXIT_REASON_UMWAIT              67
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
+#define EXIT_REASON_NOTIFY              75
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -153,7 +154,8 @@
 	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
-	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
+	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
+	{ EXIT_REASON_NOTIFY,                "NOTIFY" }
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 3f430e218375..0102a6e8a194 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -417,4 +417,10 @@ static inline u64 vmx_supported_debugctl(void)
 	return debugctl;
 }
 
+static inline bool cpu_has_notify_vmexit(void)
+{
+	return vmcs_config.cpu_based_2nd_exec_ctrl &
+		SECONDARY_EXEC_NOTIFY_VM_EXITING;
+}
+
 #endif /* __KVM_X86_VMX_CAPS_H */
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a6688663da4d..5f74aae96031 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2133,6 +2133,8 @@ static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 
 static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
 {
+	struct kvm *kvm = vmx->vcpu.kvm;
+
 	/*
 	 * If vmcs02 hasn't been initialized, set the constant vmcs02 state
 	 * according to L0's settings (vmcs12 is irrelevant here).  Host
@@ -2175,6 +2177,9 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
 	if (cpu_has_vmx_encls_vmexit())
 		vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA);
 
+	if (kvm_notify_vmexit_enabled(kvm))
+		vmcs_write32(NOTIFY_WINDOW, kvm->arch.notify_window);
+
 	/*
 	 * Set the MSR load/store lists to match L0's settings.  Only the
 	 * addresses are constant (for vmcs02), the counts can change based
@@ -6107,6 +6112,9 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
 			SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
 	case EXIT_REASON_ENCLS:
 		return nested_vmx_exit_handled_encls(vcpu, vmcs12);
+	case EXIT_REASON_NOTIFY:
+		/* Notify VM exit is not exposed to L1 */
+		return false;
 	default:
 		return true;
 	}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index cf8581978bce..4ecfca00ff9e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2472,7 +2472,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			SECONDARY_EXEC_PT_USE_GPA |
 			SECONDARY_EXEC_PT_CONCEAL_VMX |
 			SECONDARY_EXEC_ENABLE_VMFUNC |
-			SECONDARY_EXEC_BUS_LOCK_DETECTION;
+			SECONDARY_EXEC_BUS_LOCK_DETECTION |
+			SECONDARY_EXEC_NOTIFY_VM_EXITING;
 		if (cpu_has_sgx())
 			opt2 |= SECONDARY_EXEC_ENCLS_EXITING;
 		if (adjust_vmx_controls(min2, opt2,
@@ -4357,6 +4358,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 	if (!vcpu->kvm->arch.bus_lock_detection_enabled)
 		exec_control &= ~SECONDARY_EXEC_BUS_LOCK_DETECTION;
 
+	if (!kvm_notify_vmexit_enabled(vcpu->kvm))
+		exec_control &= ~SECONDARY_EXEC_NOTIFY_VM_EXITING;
+
 	return exec_control;
 }
 
@@ -4364,6 +4368,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 
 static void init_vmcs(struct vcpu_vmx *vmx)
 {
+	struct kvm *kvm = vmx->vcpu.kvm;
+
 	if (nested)
 		nested_vmx_set_vmcs_shadowing_bitmap();
 
@@ -4392,12 +4398,15 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
 	}
 
-	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
+	if (!kvm_pause_in_guest(kvm)) {
 		vmcs_write32(PLE_GAP, ple_gap);
 		vmx->ple_window = ple_window;
 		vmx->ple_window_dirty = true;
 	}
 
+	if (kvm_notify_vmexit_enabled(kvm))
+		vmcs_write32(NOTIFY_WINDOW, kvm->arch.notify_window);
+
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
 	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
 	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
@@ -5684,6 +5693,32 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int handle_notify(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual = vmx_get_exit_qual(vcpu);
+	bool context_invalid = exit_qual & NOTIFY_VM_CONTEXT_INVALID;
+
+	++vcpu->stat.notify_window_exits;
+
+	/*
+	 * Notify VM exit happened while executing iret from NMI,
+	 * "blocked by NMI" bit has to be set before next VM entry.
+	 */
+	if (enable_vnmi && (exit_qual & INTR_INFO_UNBLOCK_NMI))
+		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
+			      GUEST_INTR_STATE_NMI);
+
+	if (vcpu->kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_USER ||
+	    context_invalid) {
+		vcpu->run->exit_reason = KVM_EXIT_NOTIFY;
+		vcpu->run->notify.flags = context_invalid ?
+					  KVM_NOTIFY_CONTEXT_INVALID : 0;
+		return 0;
+	}
+
+	return 1;
+}
+
 /*
  * The exit handlers return 1 if the exit was handled fully and guest execution
  * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
@@ -5741,6 +5776,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
 	[EXIT_REASON_ENCLS]		      = handle_encls,
 	[EXIT_REASON_BUS_LOCK]                = handle_bus_lock_vmexit,
+	[EXIT_REASON_NOTIFY]		      = handle_notify,
 };
 
 static const int kvm_vmx_max_exit_handlers =
@@ -6105,7 +6141,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	     exit_reason.basic != EXIT_REASON_EPT_VIOLATION &&
 	     exit_reason.basic != EXIT_REASON_PML_FULL &&
 	     exit_reason.basic != EXIT_REASON_APIC_ACCESS &&
-	     exit_reason.basic != EXIT_REASON_TASK_SWITCH)) {
+	     exit_reason.basic != EXIT_REASON_TASK_SWITCH &&
+	     exit_reason.basic != EXIT_REASON_NOTIFY)) {
 		int ndata = 3;
 
 		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
@@ -7841,6 +7878,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.has_notify_vmexit = false,
 };
 
 static unsigned int vmx_handle_intel_pt_intr(void)
@@ -7979,6 +8018,9 @@ static __init int hardware_setup(void)
 	kvm_tsc_scaling_ratio_frac_bits = 48;
 	kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
 
+	if (cpu_has_notify_vmexit())
+		vmx_x86_ops.has_notify_vmexit = true;
+
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
 	if (enable_ept)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c8b9b0bc42aa..ad8a05349614 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -291,7 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
 	STATS_DESC_COUNTER(VCPU, nested_run),
 	STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
 	STATS_DESC_COUNTER(VCPU, directed_yield_successful),
-	STATS_DESC_ICOUNTER(VCPU, guest_mode)
+	STATS_DESC_ICOUNTER(VCPU, guest_mode),
+	STATS_DESC_COUNTER(VCPU, notify_window_exits),
 };
 
 const struct kvm_stats_header kvm_vcpu_stats_header = {
@@ -4389,6 +4390,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_DISABLE_QUIRKS2:
 		r = KVM_X86_VALID_QUIRKS;
 		break;
+	case KVM_CAP_X86_NOTIFY_VMEXIT:
+		r = kvm_x86_ops.has_notify_vmexit;
+		break;
 	default:
 		break;
 	}
@@ -6090,6 +6094,18 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		}
 		mutex_unlock(&kvm->lock);
 		break;
+	case KVM_CAP_X86_NOTIFY_VMEXIT:
+		r = -EINVAL;
+		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
+			break;
+		if (!kvm_x86_ops.has_notify_vmexit)
+			break;
+		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)
+			break;
+		kvm->arch.notify_window = cap->args[0] >> 32;
+		kvm->arch.notify_vmexit_flags = (u32)cap->args[0];
+		r = 0;
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 588792f00334..6e1e2159cbd4 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -344,6 +344,11 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
 	return kvm->arch.cstate_in_guest;
 }
 
+static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
+{
+	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
+}
+
 enum kvm_intr_type {
 	/* Values are arbitrary, but must be non-zero. */
 	KVM_HANDLING_IRQ = 1,
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dd1d8167e71f..86f1ca9c1514 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_NOTIFY           36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -490,6 +491,11 @@ struct kvm_run {
 			unsigned long args[6];
 			unsigned long ret[2];
 		} riscv_sbi;
+		/* KVM_EXIT_NOTIFY */
+		struct {
+#define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
+			__u32 flags;
+		} notify;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1148,6 +1154,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PMU_CAPABILITY 212
 #define KVM_CAP_DISABLE_QUIRKS2 213
 #define KVM_CAP_VM_TSC_CONTROL 214
+#define KVM_CAP_X86_NOTIFY_VMEXIT 215
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -2109,4 +2116,7 @@ struct kvm_stats_desc {
 /* Available with KVM_CAP_XSAVE2 */
 #define KVM_GET_XSAVE2		  _IOR(KVMIO,  0xcf, struct kvm_xsave)
 
+#define KVM_X86_NOTIFY_VMEXIT_ENABLED		(1ULL << 0)
+#define KVM_X86_NOTIFY_VMEXIT_USER		(1ULL << 1)
+
 #endif /* __LINUX_KVM_H */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/3] Introduce Notify VM exit
  2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
                   ` (2 preceding siblings ...)
  2022-04-21  7:29 ` [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit Chenyi Qiang
@ 2022-05-06  2:43 ` Chenyi Qiang
  2022-05-23 19:30 ` Paolo Bonzini
  4 siblings, 0 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-06  2:43 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

Kindly ping for the comments.

On 4/21/2022 3:29 PM, Chenyi Qiang wrote:
> Virtual machines can exploit Intel ISA characterstics to cause
> functional denial of service to the VMM. This series introduces a new
> feature named Notify VM exit, which can help mitigate such kind of
> attacks.
> 
> Patch 1: An extension of KVM_SET_VCPU_EVENTS ioctl to inject a
> synthesized shutdown event from user space. This is also a fix for other
> synthesized triple fault, e.g. the RSM patch or nested_vmx_abort(),
> which could get lost when exit to userspace to do migrate.
> 
> Patch 2: A selftest about get/set triple fault event.
> 
> Patch 3: The main patch to enable Notify VM exit.
> 
> ---
> Change logs:
> v5 -> v6
> - Do some changes in document.
> - Add a selftest about get/set triple fault event. (Sean)
> - extend the argument to include both the notify window and some flags
>    when enabling KVM_CAP_X86_BUS_LOCK_EXIT CAP. (Sean)
> - Change to use KVM_VCPUEVENT_VALID_TRIPE_FAULT in flags field and add
>    pending_triple_fault field in struct kvm_vcpu_events, which allows
>    userspace to make/clear triple fault request. (Sean)
> - Add a flag in kvm_x86_ops to avoid the kvm_has_notify_vmexit global
>    varialbe and its export.(Sean)
> - v5: https://lore.kernel.org/lkml/20220318074955.22428-1-chenyi.qiang@intel.com/
> 
> v4 -> v5
> - rename KVM_VCPUEVENTS_SHUTDOWN to KVM_VCPUEVENTS_TRIPLE_FAULT. Make it
>    bidirection and add it to get_vcpu_events. (Sean)
> - v4: https://lore.kernel.org/all/20220310084001.10235-1-chenyi.qiang@intel.com/
> 
> v3 -> v4
> - Change this feature to per-VM scope. (Jim)
> - Once VM_CONTEXT_INVALID set in exit_qualification, exit to user space
>    notify this fatal case, especially the notify VM exit happens in L2.
>    (Jim)
> - extend KVM_SET_VCPU_EVENTS to allow user space to inject a shutdown
>    event. (Jim)
> - A minor code changes.
> - Add document for the new KVM capability.
> - v3: https://lore.kernel.org/lkml/20220223062412.22334-1-chenyi.qiang@intel.com/
> 
> v2 -> v3
> - add a vcpu state notify_window_exits to record the number of
>    occurence as well as a pr_warn output. (Sean)
> - Add the handling in nested VM to prevent L1 bypassing the restriction
>    through launching a L2. (Sean)
> - Only kill L2 when L2 VM is context invalid, synthesize a
>    EXIT_REASON_TRIPLE_FAULT to L1 (Sean)
> - To ease the current implementation, make module parameter
>    notify_window read-only. (Sean)
> - Disable notify window exit by default.
> - v2: https://lore.kernel.org/lkml/20210525051204.1480610-1-tao3.xu@intel.com/
> 
> v1 -> v2
> - Default set notify window to 0, less than 0 to disable.
> - Add more description in commit message.
> ---
> 
> Chenyi Qiang (2):
>    KVM: X86: Save&restore the triple fault request
>    KVM: selftests: Add a test to get/set triple fault event
> 
> Tao Xu (1):
>    KVM: VMX: Enable Notify VM exit
> 
>   Documentation/virt/kvm/api.rst                | 55 +++++++++++
>   arch/x86/include/asm/kvm_host.h               |  9 ++
>   arch/x86/include/asm/vmx.h                    |  7 ++
>   arch/x86/include/asm/vmxfeatures.h            |  1 +
>   arch/x86/include/uapi/asm/kvm.h               |  4 +-
>   arch/x86/include/uapi/asm/vmx.h               |  4 +-
>   arch/x86/kvm/vmx/capabilities.h               |  6 ++
>   arch/x86/kvm/vmx/nested.c                     |  8 ++
>   arch/x86/kvm/vmx/vmx.c                        | 48 +++++++++-
>   arch/x86/kvm/x86.c                            | 33 ++++++-
>   arch/x86/kvm/x86.h                            |  5 +
>   include/uapi/linux/kvm.h                      | 10 ++
>   tools/testing/selftests/kvm/.gitignore        |  1 +
>   tools/testing/selftests/kvm/Makefile          |  1 +
>   .../kvm/x86_64/triple_fault_event_test.c      | 96 +++++++++++++++++++
>   15 files changed, 280 insertions(+), 8 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit
  2022-04-21  7:29 ` [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit Chenyi Qiang
@ 2022-05-17  0:59   ` Chenyi Qiang
  2022-05-18 22:30   ` Sean Christopherson
  1 sibling, 0 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-17  0:59 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 4/21/2022 3:29 PM, Chenyi Qiang wrote:
> From: Tao Xu <tao3.xu@intel.com>
> 
> There are cases that malicious virtual machines can cause CPU stuck (due
> to event windows don't open up), e.g., infinite loop in microcode when
> nested #AC (CVE-2015-5307). No event window means no event (NMI, SMI and
> IRQ) can be delivered. It leads the CPU to be unavailable to host or
> other VMs.
> 
> VMM can enable notify VM exit that a VM exit generated if no event
> window occurs in VM non-root mode for a specified amount of time (notify
> window).
> 
> Feature enabling:
> - The new vmcs field SECONDARY_EXEC_NOTIFY_VM_EXITING is introduced to
>    enable this feature. VMM can set NOTIFY_WINDOW vmcs field to adjust
>    the expected notify window.
> - Add a new KVM capability KVM_CAP_X86_NOTIFY_VMEXIT so that user space
>    can query and enable this feature in per-VM scope. The argument is a
>    64bit value: bits 63:32 are used for notify window, and bits 31:0 are
>    for flags. Current supported flags:
>    - KVM_X86_NOTIFY_VMEXIT_ENABLED: enable the feature with the notify
>      window provided.
>    - KVM_X86_NOTIFY_VMEXIT_USER: exit to userspace once the exits happen.
> - It's safe to even set notify window to zero since an internal hardware
>    threshold is added to vmcs.notify_window.
> 
> VM exit handling:
> - Introduce a vcpu state notify_window_exits to records the count of
>    notify VM exits and expose it through the debugfs.
> - Notify VM exit can happen incident to delivery of a vector event.
>    Allow it in KVM.
> - Exit to userspace unconditionally for handling when VM_CONTEXT_INVALID
>    bit is set.
> 
> Nested handling
> - Nested notify VM exits are not supported yet. Keep the same notify
>    window control in vmcs02 as vmcs01, so that L1 can't escape the
>    restriction of notify VM exits through launching L2 VM.
> - When L2 VM is context invalid and user space should synthesize a shutdown
>    event to a vcpu. KVM makes KVM_REQ_TRIPLE_FAULT request accordingly
>    and it would synthesize a nested triple fault exit to L1 hypervisor to
>    kill L2.
> 
> Notify VM exit is defined in latest Intel Architecture Instruction Set
> Extensions Programming Reference, chapter 9.2.
> 
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Tao Xu <tao3.xu@intel.com>
> Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>   Documentation/virt/kvm/api.rst     | 48 ++++++++++++++++++++++++++++++
>   arch/x86/include/asm/kvm_host.h    |  9 ++++++
>   arch/x86/include/asm/vmx.h         |  7 +++++
>   arch/x86/include/asm/vmxfeatures.h |  1 +
>   arch/x86/include/uapi/asm/vmx.h    |  4 ++-
>   arch/x86/kvm/vmx/capabilities.h    |  6 ++++
>   arch/x86/kvm/vmx/nested.c          |  8 +++++
>   arch/x86/kvm/vmx/vmx.c             | 48 ++++++++++++++++++++++++++++--
>   arch/x86/kvm/x86.c                 | 18 ++++++++++-
>   arch/x86/kvm/x86.h                 |  5 ++++
>   include/uapi/linux/kvm.h           | 10 +++++++
>   11 files changed, 159 insertions(+), 5 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index e09ce3cb49c5..821ca979234d 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6321,6 +6321,26 @@ array field represents return values. The userspace should update the return
>   values of SBI call before resuming the VCPU. For more details on RISC-V SBI
>   spec refer, https://github.com/riscv/riscv-sbi-doc.
>   
> +::
> +
> +    /* KVM_EXIT_NOTIFY */
> +    struct {
> +  #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> +      __u32 flags;
> +    } notify;
> +
> +Used on x86 systems. When the VM capability KVM_CAP_X86_NOTIFY_VMEXIT is
> +enabled, a VM exit generated if no event window occurs in VM non-root mode
> +for a specified amount of time. Once KVM_X86_NOTIFY_VMEXIT_USER is set when
> +enabling the cap, it would exit to userspace with the exit reason
> +KVM_EXIT_NOTIFY for further handling. The "flags" field contains more
> +detailed info.
> +
> +The valid value for 'data' is:
> +
> +  - KVM_NOTIFY_CONTEXT_INVALID -- the VM context is corrupted and not valid
> +    in VMCS. It would run into unknown result if resume the target VM.
> +
>   ::
>   
>   		/* Fix the size of the union. */
> @@ -7266,6 +7286,34 @@ The valid bits in cap.args[0] are:
>                                       generate a #UD within the guest.
>   =================================== ============================================
>   
> +7.31 KVM_CAP_X86_NOTIFY_VMEXIT
> +------------------------------
> +
> +:Architectures: x86
> +:Target: VM
> +:Parameters: args[0] is the value of notify window as well as some flags
> +:Returns: 0 on success, -EINVAL if hardware doesn't support notify VM exit.
> +
> +Bits 63:32 of args[0] are used for notify window.
> +Bits 31:0 of args[0] are for some flags. Valid bits are::
> +
> +  #define KVM_X86_NOTIFY_VMEXIT_ENABLED    (1 << 0)
> +  #define KVM_X86_NOTIFY_VMEXIT_USER       (1 << 1)
> +
> +This capability allows userspace to configure the notify VM exit on/off
> +in per-VM scope during VM creation. Notify VM exit is disabled by default.
> +When userspace sets KVM_X86_NOTIFY_VMEXIT_ENABLED bit in args[0], VMM would
> +enable this feature with the notify window provided, which will generate
> +a VM exit if no event window occurs in VM non-root mode for a specified of
> +time (notify window).
> +
> +If KVM_X86_NOTIFY_VMEXIT_USER is set in args[0], upon notify VM exits happen,
> +KVM would exit to userspace for handling.
> +
> +This capability is aimed to mitigate the threat that malicious VMs can
> +cause CPU stuck (due to event windows don't open up) and make the CPU
> +unavailable to host or other VMs.
> +
>   8. Other capabilities.
>   ======================
>   
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 2c20f715f009..28a42658eec5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -65,6 +65,9 @@
>   #define KVM_BUS_LOCK_DETECTION_VALID_MODE	(KVM_BUS_LOCK_DETECTION_OFF | \
>   						 KVM_BUS_LOCK_DETECTION_EXIT)
>   
> +#define KVM_X86_NOTIFY_VMEXIT_VALID_BITS	(KVM_X86_NOTIFY_VMEXIT_ENABLED | \
> +						 KVM_X86_NOTIFY_VMEXIT_USER)
> +
>   /* x86-specific vcpu->requests bit members */
>   #define KVM_REQ_MIGRATE_TIMER		KVM_ARCH_REQ(0)
>   #define KVM_REQ_REPORT_TPR_ACCESS	KVM_ARCH_REQ(1)
> @@ -1157,6 +1160,9 @@ struct kvm_arch {
>   
>   	bool bus_lock_detection_enabled;
>   	bool enable_pmu;
> +
> +	u32 notify_window;
> +	u32 notify_vmexit_flags;
>   	/*
>   	 * If exit_on_emulation_error is set, and the in-kernel instruction
>   	 * emulator fails to emulate an instruction, allow userspace
> @@ -1293,6 +1299,7 @@ struct kvm_vcpu_stat {
>   	u64 directed_yield_attempted;
>   	u64 directed_yield_successful;
>   	u64 guest_mode;
> +	u64 notify_window_exits;
>   };
>   
>   struct x86_instruction_info;
> @@ -1504,6 +1511,8 @@ struct kvm_x86_ops {
>   	 * Returns vCPU specific APICv inhibit reasons
>   	 */
>   	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> +
> +	bool has_notify_vmexit;
>   };
>   
>   struct kvm_x86_nested_ops {
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 6c343c6a1855..2aa266007873 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -74,6 +74,7 @@
>   #define SECONDARY_EXEC_TSC_SCALING              VMCS_CONTROL_BIT(TSC_SCALING)
>   #define SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE	VMCS_CONTROL_BIT(USR_WAIT_PAUSE)
>   #define SECONDARY_EXEC_BUS_LOCK_DETECTION	VMCS_CONTROL_BIT(BUS_LOCK_DETECTION)
> +#define SECONDARY_EXEC_NOTIFY_VM_EXITING	VMCS_CONTROL_BIT(NOTIFY_VM_EXITING)
>   
>   #define PIN_BASED_EXT_INTR_MASK                 VMCS_CONTROL_BIT(INTR_EXITING)
>   #define PIN_BASED_NMI_EXITING                   VMCS_CONTROL_BIT(NMI_EXITING)
> @@ -269,6 +270,7 @@ enum vmcs_field {
>   	SECONDARY_VM_EXEC_CONTROL       = 0x0000401e,
>   	PLE_GAP                         = 0x00004020,
>   	PLE_WINDOW                      = 0x00004022,
> +	NOTIFY_WINDOW                   = 0x00004024,
>   	VM_INSTRUCTION_ERROR            = 0x00004400,
>   	VM_EXIT_REASON                  = 0x00004402,
>   	VM_EXIT_INTR_INFO               = 0x00004404,
> @@ -553,6 +555,11 @@ enum vm_entry_failure_code {
>   #define EPT_VIOLATION_GVA_IS_VALID	(1 << EPT_VIOLATION_GVA_IS_VALID_BIT)
>   #define EPT_VIOLATION_GVA_TRANSLATED	(1 << EPT_VIOLATION_GVA_TRANSLATED_BIT)
>   
> +/*
> + * Exit Qualifications for NOTIFY VM EXIT
> + */
> +#define NOTIFY_VM_CONTEXT_INVALID     BIT(0)
> +
>   /*
>    * VM-instruction error numbers
>    */
> diff --git a/arch/x86/include/asm/vmxfeatures.h b/arch/x86/include/asm/vmxfeatures.h
> index d9a74681a77d..15f0f2ab4f95 100644
> --- a/arch/x86/include/asm/vmxfeatures.h
> +++ b/arch/x86/include/asm/vmxfeatures.h
> @@ -84,5 +84,6 @@
>   #define VMX_FEATURE_USR_WAIT_PAUSE	( 2*32+ 26) /* Enable TPAUSE, UMONITOR, UMWAIT in guest */
>   #define VMX_FEATURE_ENCLV_EXITING	( 2*32+ 28) /* "" VM-Exit on ENCLV (leaf dependent) */
>   #define VMX_FEATURE_BUS_LOCK_DETECTION	( 2*32+ 30) /* "" VM-Exit when bus lock caused */
> +#define VMX_FEATURE_NOTIFY_VM_EXITING	( 2*32+ 31) /* VM-Exit when no event windows after notify window */
>   
>   #endif /* _ASM_X86_VMXFEATURES_H */
> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> index 946d761adbd3..a5faf6d88f1b 100644
> --- a/arch/x86/include/uapi/asm/vmx.h
> +++ b/arch/x86/include/uapi/asm/vmx.h
> @@ -91,6 +91,7 @@
>   #define EXIT_REASON_UMWAIT              67
>   #define EXIT_REASON_TPAUSE              68
>   #define EXIT_REASON_BUS_LOCK            74
> +#define EXIT_REASON_NOTIFY              75
>   
>   #define VMX_EXIT_REASONS \
>   	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
> @@ -153,7 +154,8 @@
>   	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
>   	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
>   	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
> -	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
> +	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
> +	{ EXIT_REASON_NOTIFY,                "NOTIFY" }
>   
>   #define VMX_EXIT_REASON_FLAGS \
>   	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 3f430e218375..0102a6e8a194 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -417,4 +417,10 @@ static inline u64 vmx_supported_debugctl(void)
>   	return debugctl;
>   }
>   
> +static inline bool cpu_has_notify_vmexit(void)
> +{
> +	return vmcs_config.cpu_based_2nd_exec_ctrl &
> +		SECONDARY_EXEC_NOTIFY_VM_EXITING;
> +}
> +
>   #endif /* __KVM_X86_VMX_CAPS_H */
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index a6688663da4d..5f74aae96031 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2133,6 +2133,8 @@ static u64 nested_vmx_calc_efer(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>   
>   static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
>   {
> +	struct kvm *kvm = vmx->vcpu.kvm;
> +
>   	/*
>   	 * If vmcs02 hasn't been initialized, set the constant vmcs02 state
>   	 * according to L0's settings (vmcs12 is irrelevant here).  Host
> @@ -2175,6 +2177,9 @@ static void prepare_vmcs02_constant_state(struct vcpu_vmx *vmx)
>   	if (cpu_has_vmx_encls_vmexit())
>   		vmcs_write64(ENCLS_EXITING_BITMAP, INVALID_GPA);
>   
> +	if (kvm_notify_vmexit_enabled(kvm))
> +		vmcs_write32(NOTIFY_WINDOW, kvm->arch.notify_window);
> +
>   	/*
>   	 * Set the MSR load/store lists to match L0's settings.  Only the
>   	 * addresses are constant (for vmcs02), the counts can change based
> @@ -6107,6 +6112,9 @@ static bool nested_vmx_l1_wants_exit(struct kvm_vcpu *vcpu,
>   			SECONDARY_EXEC_ENABLE_USR_WAIT_PAUSE);
>   	case EXIT_REASON_ENCLS:
>   		return nested_vmx_exit_handled_encls(vcpu, vmcs12);
> +	case EXIT_REASON_NOTIFY:
> +		/* Notify VM exit is not exposed to L1 */
> +		return false;
>   	default:
>   		return true;
>   	}
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index cf8581978bce..4ecfca00ff9e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2472,7 +2472,8 @@ static __init int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>   			SECONDARY_EXEC_PT_USE_GPA |
>   			SECONDARY_EXEC_PT_CONCEAL_VMX |
>   			SECONDARY_EXEC_ENABLE_VMFUNC |
> -			SECONDARY_EXEC_BUS_LOCK_DETECTION;
> +			SECONDARY_EXEC_BUS_LOCK_DETECTION |
> +			SECONDARY_EXEC_NOTIFY_VM_EXITING;
>   		if (cpu_has_sgx())
>   			opt2 |= SECONDARY_EXEC_ENCLS_EXITING;
>   		if (adjust_vmx_controls(min2, opt2,
> @@ -4357,6 +4358,9 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>   	if (!vcpu->kvm->arch.bus_lock_detection_enabled)
>   		exec_control &= ~SECONDARY_EXEC_BUS_LOCK_DETECTION;
>   
> +	if (!kvm_notify_vmexit_enabled(vcpu->kvm))
> +		exec_control &= ~SECONDARY_EXEC_NOTIFY_VM_EXITING;
> +
>   	return exec_control;
>   }
>   
> @@ -4364,6 +4368,8 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
>   
>   static void init_vmcs(struct vcpu_vmx *vmx)
>   {
> +	struct kvm *kvm = vmx->vcpu.kvm;
> +
>   	if (nested)
>   		nested_vmx_set_vmcs_shadowing_bitmap();
>   
> @@ -4392,12 +4398,15 @@ static void init_vmcs(struct vcpu_vmx *vmx)
>   		vmcs_write64(POSTED_INTR_DESC_ADDR, __pa((&vmx->pi_desc)));
>   	}
>   
> -	if (!kvm_pause_in_guest(vmx->vcpu.kvm)) {
> +	if (!kvm_pause_in_guest(kvm)) {
>   		vmcs_write32(PLE_GAP, ple_gap);
>   		vmx->ple_window = ple_window;
>   		vmx->ple_window_dirty = true;
>   	}
>   
> +	if (kvm_notify_vmexit_enabled(kvm))
> +		vmcs_write32(NOTIFY_WINDOW, kvm->arch.notify_window);
> +
>   	vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, 0);
>   	vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, 0);
>   	vmcs_write32(CR3_TARGET_COUNT, 0);           /* 22.2.1 */
> @@ -5684,6 +5693,32 @@ static int handle_bus_lock_vmexit(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +static int handle_notify(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long exit_qual = vmx_get_exit_qual(vcpu);
> +	bool context_invalid = exit_qual & NOTIFY_VM_CONTEXT_INVALID;
> +
> +	++vcpu->stat.notify_window_exits;
> +
> +	/*
> +	 * Notify VM exit happened while executing iret from NMI,
> +	 * "blocked by NMI" bit has to be set before next VM entry.
> +	 */
> +	if (enable_vnmi && (exit_qual & INTR_INFO_UNBLOCK_NMI))
> +		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO,
> +			      GUEST_INTR_STATE_NMI);
> +
> +	if (vcpu->kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_USER ||
> +	    context_invalid) {
> +		vcpu->run->exit_reason = KVM_EXIT_NOTIFY;
> +		vcpu->run->notify.flags = context_invalid ?
> +					  KVM_NOTIFY_CONTEXT_INVALID : 0;
> +		return 0;
> +	}
> +
> +	return 1;
> +}
> +
>   /*
>    * The exit handlers return 1 if the exit was handled fully and guest execution
>    * may resume.  Otherwise they set the kvm_run parameter to indicate what needs
> @@ -5741,6 +5776,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>   	[EXIT_REASON_PREEMPTION_TIMER]	      = handle_preemption_timer,
>   	[EXIT_REASON_ENCLS]		      = handle_encls,
>   	[EXIT_REASON_BUS_LOCK]                = handle_bus_lock_vmexit,
> +	[EXIT_REASON_NOTIFY]		      = handle_notify,
>   };
>   
>   static const int kvm_vmx_max_exit_handlers =
> @@ -6105,7 +6141,8 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   	     exit_reason.basic != EXIT_REASON_EPT_VIOLATION &&
>   	     exit_reason.basic != EXIT_REASON_PML_FULL &&
>   	     exit_reason.basic != EXIT_REASON_APIC_ACCESS &&
> -	     exit_reason.basic != EXIT_REASON_TASK_SWITCH)) {
> +	     exit_reason.basic != EXIT_REASON_TASK_SWITCH &&
> +	     exit_reason.basic != EXIT_REASON_NOTIFY)) {
>   		int ndata = 3;
>   
>   		vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
> @@ -7841,6 +7878,8 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
>   	.complete_emulated_msr = kvm_complete_insn_gp,
>   
>   	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> +
> +	.has_notify_vmexit = false,
>   };
>   
>   static unsigned int vmx_handle_intel_pt_intr(void)
> @@ -7979,6 +8018,9 @@ static __init int hardware_setup(void)
>   	kvm_tsc_scaling_ratio_frac_bits = 48;
>   	kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
>   
> +	if (cpu_has_notify_vmexit())
> +		vmx_x86_ops.has_notify_vmexit = true;
> +
>   	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>   
>   	if (enable_ept)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c8b9b0bc42aa..ad8a05349614 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -291,7 +291,8 @@ const struct _kvm_stats_desc kvm_vcpu_stats_desc[] = {
>   	STATS_DESC_COUNTER(VCPU, nested_run),
>   	STATS_DESC_COUNTER(VCPU, directed_yield_attempted),
>   	STATS_DESC_COUNTER(VCPU, directed_yield_successful),
> -	STATS_DESC_ICOUNTER(VCPU, guest_mode)
> +	STATS_DESC_ICOUNTER(VCPU, guest_mode),
> +	STATS_DESC_COUNTER(VCPU, notify_window_exits),
>   };
>   
>   const struct kvm_stats_header kvm_vcpu_stats_header = {
> @@ -4389,6 +4390,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_DISABLE_QUIRKS2:
>   		r = KVM_X86_VALID_QUIRKS;
>   		break;
> +	case KVM_CAP_X86_NOTIFY_VMEXIT:
> +		r = kvm_x86_ops.has_notify_vmexit;
> +		break;
>   	default:
>   		break;
>   	}
> @@ -6090,6 +6094,18 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>   		}
>   		mutex_unlock(&kvm->lock);
>   		break;
> +	case KVM_CAP_X86_NOTIFY_VMEXIT:
> +		r = -EINVAL;
> +		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
> +			break;
> +		if (!kvm_x86_ops.has_notify_vmexit)
> +			break;
> +		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)

Miss the parentheses here, should be !((u32)cap->args[0] & 
KVM_X86_NOTIFY_VMEXIT_ENABLED)

> +			break;
> +		kvm->arch.notify_window = cap->args[0] >> 32;
> +		kvm->arch.notify_vmexit_flags = (u32)cap->args[0];
> +		r = 0;
> +		break;
>   	default:
>   		r = -EINVAL;
>   		break;
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 588792f00334..6e1e2159cbd4 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -344,6 +344,11 @@ static inline bool kvm_cstate_in_guest(struct kvm *kvm)
>   	return kvm->arch.cstate_in_guest;
>   }
>   
> +static inline bool kvm_notify_vmexit_enabled(struct kvm *kvm)
> +{
> +	return kvm->arch.notify_vmexit_flags & KVM_X86_NOTIFY_VMEXIT_ENABLED;
> +}
> +
>   enum kvm_intr_type {
>   	/* Values are arbitrary, but must be non-zero. */
>   	KVM_HANDLING_IRQ = 1,
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index dd1d8167e71f..86f1ca9c1514 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -270,6 +270,7 @@ struct kvm_xen_exit {
>   #define KVM_EXIT_X86_BUS_LOCK     33
>   #define KVM_EXIT_XEN              34
>   #define KVM_EXIT_RISCV_SBI        35
> +#define KVM_EXIT_NOTIFY           36
>   
>   /* For KVM_EXIT_INTERNAL_ERROR */
>   /* Emulate instruction failed. */
> @@ -490,6 +491,11 @@ struct kvm_run {
>   			unsigned long args[6];
>   			unsigned long ret[2];
>   		} riscv_sbi;
> +		/* KVM_EXIT_NOTIFY */
> +		struct {
> +#define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
> +			__u32 flags;
> +		} notify;
>   		/* Fix the size of the union. */
>   		char padding[256];
>   	};
> @@ -1148,6 +1154,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_PMU_CAPABILITY 212
>   #define KVM_CAP_DISABLE_QUIRKS2 213
>   #define KVM_CAP_VM_TSC_CONTROL 214
> +#define KVM_CAP_X86_NOTIFY_VMEXIT 215
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> @@ -2109,4 +2116,7 @@ struct kvm_stats_desc {
>   /* Available with KVM_CAP_XSAVE2 */
>   #define KVM_GET_XSAVE2		  _IOR(KVMIO,  0xcf, struct kvm_xsave)
>   
> +#define KVM_X86_NOTIFY_VMEXIT_ENABLED		(1ULL << 0)
> +#define KVM_X86_NOTIFY_VMEXIT_USER		(1ULL << 1)
> +
>   #endif /* __LINUX_KVM_H */

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request
  2022-04-21  7:29 ` [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request Chenyi Qiang
@ 2022-05-18 18:42   ` Sean Christopherson
  2022-05-19  6:25     ` Chenyi Qiang
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-05-18 18:42 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

Nits on the shortlog...

Please don't capitalize x86, spell out "and" instead of using an ampersand (though
I think it can be omitted entirely), and since there are plenty of chars left, call
out that this is adding/extending KVM's ABI, e.g. it's not obvious from the shortlog
where/when the save+restore happens.

  KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault

On Thu, Apr 21, 2022, Chenyi Qiang wrote:
> For the triple fault sythesized by KVM, e.g. the RSM path or
> nested_vmx_abort(), if KVM exits to userspace before the request is
> serviced, userspace could migrate the VM and lose the triple fault.
> 
> Add the support to save and restore the triple fault event from
> userspace. Introduce a new event KVM_VCPUEVENT_VALID_TRIPLE_FAULT in
> get/set_vcpu_events to track the triple fault request.
> 
> Note that in the set_vcpu_events path, userspace is able to set/clear
> the triple fault request through triple_fault_pending field.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> ---
>  Documentation/virt/kvm/api.rst  |  7 +++++++
>  arch/x86/include/uapi/asm/kvm.h |  4 +++-
>  arch/x86/kvm/x86.c              | 15 +++++++++++++--
>  3 files changed, 23 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 72183ae628f7..e09ce3cb49c5 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1150,6 +1150,9 @@ The following bits are defined in the flags field:
>    fields contain a valid state. This bit will be set whenever
>    KVM_CAP_EXCEPTION_PAYLOAD is enabled.
>  
> +- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
> +  triple_fault_pending field contains a valid state.
> +
>  ARM64:
>  ^^^^^^
>  
> @@ -1245,6 +1248,10 @@ can be set in the flags field to signal that the
>  exception_has_payload, exception_payload, and exception.pending fields
>  contain a valid state and shall be written into the VCPU.
>  
> +KVM_VCPUEVENT_VALID_TRIPLE_FAULT can be set in flags field to signal that
> +the triple_fault_pending field contains a valid state and shall be written
> +into the VCPU.
> +
>  ARM64:
>  ^^^^^^
>  
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 21614807a2cb..fd083f6337af 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -325,6 +325,7 @@ struct kvm_reinject_control {
>  #define KVM_VCPUEVENT_VALID_SHADOW	0x00000004
>  #define KVM_VCPUEVENT_VALID_SMM		0x00000008
>  #define KVM_VCPUEVENT_VALID_PAYLOAD	0x00000010
> +#define KVM_VCPUEVENT_VALID_TRIPLE_FAULT	0x00000020
>  
>  /* Interrupt shadow states */
>  #define KVM_X86_SHADOW_INT_MOV_SS	0x01
> @@ -359,7 +360,8 @@ struct kvm_vcpu_events {
>  		__u8 smm_inside_nmi;
>  		__u8 latched_init;
>  	} smi;
> -	__u8 reserved[27];
> +	__u8 triple_fault_pending;

What about writing this as

	struct {
		__u8 pending;
	} triple_fault;

to match the other events?  It's kinda silly, but I find it easier to visually
identify the various events that are handled by kvm_vcpu_events.

> +	__u8 reserved[26];
>  	__u8 exception_has_payload;
>  	__u64 exception_payload;
>  };
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index ab336f7c82e4..c8b9b0bc42aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4911,9 +4911,12 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>  		!!(vcpu->arch.hflags & HF_SMM_INSIDE_NMI_MASK);
>  	events->smi.latched_init = kvm_lapic_latched_init(vcpu);
>  
> +	events->triple_fault_pending = kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
> +
>  	events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
>  			 | KVM_VCPUEVENT_VALID_SHADOW
> -			 | KVM_VCPUEVENT_VALID_SMM);
> +			 | KVM_VCPUEVENT_VALID_SMM
> +			 | KVM_VCPUEVENT_VALID_TRIPLE_FAULT);

Does setting KVM_VCPUEVENT_VALID_TRIPLE_FAULT need to be guarded with a capability,
a la KVM_CAP_EXCEPTION_PAYLOAD, so that migrating from a new KVM to an old KVM doesn't
fail?  Seems rather pointless since the VM is likely hosed either way...

>  	if (vcpu->kvm->arch.exception_payload_enabled)
>  		events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event
  2022-04-21  7:29 ` [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event Chenyi Qiang
@ 2022-05-18 19:20   ` Sean Christopherson
  2022-05-23  6:46     ` Chenyi Qiang
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-05-18 19:20 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Thu, Apr 21, 2022, Chenyi Qiang wrote:
> +#ifndef __x86_64__
> +# error This test is 64-bit only

No need, all of KVM selftests are 64-bit only.

> +#endif
> +
> +#define VCPU_ID			0
> +#define ARBITRARY_IO_PORT	0x2000
> +
> +/* The virtual machine object. */
> +static struct kvm_vm *vm;
> +
> +static void l2_guest_code(void)
> +{
> +	/*
> +	 * Generate an exit to L0 userspace, i.e. main(), via I/O to an
> +	 * arbitrary port.
> +	 */

I think we can test a "real" triple fault without too much effort by abusing
vcpu->run->request_interrupt_window.  E.g. map the run struct into L2, clear
PIN_BASED_EXT_INTR_MASK in vmcs12, and then in l2_guest_code() do:

	asm volatile("cli");

	run->request_interrupt_window = true;

	asm volatile("sti; ud2");

	asm volatile("inb %%dx, %%al"                                           
		     : : [port] "d" (ARBITRARY_IO_PORT) : "rax"); 	

The STI shadow will block the IRQ-window VM-Exit until after the ud2, i.e. until
after the triple fault is pending.

And then main() can

  1. verify it got KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault
  2. clear the triple fault and re-enter L1+l2
  3. continue with the existing code, e.g. verify it got KVM_EXIT_IO, stuff a
     pending triple fault, etc...

That said, typing that out makes me think we should technically handle/prevent this
in KVM since triple fault / shutdown is kinda sorta just a special exception.

Heh, and a potentially bad/crazy idea would be to use a reserved/magic vector in
kvm_vcpu_events.exception to save/restore triple fault, e.g. we could steal
NMI_VECTOR.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f2d0ee9296b9..d58c0cfd3cd3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4743,7 +4743,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
        return (kvm_arch_interrupt_allowed(vcpu) &&
                kvm_cpu_accept_dm_intr(vcpu) &&
                !kvm_event_needs_reinjection(vcpu) &&
-               !vcpu->arch.exception.pending);
+               !vcpu->arch.exception.pending &&
+               !kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu));
 }

 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,




^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit
  2022-04-21  7:29 ` [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit Chenyi Qiang
  2022-05-17  0:59   ` Chenyi Qiang
@ 2022-05-18 22:30   ` Sean Christopherson
  2022-05-19 10:38     ` Chenyi Qiang
  1 sibling, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-05-18 22:30 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2782 bytes --]

On Thu, Apr 21, 2022, Chenyi Qiang wrote:
> @@ -1504,6 +1511,8 @@ struct kvm_x86_ops {
>  	 * Returns vCPU specific APICv inhibit reasons
>  	 */
>  	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
> +
> +	bool has_notify_vmexit;

I'm pretty sure I suggested this, but seeing it in code, it kinda sorta makes things
worst if we don't first consolidate the existing flags.  kvm_x86_ops works, but we'd
definitely be taking liberties with the "ops" part.

What about adding struct kvm_caps to collect these flags/settings that don't fit
into kvm_cpu_caps because they're not a CPUID feature flag?  kvm_x86_ops has the
advantage of kinda being read-only after init since VMX modifies vmx_x86_ops,
but IMO that's not enough reason to shove this into kvm_x86_ops.  And long term,
we might be able find a way to mark kvm_caps as full __ro_after_init.

If no one objects, the attached patch can slide in before this patch, then
has_notifiy_vmexit can land in kvm_caps.

struct kvm_caps {
	/* control of guest tsc rate supported? */
	bool has_tsc_control;
	/* maximum supported tsc_khz for guests */
	u32  max_guest_tsc_khz;
	/* number of bits of the fractional part of the TSC scaling ratio */
	u8   tsc_scaling_ratio_frac_bits;
	/* maximum allowed value of TSC scaling ratio */
	u64  max_tsc_scaling_ratio;
	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
	u64  default_tsc_scaling_ratio;
	/* bus lock detection supported? */
	bool has_bus_lock_exit;

	u64 supported_mce_cap;
	u64 supported_xcr0;
	u64 supported_xss;
};

> @@ -6090,6 +6094,18 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>  		}
>  		mutex_unlock(&kvm->lock);
>  		break;
> +	case KVM_CAP_X86_NOTIFY_VMEXIT:
> +		r = -EINVAL;
> +		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
> +			break;
> +		if (!kvm_x86_ops.has_notify_vmexit)
> +			break;
> +		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)
> +			break;
> +		kvm->arch.notify_window = cap->args[0] >> 32;

Setting notify_vmexit and notify_vmexit_flags needs to be done under kvm->lock,
and changing notify_window if kvm->created_vcpus > 0 needs to disallowed, otherwise
init_vmcs() will use the wrong value.

notify_vmexit_flags could be changed on the fly, but I doubt that's worth
supporting as even the smallest amount of complexity will go unused.

So I think this?

	case KVM_CAP_X86_NOTIFY_VMEXIT:
		r = -EINVAL;
		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
			break;
		if (!kvm_x86_ops.has_notify_vmexit)
			break;
		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)
			break;
		mutex_lock(&kvm->lock);
		if (!kvm->created_vcpus) {
			kvm->arch.notify_window = cap->args[0] >> 32;
			kvm->arch.notify_vmexit_flags = (u32)cap->args[0];
			r = 0;
		}
		mutex_unlock(&kvm->lock);
		break;

[-- Attachment #2: 0001-KVM-x86-Introduce-struct-kvm_caps-to-track-misc-caps.patch --]
[-- Type: text/x-diff, Size: 24373 bytes --]

From efa75f6f68eb62813f9fc78d5fd410426d079edb Mon Sep 17 00:00:00 2001
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 18 May 2022 13:52:51 -0700
Subject: [PATCH] KVM: x86: Introduce "struct kvm_caps" to track misc
 caps/settings

Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature.  The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h |  15 -----
 arch/x86/kvm/cpuid.c            |   8 +--
 arch/x86/kvm/debugfs.c          |   4 +-
 arch/x86/kvm/lapic.c            |   2 +-
 arch/x86/kvm/svm/nested.c       |   4 +-
 arch/x86/kvm/svm/svm.c          |  13 +++--
 arch/x86/kvm/vmx/nested.c       |   4 +-
 arch/x86/kvm/vmx/vmx.c          |  22 +++----
 arch/x86/kvm/x86.c              | 100 ++++++++++++++------------------
 arch/x86/kvm/x86.h              |  26 ++++++++-
 10 files changed, 96 insertions(+), 102 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9cdc5bbd721f..d895d25c5b2f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1661,21 +1661,6 @@ extern bool tdp_enabled;
 
 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);
 
-/* control of guest tsc rate supported? */
-extern bool kvm_has_tsc_control;
-/* maximum supported tsc_khz for guests */
-extern u32  kvm_max_guest_tsc_khz;
-/* number of bits of the fractional part of the TSC scaling ratio */
-extern u8   kvm_tsc_scaling_ratio_frac_bits;
-/* maximum allowed value of TSC scaling ratio */
-extern u64  kvm_max_tsc_scaling_ratio;
-/* 1ull << kvm_tsc_scaling_ratio_frac_bits */
-extern u64  kvm_default_tsc_scaling_ratio;
-/* bus lock detection supported? */
-extern bool kvm_has_bus_lock_exit;
-
-extern u64 kvm_mce_cap_supported;
-
 /*
  * EMULTYPE_NO_DECODE - Set when re-emulating an instruction (after completing
  *			userspace I/O) to indicate that the emulation context
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8c1a3b2430a8..6822fc9ae86d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -199,7 +199,7 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
 
 /*
  * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * supported_xcr0 (comprised of host configuration and KVM_SUPPORTED_XCR0).
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
  */
 static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
 {
@@ -209,7 +209,7 @@ static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
 	if (!best)
 		return 0;
 
-	return (best->eax | ((u64)best->edx << 32)) & supported_xcr0;
+	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
 }
 
 static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
@@ -927,8 +927,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		}
 		break;
 	case 0xd: {
-		u64 permitted_xcr0 = supported_xcr0 & xstate_get_guest_group_perm();
-		u64 permitted_xss = supported_xss;
+		u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
+		u64 permitted_xss = kvm_caps.supported_xss;
 
 		entry->eax &= permitted_xcr0;
 		entry->ebx = xstate_required_size(permitted_xcr0, false);
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index 9240b3b7f8dd..cfed36aba2f7 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -48,7 +48,7 @@ DEFINE_SIMPLE_ATTRIBUTE(vcpu_tsc_scaling_fops, vcpu_get_tsc_scaling_ratio, NULL,
 
 static int vcpu_get_tsc_scaling_frac_bits(void *data, u64 *val)
 {
-	*val = kvm_tsc_scaling_ratio_frac_bits;
+	*val = kvm_caps.tsc_scaling_ratio_frac_bits;
 	return 0;
 }
 
@@ -66,7 +66,7 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_
 				    debugfs_dentry, vcpu,
 				    &vcpu_timer_advance_ns_fops);
 
-	if (kvm_has_tsc_control) {
+	if (kvm_caps.has_tsc_control) {
 		debugfs_create_file("tsc-scaling-ratio", 0444,
 				    debugfs_dentry, vcpu,
 				    &vcpu_tsc_scaling_fops);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 21ab69db689b..5fd678c90288 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1602,7 +1602,7 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
 	 * that __delay() uses delay_tsc whenever the hardware has TSC, thus
 	 * always for VMX enabled hardware.
 	 */
-	if (vcpu->arch.tsc_scaling_ratio == kvm_default_tsc_scaling_ratio) {
+	if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
 		__delay(min(guest_cycles,
 			nsec_to_cycles(vcpu, timer_advance_ns)));
 	} else {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef..d14370cec23a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -648,7 +648,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)
 
 	vmcb02->control.tsc_offset = vcpu->arch.tsc_offset;
 
-	if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
+	if (svm->tsc_ratio_msr != kvm_caps.default_tsc_scaling_ratio) {
 		WARN_ON(!svm->tsc_scaling_enabled);
 		nested_svm_update_tsc_ratio_msr(vcpu);
 	}
@@ -979,7 +979,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 		vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS);
 	}
 
-	if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
+	if (svm->tsc_ratio_msr != kvm_caps.default_tsc_scaling_ratio) {
 		WARN_ON(!svm->tsc_scaling_enabled);
 		vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 		svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3b49337998ec..480eef001074 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1225,7 +1225,7 @@ static void __svm_vcpu_reset(struct kvm_vcpu *vcpu)
 
 	svm_init_osvw(vcpu);
 	vcpu->arch.microcode_version = 0x01000065;
-	svm->tsc_ratio_msr = kvm_default_tsc_scaling_ratio;
+	svm->tsc_ratio_msr = kvm_caps.default_tsc_scaling_ratio;
 
 	if (sev_es_guest(vcpu->kvm))
 		sev_es_vcpu_reset(svm);
@@ -4769,7 +4769,7 @@ static __init void svm_set_cpu_caps(void)
 {
 	kvm_set_cpu_caps();
 
-	supported_xss = 0;
+	kvm_caps.supported_xss = 0;
 
 	/* CPUID 0x80000001 and 0x8000000A (SVM features) */
 	if (nested) {
@@ -4845,7 +4845,8 @@ static __init int svm_hardware_setup(void)
 
 	init_msrpm_offsets();
 
-	supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
+	kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
+				     XFEATURE_MASK_BNDCSR);
 
 	if (boot_cpu_has(X86_FEATURE_FXSR_OPT))
 		kvm_enable_efer_bits(EFER_FFXSR);
@@ -4855,11 +4856,11 @@ static __init int svm_hardware_setup(void)
 			tsc_scaling = false;
 		} else {
 			pr_info("TSC scaling supported\n");
-			kvm_has_tsc_control = true;
+			kvm_caps.has_tsc_control = true;
 		}
 	}
-	kvm_max_tsc_scaling_ratio = SVM_TSC_RATIO_MAX;
-	kvm_tsc_scaling_ratio_frac_bits = 32;
+	kvm_caps.max_tsc_scaling_ratio = SVM_TSC_RATIO_MAX;
+	kvm_caps.tsc_scaling_ratio_frac_bits = 32;
 
 	tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX);
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a6688663da4d..aab4745eb1ee 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2548,7 +2548,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			vmx_get_l2_tsc_multiplier(vcpu));
 
 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);
 
 	nested_vmx_transition_tlb_flush(vcpu, vmcs12, true);
@@ -4610,7 +4610,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);
 
 	if (vmx->nested.l1_tpr_threshold != -1)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8bbcf2071faf..b06eafa5884d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1715,7 +1715,7 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 	    nested_cpu_has2(vmcs12, SECONDARY_EXEC_TSC_SCALING))
 		return vmcs12->tsc_multiplier;
 
-	return kvm_default_tsc_scaling_ratio;
+	return kvm_caps.default_tsc_scaling_ratio;
 }
 
 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
@@ -7537,7 +7537,7 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);
 
 	/* CPUID 0xD.1 */
-	supported_xss = 0;
+	kvm_caps.supported_xss = 0;
 	if (!cpu_has_vmx_xsaves())
 		kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
 
@@ -7678,9 +7678,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 		delta_tsc = 0;
 
 	/* Convert to host delta tsc if tsc scaling is enabled */
-	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
+	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio &&
 	    delta_tsc && u64_shl_div_u64(delta_tsc,
-				kvm_tsc_scaling_ratio_frac_bits,
+				kvm_caps.tsc_scaling_ratio_frac_bits,
 				vcpu->arch.l1_tsc_scaling_ratio, &delta_tsc))
 		return -ERANGE;
 
@@ -8057,8 +8057,8 @@ static __init int hardware_setup(void)
 	}
 
 	if (!cpu_has_vmx_mpx())
-		supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
-				    XFEATURE_MASK_BNDCSR);
+		kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
+					     XFEATURE_MASK_BNDCSR);
 
 	if (!cpu_has_vmx_vpid() || !cpu_has_vmx_invvpid() ||
 	    !(cpu_has_vmx_invvpid_single() || cpu_has_vmx_invvpid_global()))
@@ -8125,11 +8125,11 @@ static __init int hardware_setup(void)
 		enable_ipiv = false;
 
 	if (cpu_has_vmx_tsc_scaling())
-		kvm_has_tsc_control = true;
+		kvm_caps.has_tsc_control = true;
 
-	kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
-	kvm_tsc_scaling_ratio_frac_bits = 48;
-	kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
+	kvm_caps.max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
+	kvm_caps.tsc_scaling_ratio_frac_bits = 48;
+	kvm_caps.has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
 
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
@@ -8186,7 +8186,7 @@ static __init int hardware_setup(void)
 		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
 	}
 
-	kvm_mce_cap_supported |= MCG_LMCE_P;
+	kvm_caps.supported_mce_cap |= MCG_LMCE_P;
 
 	if (pt_mode != PT_MODE_SYSTEM && pt_mode != PT_MODE_HOST_GUEST)
 		return -EINVAL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04812eaaf61b..1fdc59a5cb1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -87,8 +87,11 @@
 
 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
-u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;
-EXPORT_SYMBOL_GPL(kvm_mce_cap_supported);
+
+struct kvm_caps kvm_caps __read_mostly = {
+	.supported_mce_cap = MCG_CTL_P | MCG_SER_P,
+};
+EXPORT_SYMBOL_GPL(kvm_caps);
 
 #define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))
 
@@ -151,19 +154,6 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly kvmclock_periodic_sync = true;
 module_param(kvmclock_periodic_sync, bool, S_IRUGO);
 
-bool __read_mostly kvm_has_tsc_control;
-EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
-u32  __read_mostly kvm_max_guest_tsc_khz;
-EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
-u8   __read_mostly kvm_tsc_scaling_ratio_frac_bits;
-EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
-u64  __read_mostly kvm_max_tsc_scaling_ratio;
-EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
-u64 __read_mostly kvm_default_tsc_scaling_ratio;
-EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
-bool __read_mostly kvm_has_bus_lock_exit;
-EXPORT_SYMBOL_GPL(kvm_has_bus_lock_exit);
-
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
@@ -235,8 +225,6 @@ EXPORT_SYMBOL_GPL(enable_apicv);
 
 u64 __read_mostly host_xss;
 EXPORT_SYMBOL_GPL(host_xss);
-u64 __read_mostly supported_xss;
-EXPORT_SYMBOL_GPL(supported_xss);
 
 const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
 	KVM_GENERIC_VM_STATS(),
@@ -309,8 +297,6 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 };
 
 u64 __read_mostly host_xcr0;
-u64 __read_mostly supported_xcr0;
-EXPORT_SYMBOL_GPL(supported_xcr0);
 
 static struct kmem_cache *x86_emulator_cache;
 
@@ -2345,12 +2331,12 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 
 	/* Guest TSC same frequency as host TSC? */
 	if (!scale) {
-		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
+		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
 		return 0;
 	}
 
 	/* TSC scaling supported? */
-	if (!kvm_has_tsc_control) {
+	if (!kvm_caps.has_tsc_control) {
 		if (user_tsc_khz > tsc_khz) {
 			vcpu->arch.tsc_catchup = 1;
 			vcpu->arch.tsc_always_catchup = 1;
@@ -2362,10 +2348,10 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	}
 
 	/* TSC scaling required  - calculate ratio */
-	ratio = mul_u64_u32_div(1ULL << kvm_tsc_scaling_ratio_frac_bits,
+	ratio = mul_u64_u32_div(1ULL << kvm_caps.tsc_scaling_ratio_frac_bits,
 				user_tsc_khz, tsc_khz);
 
-	if (ratio == 0 || ratio >= kvm_max_tsc_scaling_ratio) {
+	if (ratio == 0 || ratio >= kvm_caps.max_tsc_scaling_ratio) {
 		pr_warn_ratelimited("Invalid TSC scaling ratio - virtual-tsc-khz=%u\n",
 			            user_tsc_khz);
 		return -1;
@@ -2383,7 +2369,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 	/* tsc_khz can be zero if TSC calibration fails */
 	if (user_tsc_khz == 0) {
 		/* set tsc_scaling_ratio to a safe value */
-		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
+		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
 		return -1;
 	}
 
@@ -2460,18 +2446,18 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
  * (frac) represent the fractional part, ie. ratio represents a fixed
  * point number (mult + frac * 2^(-N)).
  *
- * N equals to kvm_tsc_scaling_ratio_frac_bits.
+ * N equals to kvm_caps.tsc_scaling_ratio_frac_bits.
  */
 static inline u64 __scale_tsc(u64 ratio, u64 tsc)
 {
-	return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
+	return mul_u64_u64_shr(tsc, ratio, kvm_caps.tsc_scaling_ratio_frac_bits);
 }
 
 u64 kvm_scale_tsc(u64 tsc, u64 ratio)
 {
 	u64 _tsc = tsc;
 
-	if (ratio != kvm_default_tsc_scaling_ratio)
+	if (ratio != kvm_caps.default_tsc_scaling_ratio)
 		_tsc = __scale_tsc(ratio, tsc);
 
 	return _tsc;
@@ -2498,11 +2484,11 @@ u64 kvm_calc_nested_tsc_offset(u64 l1_offset, u64 l2_offset, u64 l2_multiplier)
 {
 	u64 nested_offset;
 
-	if (l2_multiplier == kvm_default_tsc_scaling_ratio)
+	if (l2_multiplier == kvm_caps.default_tsc_scaling_ratio)
 		nested_offset = l1_offset;
 	else
 		nested_offset = mul_s64_u64_shr((s64) l1_offset, l2_multiplier,
-						kvm_tsc_scaling_ratio_frac_bits);
+						kvm_caps.tsc_scaling_ratio_frac_bits);
 
 	nested_offset += l2_offset;
 	return nested_offset;
@@ -2511,9 +2497,9 @@ EXPORT_SYMBOL_GPL(kvm_calc_nested_tsc_offset);
 
 u64 kvm_calc_nested_tsc_multiplier(u64 l1_multiplier, u64 l2_multiplier)
 {
-	if (l2_multiplier != kvm_default_tsc_scaling_ratio)
+	if (l2_multiplier != kvm_caps.default_tsc_scaling_ratio)
 		return mul_u64_u64_shr(l1_multiplier, l2_multiplier,
-				       kvm_tsc_scaling_ratio_frac_bits);
+				       kvm_caps.tsc_scaling_ratio_frac_bits);
 
 	return l1_multiplier;
 }
@@ -2555,7 +2541,7 @@ static void kvm_vcpu_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 l1_multipli
 	else
 		vcpu->arch.tsc_scaling_ratio = l1_multiplier;
 
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		static_call(kvm_x86_write_tsc_multiplier)(
 			vcpu, vcpu->arch.tsc_scaling_ratio);
 }
@@ -2691,7 +2677,7 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,
 
 static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 {
-	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
+	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio)
 		WARN_ON(adjustment < 0);
 	adjustment = kvm_scale_tsc((u64) adjustment,
 				   vcpu->arch.l1_tsc_scaling_ratio);
@@ -3104,7 +3090,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* With all the info we got, fill in the values */
 
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
 					    v->arch.l1_tsc_scaling_ratio);
 
@@ -3610,7 +3596,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
 		 * XSAVES/XRSTORS to save/restore PT MSRs.
 		 */
-		if (data & ~supported_xss)
+		if (data & ~kvm_caps.supported_xss)
 			return 1;
 		vcpu->arch.ia32_xss = data;
 		kvm_update_cpuid_runtime(vcpu);
@@ -4370,7 +4356,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 	case KVM_CAP_TSC_CONTROL:
 	case KVM_CAP_VM_TSC_CONTROL:
-		r = kvm_has_tsc_control;
+		r = kvm_caps.has_tsc_control;
 		break;
 	case KVM_CAP_X2APIC_API:
 		r = KVM_X2APIC_API_VALID_FLAGS;
@@ -4392,7 +4378,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = sched_info_on();
 		break;
 	case KVM_CAP_X86_BUS_LOCK_EXIT:
-		if (kvm_has_bus_lock_exit)
+		if (kvm_caps.has_bus_lock_exit)
 			r = KVM_BUS_LOCK_DETECTION_OFF |
 			    KVM_BUS_LOCK_DETECTION_EXIT;
 		else
@@ -4401,7 +4387,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_XSAVE2: {
 		u64 guest_perm = xstate_get_guest_group_perm();
 
-		r = xstate_required_size(supported_xcr0 & guest_perm, false);
+		r = xstate_required_size(kvm_caps.supported_xcr0 & guest_perm, false);
 		if (r < sizeof(struct kvm_xsave))
 			r = sizeof(struct kvm_xsave);
 		break;
@@ -4439,7 +4425,7 @@ static int kvm_x86_dev_get_attr(struct kvm_device_attr *attr)
 
 	switch (attr->attr) {
 	case KVM_X86_XCOMP_GUEST_SUPP:
-		if (put_user(supported_xcr0, uaddr))
+		if (put_user(kvm_caps.supported_xcr0, uaddr))
 			return -EFAULT;
 		return 0;
 	default:
@@ -4516,8 +4502,8 @@ long kvm_arch_dev_ioctl(struct file *filp,
 	}
 	case KVM_X86_GET_MCE_CAP_SUPPORTED:
 		r = -EFAULT;
-		if (copy_to_user(argp, &kvm_mce_cap_supported,
-				 sizeof(kvm_mce_cap_supported)))
+		if (copy_to_user(argp, &kvm_caps.supported_mce_cap,
+				 sizeof(kvm_caps.supported_mce_cap)))
 			goto out;
 		r = 0;
 		break;
@@ -4739,7 +4725,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
 	return (kvm_arch_interrupt_allowed(vcpu) &&
 		kvm_cpu_accept_dm_intr(vcpu) &&
 		!kvm_event_needs_reinjection(vcpu) &&
-		!vcpu->arch.exception.pending);
+		!vcpu->arch.exception.pending &&
+		!kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu));
 }
 
 static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
@@ -4801,7 +4788,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
 	r = -EINVAL;
 	if (!bank_num || bank_num > KVM_MAX_MCE_BANKS)
 		goto out;
-	if (mcg_cap & ~(kvm_mce_cap_supported | 0xff | 0xff0000))
+	if (mcg_cap & ~(kvm_caps.supported_mce_cap | 0xff | 0xff0000))
 		goto out;
 	r = 0;
 	vcpu->arch.mcg_cap = mcg_cap;
@@ -5093,7 +5080,8 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 
 	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
 					      guest_xsave->region,
-					      supported_xcr0, &vcpu->arch.pkru);
+					      kvm_caps.supported_xcr0,
+					      &vcpu->arch.pkru);
 }
 
 static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
@@ -5598,8 +5586,8 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = -EINVAL;
 		user_tsc_khz = (u32)arg;
 
-		if (kvm_has_tsc_control &&
-		    user_tsc_khz >= kvm_max_guest_tsc_khz)
+		if (kvm_caps.has_tsc_control &&
+		    user_tsc_khz >= kvm_caps.max_guest_tsc_khz)
 			goto out;
 
 		if (user_tsc_khz == 0)
@@ -6039,7 +6027,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		    (cap->args[0] & KVM_BUS_LOCK_DETECTION_EXIT))
 			break;
 
-		if (kvm_has_bus_lock_exit &&
+		if (kvm_caps.has_bus_lock_exit &&
 		    cap->args[0] & KVM_BUS_LOCK_DETECTION_EXIT)
 			kvm->arch.bus_lock_detection_enabled = true;
 		r = 0;
@@ -6588,8 +6576,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = -EINVAL;
 		user_tsc_khz = (u32)arg;
 
-		if (kvm_has_tsc_control &&
-		    user_tsc_khz >= kvm_max_guest_tsc_khz)
+		if (kvm_caps.has_tsc_control &&
+		    user_tsc_khz >= kvm_caps.max_guest_tsc_khz)
 			goto out;
 
 		if (user_tsc_khz == 0)
@@ -8745,7 +8733,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	/* TSC frequency always matches when on Hyper-V */
 	for_each_present_cpu(cpu)
 		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
-	kvm_max_guest_tsc_khz = tsc_khz;
+	kvm_caps.max_guest_tsc_khz = tsc_khz;
 
 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		__kvm_start_pvclock_update(kvm);
@@ -9007,7 +8995,7 @@ int kvm_arch_init(void *opaque)
 
 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
-		supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
+		kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
 	}
 
 	if (pi_inject_timer == -1)
@@ -11703,13 +11691,13 @@ int kvm_arch_hardware_setup(void *opaque)
 	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);
 
 	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
-		supported_xss = 0;
+		kvm_caps.supported_xss = 0;
 
 #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
 	cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
 #undef __kvm_cpu_cap_has
 
-	if (kvm_has_tsc_control) {
+	if (kvm_caps.has_tsc_control) {
 		/*
 		 * Make sure the user can only configure tsc_khz values that
 		 * fit into a signed integer.
@@ -11717,10 +11705,10 @@ int kvm_arch_hardware_setup(void *opaque)
 		 * be 1 on all machines.
 		 */
 		u64 max = min(0x7fffffffULL,
-			      __scale_tsc(kvm_max_tsc_scaling_ratio, tsc_khz));
-		kvm_max_guest_tsc_khz = max;
+			      __scale_tsc(kvm_caps.max_tsc_scaling_ratio, tsc_khz));
+		kvm_caps.max_guest_tsc_khz = max;
 	}
-	kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits;
+	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
 	kvm_init_msr_list();
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 588792f00334..359d0454ad28 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -8,6 +8,25 @@
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
 
+struct kvm_caps {
+	/* control of guest tsc rate supported? */
+	bool has_tsc_control;
+	/* maximum supported tsc_khz for guests */
+	u32  max_guest_tsc_khz;
+	/* number of bits of the fractional part of the TSC scaling ratio */
+	u8   tsc_scaling_ratio_frac_bits;
+	/* maximum allowed value of TSC scaling ratio */
+	u64  max_tsc_scaling_ratio;
+	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
+	u64  default_tsc_scaling_ratio;
+	/* bus lock detection supported? */
+	bool has_bus_lock_exit;
+
+	u64 supported_mce_cap;
+	u64 supported_xcr0;
+	u64 supported_xss;
+};
+
 void kvm_spurious_fault(void);
 
 #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check)		\
@@ -283,14 +302,15 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
 
 extern u64 host_xcr0;
-extern u64 supported_xcr0;
 extern u64 host_xss;
-extern u64 supported_xss;
+
+extern struct kvm_caps kvm_caps;
+
 extern bool enable_pmu;
 
 static inline bool kvm_mpx_supported(void)
 {
-	return (supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR))
+	return (kvm_caps.supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR))
 		== (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
 }
 

base-commit: a3808d88461270c71d3fece5e51cc486ecdac7d0
-- 
2.36.1.124.g0e6072fb45-goog


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request
  2022-05-18 18:42   ` Sean Christopherson
@ 2022-05-19  6:25     ` Chenyi Qiang
  0 siblings, 0 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-19  6:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

Thanks Sean for your review!

On 5/19/2022 2:42 AM, Sean Christopherson wrote:
> Nits on the shortlog...
> 
> Please don't capitalize x86, spell out "and" instead of using an ampersand (though
> I think it can be omitted entirely), and since there are plenty of chars left, call
> out that this is adding/extending KVM's ABI, e.g. it's not obvious from the shortlog
> where/when the save+restore happens.
> 
>    KVM: x86: Extend KVM_{G,S}ET_VCPU_EVENTS to support pending triple fault
> 

Will fix it.

> On Thu, Apr 21, 2022, Chenyi Qiang wrote:
>> For the triple fault sythesized by KVM, e.g. the RSM path or
>> nested_vmx_abort(), if KVM exits to userspace before the request is
>> serviced, userspace could migrate the VM and lose the triple fault.
>>
>> Add the support to save and restore the triple fault event from
>> userspace. Introduce a new event KVM_VCPUEVENT_VALID_TRIPLE_FAULT in
>> get/set_vcpu_events to track the triple fault request.
>>
>> Note that in the set_vcpu_events path, userspace is able to set/clear
>> the triple fault request through triple_fault_pending field.
>>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> ---
>>   Documentation/virt/kvm/api.rst  |  7 +++++++
>>   arch/x86/include/uapi/asm/kvm.h |  4 +++-
>>   arch/x86/kvm/x86.c              | 15 +++++++++++++--
>>   3 files changed, 23 insertions(+), 3 deletions(-)
>>
>> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
>> index 72183ae628f7..e09ce3cb49c5 100644
>> --- a/Documentation/virt/kvm/api.rst
>> +++ b/Documentation/virt/kvm/api.rst
>> @@ -1150,6 +1150,9 @@ The following bits are defined in the flags field:
>>     fields contain a valid state. This bit will be set whenever
>>     KVM_CAP_EXCEPTION_PAYLOAD is enabled.
>>   
>> +- KVM_VCPUEVENT_VALID_TRIPLE_FAULT may be set to signal that the
>> +  triple_fault_pending field contains a valid state.
>> +
>>   ARM64:
>>   ^^^^^^
>>   
>> @@ -1245,6 +1248,10 @@ can be set in the flags field to signal that the
>>   exception_has_payload, exception_payload, and exception.pending fields
>>   contain a valid state and shall be written into the VCPU.
>>   
>> +KVM_VCPUEVENT_VALID_TRIPLE_FAULT can be set in flags field to signal that
>> +the triple_fault_pending field contains a valid state and shall be written
>> +into the VCPU.
>> +
>>   ARM64:
>>   ^^^^^^
>>   
>> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
>> index 21614807a2cb..fd083f6337af 100644
>> --- a/arch/x86/include/uapi/asm/kvm.h
>> +++ b/arch/x86/include/uapi/asm/kvm.h
>> @@ -325,6 +325,7 @@ struct kvm_reinject_control {
>>   #define KVM_VCPUEVENT_VALID_SHADOW	0x00000004
>>   #define KVM_VCPUEVENT_VALID_SMM		0x00000008
>>   #define KVM_VCPUEVENT_VALID_PAYLOAD	0x00000010
>> +#define KVM_VCPUEVENT_VALID_TRIPLE_FAULT	0x00000020
>>   
>>   /* Interrupt shadow states */
>>   #define KVM_X86_SHADOW_INT_MOV_SS	0x01
>> @@ -359,7 +360,8 @@ struct kvm_vcpu_events {
>>   		__u8 smm_inside_nmi;
>>   		__u8 latched_init;
>>   	} smi;
>> -	__u8 reserved[27];
>> +	__u8 triple_fault_pending;
> 
> What about writing this as
> 
> 	struct {
> 		__u8 pending;
> 	} triple_fault;
> 
> to match the other events?  It's kinda silly, but I find it easier to visually
> identify the various events that are handled by kvm_vcpu_events.
> 

Sure, will change in this format.

>> +	__u8 reserved[26];
>>   	__u8 exception_has_payload;
>>   	__u64 exception_payload;
>>   };
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index ab336f7c82e4..c8b9b0bc42aa 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -4911,9 +4911,12 @@ static void kvm_vcpu_ioctl_x86_get_vcpu_events(struct kvm_vcpu *vcpu,
>>   		!!(vcpu->arch.hflags & HF_SMM_INSIDE_NMI_MASK);
>>   	events->smi.latched_init = kvm_lapic_latched_init(vcpu);
>>   
>> +	events->triple_fault_pending = kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu);
>> +
>>   	events->flags = (KVM_VCPUEVENT_VALID_NMI_PENDING
>>   			 | KVM_VCPUEVENT_VALID_SHADOW
>> -			 | KVM_VCPUEVENT_VALID_SMM);
>> +			 | KVM_VCPUEVENT_VALID_SMM
>> +			 | KVM_VCPUEVENT_VALID_TRIPLE_FAULT);
> 
> Does setting KVM_VCPUEVENT_VALID_TRIPLE_FAULT need to be guarded with a capability,
> a la KVM_CAP_EXCEPTION_PAYLOAD, so that migrating from a new KVM to an old KVM doesn't
> fail?  Seems rather pointless since the VM is likely hosed either way...
> 

Indeed, at least adding a capability makes it more compatible. Will add 
it in next version.

>>   	if (vcpu->kvm->arch.exception_payload_enabled)
>>   		events->flags |= KVM_VCPUEVENT_VALID_PAYLOAD;

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit
  2022-05-18 22:30   ` Sean Christopherson
@ 2022-05-19 10:38     ` Chenyi Qiang
  2022-05-19 15:22       ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-19 10:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 5/19/2022 6:30 AM, Sean Christopherson wrote:
> On Thu, Apr 21, 2022, Chenyi Qiang wrote:
>> @@ -1504,6 +1511,8 @@ struct kvm_x86_ops {
>>   	 * Returns vCPU specific APICv inhibit reasons
>>   	 */
>>   	unsigned long (*vcpu_get_apicv_inhibit_reasons)(struct kvm_vcpu *vcpu);
>> +
>> +	bool has_notify_vmexit;
> 
> I'm pretty sure I suggested this, but seeing it in code, it kinda sorta makes things
> worst if we don't first consolidate the existing flags.  kvm_x86_ops works, but we'd
> definitely be taking liberties with the "ops" part.
> 
> What about adding struct kvm_caps to collect these flags/settings that don't fit
> into kvm_cpu_caps because they're not a CPUID feature flag?  kvm_x86_ops has the
> advantage of kinda being read-only after init since VMX modifies vmx_x86_ops,
> but IMO that's not enough reason to shove this into kvm_x86_ops.  And long term,
> we might be able find a way to mark kvm_caps as full __ro_after_init.
> 
> If no one objects, the attached patch can slide in before this patch, then
> has_notifiy_vmexit can land in kvm_caps.
> 
> struct kvm_caps {
> 	/* control of guest tsc rate supported? */
> 	bool has_tsc_control;
> 	/* maximum supported tsc_khz for guests */
> 	u32  max_guest_tsc_khz;
> 	/* number of bits of the fractional part of the TSC scaling ratio */
> 	u8   tsc_scaling_ratio_frac_bits;
> 	/* maximum allowed value of TSC scaling ratio */
> 	u64  max_tsc_scaling_ratio;
> 	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
> 	u64  default_tsc_scaling_ratio;
> 	/* bus lock detection supported? */
> 	bool has_bus_lock_exit;
> 
> 	u64 supported_mce_cap;
> 	u64 supported_xcr0;
> 	u64 supported_xss;
> };
> 

Thanks Sean for your patch. I think an unintentional change is mixed in it:

@@ -4739,7 +4725,8 @@ static int 
kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
  	return (kvm_arch_interrupt_allowed(vcpu) &&
  		kvm_cpu_accept_dm_intr(vcpu) &&
  		!kvm_event_needs_reinjection(vcpu) &&
-		!vcpu->arch.exception.pending);
+		!vcpu->arch.exception.pending &&
+		!kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu));
  }

Maybe this should belong to the patch 1?

>> @@ -6090,6 +6094,18 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
>>   		}
>>   		mutex_unlock(&kvm->lock);
>>   		break;
>> +	case KVM_CAP_X86_NOTIFY_VMEXIT:
>> +		r = -EINVAL;
>> +		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
>> +			break;
>> +		if (!kvm_x86_ops.has_notify_vmexit)
>> +			break;
>> +		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)
>> +			break;
>> +		kvm->arch.notify_window = cap->args[0] >> 32;
> 
> Setting notify_vmexit and notify_vmexit_flags needs to be done under kvm->lock,
> and changing notify_window if kvm->created_vcpus > 0 needs to disallowed, otherwise
> init_vmcs() will use the wrong value.
> 
> notify_vmexit_flags could be changed on the fly, but I doubt that's worth
> supporting as even the smallest amount of complexity will go unused.
> 
> So I think this?
> 

Make sense.

> 	case KVM_CAP_X86_NOTIFY_VMEXIT:
> 		r = -EINVAL;
> 		if ((u32)cap->args[0] & ~KVM_X86_NOTIFY_VMEXIT_VALID_BITS)
> 			break;
> 		if (!kvm_x86_ops.has_notify_vmexit)
> 			break;
> 		if (!(u32)cap->args[0] & KVM_X86_NOTIFY_VMEXIT_ENABLED)
> 			break;
> 		mutex_lock(&kvm->lock);
> 		if (!kvm->created_vcpus) {
> 			kvm->arch.notify_window = cap->args[0] >> 32;
> 			kvm->arch.notify_vmexit_flags = (u32)cap->args[0];
> 			r = 0;
> 		}
> 		mutex_unlock(&kvm->lock);
> 		break;


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit
  2022-05-19 10:38     ` Chenyi Qiang
@ 2022-05-19 15:22       ` Sean Christopherson
  0 siblings, 0 replies; 17+ messages in thread
From: Sean Christopherson @ 2022-05-19 15:22 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Thu, May 19, 2022, Chenyi Qiang wrote:
> 
> 
> On 5/19/2022 6:30 AM, Sean Christopherson wrote:
> Thanks Sean for your patch. I think an unintentional change is mixed in it:
> 
> @@ -4739,7 +4725,8 @@ static int
> kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>  	return (kvm_arch_interrupt_allowed(vcpu) &&
>  		kvm_cpu_accept_dm_intr(vcpu) &&
>  		!kvm_event_needs_reinjection(vcpu) &&
> -		!vcpu->arch.exception.pending);
> +		!vcpu->arch.exception.pending &&
> +		!kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu));
>  }
> 
> Maybe this should belong to the patch 1?

Good eyes!  Let's keep it out for now.  I do think it's the right behavior, but
it is woefully incomplete; pretty much every patch that checks
vcpu->arch.exception.pending also needs to check for pending triple fault.

I'll add a patch to my exception/event cleanup series[*], which introduces
kvm_is_exception_pending(), i.e. provides exactly the wrapper needed to handle
a pending triple fault with minimal effort.

[*] https://lore.kernel.org/all/20220311032801.3467418-1-seanjc@google.com


Here's a new version without the spurious change...

--
From: Sean Christopherson <seanjc@google.com>
Date: Wed, 18 May 2022 13:52:51 -0700
Subject: [PATCH] KVM: x86: Introduce "struct kvm_caps" to track misc
 caps/settings

Add kvm_caps to hold a variety of capabilites and defaults that aren't
handled by kvm_cpu_caps because they aren't CPUID bits in order to reduce
the amount of boilerplate code required to add a new feature.  The vast
majority (all?) of the caps interact with vendor code and are written
only during initialization, i.e. should be tagged __read_mostly, declared
extern in x86.h, and exported.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm_host.h | 15 -----
 arch/x86/kvm/cpuid.c            |  8 +--
 arch/x86/kvm/debugfs.c          |  4 +-
 arch/x86/kvm/lapic.c            |  2 +-
 arch/x86/kvm/svm/nested.c       |  4 +-
 arch/x86/kvm/svm/svm.c          | 13 +++--
 arch/x86/kvm/vmx/nested.c       |  4 +-
 arch/x86/kvm/vmx/vmx.c          | 22 ++++----
 arch/x86/kvm/x86.c              | 97 ++++++++++++++-------------------
 arch/x86/kvm/x86.h              | 26 ++++++++-
 10 files changed, 94 insertions(+), 101 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 9cdc5bbd721f..d895d25c5b2f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1661,21 +1661,6 @@ extern bool tdp_enabled;

 u64 vcpu_tsc_khz(struct kvm_vcpu *vcpu);

-/* control of guest tsc rate supported? */
-extern bool kvm_has_tsc_control;
-/* maximum supported tsc_khz for guests */
-extern u32  kvm_max_guest_tsc_khz;
-/* number of bits of the fractional part of the TSC scaling ratio */
-extern u8   kvm_tsc_scaling_ratio_frac_bits;
-/* maximum allowed value of TSC scaling ratio */
-extern u64  kvm_max_tsc_scaling_ratio;
-/* 1ull << kvm_tsc_scaling_ratio_frac_bits */
-extern u64  kvm_default_tsc_scaling_ratio;
-/* bus lock detection supported? */
-extern bool kvm_has_bus_lock_exit;
-
-extern u64 kvm_mce_cap_supported;
-
 /*
  * EMULTYPE_NO_DECODE - Set when re-emulating an instruction (after completing
  *			userspace I/O) to indicate that the emulation context
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 8c1a3b2430a8..6822fc9ae86d 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -199,7 +199,7 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)

 /*
  * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * supported_xcr0 (comprised of host configuration and KVM_SUPPORTED_XCR0).
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
  */
 static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
 {
@@ -209,7 +209,7 @@ static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
 	if (!best)
 		return 0;

-	return (best->eax | ((u64)best->edx << 32)) & supported_xcr0;
+	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
 }

 static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
@@ -927,8 +927,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		}
 		break;
 	case 0xd: {
-		u64 permitted_xcr0 = supported_xcr0 & xstate_get_guest_group_perm();
-		u64 permitted_xss = supported_xss;
+		u64 permitted_xcr0 = kvm_caps.supported_xcr0 & xstate_get_guest_group_perm();
+		u64 permitted_xss = kvm_caps.supported_xss;

 		entry->eax &= permitted_xcr0;
 		entry->ebx = xstate_required_size(permitted_xcr0, false);
diff --git a/arch/x86/kvm/debugfs.c b/arch/x86/kvm/debugfs.c
index 9240b3b7f8dd..cfed36aba2f7 100644
--- a/arch/x86/kvm/debugfs.c
+++ b/arch/x86/kvm/debugfs.c
@@ -48,7 +48,7 @@ DEFINE_SIMPLE_ATTRIBUTE(vcpu_tsc_scaling_fops, vcpu_get_tsc_scaling_ratio, NULL,

 static int vcpu_get_tsc_scaling_frac_bits(void *data, u64 *val)
 {
-	*val = kvm_tsc_scaling_ratio_frac_bits;
+	*val = kvm_caps.tsc_scaling_ratio_frac_bits;
 	return 0;
 }

@@ -66,7 +66,7 @@ void kvm_arch_create_vcpu_debugfs(struct kvm_vcpu *vcpu, struct dentry *debugfs_
 				    debugfs_dentry, vcpu,
 				    &vcpu_timer_advance_ns_fops);

-	if (kvm_has_tsc_control) {
+	if (kvm_caps.has_tsc_control) {
 		debugfs_create_file("tsc-scaling-ratio", 0444,
 				    debugfs_dentry, vcpu,
 				    &vcpu_tsc_scaling_fops);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 21ab69db689b..5fd678c90288 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1602,7 +1602,7 @@ static inline void __wait_lapic_expire(struct kvm_vcpu *vcpu, u64 guest_cycles)
 	 * that __delay() uses delay_tsc whenever the hardware has TSC, thus
 	 * always for VMX enabled hardware.
 	 */
-	if (vcpu->arch.tsc_scaling_ratio == kvm_default_tsc_scaling_ratio) {
+	if (vcpu->arch.tsc_scaling_ratio == kvm_caps.default_tsc_scaling_ratio) {
 		__delay(min(guest_cycles,
 			nsec_to_cycles(vcpu, timer_advance_ns)));
 	} else {
diff --git a/arch/x86/kvm/svm/nested.c b/arch/x86/kvm/svm/nested.c
index bed5e1692cef..d14370cec23a 100644
--- a/arch/x86/kvm/svm/nested.c
+++ b/arch/x86/kvm/svm/nested.c
@@ -648,7 +648,7 @@ static void nested_vmcb02_prepare_control(struct vcpu_svm *svm)

 	vmcb02->control.tsc_offset = vcpu->arch.tsc_offset;

-	if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
+	if (svm->tsc_ratio_msr != kvm_caps.default_tsc_scaling_ratio) {
 		WARN_ON(!svm->tsc_scaling_enabled);
 		nested_svm_update_tsc_ratio_msr(vcpu);
 	}
@@ -979,7 +979,7 @@ int nested_svm_vmexit(struct vcpu_svm *svm)
 		vmcb_mark_dirty(vmcb01, VMCB_INTERCEPTS);
 	}

-	if (svm->tsc_ratio_msr != kvm_default_tsc_scaling_ratio) {
+	if (svm->tsc_ratio_msr != kvm_caps.default_tsc_scaling_ratio) {
 		WARN_ON(!svm->tsc_scaling_enabled);
 		vcpu->arch.tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 		svm_write_tsc_multiplier(vcpu, vcpu->arch.tsc_scaling_ratio);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 3b49337998ec..480eef001074 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -1225,7 +1225,7 @@ static void __svm_vcpu_reset(struct kvm_vcpu *vcpu)

 	svm_init_osvw(vcpu);
 	vcpu->arch.microcode_version = 0x01000065;
-	svm->tsc_ratio_msr = kvm_default_tsc_scaling_ratio;
+	svm->tsc_ratio_msr = kvm_caps.default_tsc_scaling_ratio;

 	if (sev_es_guest(vcpu->kvm))
 		sev_es_vcpu_reset(svm);
@@ -4769,7 +4769,7 @@ static __init void svm_set_cpu_caps(void)
 {
 	kvm_set_cpu_caps();

-	supported_xss = 0;
+	kvm_caps.supported_xss = 0;

 	/* CPUID 0x80000001 and 0x8000000A (SVM features) */
 	if (nested) {
@@ -4845,7 +4845,8 @@ static __init int svm_hardware_setup(void)

 	init_msrpm_offsets();

-	supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
+	kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
+				     XFEATURE_MASK_BNDCSR);

 	if (boot_cpu_has(X86_FEATURE_FXSR_OPT))
 		kvm_enable_efer_bits(EFER_FFXSR);
@@ -4855,11 +4856,11 @@ static __init int svm_hardware_setup(void)
 			tsc_scaling = false;
 		} else {
 			pr_info("TSC scaling supported\n");
-			kvm_has_tsc_control = true;
+			kvm_caps.has_tsc_control = true;
 		}
 	}
-	kvm_max_tsc_scaling_ratio = SVM_TSC_RATIO_MAX;
-	kvm_tsc_scaling_ratio_frac_bits = 32;
+	kvm_caps.max_tsc_scaling_ratio = SVM_TSC_RATIO_MAX;
+	kvm_caps.tsc_scaling_ratio_frac_bits = 32;

 	tsc_aux_uret_slot = kvm_add_user_return_msr(MSR_TSC_AUX);

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index a6688663da4d..aab4745eb1ee 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2548,7 +2548,7 @@ static int prepare_vmcs02(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12,
 			vmx_get_l2_tsc_multiplier(vcpu));

 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);

 	nested_vmx_transition_tlb_flush(vcpu, vmcs12, true);
@@ -4610,7 +4610,7 @@ void nested_vmx_vmexit(struct kvm_vcpu *vcpu, u32 vm_exit_reason,
 	vmcs_write32(VM_EXIT_MSR_LOAD_COUNT, vmx->msr_autoload.host.nr);
 	vmcs_write32(VM_ENTRY_MSR_LOAD_COUNT, vmx->msr_autoload.guest.nr);
 	vmcs_write64(TSC_OFFSET, vcpu->arch.tsc_offset);
-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		vmcs_write64(TSC_MULTIPLIER, vcpu->arch.tsc_scaling_ratio);

 	if (vmx->nested.l1_tpr_threshold != -1)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 8bbcf2071faf..b06eafa5884d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1715,7 +1715,7 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 	    nested_cpu_has2(vmcs12, SECONDARY_EXEC_TSC_SCALING))
 		return vmcs12->tsc_multiplier;

-	return kvm_default_tsc_scaling_ratio;
+	return kvm_caps.default_tsc_scaling_ratio;
 }

 static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
@@ -7537,7 +7537,7 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);

 	/* CPUID 0xD.1 */
-	supported_xss = 0;
+	kvm_caps.supported_xss = 0;
 	if (!cpu_has_vmx_xsaves())
 		kvm_cpu_cap_clear(X86_FEATURE_XSAVES);

@@ -7678,9 +7678,9 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 		delta_tsc = 0;

 	/* Convert to host delta tsc if tsc scaling is enabled */
-	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio &&
+	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio &&
 	    delta_tsc && u64_shl_div_u64(delta_tsc,
-				kvm_tsc_scaling_ratio_frac_bits,
+				kvm_caps.tsc_scaling_ratio_frac_bits,
 				vcpu->arch.l1_tsc_scaling_ratio, &delta_tsc))
 		return -ERANGE;

@@ -8057,8 +8057,8 @@ static __init int hardware_setup(void)
 	}

 	if (!cpu_has_vmx_mpx())
-		supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
-				    XFEATURE_MASK_BNDCSR);
+		kvm_caps.supported_xcr0 &= ~(XFEATURE_MASK_BNDREGS |
+					     XFEATURE_MASK_BNDCSR);

 	if (!cpu_has_vmx_vpid() || !cpu_has_vmx_invvpid() ||
 	    !(cpu_has_vmx_invvpid_single() || cpu_has_vmx_invvpid_global()))
@@ -8125,11 +8125,11 @@ static __init int hardware_setup(void)
 		enable_ipiv = false;

 	if (cpu_has_vmx_tsc_scaling())
-		kvm_has_tsc_control = true;
+		kvm_caps.has_tsc_control = true;

-	kvm_max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
-	kvm_tsc_scaling_ratio_frac_bits = 48;
-	kvm_has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();
+	kvm_caps.max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
+	kvm_caps.tsc_scaling_ratio_frac_bits = 48;
+	kvm_caps.has_bus_lock_exit = cpu_has_vmx_bus_lock_detection();

 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */

@@ -8186,7 +8186,7 @@ static __init int hardware_setup(void)
 		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
 	}

-	kvm_mce_cap_supported |= MCG_LMCE_P;
+	kvm_caps.supported_mce_cap |= MCG_LMCE_P;

 	if (pt_mode != PT_MODE_SYSTEM && pt_mode != PT_MODE_HOST_GUEST)
 		return -EINVAL;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04812eaaf61b..a1ad422e2f37 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -87,8 +87,11 @@

 #define MAX_IO_MSRS 256
 #define KVM_MAX_MCE_BANKS 32
-u64 __read_mostly kvm_mce_cap_supported = MCG_CTL_P | MCG_SER_P;
-EXPORT_SYMBOL_GPL(kvm_mce_cap_supported);
+
+struct kvm_caps kvm_caps __read_mostly = {
+	.supported_mce_cap = MCG_CTL_P | MCG_SER_P,
+};
+EXPORT_SYMBOL_GPL(kvm_caps);

 #define  ERR_PTR_USR(e)  ((void __user *)ERR_PTR(e))

@@ -151,19 +154,6 @@ module_param(min_timer_period_us, uint, S_IRUGO | S_IWUSR);
 static bool __read_mostly kvmclock_periodic_sync = true;
 module_param(kvmclock_periodic_sync, bool, S_IRUGO);

-bool __read_mostly kvm_has_tsc_control;
-EXPORT_SYMBOL_GPL(kvm_has_tsc_control);
-u32  __read_mostly kvm_max_guest_tsc_khz;
-EXPORT_SYMBOL_GPL(kvm_max_guest_tsc_khz);
-u8   __read_mostly kvm_tsc_scaling_ratio_frac_bits;
-EXPORT_SYMBOL_GPL(kvm_tsc_scaling_ratio_frac_bits);
-u64  __read_mostly kvm_max_tsc_scaling_ratio;
-EXPORT_SYMBOL_GPL(kvm_max_tsc_scaling_ratio);
-u64 __read_mostly kvm_default_tsc_scaling_ratio;
-EXPORT_SYMBOL_GPL(kvm_default_tsc_scaling_ratio);
-bool __read_mostly kvm_has_bus_lock_exit;
-EXPORT_SYMBOL_GPL(kvm_has_bus_lock_exit);
-
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, S_IRUGO | S_IWUSR);
@@ -235,8 +225,6 @@ EXPORT_SYMBOL_GPL(enable_apicv);

 u64 __read_mostly host_xss;
 EXPORT_SYMBOL_GPL(host_xss);
-u64 __read_mostly supported_xss;
-EXPORT_SYMBOL_GPL(supported_xss);

 const struct _kvm_stats_desc kvm_vm_stats_desc[] = {
 	KVM_GENERIC_VM_STATS(),
@@ -309,8 +297,6 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 };

 u64 __read_mostly host_xcr0;
-u64 __read_mostly supported_xcr0;
-EXPORT_SYMBOL_GPL(supported_xcr0);

 static struct kmem_cache *x86_emulator_cache;

@@ -2345,12 +2331,12 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)

 	/* Guest TSC same frequency as host TSC? */
 	if (!scale) {
-		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
+		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
 		return 0;
 	}

 	/* TSC scaling supported? */
-	if (!kvm_has_tsc_control) {
+	if (!kvm_caps.has_tsc_control) {
 		if (user_tsc_khz > tsc_khz) {
 			vcpu->arch.tsc_catchup = 1;
 			vcpu->arch.tsc_always_catchup = 1;
@@ -2362,10 +2348,10 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	}

 	/* TSC scaling required  - calculate ratio */
-	ratio = mul_u64_u32_div(1ULL << kvm_tsc_scaling_ratio_frac_bits,
+	ratio = mul_u64_u32_div(1ULL << kvm_caps.tsc_scaling_ratio_frac_bits,
 				user_tsc_khz, tsc_khz);

-	if (ratio == 0 || ratio >= kvm_max_tsc_scaling_ratio) {
+	if (ratio == 0 || ratio >= kvm_caps.max_tsc_scaling_ratio) {
 		pr_warn_ratelimited("Invalid TSC scaling ratio - virtual-tsc-khz=%u\n",
 			            user_tsc_khz);
 		return -1;
@@ -2383,7 +2369,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 	/* tsc_khz can be zero if TSC calibration fails */
 	if (user_tsc_khz == 0) {
 		/* set tsc_scaling_ratio to a safe value */
-		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_default_tsc_scaling_ratio);
+		kvm_vcpu_write_tsc_multiplier(vcpu, kvm_caps.default_tsc_scaling_ratio);
 		return -1;
 	}

@@ -2460,18 +2446,18 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
  * (frac) represent the fractional part, ie. ratio represents a fixed
  * point number (mult + frac * 2^(-N)).
  *
- * N equals to kvm_tsc_scaling_ratio_frac_bits.
+ * N equals to kvm_caps.tsc_scaling_ratio_frac_bits.
  */
 static inline u64 __scale_tsc(u64 ratio, u64 tsc)
 {
-	return mul_u64_u64_shr(tsc, ratio, kvm_tsc_scaling_ratio_frac_bits);
+	return mul_u64_u64_shr(tsc, ratio, kvm_caps.tsc_scaling_ratio_frac_bits);
 }

 u64 kvm_scale_tsc(u64 tsc, u64 ratio)
 {
 	u64 _tsc = tsc;

-	if (ratio != kvm_default_tsc_scaling_ratio)
+	if (ratio != kvm_caps.default_tsc_scaling_ratio)
 		_tsc = __scale_tsc(ratio, tsc);

 	return _tsc;
@@ -2498,11 +2484,11 @@ u64 kvm_calc_nested_tsc_offset(u64 l1_offset, u64 l2_offset, u64 l2_multiplier)
 {
 	u64 nested_offset;

-	if (l2_multiplier == kvm_default_tsc_scaling_ratio)
+	if (l2_multiplier == kvm_caps.default_tsc_scaling_ratio)
 		nested_offset = l1_offset;
 	else
 		nested_offset = mul_s64_u64_shr((s64) l1_offset, l2_multiplier,
-						kvm_tsc_scaling_ratio_frac_bits);
+						kvm_caps.tsc_scaling_ratio_frac_bits);

 	nested_offset += l2_offset;
 	return nested_offset;
@@ -2511,9 +2497,9 @@ EXPORT_SYMBOL_GPL(kvm_calc_nested_tsc_offset);

 u64 kvm_calc_nested_tsc_multiplier(u64 l1_multiplier, u64 l2_multiplier)
 {
-	if (l2_multiplier != kvm_default_tsc_scaling_ratio)
+	if (l2_multiplier != kvm_caps.default_tsc_scaling_ratio)
 		return mul_u64_u64_shr(l1_multiplier, l2_multiplier,
-				       kvm_tsc_scaling_ratio_frac_bits);
+				       kvm_caps.tsc_scaling_ratio_frac_bits);

 	return l1_multiplier;
 }
@@ -2555,7 +2541,7 @@ static void kvm_vcpu_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 l1_multipli
 	else
 		vcpu->arch.tsc_scaling_ratio = l1_multiplier;

-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		static_call(kvm_x86_write_tsc_multiplier)(
 			vcpu, vcpu->arch.tsc_scaling_ratio);
 }
@@ -2691,7 +2677,7 @@ static inline void adjust_tsc_offset_guest(struct kvm_vcpu *vcpu,

 static inline void adjust_tsc_offset_host(struct kvm_vcpu *vcpu, s64 adjustment)
 {
-	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_default_tsc_scaling_ratio)
+	if (vcpu->arch.l1_tsc_scaling_ratio != kvm_caps.default_tsc_scaling_ratio)
 		WARN_ON(adjustment < 0);
 	adjustment = kvm_scale_tsc((u64) adjustment,
 				   vcpu->arch.l1_tsc_scaling_ratio);
@@ -3104,7 +3090,7 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)

 	/* With all the info we got, fill in the values */

-	if (kvm_has_tsc_control)
+	if (kvm_caps.has_tsc_control)
 		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
 					    v->arch.l1_tsc_scaling_ratio);

@@ -3610,7 +3596,7 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
 		 * XSAVES/XRSTORS to save/restore PT MSRs.
 		 */
-		if (data & ~supported_xss)
+		if (data & ~kvm_caps.supported_xss)
 			return 1;
 		vcpu->arch.ia32_xss = data;
 		kvm_update_cpuid_runtime(vcpu);
@@ -4370,7 +4356,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		break;
 	case KVM_CAP_TSC_CONTROL:
 	case KVM_CAP_VM_TSC_CONTROL:
-		r = kvm_has_tsc_control;
+		r = kvm_caps.has_tsc_control;
 		break;
 	case KVM_CAP_X2APIC_API:
 		r = KVM_X2APIC_API_VALID_FLAGS;
@@ -4392,7 +4378,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 		r = sched_info_on();
 		break;
 	case KVM_CAP_X86_BUS_LOCK_EXIT:
-		if (kvm_has_bus_lock_exit)
+		if (kvm_caps.has_bus_lock_exit)
 			r = KVM_BUS_LOCK_DETECTION_OFF |
 			    KVM_BUS_LOCK_DETECTION_EXIT;
 		else
@@ -4401,7 +4387,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_XSAVE2: {
 		u64 guest_perm = xstate_get_guest_group_perm();

-		r = xstate_required_size(supported_xcr0 & guest_perm, false);
+		r = xstate_required_size(kvm_caps.supported_xcr0 & guest_perm, false);
 		if (r < sizeof(struct kvm_xsave))
 			r = sizeof(struct kvm_xsave);
 		break;
@@ -4439,7 +4425,7 @@ static int kvm_x86_dev_get_attr(struct kvm_device_attr *attr)

 	switch (attr->attr) {
 	case KVM_X86_XCOMP_GUEST_SUPP:
-		if (put_user(supported_xcr0, uaddr))
+		if (put_user(kvm_caps.supported_xcr0, uaddr))
 			return -EFAULT;
 		return 0;
 	default:
@@ -4516,8 +4502,8 @@ long kvm_arch_dev_ioctl(struct file *filp,
 	}
 	case KVM_X86_GET_MCE_CAP_SUPPORTED:
 		r = -EFAULT;
-		if (copy_to_user(argp, &kvm_mce_cap_supported,
-				 sizeof(kvm_mce_cap_supported)))
+		if (copy_to_user(argp, &kvm_caps.supported_mce_cap,
+				 sizeof(kvm_caps.supported_mce_cap)))
 			goto out;
 		r = 0;
 		break;
@@ -4801,7 +4787,7 @@ static int kvm_vcpu_ioctl_x86_setup_mce(struct kvm_vcpu *vcpu,
 	r = -EINVAL;
 	if (!bank_num || bank_num > KVM_MAX_MCE_BANKS)
 		goto out;
-	if (mcg_cap & ~(kvm_mce_cap_supported | 0xff | 0xff0000))
+	if (mcg_cap & ~(kvm_caps.supported_mce_cap | 0xff | 0xff0000))
 		goto out;
 	r = 0;
 	vcpu->arch.mcg_cap = mcg_cap;
@@ -5093,7 +5079,8 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,

 	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
 					      guest_xsave->region,
-					      supported_xcr0, &vcpu->arch.pkru);
+					      kvm_caps.supported_xcr0,
+					      &vcpu->arch.pkru);
 }

 static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
@@ -5598,8 +5585,8 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = -EINVAL;
 		user_tsc_khz = (u32)arg;

-		if (kvm_has_tsc_control &&
-		    user_tsc_khz >= kvm_max_guest_tsc_khz)
+		if (kvm_caps.has_tsc_control &&
+		    user_tsc_khz >= kvm_caps.max_guest_tsc_khz)
 			goto out;

 		if (user_tsc_khz == 0)
@@ -6039,7 +6026,7 @@ int kvm_vm_ioctl_enable_cap(struct kvm *kvm,
 		    (cap->args[0] & KVM_BUS_LOCK_DETECTION_EXIT))
 			break;

-		if (kvm_has_bus_lock_exit &&
+		if (kvm_caps.has_bus_lock_exit &&
 		    cap->args[0] & KVM_BUS_LOCK_DETECTION_EXIT)
 			kvm->arch.bus_lock_detection_enabled = true;
 		r = 0;
@@ -6588,8 +6575,8 @@ long kvm_arch_vm_ioctl(struct file *filp,
 		r = -EINVAL;
 		user_tsc_khz = (u32)arg;

-		if (kvm_has_tsc_control &&
-		    user_tsc_khz >= kvm_max_guest_tsc_khz)
+		if (kvm_caps.has_tsc_control &&
+		    user_tsc_khz >= kvm_caps.max_guest_tsc_khz)
 			goto out;

 		if (user_tsc_khz == 0)
@@ -8745,7 +8732,7 @@ static void kvm_hyperv_tsc_notifier(void)
 	/* TSC frequency always matches when on Hyper-V */
 	for_each_present_cpu(cpu)
 		per_cpu(cpu_tsc_khz, cpu) = tsc_khz;
-	kvm_max_guest_tsc_khz = tsc_khz;
+	kvm_caps.max_guest_tsc_khz = tsc_khz;

 	list_for_each_entry(kvm, &vm_list, vm_list) {
 		__kvm_start_pvclock_update(kvm);
@@ -9007,7 +8994,7 @@ int kvm_arch_init(void *opaque)

 	if (boot_cpu_has(X86_FEATURE_XSAVE)) {
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
-		supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
+		kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
 	}

 	if (pi_inject_timer == -1)
@@ -11703,13 +11690,13 @@ int kvm_arch_hardware_setup(void *opaque)
 	kvm_register_perf_callbacks(ops->handle_intel_pt_intr);

 	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
-		supported_xss = 0;
+		kvm_caps.supported_xss = 0;

 #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
 	cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
 #undef __kvm_cpu_cap_has

-	if (kvm_has_tsc_control) {
+	if (kvm_caps.has_tsc_control) {
 		/*
 		 * Make sure the user can only configure tsc_khz values that
 		 * fit into a signed integer.
@@ -11717,10 +11704,10 @@ int kvm_arch_hardware_setup(void *opaque)
 		 * be 1 on all machines.
 		 */
 		u64 max = min(0x7fffffffULL,
-			      __scale_tsc(kvm_max_tsc_scaling_ratio, tsc_khz));
-		kvm_max_guest_tsc_khz = max;
+			      __scale_tsc(kvm_caps.max_tsc_scaling_ratio, tsc_khz));
+		kvm_caps.max_guest_tsc_khz = max;
 	}
-	kvm_default_tsc_scaling_ratio = 1ULL << kvm_tsc_scaling_ratio_frac_bits;
+	kvm_caps.default_tsc_scaling_ratio = 1ULL << kvm_caps.tsc_scaling_ratio_frac_bits;
 	kvm_init_msr_list();
 	return 0;
 }
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 588792f00334..359d0454ad28 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -8,6 +8,25 @@
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"

+struct kvm_caps {
+	/* control of guest tsc rate supported? */
+	bool has_tsc_control;
+	/* maximum supported tsc_khz for guests */
+	u32  max_guest_tsc_khz;
+	/* number of bits of the fractional part of the TSC scaling ratio */
+	u8   tsc_scaling_ratio_frac_bits;
+	/* maximum allowed value of TSC scaling ratio */
+	u64  max_tsc_scaling_ratio;
+	/* 1ull << kvm_caps.tsc_scaling_ratio_frac_bits */
+	u64  default_tsc_scaling_ratio;
+	/* bus lock detection supported? */
+	bool has_bus_lock_exit;
+
+	u64 supported_mce_cap;
+	u64 supported_xcr0;
+	u64 supported_xss;
+};
+
 void kvm_spurious_fault(void);

 #define KVM_NESTED_VMENTER_CONSISTENCY_CHECK(consistency_check)		\
@@ -283,14 +302,15 @@ int x86_emulate_instruction(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);

 extern u64 host_xcr0;
-extern u64 supported_xcr0;
 extern u64 host_xss;
-extern u64 supported_xss;
+
+extern struct kvm_caps kvm_caps;
+
 extern bool enable_pmu;

 static inline bool kvm_mpx_supported(void)
 {
-	return (supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR))
+	return (kvm_caps.supported_xcr0 & (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR))
 		== (XFEATURE_MASK_BNDREGS | XFEATURE_MASK_BNDCSR);
 }


base-commit: a3808d88461270c71d3fece5e51cc486ecdac7d0
--
2.36.1.124.g0e6072fb45-goog



^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event
  2022-05-18 19:20   ` Sean Christopherson
@ 2022-05-23  6:46     ` Chenyi Qiang
  2022-05-23 16:23       ` Sean Christopherson
  0 siblings, 1 reply; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-23  6:46 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 5/19/2022 3:20 AM, Sean Christopherson wrote:
> On Thu, Apr 21, 2022, Chenyi Qiang wrote:
>> +#ifndef __x86_64__
>> +# error This test is 64-bit only
> 
> No need, all of KVM selftests are 64-bit only.
> 
>> +#endif
>> +
>> +#define VCPU_ID			0
>> +#define ARBITRARY_IO_PORT	0x2000
>> +
>> +/* The virtual machine object. */
>> +static struct kvm_vm *vm;
>> +
>> +static void l2_guest_code(void)
>> +{
>> +	/*
>> +	 * Generate an exit to L0 userspace, i.e. main(), via I/O to an
>> +	 * arbitrary port.
>> +	 */
> 
> I think we can test a "real" triple fault without too much effort by abusing
> vcpu->run->request_interrupt_window.  E.g. map the run struct into L2, clear
> PIN_BASED_EXT_INTR_MASK in vmcs12, and then in l2_guest_code() do:
> 
> 	asm volatile("cli");
> 
> 	run->request_interrupt_window = true;
> 

Maybe, A GUEST_SYNC to main() to change the request_interrupt_window 
also works.

> 	asm volatile("sti; ud2");
> 

How does the "real" triple fault occur here? Through UD2?

> 	asm volatile("inb %%dx, %%al"
> 		     : : [port] "d" (ARBITRARY_IO_PORT) : "rax"); 	
> 
> The STI shadow will block the IRQ-window VM-Exit until after the ud2, i.e. until
> after the triple fault is pending.
> 
> And then main() can
> 
>    1. verify it got KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault
>    2. clear the triple fault and re-enter L1+l2
>    3. continue with the existing code, e.g. verify it got KVM_EXIT_IO, stuff a
>       pending triple fault, etc...
> 
> That said, typing that out makes me think we should technically handle/prevent this
> in KVM since triple fault / shutdown is kinda sorta just a special exception.
> 
> Heh, and a potentially bad/crazy idea would be to use a reserved/magic vector in
> kvm_vcpu_events.exception to save/restore triple fault, e.g. we could steal
> NMI_VECTOR.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f2d0ee9296b9..d58c0cfd3cd3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4743,7 +4743,8 @@ static int kvm_vcpu_ready_for_interrupt_injection(struct kvm_vcpu *vcpu)
>          return (kvm_arch_interrupt_allowed(vcpu) &&
>                  kvm_cpu_accept_dm_intr(vcpu) &&
>                  !kvm_event_needs_reinjection(vcpu) &&
> -               !vcpu->arch.exception.pending);
> +               !vcpu->arch.exception.pending &&
> +               !kvm_test_request(KVM_REQ_TRIPLE_FAULT, vcpu));
>   }
> 
>   static int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu,
> 
> 
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event
  2022-05-23  6:46     ` Chenyi Qiang
@ 2022-05-23 16:23       ` Sean Christopherson
  2022-05-24 13:27         ` Chenyi Qiang
  0 siblings, 1 reply; 17+ messages in thread
From: Sean Christopherson @ 2022-05-23 16:23 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel

On Mon, May 23, 2022, Chenyi Qiang wrote:
> 
> 
> On 5/19/2022 3:20 AM, Sean Christopherson wrote:
> > On Thu, Apr 21, 2022, Chenyi Qiang wrote:
> > > +#ifndef __x86_64__
> > > +# error This test is 64-bit only
> > 
> > No need, all of KVM selftests are 64-bit only.
> > 
> > > +#endif
> > > +
> > > +#define VCPU_ID			0
> > > +#define ARBITRARY_IO_PORT	0x2000
> > > +
> > > +/* The virtual machine object. */
> > > +static struct kvm_vm *vm;
> > > +
> > > +static void l2_guest_code(void)
> > > +{
> > > +	/*
> > > +	 * Generate an exit to L0 userspace, i.e. main(), via I/O to an
> > > +	 * arbitrary port.
> > > +	 */
> > 
> > I think we can test a "real" triple fault without too much effort by abusing
> > vcpu->run->request_interrupt_window.  E.g. map the run struct into L2, clear
> > PIN_BASED_EXT_INTR_MASK in vmcs12, and then in l2_guest_code() do:
> > 
> > 	asm volatile("cli");
> > 
> > 	run->request_interrupt_window = true;
> > 
> 
> Maybe, A GUEST_SYNC to main() to change the request_interrupt_window also
> works.

Hmm, yes, that should work too.  Feel free to punt on writing this sub-test.  As
mentioned below, KVM should treat a pending triple fault as a pending "exception",
i.e. userspace shouldn't see KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault,
and so the test should actually assert that the triple fault occurs in L2.  But I
can roll that test into the KVM fix if you'd like.

> > 	asm volatile("sti; ud2");
> > 
> 
> How does the "real" triple fault occur here? Through UD2?

Yeah.  IIRC, by default L2 doesn't have a valid IDT, so any fault will escalate to
a triple fault.

> > 	asm volatile("inb %%dx, %%al"
> > 		     : : [port] "d" (ARBITRARY_IO_PORT) : "rax"); 	
> > 
> > The STI shadow will block the IRQ-window VM-Exit until after the ud2, i.e. until
> > after the triple fault is pending.
> > 
> > And then main() can
> > 
> >    1. verify it got KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault
> >    2. clear the triple fault and re-enter L1+l2
> >    3. continue with the existing code, e.g. verify it got KVM_EXIT_IO, stuff a
> >       pending triple fault, etc...
> > 
> > That said, typing that out makes me think we should technically handle/prevent this
> > in KVM since triple fault / shutdown is kinda sorta just a special exception.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/3] Introduce Notify VM exit
  2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
                   ` (3 preceding siblings ...)
  2022-05-06  2:43 ` [PATCH v6 0/3] Introduce " Chenyi Qiang
@ 2022-05-23 19:30 ` Paolo Bonzini
  2022-05-24 14:00   ` Chenyi Qiang
  4 siblings, 1 reply; 17+ messages in thread
From: Paolo Bonzini @ 2022-05-23 19:30 UTC (permalink / raw)
  To: Chenyi Qiang, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel

On 4/21/22 09:29, Chenyi Qiang wrote:
> Virtual machines can exploit Intel ISA characterstics to cause
> functional denial of service to the VMM. This series introduces a new
> feature named Notify VM exit, which can help mitigate such kind of
> attacks.
> 
> Patch 1: An extension of KVM_SET_VCPU_EVENTS ioctl to inject a
> synthesized shutdown event from user space. This is also a fix for other
> synthesized triple fault, e.g. the RSM patch or nested_vmx_abort(),
> which could get lost when exit to userspace to do migrate.
> 
> Patch 2: A selftest about get/set triple fault event.
> 
> Patch 3: The main patch to enable Notify VM exit.

Chenyi, can you send v7 for inclusion?

Paolo

> ---
> Change logs:
> v5 -> v6
> - Do some changes in document.
> - Add a selftest about get/set triple fault event. (Sean)
> - extend the argument to include both the notify window and some flags
>    when enabling KVM_CAP_X86_BUS_LOCK_EXIT CAP. (Sean)
> - Change to use KVM_VCPUEVENT_VALID_TRIPE_FAULT in flags field and add
>    pending_triple_fault field in struct kvm_vcpu_events, which allows
>    userspace to make/clear triple fault request. (Sean)
> - Add a flag in kvm_x86_ops to avoid the kvm_has_notify_vmexit global
>    varialbe and its export.(Sean)
> - v5: https://lore.kernel.org/lkml/20220318074955.22428-1-chenyi.qiang@intel.com/
> 
> v4 -> v5
> - rename KVM_VCPUEVENTS_SHUTDOWN to KVM_VCPUEVENTS_TRIPLE_FAULT. Make it
>    bidirection and add it to get_vcpu_events. (Sean)
> - v4: https://lore.kernel.org/all/20220310084001.10235-1-chenyi.qiang@intel.com/
> 
> v3 -> v4
> - Change this feature to per-VM scope. (Jim)
> - Once VM_CONTEXT_INVALID set in exit_qualification, exit to user space
>    notify this fatal case, especially the notify VM exit happens in L2.
>    (Jim)
> - extend KVM_SET_VCPU_EVENTS to allow user space to inject a shutdown
>    event. (Jim)
> - A minor code changes.
> - Add document for the new KVM capability.
> - v3: https://lore.kernel.org/lkml/20220223062412.22334-1-chenyi.qiang@intel.com/
> 
> v2 -> v3
> - add a vcpu state notify_window_exits to record the number of
>    occurence as well as a pr_warn output. (Sean)
> - Add the handling in nested VM to prevent L1 bypassing the restriction
>    through launching a L2. (Sean)
> - Only kill L2 when L2 VM is context invalid, synthesize a
>    EXIT_REASON_TRIPLE_FAULT to L1 (Sean)
> - To ease the current implementation, make module parameter
>    notify_window read-only. (Sean)
> - Disable notify window exit by default.
> - v2: https://lore.kernel.org/lkml/20210525051204.1480610-1-tao3.xu@intel.com/
> 
> v1 -> v2
> - Default set notify window to 0, less than 0 to disable.
> - Add more description in commit message.
> ---
> 
> Chenyi Qiang (2):
>    KVM: X86: Save&restore the triple fault request
>    KVM: selftests: Add a test to get/set triple fault event
> 
> Tao Xu (1):
>    KVM: VMX: Enable Notify VM exit
> 
>   Documentation/virt/kvm/api.rst                | 55 +++++++++++
>   arch/x86/include/asm/kvm_host.h               |  9 ++
>   arch/x86/include/asm/vmx.h                    |  7 ++
>   arch/x86/include/asm/vmxfeatures.h            |  1 +
>   arch/x86/include/uapi/asm/kvm.h               |  4 +-
>   arch/x86/include/uapi/asm/vmx.h               |  4 +-
>   arch/x86/kvm/vmx/capabilities.h               |  6 ++
>   arch/x86/kvm/vmx/nested.c                     |  8 ++
>   arch/x86/kvm/vmx/vmx.c                        | 48 +++++++++-
>   arch/x86/kvm/x86.c                            | 33 ++++++-
>   arch/x86/kvm/x86.h                            |  5 +
>   include/uapi/linux/kvm.h                      | 10 ++
>   tools/testing/selftests/kvm/.gitignore        |  1 +
>   tools/testing/selftests/kvm/Makefile          |  1 +
>   .../kvm/x86_64/triple_fault_event_test.c      | 96 +++++++++++++++++++
>   15 files changed, 280 insertions(+), 8 deletions(-)
>   create mode 100644 tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event
  2022-05-23 16:23       ` Sean Christopherson
@ 2022-05-24 13:27         ` Chenyi Qiang
  0 siblings, 0 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-24 13:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Paolo Bonzini, Vitaly Kuznetsov, Wanpeng Li, Jim Mattson,
	Joerg Roedel, Xiaoyao Li, kvm, linux-kernel



On 5/24/2022 12:23 AM, Sean Christopherson wrote:
> On Mon, May 23, 2022, Chenyi Qiang wrote:
>>
>>
>> On 5/19/2022 3:20 AM, Sean Christopherson wrote:
>>> On Thu, Apr 21, 2022, Chenyi Qiang wrote:
>>>> +#ifndef __x86_64__
>>>> +# error This test is 64-bit only
>>>
>>> No need, all of KVM selftests are 64-bit only.
>>>
>>>> +#endif
>>>> +
>>>> +#define VCPU_ID			0
>>>> +#define ARBITRARY_IO_PORT	0x2000
>>>> +
>>>> +/* The virtual machine object. */
>>>> +static struct kvm_vm *vm;
>>>> +
>>>> +static void l2_guest_code(void)
>>>> +{
>>>> +	/*
>>>> +	 * Generate an exit to L0 userspace, i.e. main(), via I/O to an
>>>> +	 * arbitrary port.
>>>> +	 */
>>>
>>> I think we can test a "real" triple fault without too much effort by abusing
>>> vcpu->run->request_interrupt_window.  E.g. map the run struct into L2, clear
>>> PIN_BASED_EXT_INTR_MASK in vmcs12, and then in l2_guest_code() do:
>>>
>>> 	asm volatile("cli");
>>>
>>> 	run->request_interrupt_window = true;
>>>
>>
>> Maybe, A GUEST_SYNC to main() to change the request_interrupt_window also
>> works.
> 
> Hmm, yes, that should work too.  Feel free to punt on writing this sub-test.  As
> mentioned below, KVM should treat a pending triple fault as a pending "exception",
> i.e. userspace shouldn't see KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault,
> and so the test should actually assert that the triple fault occurs in L2.  But I
> can roll that test into the KVM fix if you'd like.

If the KVM_EXIT_IRQ_WINDOW_OPEN can't see the pending triple fault, only 
asserting the triple fault may not have meaning in this selftest. 
Anyway, feel free to add the pending triple fault test in your KVM fix 
patches.

> 
>>> 	asm volatile("sti; ud2");
>>>
>>
>> How does the "real" triple fault occur here? Through UD2?
> 
> Yeah.  IIRC, by default L2 doesn't have a valid IDT, so any fault will escalate to
> a triple fault.
> 
>>> 	asm volatile("inb %%dx, %%al"
>>> 		     : : [port] "d" (ARBITRARY_IO_PORT) : "rax"); 	
>>>
>>> The STI shadow will block the IRQ-window VM-Exit until after the ud2, i.e. until
>>> after the triple fault is pending.
>>>
>>> And then main() can
>>>
>>>     1. verify it got KVM_EXIT_IRQ_WINDOW_OPEN with a pending triple fault
>>>     2. clear the triple fault and re-enter L1+l2
>>>     3. continue with the existing code, e.g. verify it got KVM_EXIT_IO, stuff a
>>>        pending triple fault, etc...
>>>
>>> That said, typing that out makes me think we should technically handle/prevent this
>>> in KVM since triple fault / shutdown is kinda sorta just a special exception.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH v6 0/3] Introduce Notify VM exit
  2022-05-23 19:30 ` Paolo Bonzini
@ 2022-05-24 14:00   ` Chenyi Qiang
  0 siblings, 0 replies; 17+ messages in thread
From: Chenyi Qiang @ 2022-05-24 14:00 UTC (permalink / raw)
  To: Paolo Bonzini, Sean Christopherson, Vitaly Kuznetsov, Wanpeng Li,
	Jim Mattson, Joerg Roedel, Xiaoyao Li
  Cc: kvm, linux-kernel



On 5/24/2022 3:30 AM, Paolo Bonzini wrote:
> On 4/21/22 09:29, Chenyi Qiang wrote:
>> Virtual machines can exploit Intel ISA characterstics to cause
>> functional denial of service to the VMM. This series introduces a new
>> feature named Notify VM exit, which can help mitigate such kind of
>> attacks.
>>
>> Patch 1: An extension of KVM_SET_VCPU_EVENTS ioctl to inject a
>> synthesized shutdown event from user space. This is also a fix for other
>> synthesized triple fault, e.g. the RSM patch or nested_vmx_abort(),
>> which could get lost when exit to userspace to do migrate.
>>
>> Patch 2: A selftest about get/set triple fault event.
>>
>> Patch 3: The main patch to enable Notify VM exit.
> 
> Chenyi, can you send v7 for inclusion?
> 
> Paolo
> 

Hi Paolo

v7 is sent out at
https://lore.kernel.org/lkml/20220524135624.22988-1-chenyi.qiang@intel.com/

>> ---
>> Change logs:
>> v5 -> v6
>> - Do some changes in document.
>> - Add a selftest about get/set triple fault event. (Sean)
>> - extend the argument to include both the notify window and some flags
>>    when enabling KVM_CAP_X86_BUS_LOCK_EXIT CAP. (Sean)
>> - Change to use KVM_VCPUEVENT_VALID_TRIPE_FAULT in flags field and add
>>    pending_triple_fault field in struct kvm_vcpu_events, which allows
>>    userspace to make/clear triple fault request. (Sean)
>> - Add a flag in kvm_x86_ops to avoid the kvm_has_notify_vmexit global
>>    varialbe and its export.(Sean)
>> - v5: 
>> https://lore.kernel.org/lkml/20220318074955.22428-1-chenyi.qiang@intel.com/ 
>>
>>
>> v4 -> v5
>> - rename KVM_VCPUEVENTS_SHUTDOWN to KVM_VCPUEVENTS_TRIPLE_FAULT. Make it
>>    bidirection and add it to get_vcpu_events. (Sean)
>> - v4: 
>> https://lore.kernel.org/all/20220310084001.10235-1-chenyi.qiang@intel.com/ 
>>
>>
>> v3 -> v4
>> - Change this feature to per-VM scope. (Jim)
>> - Once VM_CONTEXT_INVALID set in exit_qualification, exit to user space
>>    notify this fatal case, especially the notify VM exit happens in L2.
>>    (Jim)
>> - extend KVM_SET_VCPU_EVENTS to allow user space to inject a shutdown
>>    event. (Jim)
>> - A minor code changes.
>> - Add document for the new KVM capability.
>> - v3: 
>> https://lore.kernel.org/lkml/20220223062412.22334-1-chenyi.qiang@intel.com/ 
>>
>>
>> v2 -> v3
>> - add a vcpu state notify_window_exits to record the number of
>>    occurence as well as a pr_warn output. (Sean)
>> - Add the handling in nested VM to prevent L1 bypassing the restriction
>>    through launching a L2. (Sean)
>> - Only kill L2 when L2 VM is context invalid, synthesize a
>>    EXIT_REASON_TRIPLE_FAULT to L1 (Sean)
>> - To ease the current implementation, make module parameter
>>    notify_window read-only. (Sean)
>> - Disable notify window exit by default.
>> - v2: 
>> https://lore.kernel.org/lkml/20210525051204.1480610-1-tao3.xu@intel.com/
>>
>> v1 -> v2
>> - Default set notify window to 0, less than 0 to disable.
>> - Add more description in commit message.
>> ---
>>
>> Chenyi Qiang (2):
>>    KVM: X86: Save&restore the triple fault request
>>    KVM: selftests: Add a test to get/set triple fault event
>>
>> Tao Xu (1):
>>    KVM: VMX: Enable Notify VM exit
>>
>>   Documentation/virt/kvm/api.rst                | 55 +++++++++++
>>   arch/x86/include/asm/kvm_host.h               |  9 ++
>>   arch/x86/include/asm/vmx.h                    |  7 ++
>>   arch/x86/include/asm/vmxfeatures.h            |  1 +
>>   arch/x86/include/uapi/asm/kvm.h               |  4 +-
>>   arch/x86/include/uapi/asm/vmx.h               |  4 +-
>>   arch/x86/kvm/vmx/capabilities.h               |  6 ++
>>   arch/x86/kvm/vmx/nested.c                     |  8 ++
>>   arch/x86/kvm/vmx/vmx.c                        | 48 +++++++++-
>>   arch/x86/kvm/x86.c                            | 33 ++++++-
>>   arch/x86/kvm/x86.h                            |  5 +
>>   include/uapi/linux/kvm.h                      | 10 ++
>>   tools/testing/selftests/kvm/.gitignore        |  1 +
>>   tools/testing/selftests/kvm/Makefile          |  1 +
>>   .../kvm/x86_64/triple_fault_event_test.c      | 96 +++++++++++++++++++
>>   15 files changed, 280 insertions(+), 8 deletions(-)
>>   create mode 100644 
>> tools/testing/selftests/kvm/x86_64/triple_fault_event_test.c
>>
> 

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2022-05-24 14:00 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-21  7:29 [PATCH v6 0/3] Introduce Notify VM exit Chenyi Qiang
2022-04-21  7:29 ` [PATCH v6 1/3] KVM: X86: Save&restore the triple fault request Chenyi Qiang
2022-05-18 18:42   ` Sean Christopherson
2022-05-19  6:25     ` Chenyi Qiang
2022-04-21  7:29 ` [PATCH v6 2/3] KVM: selftests: Add a test to get/set triple fault event Chenyi Qiang
2022-05-18 19:20   ` Sean Christopherson
2022-05-23  6:46     ` Chenyi Qiang
2022-05-23 16:23       ` Sean Christopherson
2022-05-24 13:27         ` Chenyi Qiang
2022-04-21  7:29 ` [PATCH v6 3/3] KVM: VMX: Enable Notify VM exit Chenyi Qiang
2022-05-17  0:59   ` Chenyi Qiang
2022-05-18 22:30   ` Sean Christopherson
2022-05-19 10:38     ` Chenyi Qiang
2022-05-19 15:22       ` Sean Christopherson
2022-05-06  2:43 ` [PATCH v6 0/3] Introduce " Chenyi Qiang
2022-05-23 19:30 ` Paolo Bonzini
2022-05-24 14:00   ` Chenyi Qiang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).