kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 00/23] Enable FRED with KVM VMX
@ 2023-11-08 18:29 Xin Li
  2023-11-08 18:29 ` [PATCH v1 01/23] KVM: VMX: Cleanup VMX basic information defines and usages Xin Li
                   ` (22 more replies)
  0 siblings, 23 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

This patch set enables the Intel flexible return and event delivery
(FRED) architecture with KVM VMX to allow guests to utilize FRED.

The FRED architecture defines simple new transitions that change
privilege level (ring transitions). The FRED architecture was
designed with the following goals:

1) Improve overall performance and response time by replacing event
   delivery through the interrupt descriptor table (IDT event
   delivery) and event return by the IRET instruction with lower
   latency transitions.

2) Improve software robustness by ensuring that event delivery
   establishes the full supervisor context and that event return
   establishes the full user context.

The new transitions defined by the FRED architecture are FRED event
delivery and, for returning from events, two FRED return instructions.
FRED event delivery can effect a transition from ring 3 to ring 0, but
it is used also to deliver events incident to ring 0. One FRED
instruction (ERETU) effects a return from ring 0 to ring 3, while the
other (ERETS) returns while remaining in ring 0. Collectively, FRED
event delivery and the FRED return instructions are FRED transitions.

Intel VMX architecture is extended to run FRED guests, and the changes
are majorly:

1) New VMCS fields for FRED context management, which includes two new
event data VMCS fields, eight new guest FRED context VMCS fields and
eight new host FRED context VMCS fields.

2) VMX nested-Exception support for proper virtualization of stack
levels introduced with FRED architecture.

Search for the latest FRED spec in most search engines with this search pattern:

  site:intel.com FRED (flexible return and event delivery) specification

We want to send out the FRED VMX patch set for review while the FRED
native patch set v12 is being reviewed @
https://lkml.kernel.org/kvm/20231003062458.23552-1-xin3.li@intel.com/.
For easier review, I have set up a base tree with the latest FRED native
patch set on top of tip tree in the 'fred_v12' branch of repo
    https://github.com/xinli-intel/linux-fred-public.git.

Patch 1-2 are cleanups to VMX basic and misc MSRs, which were sent
out earlier as a preparation for FRED changes:
https://lore.kernel.org/kvm/20231030233940.438233-1-xin@zytor.com/.

Patch 3-14 add FRED support to VMX.
Patch 15-18 add FRED support to nested VMX.
Patch 19 exposes FRED to KVM guests to complete the enabling.
Patch 20-23 adds FRED selftests.


Shan Kang (1):
  KVM: selftests: Add fred exception tests

Xin Li (22):
  KVM: VMX: Cleanup VMX basic information defines and usages
  KVM: VMX: Cleanup VMX misc information defines and usages
  KVM: VMX: Add support for the secondary VM exit controls
  KVM: x86: Mark CR4.FRED as not reserved
  KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  KVM: VMX: Disable intercepting FRED MSRs
  KVM: VMX: Initialize VMCS FRED fields
  KVM: VMX: Switch FRED RSP0 between host and guest
  KVM: VMX: Add support for FRED context save/restore
  KVM: x86: Add kvm_is_fred_enabled()
  KVM: VMX: Handle FRED event data
  KVM: VMX: Handle VMX nested exception for FRED
  KVM: VMX: Dump FRED context in dump_vmcs()
  KVM: nVMX: Add support for the secondary VM exit controls
  KVM: nVMX: Add FRED VMCS fields
  KVM: nVMX: Add support for VMX FRED controls
  KVM: nVMX: Add VMCS FRED states checking
  KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests
  KVM: selftests: Add FRED VMCS fields to evmcs
  KVM: selftests: Run debug_regs test with FRED enabled
  KVM: selftests: Add a new VM guest mode to run user level code

 Documentation/virt/kvm/x86/nested-vmx.rst     |  19 +
 arch/x86/include/asm/hyperv-tlfs.h            |  19 +
 arch/x86/include/asm/kvm_host.h               |   9 +-
 arch/x86/include/asm/msr-index.h              |  15 +-
 arch/x86/include/asm/vmx.h                    |  57 ++-
 arch/x86/kvm/cpuid.c                          |   4 +-
 arch/x86/kvm/kvm_cache_regs.h                 |  10 +
 arch/x86/kvm/svm/svm.c                        |   4 +-
 arch/x86/kvm/vmx/capabilities.h               |  20 +-
 arch/x86/kvm/vmx/hyperv.c                     |  61 ++-
 arch/x86/kvm/vmx/nested.c                     | 315 ++++++++++++--
 arch/x86/kvm/vmx/nested.h                     |   2 +-
 arch/x86/kvm/vmx/vmcs.h                       |   1 +
 arch/x86/kvm/vmx/vmcs12.c                     |  19 +
 arch/x86/kvm/vmx/vmcs12.h                     |  38 ++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h         |   6 +-
 arch/x86/kvm/vmx/vmx.c                        | 404 ++++++++++++++++--
 arch/x86/kvm/vmx/vmx.h                        |  14 +-
 arch/x86/kvm/x86.c                            |  55 ++-
 arch/x86/kvm/x86.h                            |   5 +-
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/kvm_util_base.h     |   1 +
 .../selftests/kvm/include/x86_64/evmcs.h      | 146 +++++++
 .../selftests/kvm/include/x86_64/processor.h  |  33 ++
 .../selftests/kvm/include/x86_64/vmx.h        |  20 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |   5 +-
 .../selftests/kvm/lib/x86_64/processor.c      |  15 +-
 tools/testing/selftests/kvm/lib/x86_64/vmx.c  |   4 +-
 .../testing/selftests/kvm/x86_64/debug_regs.c |  50 ++-
 .../testing/selftests/kvm/x86_64/fred_test.c  | 262 ++++++++++++
 30 files changed, 1464 insertions(+), 150 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/x86_64/fred_test.c


base-commit: d49b86c24e836941c85c4906e9519fca9426a6e0
-- 
2.42.0


^ permalink raw reply	[flat|nested] 70+ messages in thread

* [PATCH v1 01/23] KVM: VMX: Cleanup VMX basic information defines and usages
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 02/23] KVM: VMX: Cleanup VMX misc " Xin Li
                   ` (21 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Define VMX basic information fields with BIT_ULL()/GENMASK_ULL(), and
replace hardcoded VMX basic numbers with these field macros.

Per Sean's ask, save the full/raw value of MSR_IA32_VMX_BASIC in the
global vmcs_config as type u64 to get rid of the hi/lo crud, and then
use VMX_BASIC helpers to extract info as needed.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---

Changes since v2:
* Simply save the full/raw value of MSR_IA32_VMX_BASIC in the global
  vmcs_config, and then use the helpers to extract info from it as
  needed (Sean Christopherson).
* Move all VMX_MISC related changes to the second patch (Kai Huang).
* Commonize memory type definitions used in the VMX files, as memory
  types are architectural.

Changes since v1:
* Don't add field shift macros unless it's really needed, extra layer
  of indirect makes it harder to read (Sean Christopherson).
* Add a static_assert() to ensure that VMX_BASIC_FEATURES_MASK doesn't
  overlap with VMX_BASIC_RESERVED_BITS (Sean Christopherson).
* read MSR_IA32_VMX_BASIC into an u64 rather than 2 u32 (Sean
  Christopherson).
* Add 2 new functions for extracting fields from VMX basic (Sean
  Christopherson).
* Drop the tools header update (Sean Christopherson).
* Move VMX basic field macros to arch/x86/include/asm/vmx.h.
---
 arch/x86/include/asm/msr-index.h |  9 ---------
 arch/x86/include/asm/vmx.h       | 15 +++++++++++++--
 arch/x86/kvm/vmx/capabilities.h  |  6 ++----
 arch/x86/kvm/vmx/nested.c        | 31 +++++++++++++++++++++----------
 arch/x86/kvm/vmx/vmx.c           | 24 ++++++++++--------------
 5 files changed, 46 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d1d6b3c3e6bd..68c75c299300 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1110,15 +1110,6 @@
 #define MSR_IA32_VMX_VMFUNC             0x00000491
 #define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
 
-/* VMX_BASIC bits and bitmasks */
-#define VMX_BASIC_VMCS_SIZE_SHIFT	32
-#define VMX_BASIC_TRUE_CTLS		(1ULL << 55)
-#define VMX_BASIC_64		0x0001000000000000LLU
-#define VMX_BASIC_MEM_TYPE_SHIFT	50
-#define VMX_BASIC_MEM_TYPE_MASK	0x003c000000000000LLU
-#define VMX_BASIC_MEM_TYPE_WB	6LLU
-#define VMX_BASIC_INOUT		0x0040000000000000LLU
-
 /* Resctrl MSRs: */
 /* - Intel: */
 #define MSR_IA32_L3_QOS_CFG		0xc81
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4dba17363008..273960225a3b 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -121,6 +121,14 @@
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
+/* x86 memory types, explicitly used in VMX only */
+#define MEM_TYPE_WB				0x6ULL
+#define MEM_TYPE_UC				0x0ULL
+
+/* VMX_BASIC bits and bitmasks */
+#define VMX_BASIC_32BIT_PHYS_ADDR_ONLY		BIT_ULL(48)
+#define VMX_BASIC_INOUT				BIT_ULL(54)
+
 #define VMX_MISC_PREEMPTION_TIMER_RATE_MASK	0x0000001f
 #define VMX_MISC_SAVE_EFER_LMA			0x00000020
 #define VMX_MISC_ACTIVITY_HLT			0x00000040
@@ -144,6 +152,11 @@ static inline u32 vmx_basic_vmcs_size(u64 vmx_basic)
 	return (vmx_basic & GENMASK_ULL(44, 32)) >> 32;
 }
 
+static inline u32 vmx_basic_vmcs_mem_type(u64 vmx_basic)
+{
+	return (vmx_basic & GENMASK_ULL(53, 50)) >> 50;
+}
+
 static inline int vmx_misc_preemption_timer_rate(u64 vmx_misc)
 {
 	return vmx_misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
@@ -506,8 +519,6 @@ enum vmcs_field {
 #define VMX_EPTP_PWL_5				0x20ull
 #define VMX_EPTP_AD_ENABLE_BIT			(1ull << 6)
 #define VMX_EPTP_MT_MASK			0x7ull
-#define VMX_EPTP_MT_WB				0x6ull
-#define VMX_EPTP_MT_UC				0x0ull
 #define VMX_EPT_READABLE_MASK			0x1ull
 #define VMX_EPT_WRITABLE_MASK			0x2ull
 #define VMX_EPT_EXECUTABLE_MASK			0x4ull
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..86ce8bb96bed 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -54,9 +54,7 @@ struct nested_vmx_msrs {
 };
 
 struct vmcs_config {
-	int size;
-	u32 basic_cap;
-	u32 revision_id;
+	u64 basic;
 	u32 pin_based_exec_ctrl;
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
@@ -76,7 +74,7 @@ extern struct vmx_capability vmx_capability __ro_after_init;
 
 static inline bool cpu_has_vmx_basic_inout(void)
 {
-	return	(((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
+	return	vmcs_config.basic & VMX_BASIC_INOUT;
 }
 
 static inline bool cpu_has_virtual_nmis(void)
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..23704f8d9035 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1201,23 +1201,34 @@ static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
 	return (superset | subset) == superset;
 }
 
+#define VMX_BASIC_VMCS_SIZE_SHIFT		32
+#define VMX_BASIC_DUAL_MONITOR_TREATMENT	BIT_ULL(49)
+#define VMX_BASIC_MEM_TYPE_SHIFT		50
+#define VMX_BASIC_TRUE_CTLS			BIT_ULL(55)
+
+#define VMX_BASIC_FEATURES_MASK			\
+	(VMX_BASIC_DUAL_MONITOR_TREATMENT |	\
+	 VMX_BASIC_INOUT |			\
+	 VMX_BASIC_TRUE_CTLS)
+
+#define VMX_BASIC_RESERVED_BITS			\
+	(GENMASK_ULL(63, 56) | GENMASK_ULL(47, 45) | BIT_ULL(31))
+
 static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
 {
-	const u64 feature_and_reserved =
-		/* feature (except bit 48; see below) */
-		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
-		/* reserved */
-		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
 	u64 vmx_basic = vmcs_config.nested.basic;
 
-	if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
+	static_assert(!(VMX_BASIC_FEATURES_MASK & VMX_BASIC_RESERVED_BITS));
+
+	if (!is_bitwise_subset(vmx_basic, data,
+			       VMX_BASIC_FEATURES_MASK | VMX_BASIC_RESERVED_BITS))
 		return -EINVAL;
 
 	/*
 	 * KVM does not emulate a version of VMX that constrains physical
 	 * addresses of VMX structures (e.g. VMCS) to 32-bits.
 	 */
-	if (data & BIT_ULL(48))
+	if (data & VMX_BASIC_32BIT_PHYS_ADDR_ONLY)
 		return -EINVAL;
 
 	if (vmx_basic_vmcs_revision_id(vmx_basic) !=
@@ -2690,11 +2701,11 @@ static bool nested_vmx_check_eptp(struct kvm_vcpu *vcpu, u64 new_eptp)
 
 	/* Check for memory type validity */
 	switch (new_eptp & VMX_EPTP_MT_MASK) {
-	case VMX_EPTP_MT_UC:
+	case MEM_TYPE_UC:
 		if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPTP_UC_BIT)))
 			return false;
 		break;
-	case VMX_EPTP_MT_WB:
+	case MEM_TYPE_WB:
 		if (CC(!(vmx->nested.msrs.ept_caps & VMX_EPTP_WB_BIT)))
 			return false;
 		break;
@@ -6964,7 +6975,7 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
 		VMCS12_REVISION |
 		VMX_BASIC_TRUE_CTLS |
 		((u64)VMCS12_SIZE << VMX_BASIC_VMCS_SIZE_SHIFT) |
-		(VMX_BASIC_MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
+		(MEM_TYPE_WB << VMX_BASIC_MEM_TYPE_SHIFT);
 
 	if (cpu_has_vmx_basic_inout())
 		msrs->basic |= VMX_BASIC_INOUT;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index ba5cd26137e0..6b3b2df11cea 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2569,13 +2569,13 @@ static u64 adjust_vmx_controls64(u64 ctl_opt, u32 msr)
 static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			     struct vmx_capability *vmx_cap)
 {
-	u32 vmx_msr_low, vmx_msr_high;
 	u32 _pin_based_exec_control = 0;
 	u32 _cpu_based_exec_control = 0;
 	u32 _cpu_based_2nd_exec_control = 0;
 	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
+	u64 basic_msr;
 	u64 misc_msr;
 	int i;
 
@@ -2694,29 +2694,25 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		_vmexit_control &= ~x_ctrl;
 	}
 
-	rdmsr(MSR_IA32_VMX_BASIC, vmx_msr_low, vmx_msr_high);
+	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);
 
 	/* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
-	if ((vmx_msr_high & 0x1fff) > PAGE_SIZE)
+	if ((vmx_basic_vmcs_size(basic_msr) > PAGE_SIZE))
 		return -EIO;
 
 #ifdef CONFIG_X86_64
 	/* IA-32 SDM Vol 3B: 64-bit CPUs always have VMX_BASIC_MSR[48]==0. */
-	if (vmx_msr_high & (1u<<16))
+	if (basic_msr & VMX_BASIC_32BIT_PHYS_ADDR_ONLY)
 		return -EIO;
 #endif
 
 	/* Require Write-Back (WB) memory type for VMCS accesses. */
-	if (((vmx_msr_high >> 18) & 15) != 6)
+	if (vmx_basic_vmcs_mem_type(basic_msr) != MEM_TYPE_WB)
 		return -EIO;
 
 	rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
 
-	vmcs_conf->size = vmx_msr_high & 0x1fff;
-	vmcs_conf->basic_cap = vmx_msr_high & ~0x1fff;
-
-	vmcs_conf->revision_id = vmx_msr_low;
-
+	vmcs_conf->basic = basic_msr;
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
@@ -2866,13 +2862,13 @@ struct vmcs *alloc_vmcs_cpu(bool shadow, int cpu, gfp_t flags)
 	if (!pages)
 		return NULL;
 	vmcs = page_address(pages);
-	memset(vmcs, 0, vmcs_config.size);
+	memset(vmcs, 0, vmx_basic_vmcs_size(vmcs_config.basic));
 
 	/* KVM supports Enlightened VMCS v1 only */
 	if (kvm_is_using_evmcs())
 		vmcs->hdr.revision_id = KVM_EVMCS_VERSION;
 	else
-		vmcs->hdr.revision_id = vmcs_config.revision_id;
+		vmcs->hdr.revision_id = vmx_basic_vmcs_revision_id(vmcs_config.basic);
 
 	if (shadow)
 		vmcs->hdr.shadow_vmcs = 1;
@@ -2965,7 +2961,7 @@ static __init int alloc_kvm_area(void)
 		 * physical CPU.
 		 */
 		if (kvm_is_using_evmcs())
-			vmcs->hdr.revision_id = vmcs_config.revision_id;
+			vmcs->hdr.revision_id = vmx_basic_vmcs_revision_id(vmcs_config.basic);
 
 		per_cpu(vmxarea, cpu) = vmcs;
 	}
@@ -3367,7 +3363,7 @@ static int vmx_get_max_ept_level(void)
 
 u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
-	u64 eptp = VMX_EPTP_MT_WB;
+	u64 eptp = MEM_TYPE_WB;
 
 	eptp |= (root_level == 5) ? VMX_EPTP_PWL_5 : VMX_EPTP_PWL_4;
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 02/23] KVM: VMX: Cleanup VMX misc information defines and usages
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
  2023-11-08 18:29 ` [PATCH v1 01/23] KVM: VMX: Cleanup VMX basic information defines and usages Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 03/23] KVM: VMX: Add support for the secondary VM exit controls Xin Li
                   ` (20 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Define VMX misc information fields with BIT_ULL()/GENMASK_ULL(), and move
VMX misc field macros to vmx.h if used in multiple files or where they are
used only once.

Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr-index.h |  5 -----
 arch/x86/include/asm/vmx.h       | 12 +++++------
 arch/x86/kvm/vmx/capabilities.h  |  4 ++--
 arch/x86/kvm/vmx/nested.c        | 34 ++++++++++++++++++++++++--------
 arch/x86/kvm/vmx/nested.h        |  2 +-
 arch/x86/kvm/vmx/vmx.c           |  8 +++-----
 6 files changed, 37 insertions(+), 28 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 68c75c299300..d7d4eace5dae 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1126,11 +1126,6 @@
 #define MSR_IA32_SMBA_BW_BASE		0xc0000280
 #define MSR_IA32_EVT_CFG_BASE		0xc0000400
 
-/* MSR_IA32_VMX_MISC bits */
-#define MSR_IA32_VMX_MISC_INTEL_PT                 (1ULL << 14)
-#define MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS (1ULL << 29)
-#define MSR_IA32_VMX_MISC_PREEMPTION_TIMER_SCALE   0x1F
-
 /* AMD-V MSRs */
 #define MSR_VM_CR                       0xc0010114
 #define MSR_VM_IGNNE                    0xc0010115
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 273960225a3b..5771c5751e7e 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -129,12 +129,10 @@
 #define VMX_BASIC_32BIT_PHYS_ADDR_ONLY		BIT_ULL(48)
 #define VMX_BASIC_INOUT				BIT_ULL(54)
 
-#define VMX_MISC_PREEMPTION_TIMER_RATE_MASK	0x0000001f
-#define VMX_MISC_SAVE_EFER_LMA			0x00000020
-#define VMX_MISC_ACTIVITY_HLT			0x00000040
-#define VMX_MISC_ACTIVITY_WAIT_SIPI		0x00000100
-#define VMX_MISC_ZERO_LEN_INS			0x40000000
-#define VMX_MISC_MSR_LIST_MULTIPLIER		512
+/* VMX_MISC bits and bitmasks */
+#define VMX_MISC_INTEL_PT			BIT_ULL(14)
+#define VMX_MISC_VMWRITE_SHADOW_RO_FIELDS	BIT_ULL(29)
+#define VMX_MISC_ZERO_LEN_INS			BIT_ULL(30)
 
 /* VMFUNC functions */
 #define VMFUNC_CONTROL_BIT(x)	BIT((VMX_FEATURE_##x & 0x1f) - 28)
@@ -159,7 +157,7 @@ static inline u32 vmx_basic_vmcs_mem_type(u64 vmx_basic)
 
 static inline int vmx_misc_preemption_timer_rate(u64 vmx_misc)
 {
-	return vmx_misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+	return vmx_misc & GENMASK_ULL(4, 0);
 }
 
 static inline int vmx_misc_cr3_count(u64 vmx_misc)
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 86ce8bb96bed..cb6588238f46 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -223,7 +223,7 @@ static inline bool cpu_has_vmx_vmfunc(void)
 static inline bool cpu_has_vmx_shadow_vmcs(void)
 {
 	/* check if the cpu supports writing r/o exit information fields */
-	if (!(vmcs_config.misc & MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS))
+	if (!(vmcs_config.misc & VMX_MISC_VMWRITE_SHADOW_RO_FIELDS))
 		return false;
 
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
@@ -365,7 +365,7 @@ static inline bool cpu_has_vmx_invvpid_global(void)
 
 static inline bool cpu_has_vmx_intel_pt(void)
 {
-	return (vmcs_config.misc & MSR_IA32_VMX_MISC_INTEL_PT) &&
+	return (vmcs_config.misc & VMX_MISC_INTEL_PT) &&
 		(vmcs_config.cpu_based_2nd_exec_ctrl & SECONDARY_EXEC_PT_USE_GPA) &&
 		(vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_RTIT_CTL);
 }
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 23704f8d9035..ff07d6e736a2 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -886,6 +886,8 @@ static int nested_vmx_store_msr_check(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
+#define VMX_MISC_MSR_LIST_MULTIPLIER	512
+
 static u32 nested_vmx_max_atomic_switch_msrs(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1295,18 +1297,34 @@ vmx_restore_control_msr(struct vcpu_vmx *vmx, u32 msr_index, u64 data)
 	return 0;
 }
 
+#define VMX_MISC_SAVE_EFER_LMA		BIT_ULL(5)
+#define VMX_MISC_ACTIVITY_STATE_BITMAP	GENMASK_ULL(8, 6)
+#define VMX_MISC_ACTIVITY_HLT		BIT_ULL(6)
+#define VMX_MISC_ACTIVITY_WAIT_SIPI	BIT_ULL(8)
+#define VMX_MISC_RDMSR_IN_SMM		BIT_ULL(15)
+#define VMX_MISC_VMXOFF_BLOCK_SMI	BIT_ULL(28)
+
+#define VMX_MISC_FEATURES_MASK			\
+	(VMX_MISC_SAVE_EFER_LMA |		\
+	 VMX_MISC_ACTIVITY_STATE_BITMAP |	\
+	 VMX_MISC_INTEL_PT |			\
+	 VMX_MISC_RDMSR_IN_SMM |		\
+	 VMX_MISC_VMXOFF_BLOCK_SMI |		\
+	 VMX_MISC_VMWRITE_SHADOW_RO_FIELDS |	\
+	 VMX_MISC_ZERO_LEN_INS)
+
+#define VMX_MISC_RESERVED_BITS			\
+	(BIT_ULL(31) | GENMASK_ULL(13, 9))
+
 static int vmx_restore_vmx_misc(struct vcpu_vmx *vmx, u64 data)
 {
-	const u64 feature_and_reserved_bits =
-		/* feature */
-		BIT_ULL(5) | GENMASK_ULL(8, 6) | BIT_ULL(14) | BIT_ULL(15) |
-		BIT_ULL(28) | BIT_ULL(29) | BIT_ULL(30) |
-		/* reserved */
-		GENMASK_ULL(13, 9) | BIT_ULL(31);
 	u64 vmx_misc = vmx_control_msr(vmcs_config.nested.misc_low,
 				       vmcs_config.nested.misc_high);
 
-	if (!is_bitwise_subset(vmx_misc, data, feature_and_reserved_bits))
+	static_assert(!(VMX_MISC_FEATURES_MASK & VMX_MISC_RESERVED_BITS));
+
+	if (!is_bitwise_subset(vmx_misc, data,
+			       VMX_MISC_FEATURES_MASK | VMX_MISC_RESERVED_BITS))
 		return -EINVAL;
 
 	if ((vmx->nested.msrs.pinbased_ctls_high &
@@ -6956,7 +6974,7 @@ static void nested_vmx_setup_misc_data(struct vmcs_config *vmcs_conf,
 {
 	msrs->misc_low = (u32)vmcs_conf->misc & VMX_MISC_SAVE_EFER_LMA;
 	msrs->misc_low |=
-		MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS |
+		VMX_MISC_VMWRITE_SHADOW_RO_FIELDS |
 		VMX_MISC_EMULATED_PREEMPTION_TIMER_RATE |
 		VMX_MISC_ACTIVITY_HLT |
 		VMX_MISC_ACTIVITY_WAIT_SIPI;
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index b4b9d51438c6..24ff4df509b6 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -108,7 +108,7 @@ static inline unsigned nested_cpu_vmx_misc_cr3_count(struct kvm_vcpu *vcpu)
 static inline bool nested_cpu_has_vmwrite_any_field(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.msrs.misc_low &
-		MSR_IA32_VMX_MISC_VMWRITE_SHADOW_RO_FIELDS;
+		VMX_MISC_VMWRITE_SHADOW_RO_FIELDS;
 }
 
 static inline bool nested_cpu_has_zero_length_injection(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6b3b2df11cea..396a806bf2b2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2576,7 +2576,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _vmexit_control = 0;
 	u32 _vmentry_control = 0;
 	u64 basic_msr;
-	u64 misc_msr;
 	int i;
 
 	/*
@@ -2710,8 +2709,6 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	if (vmx_basic_vmcs_mem_type(basic_msr) != MEM_TYPE_WB)
 		return -EIO;
 
-	rdmsrl(MSR_IA32_VMX_MISC, misc_msr);
-
 	vmcs_conf->basic = basic_msr;
 	vmcs_conf->pin_based_exec_ctrl = _pin_based_exec_control;
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
@@ -2719,7 +2716,8 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
 	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
-	vmcs_conf->misc	= misc_msr;
+
+	rdmsrl(MSR_IA32_VMX_MISC, vmcs_conf->misc);
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (enlightened_vmcs)
@@ -8555,7 +8553,7 @@ static __init int hardware_setup(void)
 		u64 use_timer_freq = 5000ULL * 1000 * 1000;
 
 		cpu_preemption_timer_multi =
-			vmcs_config.misc & VMX_MISC_PREEMPTION_TIMER_RATE_MASK;
+			vmx_misc_preemption_timer_rate(vmcs_config.misc);
 
 		if (tsc_khz)
 			use_timer_freq = (u64)tsc_khz * 1000;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 03/23] KVM: VMX: Add support for the secondary VM exit controls
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
  2023-11-08 18:29 ` [PATCH v1 01/23] KVM: VMX: Cleanup VMX basic information defines and usages Xin Li
  2023-11-08 18:29 ` [PATCH v1 02/23] KVM: VMX: Cleanup VMX misc " Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 04/23] KVM: x86: Mark CR4.FRED as not reserved Xin Li
                   ` (19 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Enable the secondary VM exit controls to prepare for FRED enabling.

The activation of the secondary VM exit controls is off now, and it
will be switched on when a VMX feature needing it is enabled.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/include/asm/vmx.h       |  3 +++
 arch/x86/kvm/vmx/capabilities.h  |  9 ++++++++-
 arch/x86/kvm/vmx/vmcs.h          |  1 +
 arch/x86/kvm/vmx/vmx.c           | 28 +++++++++++++++++++++++++---
 arch/x86/kvm/vmx/vmx.h           |  7 ++++++-
 6 files changed, 44 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index d7d4eace5dae..c95ed288aa5d 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1109,6 +1109,7 @@
 #define MSR_IA32_VMX_TRUE_ENTRY_CTLS     0x00000490
 #define MSR_IA32_VMX_VMFUNC             0x00000491
 #define MSR_IA32_VMX_PROCBASED_CTLS3	0x00000492
+#define MSR_IA32_VMX_EXIT_CTLS2		0x00000493
 
 /* Resctrl MSRs: */
 /* - Intel: */
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 5771c5751e7e..4d4177ec802c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -105,6 +105,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -247,6 +248,8 @@ enum vmcs_field {
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
+	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
+	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index cb6588238f46..e8f3ad0f79ee 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -59,8 +59,9 @@ struct vmcs_config {
 	u32 cpu_based_exec_ctrl;
 	u32 cpu_based_2nd_exec_ctrl;
 	u64 cpu_based_3rd_exec_ctrl;
-	u32 vmexit_ctrl;
 	u32 vmentry_ctrl;
+	u32 vmexit_ctrl;
+	u64 secondary_vmexit_ctrl;
 	u64 misc;
 	struct nested_vmx_msrs nested;
 };
@@ -136,6 +137,12 @@ static inline bool cpu_has_tertiary_exec_ctrls(void)
 		CPU_BASED_ACTIVATE_TERTIARY_CONTROLS;
 }
 
+static inline bool cpu_has_secondary_vmexit_ctrls(void)
+{
+	return vmcs_config.vmexit_ctrl &
+		VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
+}
+
 static inline bool cpu_has_vmx_virtualize_apic_accesses(void)
 {
 	return vmcs_config.cpu_based_2nd_exec_ctrl &
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index 7c1996b433e2..7d45a6504200 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -47,6 +47,7 @@ struct vmcs_host_state {
 struct vmcs_controls_shadow {
 	u32 vm_entry;
 	u32 vm_exit;
+	u64 secondary_vm_exit;
 	u32 pin;
 	u32 exec;
 	u32 secondary_exec;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 396a806bf2b2..df769207cbe0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2574,6 +2574,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	u32 _cpu_based_2nd_exec_control = 0;
 	u64 _cpu_based_3rd_exec_control = 0;
 	u32 _vmexit_control = 0;
+	u64 _secondary_vmexit_control = 0;
 	u32 _vmentry_control = 0;
 	u64 basic_msr;
 	int i;
@@ -2693,6 +2694,11 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		_vmexit_control &= ~x_ctrl;
 	}
 
+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+		_secondary_vmexit_control =
+			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
+					      MSR_IA32_VMX_EXIT_CTLS2);
+
 	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);
 
 	/* IA-32 SDM Vol 3B: VMCS size is never greater than 4kB. */
@@ -2714,8 +2720,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	vmcs_conf->cpu_based_exec_ctrl = _cpu_based_exec_control;
 	vmcs_conf->cpu_based_2nd_exec_ctrl = _cpu_based_2nd_exec_control;
 	vmcs_conf->cpu_based_3rd_exec_ctrl = _cpu_based_3rd_exec_control;
-	vmcs_conf->vmexit_ctrl         = _vmexit_control;
 	vmcs_conf->vmentry_ctrl        = _vmentry_control;
+	vmcs_conf->vmexit_ctrl         = _vmexit_control;
+	vmcs_conf->secondary_vmexit_ctrl   = _secondary_vmexit_control;
 
 	rdmsrl(MSR_IA32_VMX_MISC, vmcs_conf->misc);
 
@@ -4420,9 +4427,21 @@ static u32 vmx_vmexit_ctrl(void)
 	if (cpu_has_perf_global_ctrl_bug())
 		vmexit_ctrl &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 
-	/* Loading of EFER and PERF_GLOBAL_CTRL are toggled dynamically */
+	/*
+	 * The following features are toggled dynamically:
+	 * - Loading of PERF_GLOBAL_CTRL
+	 * - Loading of EFER
+	 * - Activating of SECONDARY_CONTROLS
+	 */
 	return vmexit_ctrl &
-		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
+		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL |
+		  VM_EXIT_LOAD_IA32_EFER |
+		  VM_EXIT_ACTIVATE_SECONDARY_CONTROLS);
+}
+
+static u64 vmx_secondary_vmexit_ctrl(void)
+{
+	return vmcs_config.secondary_vmexit_ctrl;
 }
 
 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -4770,6 +4789,9 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 	vm_exit_controls_set(vmx, vmx_vmexit_ctrl());
 
+	if (cpu_has_secondary_vmexit_ctrls())
+		secondary_vm_exit_controls_set(vmx, vmx_secondary_vmexit_ctrl());
+
 	/* 22.2.1, 20.8.1 */
 	vm_entry_controls_set(vmx, vmx_vmentry_ctrl());
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..99a0f6783085 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -502,7 +502,11 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
+	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+
+#define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
@@ -606,6 +610,7 @@ static __always_inline void lname##_controls_clearbit(struct vcpu_vmx *vmx, u##b
 }
 BUILD_CONTROLS_SHADOW(vm_entry, VM_ENTRY_CONTROLS, 32)
 BUILD_CONTROLS_SHADOW(vm_exit, VM_EXIT_CONTROLS, 32)
+BUILD_CONTROLS_SHADOW(secondary_vm_exit, SECONDARY_VM_EXIT_CONTROLS, 64)
 BUILD_CONTROLS_SHADOW(pin, PIN_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(exec, CPU_BASED_VM_EXEC_CONTROL, 32)
 BUILD_CONTROLS_SHADOW(secondary_exec, SECONDARY_VM_EXEC_CONTROL, 32)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 04/23] KVM: x86: Mark CR4.FRED as not reserved
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (2 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 03/23] KVM: VMX: Add support for the secondary VM exit controls Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li
                   ` (18 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

The CR4.FRED bit, i.e., CR4[32], is no longer a reserved bit when a guest
enumerates FRED, otherwise it is still a reserved bit.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 3 ++-
 arch/x86/kvm/x86.h              | 2 ++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d7036982332e..1e5a6d9439f8 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -133,7 +133,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_FRED))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5184fde1dc54..60da8cbe6759 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -530,6 +530,8 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_VMXE;        \
 	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
 		__reserved_bits |= X86_CR4_PCIDE;       \
+	if (!__cpu_has(__c, X86_FEATURE_FRED))          \
+		__reserved_bits |= X86_CR4_FRED;        \
 	__reserved_bits;                                \
 })
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (3 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 04/23] KVM: x86: Mark CR4.FRED as not reserved Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-09  8:53   ` Chao Gao
  2023-11-13 17:18   ` Nikolay Borisov
  2023-11-08 18:29 ` [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID Xin Li
                   ` (17 subsequent siblings)
  22 siblings, 2 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Setup the global vmcs_config for FRED:
1) Add VM_ENTRY_LOAD_IA32_FRED to KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS to
   have a FRED CPU load guest FRED MSRs from VMCS upon VM entry.
2) Add SECONDARY_VM_EXIT_SAVE_IA32_FRED to
   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU save
   guest FRED MSRs to VMCS during VM exit.
3) add SECONDARY_VM_EXIT_LOAD_IA32_FRED to
   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU load
   host FRED MSRs from VMCS during VM exit.

Also add sanity checks to make sure FRED VM entry/exit controls can be
set on a FRED CPU.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/vmx.h |  3 +++
 arch/x86/kvm/vmx/vmx.c     | 19 ++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h     |  7 +++++--
 3 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 4d4177ec802c..41796a733bc9 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -106,6 +106,8 @@
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
 #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	0x00000001
+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	0x00000002
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -119,6 +121,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index df769207cbe0..9186f41974ab 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2694,10 +2694,27 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		_vmexit_control &= ~x_ctrl;
 	}
 
-	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
 		_secondary_vmexit_control =
 			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
 					      MSR_IA32_VMX_EXIT_CTLS2);
+		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+		    !(_secondary_vmexit_control & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
+		      _secondary_vmexit_control & SECONDARY_VM_EXIT_LOAD_IA32_FRED)) {
+			pr_warn_once("FRED enabled but no VMX VM-Exit {SAVE,LOAD}_IA32_FRED controls: %llx\n",
+				     _secondary_vmexit_control);
+			if (error_on_inconsistent_vmcs_config)
+				return -EIO;
+		}
+	}
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
+		pr_warn_once("FRED enabled but no VMX VM-Entry LOAD_IA32_FRED control: %x\n",
+			     _vmentry_control);
+		if (error_on_inconsistent_vmcs_config)
+			return -EIO;
+	}
 
 	rdmsrl(MSR_IA32_VMX_BASIC, basic_msr);
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 99a0f6783085..f8c02bd37069 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
 	 VM_ENTRY_LOAD_IA32_EFER |					\
 	 VM_ENTRY_LOAD_BNDCFGS |					\
 	 VM_ENTRY_PT_CONCEAL_PIP |					\
-	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
+	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
+	 VM_ENTRY_LOAD_IA32_FRED)
 
 #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
 	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
@@ -506,7 +507,9 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
 
 #define KVM_REQUIRED_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
-#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS (0)
+#define KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS			\
+	     (SECONDARY_VM_EXIT_SAVE_IA32_FRED |			\
+	      SECONDARY_VM_EXIT_LOAD_IA32_FRED)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (4 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-09  9:15   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs Xin Li
                   ` (16 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Clear FRED VM entry/exit controls when initializing a vCPU, and set
these controls only if FRED is enumerated after set CPUID.

FRED VM entry/exit controls need to be set to establish context
sufficient to support FRED event delivery immediately after VM entry
and exit.  However it is not required to save/load FRED MSRs for
a non-FRED guest, which aren't supposed to access FRED MSRs.

A non-FRED guest should get #GP upon accessing FRED MSRs, otherwise
it corrupts host FRED MSRs.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 34 +++++++++++++++++++++++++++++++++-
 1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9186f41974ab..5d4786812664 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4423,6 +4423,9 @@ static u32 vmx_vmentry_ctrl(void)
 	if (cpu_has_perf_global_ctrl_bug())
 		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 
+	/* Whether to load guest FRED MSRs is deferred until after set CPUID */
+	vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_FRED;
+
 	return vmentry_ctrl;
 }
 
@@ -4458,7 +4461,13 @@ static u32 vmx_vmexit_ctrl(void)
 
 static u64 vmx_secondary_vmexit_ctrl(void)
 {
-	return vmcs_config.secondary_vmexit_ctrl;
+	u64 secondary_vmexit_ctrl = vmcs_config.secondary_vmexit_ctrl;
+
+	/* Whether to save/load FRED MSRs is deferred until after set CPUID */
+	secondary_vmexit_ctrl &= ~(SECONDARY_VM_EXIT_SAVE_IA32_FRED |
+				   SECONDARY_VM_EXIT_LOAD_IA32_FRED);
+
+	return secondary_vmexit_ctrl;
 }
 
 static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
@@ -7785,10 +7794,33 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_vmx *vmx = to_vmx(vcpu);
+
+	if (!cpu_feature_enabled(X86_FEATURE_FRED) ||
+	    !guest_cpuid_has(vcpu, X86_FEATURE_FRED))
+		return;
+
+	/* Enable loading guest FRED MSRs from VMCS */
+	vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_FRED);
+
+	/*
+	 * Enable saving guest FRED MSRs into VMCS and loading host FRED MSRs
+	 * from VMCS.
+	 */
+	vm_exit_controls_setbit(vmx, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS);
+	secondary_vm_exit_controls_setbit(vmx,
+					  SECONDARY_VM_EXIT_SAVE_IA32_FRED |
+					  SECONDARY_VM_EXIT_LOAD_IA32_FRED);
+}
+
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
+	vmx_vcpu_config_fred_after_set_cpuid(vcpu);
+
 	/*
 	 * XSAVES is effectively enabled if and only if XSAVE is also exposed
 	 * to the guest.  XSAVES depends on CR4.OSXSAVE, and CR4.OSXSAVE can be
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (5 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-09  9:21   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields Xin Li
                   ` (15 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add FRED MSRs to the valid passthrough MSR list and disable intercepting
FRED MSRs only if FRED is enumerated after set CPUID.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5d4786812664..327e052d90c1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -700,6 +700,9 @@ static bool is_valid_passthrough_msr(u32 msr)
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
 		return true;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		/* FRED MSRs should be passthrough to FRED guests only */
+		return true;
 	}
 
 	r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -7813,6 +7816,16 @@ static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
 	secondary_vm_exit_controls_setbit(vmx,
 					  SECONDARY_VM_EXIT_SAVE_IA32_FRED |
 					  SECONDARY_VM_EXIT_LOAD_IA32_FRED);
+
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW);
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW);
 }
 
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (6 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-13  3:04   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest Xin Li
                   ` (14 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Initialize host VMCS FRED fields with host FRED MSRs' value and
guest VMCS FRED fields to 0.

FRED CPU states are managed in 9 new FRED MSRs, as well as a few
existing CPU registers and MSRs, e.g., CR4.FRED.  To support FRED
context management, new VMCS fields corresponding to most of FRED
CPU state MSRs are added to both the host-state and guest-state
areas of VMCS.

Specifically no VMCS fields are added for FRED RSP0 and SSP0 MSRs,
because the 2 FRED MSRs are used during ring 3 event delivery only,
thus KVM, running on ring 0, can run safely even with guest FRED
RSP0 and SSP0.  It can be deferred to load host FRED RSP0 and SSP0
until before returning to user level.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 32 ++++++++++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 41796a733bc9..d54a1a1057b0 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -277,12 +277,28 @@ enum vmcs_field {
 	GUEST_BNDCFGS_HIGH              = 0x00002813,
 	GUEST_IA32_RTIT_CTL		= 0x00002814,
 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
+	GUEST_IA32_FRED_CONFIG		= 0x0000281a,
+	GUEST_IA32_FRED_RSP1		= 0x0000281c,
+	GUEST_IA32_FRED_RSP2		= 0x0000281e,
+	GUEST_IA32_FRED_RSP3		= 0x00002820,
+	GUEST_IA32_FRED_STKLVLS		= 0x00002822,
+	GUEST_IA32_FRED_SSP1		= 0x00002824,
+	GUEST_IA32_FRED_SSP2		= 0x00002826,
+	GUEST_IA32_FRED_SSP3		= 0x00002828,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_FRED_CONFIG		= 0x00002c08,
+	HOST_IA32_FRED_RSP1		= 0x00002c0a,
+	HOST_IA32_FRED_RSP2		= 0x00002c0c,
+	HOST_IA32_FRED_RSP3		= 0x00002c0e,
+	HOST_IA32_FRED_STKLVLS		= 0x00002c10,
+	HOST_IA32_FRED_SSP1		= 0x00002c12,
+	HOST_IA32_FRED_SSP2		= 0x00002c14,
+	HOST_IA32_FRED_SSP3		= 0x00002c16,
 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
 	EXCEPTION_BITMAP                = 0x00004004,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 327e052d90c1..41772ecdd368 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1477,6 +1477,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
 		}
 
+#ifdef CONFIG_X86_64
+		/* Per-CPU FRED MSRs */
+		if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+			vmcs_write64(HOST_IA32_FRED_RSP1, read_msr(MSR_IA32_FRED_RSP1));
+			vmcs_write64(HOST_IA32_FRED_RSP2, read_msr(MSR_IA32_FRED_RSP2));
+			vmcs_write64(HOST_IA32_FRED_RSP3, read_msr(MSR_IA32_FRED_RSP3));
+			vmcs_write64(HOST_IA32_FRED_SSP1, read_msr(MSR_IA32_FRED_SSP1));
+			vmcs_write64(HOST_IA32_FRED_SSP2, read_msr(MSR_IA32_FRED_SSP2));
+			vmcs_write64(HOST_IA32_FRED_SSP3, read_msr(MSR_IA32_FRED_SSP3));
+		}
+#endif
+
 		vmx->loaded_vmcs->cpu = cpu;
 	}
 }
@@ -4375,6 +4387,15 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 
 	if (cpu_has_load_ia32_efer())
 		vmcs_write64(HOST_IA32_EFER, host_efer);
+
+	/*
+	 * FRED MSRs are per-cpu, however FRED CONFIG and STKLVLS MSRs
+	 * are the same on all CPUs, thus they are initialized here.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		vmcs_write64(HOST_IA32_FRED_CONFIG, read_msr(MSR_IA32_FRED_CONFIG));
+		vmcs_write64(HOST_IA32_FRED_STKLVLS, read_msr(MSR_IA32_FRED_STKLVLS));
+	}
 }
 
 void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
@@ -4936,6 +4957,17 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmcs_writel(GUEST_IDTR_BASE, 0);
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
 
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
+	}
+
 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (7 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-13  3:47   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore Xin Li
                   ` (13 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Switch MSR_IA32_FRED_RSP0 between host and guest in
vmx_prepare_switch_to_{host,guest}().

MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only, thus
KVM, running on ring 0, can run safely with guest FRED RSP0, i.e.,
no need to switch between host/guest FRED RSP0 during VM entry and
exit.

KVM should switch to host FRED RSP0 before returning to user level,
and switch to guest FRED RSP0 before entering guest mode.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++++++
 arch/x86/kvm/vmx/vmx.h |  2 ++
 2 files changed, 19 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 41772ecdd368..d00ab9d4c93e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1344,6 +1344,17 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 	}
 
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    guest_cpuid_has(vcpu, X86_FEATURE_FRED)) {
+		/*
+		 * MSR_IA32_FRED_RSP0 is top of task stack, which never changes.
+		 * Thus it should be initialized only once.
+		 */
+		if (unlikely(vmx->msr_host_fred_rsp0 == 0))
+			vmx->msr_host_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+		wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
+	}
 #else
 	savesegment(fs, fs_sel);
 	savesegment(gs, gs_sel);
@@ -1388,6 +1399,12 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
 	invalidate_tss_limit();
 #ifdef CONFIG_X86_64
 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    guest_cpuid_has(&vmx->vcpu, X86_FEATURE_FRED)) {
+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+		wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_host_fred_rsp0);
+	}
 #endif
 	load_fixmap_gdt(raw_smp_processor_id());
 	vmx->guest_state_loaded = false;
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index f8c02bd37069..328a3447f064 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -276,6 +276,8 @@ struct vcpu_vmx {
 #ifdef CONFIG_X86_64
 	u64		      msr_host_kernel_gs_base;
 	u64		      msr_guest_kernel_gs_base;
+	u64		      msr_host_fred_rsp0;
+	u64		      msr_guest_fred_rsp0;
 #endif
 
 	u64		      spec_ctrl;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (8 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-13  5:24   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled() Xin Li
                   ` (12 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Handle host initiated FRED MSR access requests to allow FRED context
to be set/get from user level.

During VM save/restore and live migration, FRED context needs to be
saved/restored, which requires FRED MSRs to be accessed from a user
level application, e.g., Qemu.

Note, handling of MSR_IA32_FRED_SSP0, i.e., MSR_IA32_PL0_SSP, is not
added yet, which needs to be aligned with KVM CET patch set.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 72 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c     | 23 ++++++++++++++
 2 files changed, 95 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d00ab9d4c93e..58d01e845804 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1429,6 +1429,24 @@ static void vmx_write_guest_kernel_gs_base(struct vcpu_vmx *vmx, u64 data)
 	preempt_enable();
 	vmx->msr_guest_kernel_gs_base = data;
 }
+
+static u64 vmx_read_guest_fred_rsp0(struct vcpu_vmx *vmx)
+{
+	preempt_disable();
+	if (vmx->guest_state_loaded)
+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
+	preempt_enable();
+	return vmx->msr_guest_fred_rsp0;
+}
+
+static void vmx_write_guest_fred_rsp0(struct vcpu_vmx *vmx, u64 data)
+{
+	preempt_disable();
+	if (vmx->guest_state_loaded)
+		wrmsrl(MSR_IA32_FRED_RSP0, data);
+	preempt_enable();
+	vmx->msr_guest_fred_rsp0 = data;
+}
 #endif
 
 void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
@@ -2028,6 +2046,33 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	case MSR_KERNEL_GS_BASE:
 		msr_info->data = vmx_read_guest_kernel_gs_base(vmx);
 		break;
+	case MSR_IA32_FRED_RSP0:
+		msr_info->data = vmx_read_guest_fred_rsp0(vmx);
+		break;
+	case MSR_IA32_FRED_RSP1:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_RSP1);
+		break;
+	case MSR_IA32_FRED_RSP2:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_RSP2);
+		break;
+	case MSR_IA32_FRED_RSP3:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_RSP3);
+		break;
+	case MSR_IA32_FRED_STKLVLS:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+		break;
+	case MSR_IA32_FRED_SSP1:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_SSP1);
+		break;
+	case MSR_IA32_FRED_SSP2:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_SSP2);
+		break;
+	case MSR_IA32_FRED_SSP3:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_SSP3);
+		break;
+	case MSR_IA32_FRED_CONFIG:
+		msr_info->data = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+		break;
 #endif
 	case MSR_EFER:
 		return kvm_get_msr_common(vcpu, msr_info);
@@ -2233,6 +2278,33 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vmx_update_exception_bitmap(vcpu);
 		}
 		break;
+	case MSR_IA32_FRED_RSP0:
+		vmx_write_guest_fred_rsp0(vmx, data);
+		break;
+	case MSR_IA32_FRED_RSP1:
+		vmcs_write64(GUEST_IA32_FRED_RSP1, data);
+		break;
+	case MSR_IA32_FRED_RSP2:
+		vmcs_write64(GUEST_IA32_FRED_RSP2, data);
+		break;
+	case MSR_IA32_FRED_RSP3:
+		vmcs_write64(GUEST_IA32_FRED_RSP3, data);
+		break;
+	case MSR_IA32_FRED_STKLVLS:
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, data);
+		break;
+	case MSR_IA32_FRED_SSP1:
+		vmcs_write64(GUEST_IA32_FRED_SSP1, data);
+		break;
+	case MSR_IA32_FRED_SSP2:
+		vmcs_write64(GUEST_IA32_FRED_SSP2, data);
+		break;
+	case MSR_IA32_FRED_SSP3:
+		vmcs_write64(GUEST_IA32_FRED_SSP3, data);
+		break;
+	case MSR_IA32_FRED_CONFIG:
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, data);
+		break;
 #endif
 	case MSR_IA32_SYSENTER_CS:
 		if (is_guest_mode(vcpu))
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2c924075f6f1..c5a55810647f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1451,6 +1451,9 @@ static const u32 msrs_to_save_base[] = {
 	MSR_STAR,
 #ifdef CONFIG_X86_64
 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
+	MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
+	MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
+	MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
 #endif
 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
@@ -1890,6 +1893,16 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 
 		data = (u32)data;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_FRED))
+			break;
+
+		/*
+		 * Inject #GP upon FRED MSRs accesses from a non-FRED guest to
+		 * make sure no malicious guest can write to FRED MSRs thus to
+		 * corrupt host FRED MSRs.
+		 */
+		return 1;
 	}
 
 	msr.data = data;
@@ -1933,6 +1946,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
 			return 1;
 		break;
+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
+		if (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_FRED))
+			break;
+
+		/*
+		 * Inject #GP upon FRED MSRs accesses from a non-FRED guest to
+		 * make sure no malicious guest can write to FRED MSRs thus to
+		 * corrupt host FRED MSRs.
+		 */
+		return 1;
 	}
 
 	msr.index = index;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (9 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-13  7:35   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 12/23] KVM: VMX: Handle FRED event data Xin Li
                   ` (11 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add kvm_is_fred_enabled() to get if FRED is enabled on a vCPU.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/kvm_cache_regs.h | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
index 75eae9c4998a..390643e8c532 100644
--- a/arch/x86/kvm/kvm_cache_regs.h
+++ b/arch/x86/kvm/kvm_cache_regs.h
@@ -187,6 +187,16 @@ static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu,
 	return !!kvm_read_cr4_bits(vcpu, cr4_bit);
 }
 
+static __always_inline bool kvm_is_fred_enabled(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_X86_64
+	return cpu_feature_enabled(X86_FEATURE_FRED) &&
+	       kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
+#else
+	return false;
+#endif
+}
+
 static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
 {
 	if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (10 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled() Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-13 10:14   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED Xin Li
                   ` (10 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Set injected-event data when injecting a #PF, #DB, or #NM caused
by extended feature disable using FRED event delivery, and save
original-event data for being used as injected-event data.

Unlike IDT using some extra CPU register as part of an event
context, e.g., %cr2 for #PF, FRED saves a complete event context
in its stack frame, e.g., FRED saves the faulting linear address
of a #PF into the event data field defined in its stack frame.

Thus a new VMX control field called injected-event data is added
to provide the event data that will be pushed into a FRED stack
frame for VM entries that inject an event using FRED event delivery.
In addition, a new VM exit information field called original-event
data is added to store the event data that would have saved into a
FRED stack frame for VM exits that occur during FRED event delivery.
After such a VM exit is handled to allow the original-event to be
delivered, the data in the original-event data VMCS field needs to
be set into the injected-event data VMCS field for the injection of
the original event.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/vmx.h |  4 ++
 arch/x86/kvm/vmx/vmx.c     | 84 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/x86.c         | 10 ++++-
 3 files changed, 91 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index d54a1a1057b0..97729248e844 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -253,8 +253,12 @@ enum vmcs_field {
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
 	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
+	INJECTED_EVENT_DATA		= 0x00002052,
+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
+	ORIGINAL_EVENT_DATA		= 0x00002404,
+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
 	VMCS_LINK_POINTER               = 0x00002800,
 	VMCS_LINK_POINTER_HIGH          = 0x00002801,
 	GUEST_IA32_DEBUGCTL             = 0x00002802,
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 58d01e845804..67fd4a56d031 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1880,9 +1880,30 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
 			     vmx->vcpu.arch.event_exit_inst_len);
 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
-	} else
+	} else {
 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
 
+		if (kvm_is_fred_enabled(vcpu)) {
+			u64 event_data = 0;
+
+			if (is_debug(intr_info))
+				/*
+				 * Compared to DR6, FRED #DB event data saved on
+				 * the stack frame have bits 4 ~ 11 and 16 ~ 31
+				 * inverted, i.e.,
+				 *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
+				 */
+				event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
+			else if (is_page_fault(intr_info))
+				event_data = vcpu->arch.cr2;
+			else if (is_nm_fault(intr_info) &&
+				 vcpu->arch.guest_fpu.fpstate->xfd)
+				event_data = vcpu->arch.guest_fpu.xfd_err;
+
+			vmcs_write64(INJECTED_EVENT_DATA, event_data);
+		}
+	}
+
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
 
 	vmx_clear_hlt(vcpu);
@@ -7226,7 +7247,8 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
 static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 				      u32 idt_vectoring_info,
 				      int instr_len_field,
-				      int error_code_field)
+				      int error_code_field,
+				      int event_data_field)
 {
 	u8 vector;
 	int type;
@@ -7260,6 +7282,37 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
 		fallthrough;
 	case INTR_TYPE_HARD_EXCEPTION:
+		if (kvm_is_fred_enabled(vcpu) && event_data_field) {
+			/*
+			 * Save original-event data for being used as injected-event data.
+			 */
+			u64 event_data = vmcs_read64(event_data_field);
+
+			switch (vector) {
+			case DB_VECTOR:
+				get_debugreg(vcpu->arch.dr6, 6);
+				WARN_ON(vcpu->arch.dr6 != (event_data ^ DR6_RESERVED));
+				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
+				break;
+			case NM_VECTOR:
+				if (vcpu->arch.guest_fpu.fpstate->xfd) {
+					rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+					WARN_ON(vcpu->arch.guest_fpu.xfd_err != event_data);
+					vcpu->arch.guest_fpu.xfd_err = event_data;
+				} else {
+					WARN_ON(event_data != 0);
+				}
+				break;
+			case PF_VECTOR:
+				WARN_ON(vcpu->arch.cr2 != event_data);
+				vcpu->arch.cr2 = event_data;
+				break;
+			default:
+				WARN_ON(event_data != 0);
+				break;
+			}
+		}
+
 		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
 			u32 err = vmcs_read32(error_code_field);
 			kvm_requeue_exception_e(vcpu, vector, err);
@@ -7279,9 +7332,11 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 
 static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 {
-	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
+	__vmx_complete_interrupts(&vmx->vcpu,
+				  vmx->idt_vectoring_info,
 				  VM_EXIT_INSTRUCTION_LEN,
-				  IDT_VECTORING_ERROR_CODE);
+				  IDT_VECTORING_ERROR_CODE,
+				  ORIGINAL_EVENT_DATA);
 }
 
 static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
@@ -7289,7 +7344,8 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 				  VM_ENTRY_INSTRUCTION_LEN,
-				  VM_ENTRY_EXCEPTION_ERROR_CODE);
+				  VM_ENTRY_EXCEPTION_ERROR_CODE,
+				  0);
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
 }
@@ -7406,6 +7462,24 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 
 	vmx_disable_fb_clear(vmx);
 
+	/*
+	 * %cr2 needs to be saved after a VM exit and restored before a VM
+	 * entry in case a VM exit happens immediately after delivery of a
+	 * guest #PF but before guest reads %cr2.
+	 *
+	 * A FRED guest should read its #PF faulting linear address from
+	 * the event data field in its FRED stack frame instead of %cr2.
+	 * But the FRED 5.0 spec still requires a FRED CPU to update %cr2
+	 * in the normal way, thus %cr2 is still updated even for a FRED
+	 * guest.
+	 *
+	 * Note, an NMI could interrupt KVM:
+	 *   1) after VM exit but before CR2 is saved.
+	 *   2) after CR2 is restored but before VM entry.
+	 * And a #PF could happen durng NMI handlng, which overwrites %cr2.
+	 * Thus exc_nmi() should save and restore %cr2 upon entering and
+	 * before leaving to make sure %cr2 not corrupted.
+	 */
 	if (vcpu->arch.cr2 != native_read_cr2())
 		native_write_cr2(vcpu->arch.cr2);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c5a55810647f..d190bfc63fc4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -680,8 +680,14 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 			vcpu->arch.exception.injected = true;
 			if (WARN_ON_ONCE(has_payload)) {
 				/*
-				 * A reinjected event has already
-				 * delivered its payload.
+				 * For a reinjected event, KVM delivers its
+				 * payload through:
+				 *   #PF: save %cr2 into arch.cr2 immediately
+				 *        after VM exits.
+				 *   #DB: save %dr6 into arch.dr6 later in
+				 *        sync_dirty_debug_regs().
+				 *
+				 * For FRED guest, see __vmx_complete_interrupts().
 				 */
 				has_payload = false;
 				payload = 0;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (11 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 12/23] KVM: VMX: Handle FRED event data Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-14  7:40   ` Chao Gao
  2023-11-08 18:29 ` [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li
                   ` (9 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Set VMX nested exception bit in the VM-entry interruption information
VMCS field when injecting a nested exception using FRED event delivery
to ensure:
  1) The nested exception is injected on a correct stack level.
  2) The nested bit defined in FRED stack frame is set.

The event stack level used by FRED event delivery depends on whether the
event was a nested exception encountered during delivery of another event,
because a nested exception is "regarded" as happening on ring 0.  E.g.,
when #PF is configured to use stack level 1 in IA32_FRED_STKLVLS MSR:
  - nested #PF will be delivered on stack level 1 when triggered from
    user level.
  - normal #PF will be delivered on stack level 0 when triggered from
    user level.

The VMX nested-exception support ensures the correct event stack level is
chosen when a VM entry injects a nested exception.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  6 ++++--
 arch/x86/include/asm/vmx.h      |  4 +++-
 arch/x86/kvm/svm/svm.c          |  4 ++--
 arch/x86/kvm/vmx/vmx.c          | 26 +++++++++++++++++++++-----
 arch/x86/kvm/x86.c              | 22 +++++++++++++---------
 arch/x86/kvm/x86.h              |  1 +
 6 files changed, 44 insertions(+), 19 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1e5a6d9439f8..2ae8cc83dbb3 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -721,6 +721,7 @@ struct kvm_queued_exception {
 	u32 error_code;
 	unsigned long payload;
 	bool has_payload;
+	bool nested;
 };
 
 struct kvm_vcpu_arch {
@@ -2015,8 +2016,9 @@ int kvm_emulate_rdpmc(struct kvm_vcpu *vcpu);
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr);
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr, unsigned long payload);
-void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr);
-void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code);
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr, bool nested);
+void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr,
+			     u32 error_code, bool nested);
 void kvm_inject_page_fault(struct kvm_vcpu *vcpu, struct x86_exception *fault);
 void kvm_inject_emulated_page_fault(struct kvm_vcpu *vcpu,
 				    struct x86_exception *fault);
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 97729248e844..020dfd3f6b44 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -132,6 +132,7 @@
 /* VMX_BASIC bits and bitmasks */
 #define VMX_BASIC_32BIT_PHYS_ADDR_ONLY		BIT_ULL(48)
 #define VMX_BASIC_INOUT				BIT_ULL(54)
+#define VMX_BASIC_NESTED_EXCEPTION		BIT_ULL(58)
 
 /* VMX_MISC bits and bitmasks */
 #define VMX_MISC_INTEL_PT			BIT_ULL(14)
@@ -404,8 +405,9 @@ enum vmcs_field {
 #define INTR_INFO_INTR_TYPE_MASK        0x700           /* 10:8 */
 #define INTR_INFO_DELIVER_CODE_MASK     0x800           /* 11 */
 #define INTR_INFO_UNBLOCK_NMI		0x1000		/* 12 */
+#define INTR_INFO_NESTED_EXCEPTION_MASK	0x2000		/* 13 */
 #define INTR_INFO_VALID_MASK            0x80000000      /* 31 */
-#define INTR_INFO_RESVD_BITS_MASK       0x7ffff000
+#define INTR_INFO_RESVD_BITS_MASK       0x7fffd000
 
 #define VECTORING_INFO_VECTOR_MASK           	INTR_INFO_VECTOR_MASK
 #define VECTORING_INFO_TYPE_MASK        	INTR_INFO_INTR_TYPE_MASK
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 712146312358..78a9ff5cfcad 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4047,10 +4047,10 @@ static void svm_complete_interrupts(struct kvm_vcpu *vcpu)
 
 		if (exitintinfo & SVM_EXITINTINFO_VALID_ERR) {
 			u32 err = svm->vmcb->control.exit_int_info_err;
-			kvm_requeue_exception_e(vcpu, vector, err);
+			kvm_requeue_exception_e(vcpu, vector, err, false);
 
 		} else
-			kvm_requeue_exception(vcpu, vector);
+			kvm_requeue_exception(vcpu, vector, false);
 		break;
 	case SVM_EXITINTINFO_TYPE_INTR:
 		kvm_queue_interrupt(vcpu, vector, false);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 67fd4a56d031..518e68ee5a0d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -1901,6 +1901,8 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
 				event_data = vcpu->arch.guest_fpu.xfd_err;
 
 			vmcs_write64(INJECTED_EVENT_DATA, event_data);
+
+			intr_info |= ex->nested ? INTR_INFO_NESTED_EXCEPTION_MASK : 0;
 		}
 	}
 
@@ -2851,6 +2853,19 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	/* IA-32 SDM Vol 3B: 64-bit CPUs always have VMX_BASIC_MSR[48]==0. */
 	if (basic_msr & VMX_BASIC_32BIT_PHYS_ADDR_ONLY)
 		return -EIO;
+
+	/*
+	 * FRED draft Spec 5.0 Section 9.2:
+	 *
+	 * Any processor that enumerates support for FRED transitions
+	 * will also enumerate VMX nested-exception support.
+	 */
+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
+	    !(basic_msr & VMX_BASIC_NESTED_EXCEPTION)) {
+		pr_warn_once("FRED enabled but no VMX nested-exception support\n");
+		if (error_on_inconsistent_vmcs_config)
+			return -EIO;
+	}
 #endif
 
 	/* Require Write-Back (WB) memory type for VMCS accesses. */
@@ -7313,11 +7328,12 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
 			}
 		}
 
-		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
-			u32 err = vmcs_read32(error_code_field);
-			kvm_requeue_exception_e(vcpu, vector, err);
-		} else
-			kvm_requeue_exception(vcpu, vector);
+		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
+			kvm_requeue_exception_e(vcpu, vector, vmcs_read32(error_code_field),
+						idt_vectoring_info & INTR_INFO_NESTED_EXCEPTION_MASK);
+		else
+			kvm_requeue_exception(vcpu, vector,
+					      idt_vectoring_info & INTR_INFO_NESTED_EXCEPTION_MASK);
 		break;
 	case INTR_TYPE_SOFT_INTR:
 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d190bfc63fc4..51c07730f1b6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -645,7 +645,8 @@ static void kvm_leave_nested(struct kvm_vcpu *vcpu)
 
 static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 		unsigned nr, bool has_error, u32 error_code,
-	        bool has_payload, unsigned long payload, bool reinject)
+	        bool has_payload, unsigned long payload,
+		bool reinject, bool nested)
 {
 	u32 prev_nr;
 	int class1, class2;
@@ -678,6 +679,7 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 			 */
 			WARN_ON_ONCE(kvm_is_exception_pending(vcpu));
 			vcpu->arch.exception.injected = true;
+			vcpu->arch.exception.nested = nested;
 			if (WARN_ON_ONCE(has_payload)) {
 				/*
 				 * For a reinjected event, KVM delivers its
@@ -727,6 +729,8 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
 		kvm_queue_exception_e(vcpu, DF_VECTOR, 0);
 	} else {
+		vcpu->arch.exception.nested = true;
+
 		/* replace previous exception with a new one in a hope
 		   that instruction re-execution will regenerate lost
 		   exception */
@@ -736,20 +740,20 @@ static void kvm_multiple_exception(struct kvm_vcpu *vcpu,
 
 void kvm_queue_exception(struct kvm_vcpu *vcpu, unsigned nr)
 {
-	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, false);
+	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, false, false);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception);
 
-void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr)
+void kvm_requeue_exception(struct kvm_vcpu *vcpu, unsigned nr, bool nested)
 {
-	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, true);
+	kvm_multiple_exception(vcpu, nr, false, 0, false, 0, true, nested);
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception);
 
 void kvm_queue_exception_p(struct kvm_vcpu *vcpu, unsigned nr,
 			   unsigned long payload)
 {
-	kvm_multiple_exception(vcpu, nr, false, 0, true, payload, false);
+	kvm_multiple_exception(vcpu, nr, false, 0, true, payload, false, false);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception_p);
 
@@ -757,7 +761,7 @@ static void kvm_queue_exception_e_p(struct kvm_vcpu *vcpu, unsigned nr,
 				    u32 error_code, unsigned long payload)
 {
 	kvm_multiple_exception(vcpu, nr, true, error_code,
-			       true, payload, false);
+			       true, payload, false, false);
 }
 
 int kvm_complete_insn_gp(struct kvm_vcpu *vcpu, int err)
@@ -829,13 +833,13 @@ void kvm_inject_nmi(struct kvm_vcpu *vcpu)
 
 void kvm_queue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
 {
-	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, false);
+	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, false, false);
 }
 EXPORT_SYMBOL_GPL(kvm_queue_exception_e);
 
-void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code)
+void kvm_requeue_exception_e(struct kvm_vcpu *vcpu, unsigned nr, u32 error_code, bool nested)
 {
-	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, true);
+	kvm_multiple_exception(vcpu, nr, true, error_code, false, 0, true, nested);
 }
 EXPORT_SYMBOL_GPL(kvm_requeue_exception_e);
 
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 60da8cbe6759..63e543c6834b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -108,6 +108,7 @@ static inline void kvm_clear_exception_queue(struct kvm_vcpu *vcpu)
 {
 	vcpu->arch.exception.pending = false;
 	vcpu->arch.exception.injected = false;
+	vcpu->arch.exception.nested = false;
 	vcpu->arch.exception_vmexit.pending = false;
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs()
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (12 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-14 14:36   ` Nikolay Borisov
  2023-11-08 18:29 ` [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls Xin Li
                   ` (8 subsequent siblings)
  22 siblings, 1 reply; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add FRED related VMCS fields to dump_vmcs() to have it dump FRED context.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 48 ++++++++++++++++++++++++++++++++++++------
 1 file changed, 41 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 518e68ee5a0d..b826dc188fc7 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6429,7 +6429,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 vmentry_ctl, vmexit_ctl;
 	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
-	u64 tertiary_exec_control;
+	u64 tertiary_exec_control, secondary_vmexit_ctl;
 	unsigned long cr4;
 	int efer_slot;
 
@@ -6440,6 +6440,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 
 	vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
 	vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
+	secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
+			       vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
 	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
 	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
 	cr4 = vmcs_readl(GUEST_CR4);
@@ -6486,6 +6488,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
 	vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
 	vmx_dump_sel("TR:  ", GUEST_TR_SELECTOR);
+#ifdef CONFIG_X86_64
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		pr_err("FRED guest: config=0x%016llx, stack levels=0x%016llx\n"
+		       "RSP0=0x%016lx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(GUEST_IA32_FRED_CONFIG),
+		       vmcs_read64(GUEST_IA32_FRED_STKLVLS),
+		       read_msr(MSR_IA32_FRED_RSP0),
+		       vmcs_read64(GUEST_IA32_FRED_RSP1),
+		       vmcs_read64(GUEST_IA32_FRED_RSP2),
+		       vmcs_read64(GUEST_IA32_FRED_RSP3));
+	}
+#endif
 	efer_slot = vmx_find_loadstore_msr_slot(&vmx->msr_autoload.guest, MSR_EFER);
 	if (vmentry_ctl & VM_ENTRY_LOAD_IA32_EFER)
 		pr_err("EFER= 0x%016llx\n", vmcs_read64(GUEST_IA32_EFER));
@@ -6533,6 +6548,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	       vmcs_readl(HOST_TR_BASE));
 	pr_err("GDTBase=%016lx IDTBase=%016lx\n",
 	       vmcs_readl(HOST_GDTR_BASE), vmcs_readl(HOST_IDTR_BASE));
+#ifdef CONFIG_X86_64
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		pr_err("FRED host: config=0x%016llx, stack levels=0x%016llx\n"
+		       "RSP0=0x%016llx, RSP1=0x%016llx\n"
+		       "RSP2=0x%016llx, RSP3=0x%016llx\n",
+		       vmcs_read64(HOST_IA32_FRED_CONFIG),
+		       vmcs_read64(HOST_IA32_FRED_STKLVLS),
+		       vmx->msr_host_fred_rsp0,
+		       vmcs_read64(HOST_IA32_FRED_RSP1),
+		       vmcs_read64(HOST_IA32_FRED_RSP2),
+		       vmcs_read64(HOST_IA32_FRED_RSP3));
+	}
+#endif
 	pr_err("CR0=%016lx CR3=%016lx CR4=%016lx\n",
 	       vmcs_readl(HOST_CR0), vmcs_readl(HOST_CR3),
 	       vmcs_readl(HOST_CR4));
@@ -6554,25 +6582,31 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	pr_err("*** Control State ***\n");
 	pr_err("CPUBased=0x%08x SecondaryExec=0x%08x TertiaryExec=0x%016llx\n",
 	       cpu_based_exec_ctrl, secondary_exec_control, tertiary_exec_control);
-	pr_err("PinBased=0x%08x EntryControls=%08x ExitControls=%08x\n",
-	       pin_based_exec_ctrl, vmentry_ctl, vmexit_ctl);
+	pr_err("PinBased=0x%08x EntryControls=0x%08x\n",
+	       pin_based_exec_ctrl, vmentry_ctl);
+	pr_err("ExitControls=0x%08x SecondaryExitControls=0x%016llx\n",
+	       vmexit_ctl, secondary_vmexit_ctl);
 	pr_err("ExceptionBitmap=%08x PFECmask=%08x PFECmatch=%08x\n",
 	       vmcs_read32(EXCEPTION_BITMAP),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MASK),
 	       vmcs_read32(PAGE_FAULT_ERROR_CODE_MATCH));
-	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x\n",
+	pr_err("VMEntry: intr_info=%08x errcode=%08x ilen=%08x event data=%016llx\n",
 	       vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
 	       vmcs_read32(VM_ENTRY_EXCEPTION_ERROR_CODE),
-	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN));
+	       vmcs_read32(VM_ENTRY_INSTRUCTION_LEN),
+	       cpu_feature_enabled(X86_FEATURE_FRED) ?
+	       vmcs_read64(INJECTED_EVENT_DATA) : 0);
 	pr_err("VMExit: intr_info=%08x errcode=%08x ilen=%08x\n",
 	       vmcs_read32(VM_EXIT_INTR_INFO),
 	       vmcs_read32(VM_EXIT_INTR_ERROR_CODE),
 	       vmcs_read32(VM_EXIT_INSTRUCTION_LEN));
 	pr_err("        reason=%08x qualification=%016lx\n",
 	       vmcs_read32(VM_EXIT_REASON), vmcs_readl(EXIT_QUALIFICATION));
-	pr_err("IDTVectoring: info=%08x errcode=%08x\n",
+	pr_err("IDTVectoring: info=%08x errcode=%08x event data=%016llx\n",
 	       vmcs_read32(IDT_VECTORING_INFO_FIELD),
-	       vmcs_read32(IDT_VECTORING_ERROR_CODE));
+	       vmcs_read32(IDT_VECTORING_ERROR_CODE),
+	       cpu_feature_enabled(X86_FEATURE_FRED) ?
+	       vmcs_read64(ORIGINAL_EVENT_DATA) : 0);
 	pr_err("TSC Offset = 0x%016llx\n", vmcs_read64(TSC_OFFSET));
 	if (secondary_exec_control & SECONDARY_EXEC_TSC_SCALING)
 		pr_err("TSC Multiplier = 0x%016llx\n",
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (13 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-09  8:21   ` Jeremi Piotrowski
  2023-11-20 15:52   ` Vitaly Kuznetsov
  2023-11-08 18:29 ` [PATCH v1 16/23] KVM: nVMX: Add FRED VMCS fields Xin Li
                   ` (7 subsequent siblings)
  22 siblings, 2 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Enable the secondary VM exit controls to prepare for nested FRED.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 Documentation/virt/kvm/x86/nested-vmx.rst |  1 +
 arch/x86/include/asm/hyperv-tlfs.h        |  1 +
 arch/x86/kvm/vmx/capabilities.h           |  1 +
 arch/x86/kvm/vmx/hyperv.c                 | 18 +++++++++++++++++-
 arch/x86/kvm/vmx/nested.c                 | 18 +++++++++++++++++-
 arch/x86/kvm/vmx/vmcs12.c                 |  1 +
 arch/x86/kvm/vmx/vmcs12.h                 |  2 ++
 arch/x86/kvm/x86.h                        |  2 +-
 8 files changed, 41 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index ac2095d41f02..e64ef231f310 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
 		u16 host_fs_selector;
 		u16 host_gs_selector;
 		u16 host_tr_selector;
+		u64 secondary_vm_exit_controls;
 	};
 
 
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 2ff26f53cd62..299554708e37 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -616,6 +616,7 @@ struct hv_enlightened_vmcs {
 	u64 host_ssp;
 	u64 host_ia32_int_ssp_table_addr;
 	u64 padding64_6;
+	u64 secondary_vm_exit_controls;
 } __packed;
 
 #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index e8f3ad0f79ee..caf38a54856c 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -38,6 +38,7 @@ struct nested_vmx_msrs {
 	u32 pinbased_ctls_high;
 	u32 exit_ctls_low;
 	u32 exit_ctls_high;
+	u64 secondary_exit_ctls;
 	u32 entry_ctls_low;
 	u32 entry_ctls_high;
 	u32 misc_low;
diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
index 313b8bb5b8a7..b8cd53601a00 100644
--- a/arch/x86/kvm/vmx/hyperv.c
+++ b/arch/x86/kvm/vmx/hyperv.c
@@ -103,7 +103,10 @@
 	 VM_EXIT_LOAD_IA32_EFER |					\
 	 VM_EXIT_CLEAR_BNDCFGS |					\
 	 VM_EXIT_PT_CONCEAL_PIP |					\
-	 VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	 VM_EXIT_CLEAR_IA32_RTIT_CTL |					\
+	 VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+
+#define EVMCS1_SUPPORTED_VMEXIT_CTRL2 (0ULL)
 
 #define EVMCS1_SUPPORTED_VMENTRY_CTRL					\
 	(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |				\
@@ -315,6 +318,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
 	EVMCS1_FIELD(VM_EXIT_CONTROLS, vm_exit_controls,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
+	EVMCS1_FIELD(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
 	EVMCS1_FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
 	EVMCS1_FIELD(GUEST_ES_LIMIT, guest_es_limit,
@@ -464,6 +469,7 @@ enum evmcs_revision {
 
 enum evmcs_ctrl_type {
 	EVMCS_EXIT_CTRLS,
+	EVMCS_2NDEXIT,
 	EVMCS_ENTRY_CTRLS,
 	EVMCS_EXEC_CTRL,
 	EVMCS_2NDEXEC,
@@ -477,6 +483,9 @@ static const u32 evmcs_supported_ctrls[NR_EVMCS_CTRLS][NR_EVMCS_REVISIONS] = {
 	[EVMCS_EXIT_CTRLS] = {
 		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL,
 	},
+	[EVMCS_2NDEXIT] = {
+		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL2,
+	},
 	[EVMCS_ENTRY_CTRLS] = {
 		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMENTRY_CTRL,
 	},
@@ -539,6 +548,9 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *
 			supported_ctrls &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 		ctl_high &= supported_ctrls;
 		break;
+	case MSR_IA32_VMX_EXIT_CTLS2:
+		ctl_low &= evmcs_get_supported_ctls(EVMCS_2NDEXIT);
+		break;
 	case MSR_IA32_VMX_ENTRY_CTLS:
 	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
 		supported_ctrls = evmcs_get_supported_ctls(EVMCS_ENTRY_CTRLS);
@@ -589,6 +601,10 @@ int nested_evmcs_check_controls(struct vmcs12 *vmcs12)
 					       vmcs12->vm_exit_controls)))
 		return -EINVAL;
 
+	if (CC(!nested_evmcs_is_valid_controls(EVMCS_2NDEXIT,
+					       vmcs12->secondary_vm_exit_controls)))
+		return -EINVAL;
+
 	if (CC(!nested_evmcs_is_valid_controls(EVMCS_ENTRY_CTRLS,
 					       vmcs12->vm_entry_controls)))
 		return -EINVAL;
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index ff07d6e736a2..d6341845df43 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1411,6 +1411,7 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
 	case MSR_IA32_VMX_PINBASED_CTLS:
 	case MSR_IA32_VMX_PROCBASED_CTLS:
 	case MSR_IA32_VMX_EXIT_CTLS:
+	case MSR_IA32_VMX_EXIT_CTLS2:
 	case MSR_IA32_VMX_ENTRY_CTLS:
 		/*
 		 * The "non-true" VMX capability MSRs are generated from the
@@ -1489,6 +1490,9 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
 		if (msr_index == MSR_IA32_VMX_EXIT_CTLS)
 			*pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
 		break;
+	case MSR_IA32_VMX_EXIT_CTLS2:
+		*pdata = msrs->secondary_exit_ctls;
+		break;
 	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
 	case MSR_IA32_VMX_ENTRY_CTLS:
 		*pdata = vmx_control_msr(
@@ -1692,6 +1696,8 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->pin_based_vm_exec_control =
 			evmcs->pin_based_vm_exec_control;
 		vmcs12->vm_exit_controls = evmcs->vm_exit_controls;
+		vmcs12->secondary_vm_exit_controls =
+			evmcs->secondary_vm_exit_controls;
 		vmcs12->secondary_vm_exec_control =
 			evmcs->secondary_vm_exec_control;
 	}
@@ -1894,6 +1900,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 	 * evmcs->vmcs_link_pointer = vmcs12->vmcs_link_pointer;
 	 * evmcs->pin_based_vm_exec_control = vmcs12->pin_based_vm_exec_control;
 	 * evmcs->vm_exit_controls = vmcs12->vm_exit_controls;
+	 * evmcs->secondary_vm_exit_controls = vmcs12->secondary_vm_exit_controls;
 	 * evmcs->secondary_vm_exec_control = vmcs12->secondary_vm_exec_control;
 	 * evmcs->page_fault_error_code_mask =
 	 *		vmcs12->page_fault_error_code_mask;
@@ -2411,6 +2418,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 		exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
 	vm_exit_controls_set(vmx, exec_control);
 
+	if (exec_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
+		exec_control = __secondary_vm_exit_controls_get(vmcs01);
+		secondary_vm_exit_controls_set(vmx, exec_control);
+	}
+
 	/*
 	 * Interrupt/Exception Fields
 	 */
@@ -6819,13 +6831,17 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
 		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
 		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
 
+	/* secondary exit controls */
+	if (msrs->exit_ctls_high & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
+		rdmsrl(MSR_IA32_VMX_EXIT_CTLS2, msrs->secondary_exit_ctls);
+
 	/* We support free control of debug control saving. */
 	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
 }
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..98457d7b2b23 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -73,6 +73,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
 	FIELD(CR3_TARGET_COUNT, cr3_target_count),
 	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
+	FIELD(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
 	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
 	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
 	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..f50f897b9b5f 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -185,6 +185,7 @@ struct __packed vmcs12 {
 	u16 host_gs_selector;
 	u16 host_tr_selector;
 	u16 guest_pml_index;
+	u64 secondary_vm_exit_controls;
 };
 
 /*
@@ -358,6 +359,7 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_gs_selector, 992);
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
+	CHECK_OFFSET(secondary_vm_exit_controls, 998);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 63e543c6834b..96ad139adc3f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -47,7 +47,7 @@ void kvm_spurious_fault(void);
  * associated feature that KVM supports for nested virtualization.
  */
 #define KVM_FIRST_EMULATED_VMX_MSR	MSR_IA32_VMX_BASIC
-#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_VMFUNC
+#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_EXIT_CTLS2
 
 #define KVM_DEFAULT_PLE_GAP		128
 #define KVM_VMX_DEFAULT_PLE_WINDOW	4096
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 16/23] KVM: nVMX: Add FRED VMCS fields
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (14 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 17/23] KVM: nVMX: Add support for VMX FRED controls Xin Li
                   ` (6 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add FRED VMCS fields to nested VMX context management.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 Documentation/virt/kvm/x86/nested-vmx.rst |  18 +++
 arch/x86/include/asm/hyperv-tlfs.h        |  18 +++
 arch/x86/kvm/vmx/hyperv.c                 |  38 ++++++
 arch/x86/kvm/vmx/nested.c                 | 154 ++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c                 |  18 +++
 arch/x86/kvm/vmx/vmcs12.h                 |  36 +++++
 arch/x86/kvm/vmx/vmcs_shadow_fields.h     |   6 +-
 7 files changed, 274 insertions(+), 14 deletions(-)

diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
index e64ef231f310..87fa9f3877ab 100644
--- a/Documentation/virt/kvm/x86/nested-vmx.rst
+++ b/Documentation/virt/kvm/x86/nested-vmx.rst
@@ -218,6 +218,24 @@ struct shadow_vmcs is ever changed.
 		u16 host_gs_selector;
 		u16 host_tr_selector;
 		u64 secondary_vm_exit_controls;
+		u64 guest_ia32_fred_config;
+		u64 guest_ia32_fred_rsp1;
+		u64 guest_ia32_fred_rsp2;
+		u64 guest_ia32_fred_rsp3;
+		u64 guest_ia32_fred_stklvls;
+		u64 guest_ia32_fred_ssp1;
+		u64 guest_ia32_fred_ssp2;
+		u64 guest_ia32_fred_ssp3;
+		u64 host_ia32_fred_config;
+		u64 host_ia32_fred_rsp1;
+		u64 host_ia32_fred_rsp2;
+		u64 host_ia32_fred_rsp3;
+		u64 host_ia32_fred_stklvls;
+		u64 host_ia32_fred_ssp1;
+		u64 host_ia32_fred_ssp2;
+		u64 host_ia32_fred_ssp3;
+		u64 injected_event_data;
+		u64 original_event_data;
 	};
 
 
diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
index 299554708e37..269dbf73b63f 100644
--- a/arch/x86/include/asm/hyperv-tlfs.h
+++ b/arch/x86/include/asm/hyperv-tlfs.h
@@ -617,6 +617,24 @@ struct hv_enlightened_vmcs {
 	u64 host_ia32_int_ssp_table_addr;
 	u64 padding64_6;
 	u64 secondary_vm_exit_controls;
+	u64 guest_ia32_fred_config;
+	u64 guest_ia32_fred_rsp1;
+	u64 guest_ia32_fred_rsp2;
+	u64 guest_ia32_fred_rsp3;
+	u64 guest_ia32_fred_stklvls;
+	u64 guest_ia32_fred_ssp1;
+	u64 guest_ia32_fred_ssp2;
+	u64 guest_ia32_fred_ssp3;
+	u64 host_ia32_fred_config;
+	u64 host_ia32_fred_rsp1;
+	u64 host_ia32_fred_rsp2;
+	u64 host_ia32_fred_rsp3;
+	u64 host_ia32_fred_stklvls;
+	u64 host_ia32_fred_ssp1;
+	u64 host_ia32_fred_ssp2;
+	u64 host_ia32_fred_ssp3;
+	u64 injected_event_data;
+	u64 original_event_data;
 } __packed;
 
 #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0
diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
index b8cd53601a00..b12ef8fd76c9 100644
--- a/arch/x86/kvm/vmx/hyperv.c
+++ b/arch/x86/kvm/vmx/hyperv.c
@@ -243,6 +243,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
 	EVMCS1_FIELD(TSC_MULTIPLIER, tsc_multiplier,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2),
+	EVMCS1_FIELD(INJECTED_EVENT_DATA, injected_event_data,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT),
 	/*
 	 * Not used by KVM:
 	 *
@@ -262,11 +264,47 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
 	 *	     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_GRP1),
 	 */
 
+	/* FRED guest and host context */
+	EVMCS1_FIELD(GUEST_IA32_FRED_CONFIG, guest_ia32_fred_config,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_RSP1, guest_ia32_fred_rsp1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_RSP2, guest_ia32_fred_rsp2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_RSP3, guest_ia32_fred_rsp3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_STKLVLS, guest_ia32_fred_stklvls,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_SSP1, guest_ia32_fred_ssp1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_SSP2, guest_ia32_fred_ssp2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(GUEST_IA32_FRED_SSP3, guest_ia32_fred_ssp3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2),
+	EVMCS1_FIELD(HOST_IA32_FRED_CONFIG, host_ia32_fred_config,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_RSP1, host_ia32_fred_rsp1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_RSP2, host_ia32_fred_rsp2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_RSP3, host_ia32_fred_rsp3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_STKLVLS, host_ia32_fred_stklvls,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_SSP1, host_ia32_fred_ssp1,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_SSP2, host_ia32_fred_ssp2,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+	EVMCS1_FIELD(HOST_IA32_FRED_SSP3, host_ia32_fred_ssp3,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER),
+
 	/* 64 bit read only */
 	EVMCS1_FIELD(GUEST_PHYSICAL_ADDRESS, guest_physical_address,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
 	EVMCS1_FIELD(EXIT_QUALIFICATION, exit_qualification,
 		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
+	EVMCS1_FIELD(ORIGINAL_EVENT_DATA, original_event_data,
+		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE),
 	/*
 	 * Not defined in KVM:
 	 *
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index d6341845df43..af616c4a3491 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -83,7 +83,18 @@ static void init_vmcs_shadow_fields(void)
 			pr_err("Missing field from shadow_read_only_field %x\n",
 			       field + 1);
 
+		switch (field) {
+		case ORIGINAL_EVENT_DATA:
+		case ORIGINAL_EVENT_DATA_HIGH:
+			if (!cpu_feature_enabled(X86_FEATURE_FRED))
+				continue;
+			break;
+		default:
+			break;
+		}
+
 		clear_bit(field, vmx_vmread_bitmap);
+
 		if (field & 1)
 #ifdef CONFIG_X86_64
 			continue;
@@ -126,6 +137,11 @@ static void init_vmcs_shadow_fields(void)
 			if (!cpu_has_vmx_apicv())
 				continue;
 			break;
+		case INJECTED_EVENT_DATA:
+		case INJECTED_EVENT_DATA_HIGH:
+			if (!cpu_feature_enabled(X86_FEATURE_FRED))
+				continue;
+			break;
 		default:
 			break;
 		}
@@ -650,6 +666,9 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_KERNEL_GS_BASE, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
 #endif
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_SPEC_CTRL, MSR_TYPE_RW);
@@ -1565,6 +1584,17 @@ static void copy_shadow_to_vmcs12(struct vcpu_vmx *vmx)
 
 	for (i = 0; i < max_shadow_read_write_fields; i++) {
 		field = shadow_read_write_fields[i];
+
+		switch (field.encoding) {
+		case INJECTED_EVENT_DATA:
+		case INJECTED_EVENT_DATA_HIGH:
+			if (!cpu_feature_enabled(X86_FEATURE_FRED))
+				continue;
+			break;
+		default:
+			break;
+		}
+
 		val = __vmcs_readl(field.encoding);
 		vmcs12_write_any(vmcs12, field.encoding, field.offset, val);
 	}
@@ -1599,6 +1629,19 @@ static void copy_vmcs12_to_shadow(struct vcpu_vmx *vmx)
 	for (q = 0; q < ARRAY_SIZE(fields); q++) {
 		for (i = 0; i < max_fields[q]; i++) {
 			field = fields[q][i];
+
+			switch (field.encoding) {
+			case INJECTED_EVENT_DATA:
+			case INJECTED_EVENT_DATA_HIGH:
+			case ORIGINAL_EVENT_DATA:
+			case ORIGINAL_EVENT_DATA_HIGH:
+				if (!cpu_feature_enabled(X86_FEATURE_FRED))
+					continue;
+				break;
+			default:
+				break;
+			}
+
 			val = vmcs12_read_any(vmcs12, field.encoding,
 					      field.offset);
 			__vmcs_writel(field.encoding, val);
@@ -1660,6 +1703,8 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 			evmcs->vm_entry_intr_info_field;
 		vmcs12->vm_entry_exception_error_code =
 			evmcs->vm_entry_exception_error_code;
+		vmcs12->injected_event_data =
+			evmcs->injected_event_data;
 		vmcs12->vm_entry_instruction_len =
 			evmcs->vm_entry_instruction_len;
 	}
@@ -1725,6 +1770,14 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->guest_tr_base = evmcs->guest_tr_base;
 		vmcs12->guest_gdtr_base = evmcs->guest_gdtr_base;
 		vmcs12->guest_idtr_base = evmcs->guest_idtr_base;
+		vmcs12->guest_ia32_fred_config = evmcs->guest_ia32_fred_config;
+		vmcs12->guest_ia32_fred_rsp1 = evmcs->guest_ia32_fred_rsp1;
+		vmcs12->guest_ia32_fred_rsp2 = evmcs->guest_ia32_fred_rsp2;
+		vmcs12->guest_ia32_fred_rsp3 = evmcs->guest_ia32_fred_rsp3;
+		vmcs12->guest_ia32_fred_stklvls = evmcs->guest_ia32_fred_stklvls;
+		vmcs12->guest_ia32_fred_ssp1 = evmcs->guest_ia32_fred_ssp1;
+		vmcs12->guest_ia32_fred_ssp2 = evmcs->guest_ia32_fred_ssp2;
+		vmcs12->guest_ia32_fred_ssp3 = evmcs->guest_ia32_fred_ssp3;
 		vmcs12->guest_es_limit = evmcs->guest_es_limit;
 		vmcs12->guest_cs_limit = evmcs->guest_cs_limit;
 		vmcs12->guest_ss_limit = evmcs->guest_ss_limit;
@@ -1781,6 +1834,14 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 		vmcs12->host_tr_base = evmcs->host_tr_base;
 		vmcs12->host_gdtr_base = evmcs->host_gdtr_base;
 		vmcs12->host_idtr_base = evmcs->host_idtr_base;
+		vmcs12->host_ia32_fred_config = evmcs->host_ia32_fred_config;
+		vmcs12->host_ia32_fred_rsp1 = evmcs->host_ia32_fred_rsp1;
+		vmcs12->host_ia32_fred_rsp2 = evmcs->host_ia32_fred_rsp2;
+		vmcs12->host_ia32_fred_rsp3 = evmcs->host_ia32_fred_rsp3;
+		vmcs12->host_ia32_fred_stklvls = evmcs->host_ia32_fred_stklvls;
+		vmcs12->host_ia32_fred_ssp1 = evmcs->host_ia32_fred_ssp1;
+		vmcs12->host_ia32_fred_ssp2 = evmcs->host_ia32_fred_ssp2;
+		vmcs12->host_ia32_fred_ssp3 = evmcs->host_ia32_fred_ssp3;
 		vmcs12->host_rsp = evmcs->host_rsp;
 	}
 
@@ -1840,6 +1901,7 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
 	 * vmcs12->vm_exit_intr_error_code = evmcs->vm_exit_intr_error_code;
 	 * vmcs12->idt_vectoring_info_field = evmcs->idt_vectoring_info_field;
 	 * vmcs12->idt_vectoring_error_code = evmcs->idt_vectoring_error_code;
+	 * vmcs12->original_event_data = evmcs->original_event_data;
 	 * vmcs12->vm_exit_instruction_len = evmcs->vm_exit_instruction_len;
 	 * vmcs12->vmx_instruction_info = evmcs->vmx_instruction_info;
 	 * vmcs12->exit_qualification = evmcs->exit_qualification;
@@ -1975,6 +2037,14 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 	evmcs->guest_tr_base = vmcs12->guest_tr_base;
 	evmcs->guest_gdtr_base = vmcs12->guest_gdtr_base;
 	evmcs->guest_idtr_base = vmcs12->guest_idtr_base;
+	evmcs->guest_ia32_fred_config = vmcs12->guest_ia32_fred_config;
+	evmcs->guest_ia32_fred_rsp1 = vmcs12->guest_ia32_fred_rsp1;
+	evmcs->guest_ia32_fred_rsp2 = vmcs12->guest_ia32_fred_rsp2;
+	evmcs->guest_ia32_fred_rsp3 = vmcs12->guest_ia32_fred_rsp3;
+	evmcs->guest_ia32_fred_stklvls = vmcs12->guest_ia32_fred_stklvls;
+	evmcs->guest_ia32_fred_ssp1 = vmcs12->guest_ia32_fred_ssp1;
+	evmcs->guest_ia32_fred_ssp2 = vmcs12->guest_ia32_fred_ssp2;
+	evmcs->guest_ia32_fred_ssp3 = vmcs12->guest_ia32_fred_ssp3;
 
 	evmcs->guest_ia32_pat = vmcs12->guest_ia32_pat;
 	evmcs->guest_ia32_efer = vmcs12->guest_ia32_efer;
@@ -2005,6 +2075,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 	evmcs->vm_exit_intr_error_code = vmcs12->vm_exit_intr_error_code;
 	evmcs->idt_vectoring_info_field = vmcs12->idt_vectoring_info_field;
 	evmcs->idt_vectoring_error_code = vmcs12->idt_vectoring_error_code;
+	evmcs->original_event_data = vmcs12->original_event_data;
 	evmcs->vm_exit_instruction_len = vmcs12->vm_exit_instruction_len;
 	evmcs->vmx_instruction_info = vmcs12->vmx_instruction_info;
 
@@ -2021,6 +2092,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
 	evmcs->vm_entry_intr_info_field = vmcs12->vm_entry_intr_info_field;
 	evmcs->vm_entry_exception_error_code =
 		vmcs12->vm_entry_exception_error_code;
+	evmcs->injected_event_data = vmcs12->injected_event_data;
 	evmcs->vm_entry_instruction_len = vmcs12->vm_entry_instruction_len;
 
 	evmcs->guest_rip = vmcs12->guest_rip;
@@ -2435,6 +2507,8 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
 			     vmcs12->vm_entry_instruction_len);
 		vmcs_write32(GUEST_INTERRUPTIBILITY_INFO,
 			     vmcs12->guest_interruptibility_info);
+		if (cpu_feature_enabled(X86_FEATURE_FRED))
+			vmcs_write64(INJECTED_EVENT_DATA, vmcs12->injected_event_data);
 		vmx->loaded_vmcs->nmi_known_unmasked =
 			!(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_NMI);
 	} else {
@@ -2485,6 +2559,17 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 		vmcs_writel(GUEST_GDTR_BASE, vmcs12->guest_gdtr_base);
 		vmcs_writel(GUEST_IDTR_BASE, vmcs12->guest_idtr_base);
 
+		if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+			vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->guest_ia32_fred_config);
+			vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->guest_ia32_fred_rsp1);
+			vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->guest_ia32_fred_rsp2);
+			vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->guest_ia32_fred_rsp3);
+			vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->guest_ia32_fred_stklvls);
+			vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->guest_ia32_fred_ssp1);
+			vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->guest_ia32_fred_ssp2);
+			vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->guest_ia32_fred_ssp3);
+		}
+
 		vmx->segment_cache.bitmask = 0;
 	}
 
@@ -3765,6 +3850,22 @@ vmcs12_guest_cr4(struct kvm_vcpu *vcpu, struct vmcs12 *vmcs12)
 			vcpu->arch.cr4_guest_owned_bits));
 }
 
+static inline unsigned long
+nested_vmx_get_event_data(struct kvm_vcpu *vcpu, bool for_ex_vmexit)
+{
+	struct kvm_queued_exception *ex = for_ex_vmexit ?
+		&vcpu->arch.exception_vmexit : &vcpu->arch.exception;
+
+	if (ex->has_payload)
+		return ex->payload;
+	else if (ex->vector == PF_VECTOR)
+		return vcpu->arch.cr2;
+	else if (ex->vector == DB_VECTOR)
+		return (vcpu->arch.dr6 & ~DR6_BT) ^ DR6_ACTIVE_LOW;
+	else
+		return 0;
+}
+
 static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 				      struct vmcs12 *vmcs12,
 				      u32 vm_exit_reason, u32 exit_intr_info)
@@ -3772,6 +3873,8 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 	u32 idt_vectoring;
 	unsigned int nr;
 
+	vmcs12->original_event_data = 0;
+
 	/*
 	 * Per the SDM, VM-Exits due to double and triple faults are never
 	 * considered to occur during event delivery, even if the double/triple
@@ -3810,6 +3913,12 @@ static void vmcs12_save_pending_event(struct kvm_vcpu *vcpu,
 				vcpu->arch.exception.error_code;
 		}
 
+		idt_vectoring |= vcpu->arch.exception.nested ?
+				INTR_INFO_NESTED_EXCEPTION_MASK : 0;
+
+		vmcs12->original_event_data =
+			nested_vmx_get_event_data(vcpu, false);
+
 		vmcs12->idt_vectoring_info_field = idt_vectoring;
 	} else if (vcpu->arch.nmi_injected) {
 		vmcs12->idt_vectoring_info_field =
@@ -3900,19 +4009,7 @@ static void nested_vmx_inject_exception_vmexit(struct kvm_vcpu *vcpu)
 	struct kvm_queued_exception *ex = &vcpu->arch.exception_vmexit;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
-	unsigned long exit_qual;
-
-	if (ex->has_payload) {
-		exit_qual = ex->payload;
-	} else if (ex->vector == PF_VECTOR) {
-		exit_qual = vcpu->arch.cr2;
-	} else if (ex->vector == DB_VECTOR) {
-		exit_qual = vcpu->arch.dr6;
-		exit_qual &= ~DR6_BT;
-		exit_qual ^= DR6_ACTIVE_LOW;
-	} else {
-		exit_qual = 0;
-	}
+	unsigned long exit_qual = nested_vmx_get_event_data(vcpu, true);
 
 	/*
 	 * Unlike AMD's Paged Real Mode, which reports an error code on #PF
@@ -4282,6 +4379,14 @@ static bool is_vmcs12_ext_field(unsigned long field)
 	case GUEST_TR_BASE:
 	case GUEST_GDTR_BASE:
 	case GUEST_IDTR_BASE:
+	case GUEST_IA32_FRED_CONFIG:
+	case GUEST_IA32_FRED_RSP1:
+	case GUEST_IA32_FRED_RSP2:
+	case GUEST_IA32_FRED_RSP3:
+	case GUEST_IA32_FRED_STKLVLS:
+	case GUEST_IA32_FRED_SSP1:
+	case GUEST_IA32_FRED_SSP2:
+	case GUEST_IA32_FRED_SSP3:
 	case GUEST_PENDING_DBG_EXCEPTIONS:
 	case GUEST_BNDCFGS:
 		return true;
@@ -4331,6 +4436,18 @@ static void sync_vmcs02_to_vmcs12_rare(struct kvm_vcpu *vcpu,
 	vmcs12->guest_tr_base = vmcs_readl(GUEST_TR_BASE);
 	vmcs12->guest_gdtr_base = vmcs_readl(GUEST_GDTR_BASE);
 	vmcs12->guest_idtr_base = vmcs_readl(GUEST_IDTR_BASE);
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		vmcs12->guest_ia32_fred_config = vmcs_read64(GUEST_IA32_FRED_CONFIG);
+		vmcs12->guest_ia32_fred_rsp1 = vmcs_read64(GUEST_IA32_FRED_RSP1);
+		vmcs12->guest_ia32_fred_rsp2 = vmcs_read64(GUEST_IA32_FRED_RSP2);
+		vmcs12->guest_ia32_fred_rsp3 = vmcs_read64(GUEST_IA32_FRED_RSP3);
+		vmcs12->guest_ia32_fred_stklvls = vmcs_read64(GUEST_IA32_FRED_STKLVLS);
+		vmcs12->guest_ia32_fred_ssp1 = vmcs_read64(GUEST_IA32_FRED_SSP1);
+		vmcs12->guest_ia32_fred_ssp2 = vmcs_read64(GUEST_IA32_FRED_SSP2);
+		vmcs12->guest_ia32_fred_ssp3 = vmcs_read64(GUEST_IA32_FRED_SSP3);
+	}
+
 	vmcs12->guest_pending_dbg_exceptions =
 		vmcs_readl(GUEST_PENDING_DBG_EXCEPTIONS);
 
@@ -4555,6 +4672,17 @@ static void load_vmcs12_host_state(struct kvm_vcpu *vcpu,
 	vmcs_write32(GUEST_IDTR_LIMIT, 0xFFFF);
 	vmcs_write32(GUEST_GDTR_LIMIT, 0xFFFF);
 
+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
+		vmcs_write64(GUEST_IA32_FRED_CONFIG, vmcs12->host_ia32_fred_config);
+		vmcs_write64(GUEST_IA32_FRED_RSP1, vmcs12->host_ia32_fred_rsp1);
+		vmcs_write64(GUEST_IA32_FRED_RSP2, vmcs12->host_ia32_fred_rsp2);
+		vmcs_write64(GUEST_IA32_FRED_RSP3, vmcs12->host_ia32_fred_rsp3);
+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, vmcs12->host_ia32_fred_stklvls);
+		vmcs_write64(GUEST_IA32_FRED_SSP1, vmcs12->host_ia32_fred_ssp1);
+		vmcs_write64(GUEST_IA32_FRED_SSP2, vmcs12->host_ia32_fred_ssp2);
+		vmcs_write64(GUEST_IA32_FRED_SSP3, vmcs12->host_ia32_fred_ssp3);
+	}
+
 	/* If not VM_EXIT_CLEAR_BNDCFGS, the L2 value propagates to L1.  */
 	if (vmcs12->vm_exit_controls & VM_EXIT_CLEAR_BNDCFGS)
 		vmcs_write64(GUEST_BNDCFGS, 0);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 98457d7b2b23..59f17fdfad11 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -80,6 +80,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(VM_ENTRY_MSR_LOAD_COUNT, vm_entry_msr_load_count),
 	FIELD(VM_ENTRY_INTR_INFO_FIELD, vm_entry_intr_info_field),
 	FIELD(VM_ENTRY_EXCEPTION_ERROR_CODE, vm_entry_exception_error_code),
+	FIELD(INJECTED_EVENT_DATA, injected_event_data),
 	FIELD(VM_ENTRY_INSTRUCTION_LEN, vm_entry_instruction_len),
 	FIELD(TPR_THRESHOLD, tpr_threshold),
 	FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control),
@@ -89,6 +90,7 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code),
 	FIELD(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field),
 	FIELD(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code),
+	FIELD(ORIGINAL_EVENT_DATA, original_event_data),
 	FIELD(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len),
 	FIELD(VMX_INSTRUCTION_INFO, vmx_instruction_info),
 	FIELD(GUEST_ES_LIMIT, guest_es_limit),
@@ -152,5 +154,21 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
 	FIELD(HOST_RSP, host_rsp),
 	FIELD(HOST_RIP, host_rip),
+	FIELD(GUEST_IA32_FRED_CONFIG, guest_ia32_fred_config),
+	FIELD(GUEST_IA32_FRED_RSP1, guest_ia32_fred_rsp1),
+	FIELD(GUEST_IA32_FRED_RSP2, guest_ia32_fred_rsp2),
+	FIELD(GUEST_IA32_FRED_RSP3, guest_ia32_fred_rsp3),
+	FIELD(GUEST_IA32_FRED_STKLVLS, guest_ia32_fred_stklvls),
+	FIELD(GUEST_IA32_FRED_SSP1, guest_ia32_fred_ssp1),
+	FIELD(GUEST_IA32_FRED_SSP2, guest_ia32_fred_ssp2),
+	FIELD(GUEST_IA32_FRED_SSP3, guest_ia32_fred_ssp3),
+	FIELD(HOST_IA32_FRED_CONFIG, host_ia32_fred_config),
+	FIELD(HOST_IA32_FRED_RSP1, host_ia32_fred_rsp1),
+	FIELD(HOST_IA32_FRED_RSP2, host_ia32_fred_rsp2),
+	FIELD(HOST_IA32_FRED_RSP3, host_ia32_fred_rsp3),
+	FIELD(HOST_IA32_FRED_STKLVLS, host_ia32_fred_stklvls),
+	FIELD(HOST_IA32_FRED_SSP1, host_ia32_fred_ssp1),
+	FIELD(HOST_IA32_FRED_SSP2, host_ia32_fred_ssp2),
+	FIELD(HOST_IA32_FRED_SSP3, host_ia32_fred_ssp3),
 };
 const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index f50f897b9b5f..edf7fcef8ccf 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -186,6 +186,24 @@ struct __packed vmcs12 {
 	u16 host_tr_selector;
 	u16 guest_pml_index;
 	u64 secondary_vm_exit_controls;
+	u64 guest_ia32_fred_config;
+	u64 guest_ia32_fred_rsp1;
+	u64 guest_ia32_fred_rsp2;
+	u64 guest_ia32_fred_rsp3;
+	u64 guest_ia32_fred_stklvls;
+	u64 guest_ia32_fred_ssp1;
+	u64 guest_ia32_fred_ssp2;
+	u64 guest_ia32_fred_ssp3;
+	u64 host_ia32_fred_config;
+	u64 host_ia32_fred_rsp1;
+	u64 host_ia32_fred_rsp2;
+	u64 host_ia32_fred_rsp3;
+	u64 host_ia32_fred_stklvls;
+	u64 host_ia32_fred_ssp1;
+	u64 host_ia32_fred_ssp2;
+	u64 host_ia32_fred_ssp3;
+	u64 injected_event_data;
+	u64 original_event_data;
 };
 
 /*
@@ -360,6 +378,24 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_tr_selector, 994);
 	CHECK_OFFSET(guest_pml_index, 996);
 	CHECK_OFFSET(secondary_vm_exit_controls, 998);
+	CHECK_OFFSET(guest_ia32_fred_config, 1006);
+	CHECK_OFFSET(guest_ia32_fred_rsp1, 1014);
+	CHECK_OFFSET(guest_ia32_fred_rsp2, 1022);
+	CHECK_OFFSET(guest_ia32_fred_rsp3, 1030);
+	CHECK_OFFSET(guest_ia32_fred_stklvls, 1038);
+	CHECK_OFFSET(guest_ia32_fred_ssp1, 1046);
+	CHECK_OFFSET(guest_ia32_fred_ssp2, 1054);
+	CHECK_OFFSET(guest_ia32_fred_ssp3, 1062);
+	CHECK_OFFSET(host_ia32_fred_config, 1070);
+	CHECK_OFFSET(host_ia32_fred_rsp1, 1078);
+	CHECK_OFFSET(host_ia32_fred_rsp2, 1086);
+	CHECK_OFFSET(host_ia32_fred_rsp3, 1094);
+	CHECK_OFFSET(host_ia32_fred_stklvls, 1102);
+	CHECK_OFFSET(host_ia32_fred_ssp1, 1110);
+	CHECK_OFFSET(host_ia32_fred_ssp2, 1118);
+	CHECK_OFFSET(host_ia32_fred_ssp3, 1126);
+	CHECK_OFFSET(injected_event_data, 1134);
+	CHECK_OFFSET(original_event_data, 1142);
 }
 
 extern const unsigned short vmcs12_field_offsets[];
diff --git a/arch/x86/kvm/vmx/vmcs_shadow_fields.h b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
index cad128d1657b..33c3e2a06e41 100644
--- a/arch/x86/kvm/vmx/vmcs_shadow_fields.h
+++ b/arch/x86/kvm/vmx/vmcs_shadow_fields.h
@@ -41,9 +41,9 @@ SHADOW_FIELD_RW(HOST_GS_SELECTOR, host_gs_selector)
 SHADOW_FIELD_RO(VM_EXIT_REASON, vm_exit_reason)
 SHADOW_FIELD_RO(VM_EXIT_INTR_INFO, vm_exit_intr_info)
 SHADOW_FIELD_RO(VM_EXIT_INSTRUCTION_LEN, vm_exit_instruction_len)
+SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(IDT_VECTORING_INFO_FIELD, idt_vectoring_info_field)
 SHADOW_FIELD_RO(IDT_VECTORING_ERROR_CODE, idt_vectoring_error_code)
-SHADOW_FIELD_RO(VM_EXIT_INTR_ERROR_CODE, vm_exit_intr_error_code)
 SHADOW_FIELD_RO(GUEST_CS_AR_BYTES, guest_cs_ar_bytes)
 SHADOW_FIELD_RO(GUEST_SS_AR_BYTES, guest_ss_ar_bytes)
 SHADOW_FIELD_RW(CPU_BASED_VM_EXEC_CONTROL, cpu_based_vm_exec_control)
@@ -74,6 +74,10 @@ SHADOW_FIELD_RW(HOST_GS_BASE, host_gs_base)
 /* 64-bit */
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS, guest_physical_address)
 SHADOW_FIELD_RO(GUEST_PHYSICAL_ADDRESS_HIGH, guest_physical_address)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA, original_event_data)
+SHADOW_FIELD_RO(ORIGINAL_EVENT_DATA_HIGH, original_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA, injected_event_data)
+SHADOW_FIELD_RW(INJECTED_EVENT_DATA_HIGH, injected_event_data)
 
 #undef SHADOW_FIELD_RO
 #undef SHADOW_FIELD_RW
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 17/23] KVM: nVMX: Add support for VMX FRED controls
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (15 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 16/23] KVM: nVMX: Add FRED VMCS fields Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 18/23] KVM: nVMX: Add VMCS FRED states checking Xin Li
                   ` (5 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add VMX FRED controls to nested VMX controls and set the VMX
nested-exception support bit (bit 58) in the nested IA32_VMX_BASIC MSR
when FRED is enabled.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/hyperv.c |  7 +++++--
 arch/x86/kvm/vmx/nested.c | 14 ++++++++++----
 arch/x86/kvm/vmx/vmx.c    |  3 +++
 3 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
index b12ef8fd76c9..28fca62f3887 100644
--- a/arch/x86/kvm/vmx/hyperv.c
+++ b/arch/x86/kvm/vmx/hyperv.c
@@ -106,7 +106,9 @@
 	 VM_EXIT_CLEAR_IA32_RTIT_CTL |					\
 	 VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
 
-#define EVMCS1_SUPPORTED_VMEXIT_CTRL2 (0ULL)
+#define EVMCS1_SUPPORTED_VMEXIT_CTRL2					\
+	(SECONDARY_VM_EXIT_SAVE_IA32_FRED |				\
+	 SECONDARY_VM_EXIT_LOAD_IA32_FRED)
 
 #define EVMCS1_SUPPORTED_VMENTRY_CTRL					\
 	(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |				\
@@ -117,7 +119,8 @@
 	 VM_ENTRY_LOAD_IA32_EFER |					\
 	 VM_ENTRY_LOAD_BNDCFGS |					\
 	 VM_ENTRY_PT_CONCEAL_PIP |					\
-	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
+	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
+	 VM_ENTRY_LOAD_IA32_FRED)
 
 #define EVMCS1_SUPPORTED_VMFUNC (0)
 
diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index af616c4a3491..b85cd5c0ec98 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1230,10 +1230,12 @@ static bool is_bitwise_subset(u64 superset, u64 subset, u64 mask)
 #define VMX_BASIC_FEATURES_MASK			\
 	(VMX_BASIC_DUAL_MONITOR_TREATMENT |	\
 	 VMX_BASIC_INOUT |			\
-	 VMX_BASIC_TRUE_CTLS)
+	 VMX_BASIC_TRUE_CTLS |			\
+	 VMX_BASIC_NESTED_EXCEPTION)
 
-#define VMX_BASIC_RESERVED_BITS			\
-	(GENMASK_ULL(63, 56) | GENMASK_ULL(47, 45) | BIT_ULL(31))
+#define VMX_BASIC_RESERVED_BITS				\
+	(GENMASK_ULL(63, 59) | GENMASK_ULL(57, 56) |	\
+	 GENMASK_ULL(47, 45) | BIT_ULL(31))
 
 static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
 {
@@ -6985,7 +6987,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
 #ifdef CONFIG_X86_64
 		VM_ENTRY_IA32E_MODE |
 #endif
-		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+		VM_ENTRY_LOAD_IA32_FRED;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
@@ -7141,6 +7144,9 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
 
 	if (cpu_has_vmx_basic_inout())
 		msrs->basic |= VMX_BASIC_INOUT;
+
+	if (cpu_feature_enabled(X86_FEATURE_FRED))
+		msrs->basic |= VMX_BASIC_NESTED_EXCEPTION;
 }
 
 static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index b826dc188fc7..3353fd6615a2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7970,6 +7970,9 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
 
+	entry = kvm_find_cpuid_entry_index(vcpu, 0x7, 1);
+	cr4_fixed1_update(X86_CR4_FRED,       eax, feature_bit(FRED));
+
 #undef cr4_fixed1_update
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 18/23] KVM: nVMX: Add VMCS FRED states checking
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (16 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 17/23] KVM: nVMX: Add support for VMX FRED controls Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:29 ` [PATCH v1 19/23] KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests Xin Li
                   ` (4 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add FRED related VMCS fields checkings.

As real hardware, nested VMX performs checks on various VMCS fields,
including both controls and guest/host states.  With the introduction
of VMX FRED, add FRED related VMCS fields checkings.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 70 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 69 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index b85cd5c0ec98..bbfa09d575d3 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2940,6 +2940,7 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 					  struct vmcs12 *vmcs12)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	bool fred_enabled = !!(vmcs12->guest_cr4 & X86_CR4_FRED);
 
 	if (CC(!vmx_control_verify(vmcs12->vm_entry_controls,
 				    vmx->nested.msrs.entry_ctls_low,
@@ -2958,6 +2959,7 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		u32 intr_type = intr_info & INTR_INFO_INTR_TYPE_MASK;
 		bool has_error_code = intr_info & INTR_INFO_DELIVER_CODE_MASK;
 		bool should_have_error_code;
+		bool has_nested_exception = vmx->nested.msrs.basic & VMX_BASIC_NESTED_EXCEPTION;
 		bool urg = nested_cpu_has2(vmcs12,
 					   SECONDARY_EXEC_UNRESTRICTED_GUEST);
 		bool prot_mode = !urg || vmcs12->guest_cr0 & X86_CR0_PE;
@@ -2971,7 +2973,9 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		/* VM-entry interruption-info field: vector */
 		if (CC(intr_type == INTR_TYPE_NMI_INTR && vector != NMI_VECTOR) ||
 		    CC(intr_type == INTR_TYPE_HARD_EXCEPTION && vector > 31) ||
-		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
+		    CC(intr_type == INTR_TYPE_OTHER_EVENT &&
+		       ((!fred_enabled && vector > 0) ||
+		        (fred_enabled && vector > 2))))
 			return -EINVAL;
 
 		/* VM-entry interruption-info field: deliver error code */
@@ -2990,6 +2994,15 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		if (CC(intr_info & INTR_INFO_RESVD_BITS_MASK))
 			return -EINVAL;
 
+		/*
+		 * When the CPU enumerates VMX nested-exception support, bit 13
+		 * (set to indicate a nested exception) of the intr info field
+		 * may have value 1. Otherwise the bit 13 is reserved.
+		 */
+		if (CC(!has_nested_exception &&
+		       (intr_info & INTR_INFO_NESTED_EXCEPTION_MASK)))
+			return -EINVAL;
+
 		/* VM-entry instruction length */
 		switch (intr_type) {
 		case INTR_TYPE_SOFT_EXCEPTION:
@@ -2999,6 +3012,12 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 			    CC(vmcs12->vm_entry_instruction_len == 0 &&
 			    CC(!nested_cpu_has_zero_length_injection(vcpu))))
 				return -EINVAL;
+			break;
+		case INTR_TYPE_OTHER_EVENT:
+			if (fred_enabled && (vector == 1 || vector == 2))
+				if (CC(vmcs12->vm_entry_instruction_len > 15))
+					return -EINVAL;
+			break;
 		}
 	}
 
@@ -3056,14 +3075,31 @@ static int nested_vmx_check_host_state(struct kvm_vcpu *vcpu,
 					   vmcs12->host_ia32_perf_global_ctrl)))
 		return -EINVAL;
 
+	/* Host FRED state checking */
 	if (ia32e) {
 		if (CC(!(vmcs12->host_cr4 & X86_CR4_PAE)))
 			return -EINVAL;
+		if (vmcs12->vm_exit_controls & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS &&
+		    vmcs12->secondary_vm_exit_controls & SECONDARY_VM_EXIT_LOAD_IA32_FRED) {
+			/* Bit 2, bits 5:4, and bit 11 of the IA32_FRED_CONFIG must be zero */
+			if (CC(vmcs12->host_ia32_fred_config & 0x834) ||
+			    CC(vmcs12->host_ia32_fred_rsp1 & 0x3F) ||
+			    CC(vmcs12->host_ia32_fred_rsp2 & 0x3F) ||
+			    CC(vmcs12->host_ia32_fred_rsp3 & 0x3F))
+				return -EINVAL;
+			if (CC(is_noncanonical_address(vmcs12->host_ia32_fred_config & ~0xFFFULL, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->host_ia32_fred_rsp3, vcpu)))
+				return -EINVAL;
+		}
 	} else {
 		if (CC(vmcs12->vm_entry_controls & VM_ENTRY_IA32E_MODE) ||
 		    CC(vmcs12->host_cr4 & X86_CR4_PCIDE) ||
 		    CC((vmcs12->host_rip) >> 32))
 			return -EINVAL;
+		if (CC(vmcs12->host_cr4 & X86_CR4_FRED))
+			return -EINVAL;
 	}
 
 	if (CC(vmcs12->host_cs_selector & (SEGMENT_RPL_MASK | SEGMENT_TI_MASK)) ||
@@ -3205,6 +3241,38 @@ static int nested_vmx_check_guest_state(struct kvm_vcpu *vcpu,
 	     CC((vmcs12->guest_bndcfgs & MSR_IA32_BNDCFGS_RSVD))))
 		return -EINVAL;
 
+	/* Guest FRED state checking */
+	if (ia32e) {
+		if (vmcs12->vm_entry_controls & VM_ENTRY_LOAD_IA32_FRED) {
+			/* Bit 2, bits 5:4, and bit 11 of the IA32_FRED_CONFIG must be zero */
+			if (CC(vmcs12->guest_ia32_fred_config & 0x834) ||
+			    CC(vmcs12->guest_ia32_fred_rsp1 & 0x3F) ||
+			    CC(vmcs12->guest_ia32_fred_rsp2 & 0x3F) ||
+			    CC(vmcs12->guest_ia32_fred_rsp3 & 0x3F))
+				return -EINVAL;
+			if (CC(is_noncanonical_address(vmcs12->guest_ia32_fred_config & ~0xFFFULL, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp1, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp2, vcpu)) ||
+			    CC(is_noncanonical_address(vmcs12->guest_ia32_fred_rsp3, vcpu)))
+				return -EINVAL;
+		}
+		if (vmcs12->guest_cr4 & X86_CR4_FRED) {
+			unsigned int ss_dpl = VMX_AR_DPL(vmcs12->guest_ss_ar_bytes);
+			if (CC(ss_dpl == 1 || ss_dpl == 2))
+				return -EINVAL;
+			if (ss_dpl == 0 &&
+			    CC(!(vmcs12->guest_cs_ar_bytes & VMX_AR_L_MASK)))
+				return -EINVAL;
+			if (ss_dpl == 3 &&
+			    (CC(vmcs12->guest_rflags & X86_EFLAGS_IOPL) ||
+			     CC(vmcs12->guest_interruptibility_info & GUEST_INTR_STATE_STI)))
+				return -EINVAL;
+		}
+	} else {
+		if (CC(vmcs12->guest_cr4 & X86_CR4_FRED))
+			return -EINVAL;
+	}
+
 	if (nested_check_guest_non_reg_state(vmcs12))
 		return -EINVAL;
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 19/23] KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (17 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 18/23] KVM: nVMX: Add VMCS FRED states checking Xin Li
@ 2023-11-08 18:29 ` Xin Li
  2023-11-08 18:30 ` [PATCH v1 20/23] KVM: selftests: Add FRED VMCS fields to evmcs Xin Li
                   ` (3 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:29 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Allow FRED/LKGS/WRMSRNS to be exposed to guests, thus a guest OS could see
these features when the guest is configured with FRED/LKGS/WRMSRNS in Qemu.

A qemu patch is required to expose FRED/LKGS/WRMSRNS to KVM guests.

Tested-by: Shan Kang <shan.kang@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 arch/x86/kvm/cpuid.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index dda6fc4cfae8..aa11b3e394ae 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -670,8 +670,8 @@ void kvm_set_cpu_caps(void)
 
 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
-		F(FZRM) | F(FSRS) | F(FSRC) |
-		F(AMX_FP16) | F(AVX_IFMA)
+		F(FZRM) | F(FSRS) | F(FSRC) | F(FRED) | F(LKGS) |
+		F(WRMSRNS) | F(AMX_FP16) | F(AVX_IFMA)
 	);
 
 	kvm_cpu_cap_init_kvm_defined(CPUID_7_1_EDX,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 20/23] KVM: selftests: Add FRED VMCS fields to evmcs
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (18 preceding siblings ...)
  2023-11-08 18:29 ` [PATCH v1 19/23] KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests Xin Li
@ 2023-11-08 18:30 ` Xin Li
  2023-11-08 18:30 ` [PATCH v1 21/23] KVM: selftests: Run debug_regs test with FRED enabled Xin Li
                   ` (2 subsequent siblings)
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add FRED VMCS fields to evmcs.

Signed-off-by: Xin Li <xin3.li@intel.com>
---
 .../selftests/kvm/include/x86_64/evmcs.h      | 146 ++++++++++++++++++
 .../selftests/kvm/include/x86_64/vmx.h        |  20 +++
 2 files changed, 166 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/x86_64/evmcs.h b/tools/testing/selftests/kvm/include/x86_64/evmcs.h
index 901caf0e0939..afd35f0c34fd 100644
--- a/tools/testing/selftests/kvm/include/x86_64/evmcs.h
+++ b/tools/testing/selftests/kvm/include/x86_64/evmcs.h
@@ -216,6 +216,27 @@ struct hv_enlightened_vmcs {
 	u64 host_ssp;
 	u64 host_ia32_int_ssp_table_addr;
 	u64 padding64_6;
+
+	u64 host_ia32_fred_config;
+	u64 host_ia32_fred_rsp1;
+	u64 host_ia32_fred_rsp2;
+	u64 host_ia32_fred_rsp3;
+	u64 host_ia32_fred_stklvls;
+	u64 host_ia32_fred_ssp1;
+	u64 host_ia32_fred_ssp2;
+	u64 host_ia32_fred_ssp3;
+
+	u64 guest_ia32_fred_config;
+	u64 guest_ia32_fred_rsp1;
+	u64 guest_ia32_fred_rsp2;
+	u64 guest_ia32_fred_rsp3;
+	u64 guest_ia32_fred_stklvls;
+	u64 guest_ia32_fred_ssp1;
+	u64 guest_ia32_fred_ssp2;
+	u64 guest_ia32_fred_ssp3;
+
+	u64 injected_event_data;
+	u64 original_event_data;
 } __packed;
 
 #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE                     0
@@ -450,6 +471,9 @@ static inline int evmcs_vmread(uint64_t encoding, uint64_t *value)
 	case GUEST_LINEAR_ADDRESS:
 		*value = current_evmcs->guest_linear_address;
 		break;
+	case ORIGINAL_EVENT_DATA:
+		*value = current_evmcs->original_event_data;
+		break;
 	case VM_EXIT_MSR_STORE_ADDR:
 		*value = current_evmcs->vm_exit_msr_store_addr;
 		break;
@@ -492,6 +516,9 @@ static inline int evmcs_vmread(uint64_t encoding, uint64_t *value)
 	case VM_ENTRY_EXCEPTION_ERROR_CODE:
 		*value = current_evmcs->vm_entry_exception_error_code;
 		break;
+	case INJECTED_EVENT_DATA:
+		*value = current_evmcs->injected_event_data;
+		break;
 	case VM_ENTRY_INSTRUCTION_LEN:
 		*value = current_evmcs->vm_entry_instruction_len;
 		break;
@@ -669,6 +696,54 @@ static inline int evmcs_vmread(uint64_t encoding, uint64_t *value)
 	case TSC_MULTIPLIER:
 		*value = current_evmcs->tsc_multiplier;
 		break;
+	case HOST_IA32_FRED_CONFIG:
+		*value = current_evmcs->host_ia32_fred_config;
+		break;
+	case HOST_IA32_FRED_RSP1:
+		*value = current_evmcs->host_ia32_fred_rsp1;
+		break;
+	case HOST_IA32_FRED_RSP2:
+		*value = current_evmcs->host_ia32_fred_rsp2;
+		break;
+	case HOST_IA32_FRED_RSP3:
+		*value = current_evmcs->host_ia32_fred_rsp3;
+		break;
+	case HOST_IA32_FRED_STKLVLS:
+		*value = current_evmcs->host_ia32_fred_stklvls;
+		break;
+	case HOST_IA32_FRED_SSP1:
+		*value = current_evmcs->host_ia32_fred_ssp1;
+		break;
+	case HOST_IA32_FRED_SSP2:
+		*value = current_evmcs->host_ia32_fred_ssp2;
+		break;
+	case HOST_IA32_FRED_SSP3:
+		*value = current_evmcs->host_ia32_fred_ssp3;
+		break;
+	case GUEST_IA32_FRED_CONFIG:
+		*value = current_evmcs->guest_ia32_fred_config;
+		break;
+	case GUEST_IA32_FRED_RSP1:
+		*value = current_evmcs->guest_ia32_fred_rsp1;
+		break;
+	case GUEST_IA32_FRED_RSP2:
+		*value = current_evmcs->guest_ia32_fred_rsp2;
+		break;
+	case GUEST_IA32_FRED_RSP3:
+		*value = current_evmcs->guest_ia32_fred_rsp3;
+		break;
+	case GUEST_IA32_FRED_STKLVLS:
+		*value = current_evmcs->guest_ia32_fred_stklvls;
+		break;
+	case GUEST_IA32_FRED_SSP1:
+		*value = current_evmcs->guest_ia32_fred_ssp1;
+		break;
+	case GUEST_IA32_FRED_SSP2:
+		*value = current_evmcs->guest_ia32_fred_ssp2;
+		break;
+	case GUEST_IA32_FRED_SSP3:
+		*value = current_evmcs->guest_ia32_fred_ssp3;
+		break;
 	default: return 1;
 	}
 
@@ -906,6 +981,10 @@ static inline int evmcs_vmwrite(uint64_t encoding, uint64_t value)
 		current_evmcs->guest_linear_address = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE;
 		break;
+	case ORIGINAL_EVENT_DATA:
+		current_evmcs->original_event_data = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE;
+		break;
 	case VM_EXIT_MSR_STORE_ADDR:
 		current_evmcs->vm_exit_msr_store_addr = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_ALL;
@@ -962,6 +1041,10 @@ static inline int evmcs_vmwrite(uint64_t encoding, uint64_t value)
 		current_evmcs->vm_entry_exception_error_code = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT;
 		break;
+	case INJECTED_EVENT_DATA:
+		current_evmcs->injected_event_data = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT;
+		break;
 	case VM_ENTRY_INSTRUCTION_LEN:
 		current_evmcs->vm_entry_instruction_len = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_EVENT;
@@ -1198,6 +1281,69 @@ static inline int evmcs_vmwrite(uint64_t encoding, uint64_t value)
 		current_evmcs->tsc_multiplier = value;
 		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP2;
 		break;
+	case HOST_IA32_FRED_CONFIG:
+		current_evmcs->host_ia32_fred_config = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_RSP1:
+		current_evmcs->host_ia32_fred_rsp1 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_RSP2:
+		current_evmcs->host_ia32_fred_rsp2 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_RSP3:
+		current_evmcs->host_ia32_fred_rsp3 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_STKLVLS:
+		current_evmcs->host_ia32_fred_stklvls = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_SSP1:
+		current_evmcs->host_ia32_fred_ssp1 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_SSP2:
+		current_evmcs->host_ia32_fred_ssp2 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case HOST_IA32_FRED_SSP3:
+		current_evmcs->host_ia32_fred_ssp3 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_HOST_POINTER;
+		break;
+	case GUEST_IA32_FRED_CONFIG:
+		current_evmcs->guest_ia32_fred_config = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+	case GUEST_IA32_FRED_RSP1:
+		current_evmcs->guest_ia32_fred_rsp1 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_RSP2:
+		current_evmcs->guest_ia32_fred_rsp2 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_RSP3:
+		current_evmcs->guest_ia32_fred_rsp3 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_STKLVLS:
+		current_evmcs->guest_ia32_fred_stklvls = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_SSP1:
+		current_evmcs->guest_ia32_fred_ssp1 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_SSP2:
+		current_evmcs->guest_ia32_fred_ssp2 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
+	case GUEST_IA32_FRED_SSP3:
+		current_evmcs->guest_ia32_fred_ssp3 = value;
+		current_evmcs->hv_clean_fields &= ~HV_VMX_ENLIGHTENED_CLEAN_FIELD_GUEST_GRP2;
+		break;
 	default: return 1;
 	}
 
diff --git a/tools/testing/selftests/kvm/include/x86_64/vmx.h b/tools/testing/selftests/kvm/include/x86_64/vmx.h
index 5f0c0a29c556..e1fdbc293af7 100644
--- a/tools/testing/selftests/kvm/include/x86_64/vmx.h
+++ b/tools/testing/selftests/kvm/include/x86_64/vmx.h
@@ -165,8 +165,12 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER			= 0x00002032,
 	TSC_MULTIPLIER_HIGH		= 0x00002033,
+	INJECTED_EVENT_DATA		= 0x00002052,
+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
 	GUEST_PHYSICAL_ADDRESS		= 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH	= 0x00002401,
+	ORIGINAL_EVENT_DATA		= 0x00002404,
+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
 	VMCS_LINK_POINTER		= 0x00002800,
 	VMCS_LINK_POINTER_HIGH		= 0x00002801,
 	GUEST_IA32_DEBUGCTL		= 0x00002802,
@@ -187,12 +191,28 @@ enum vmcs_field {
 	GUEST_PDPTR3_HIGH		= 0x00002811,
 	GUEST_BNDCFGS			= 0x00002812,
 	GUEST_BNDCFGS_HIGH		= 0x00002813,
+	GUEST_IA32_FRED_CONFIG		= 0x0000281a,
+	GUEST_IA32_FRED_RSP1		= 0x0000281c,
+	GUEST_IA32_FRED_RSP2		= 0x0000281e,
+	GUEST_IA32_FRED_RSP3		= 0x00002820,
+	GUEST_IA32_FRED_STKLVLS		= 0x00002822,
+	GUEST_IA32_FRED_SSP1		= 0x00002824,
+	GUEST_IA32_FRED_SSP2		= 0x00002826,
+	GUEST_IA32_FRED_SSP3		= 0x00002828,
 	HOST_IA32_PAT			= 0x00002c00,
 	HOST_IA32_PAT_HIGH		= 0x00002c01,
 	HOST_IA32_EFER			= 0x00002c02,
 	HOST_IA32_EFER_HIGH		= 0x00002c03,
 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
+	HOST_IA32_FRED_CONFIG		= 0x00002c08,
+	HOST_IA32_FRED_RSP1		= 0x00002c0a,
+	HOST_IA32_FRED_RSP2		= 0x00002c0c,
+	HOST_IA32_FRED_RSP3		= 0x00002c0e,
+	HOST_IA32_FRED_STKLVLS		= 0x00002c10,
+	HOST_IA32_FRED_SSP1		= 0x00002c12,
+	HOST_IA32_FRED_SSP2		= 0x00002c14,
+	HOST_IA32_FRED_SSP3		= 0x00002c16,
 	PIN_BASED_VM_EXEC_CONTROL	= 0x00004000,
 	CPU_BASED_VM_EXEC_CONTROL	= 0x00004002,
 	EXCEPTION_BITMAP		= 0x00004004,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 21/23] KVM: selftests: Run debug_regs test with FRED enabled
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (19 preceding siblings ...)
  2023-11-08 18:30 ` [PATCH v1 20/23] KVM: selftests: Add FRED VMCS fields to evmcs Xin Li
@ 2023-11-08 18:30 ` Xin Li
  2023-11-08 18:30 ` [PATCH v1 22/23] KVM: selftests: Add a new VM guest mode to run user level code Xin Li
  2023-11-08 18:30 ` [PATCH v1 23/23] KVM: selftests: Add fred exception tests Xin Li
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Run another round of debug_regs test with FRED enabled if FRED is
available.

Signed-off-by: Xin Li <xin3.li@intel.com>
---
 .../selftests/kvm/include/x86_64/processor.h  |  4 ++
 .../testing/selftests/kvm/x86_64/debug_regs.c | 50 ++++++++++++++-----
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 25bc61dac5fb..165d21fd1577 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -47,6 +47,7 @@ extern bool host_cpu_is_amd;
 #define X86_CR4_SMEP		(1ul << 20)
 #define X86_CR4_SMAP		(1ul << 21)
 #define X86_CR4_PKE		(1ul << 22)
+#define X86_CR4_FRED		(1ul << 32)
 
 struct xstate_header {
 	u64				xstate_bv;
@@ -163,6 +164,9 @@ struct kvm_x86_cpu_feature {
 #define	X86_FEATURE_SPEC_CTRL		KVM_X86_CPU_FEATURE(0x7, 0, EDX, 26)
 #define	X86_FEATURE_ARCH_CAPABILITIES	KVM_X86_CPU_FEATURE(0x7, 0, EDX, 29)
 #define	X86_FEATURE_PKS			KVM_X86_CPU_FEATURE(0x7, 0, ECX, 31)
+#define	X86_FEATURE_FRED		KVM_X86_CPU_FEATURE(0x7, 1, EAX, 17)
+#define	X86_FEATURE_LKGS		KVM_X86_CPU_FEATURE(0x7, 1, EAX, 18)
+#define	X86_FEATURE_WRMSRNS		KVM_X86_CPU_FEATURE(0x7, 1, EAX, 19)
 #define	X86_FEATURE_XTILECFG		KVM_X86_CPU_FEATURE(0xD, 0, EAX, 17)
 #define	X86_FEATURE_XTILEDATA		KVM_X86_CPU_FEATURE(0xD, 0, EAX, 18)
 #define	X86_FEATURE_XSAVES		KVM_X86_CPU_FEATURE(0xD, 1, EAX, 3)
diff --git a/tools/testing/selftests/kvm/x86_64/debug_regs.c b/tools/testing/selftests/kvm/x86_64/debug_regs.c
index f6b295e0b2d2..69055e764f15 100644
--- a/tools/testing/selftests/kvm/x86_64/debug_regs.c
+++ b/tools/testing/selftests/kvm/x86_64/debug_regs.c
@@ -20,7 +20,7 @@ uint32_t guest_value;
 
 extern unsigned char sw_bp, hw_bp, write_data, ss_start, bd_start;
 
-static void guest_code(void)
+static void guest_test_code(void)
 {
 	/* Create a pending interrupt on current vCPU */
 	x2apic_enable();
@@ -61,6 +61,15 @@ static void guest_code(void)
 
 	/* DR6.BD test */
 	asm volatile("bd_start: mov %%dr0, %%rax" : : : "rax");
+}
+
+static void guest_code(void)
+{
+	guest_test_code();
+
+	if (get_cr4() & X86_CR4_FRED)
+		guest_test_code();
+
 	GUEST_DONE();
 }
 
@@ -75,19 +84,15 @@ static void vcpu_skip_insn(struct kvm_vcpu *vcpu, int insn_len)
 	vcpu_regs_set(vcpu, &regs);
 }
 
-int main(void)
+void run_test(struct kvm_vcpu *vcpu)
 {
 	struct kvm_guest_debug debug;
+	struct kvm_run *run = vcpu->run;
 	unsigned long long target_dr6, target_rip;
-	struct kvm_vcpu *vcpu;
-	struct kvm_run *run;
-	struct kvm_vm *vm;
-	struct ucall uc;
-	uint64_t cmd;
 	int i;
 	/* Instruction lengths starting at ss_start */
 	int ss_size[6] = {
-		1,		/* sti*/
+		1,		/* sti */
 		2,		/* xor */
 		2,		/* cpuid */
 		5,		/* mov */
@@ -95,11 +100,6 @@ int main(void)
 		1,		/* cli */
 	};
 
-	TEST_REQUIRE(kvm_has_cap(KVM_CAP_SET_GUEST_DEBUG));
-
-	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
-	run = vcpu->run;
-
 	/* Test software BPs - int3 */
 	memset(&debug, 0, sizeof(debug));
 	debug.control = KVM_GUESTDBG_ENABLE | KVM_GUESTDBG_USE_SW_BP;
@@ -202,6 +202,30 @@ int main(void)
 	/* Disable all debug controls, run to the end */
 	memset(&debug, 0, sizeof(debug));
 	vcpu_guest_debug_set(vcpu, &debug);
+}
+
+int main(void)
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	uint64_t cmd;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_SET_GUEST_DEBUG));
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	run_test(vcpu);
+
+	if (kvm_cpu_has(X86_FEATURE_FRED)) {
+		struct kvm_sregs sregs;
+
+		vcpu_sregs_get(vcpu, &sregs);
+		sregs.cr4 |= X86_CR4_FRED;
+		vcpu_sregs_set(vcpu, &sregs);
+
+		run_test(vcpu);
+	}
 
 	vcpu_run(vcpu);
 	TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 22/23] KVM: selftests: Add a new VM guest mode to run user level code
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (20 preceding siblings ...)
  2023-11-08 18:30 ` [PATCH v1 21/23] KVM: selftests: Run debug_regs test with FRED enabled Xin Li
@ 2023-11-08 18:30 ` Xin Li
  2023-11-08 18:30 ` [PATCH v1 23/23] KVM: selftests: Add fred exception tests Xin Li
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

Add a new VM guest mode VM_MODE_PXXV48_4K_USER to set the user bit of
guest page table entries, thus allow user level code to run in guests.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 .../testing/selftests/kvm/include/kvm_util_base.h |  1 +
 tools/testing/selftests/kvm/lib/kvm_util.c        |  5 ++++-
 .../testing/selftests/kvm/lib/x86_64/processor.c  | 15 ++++++++++-----
 tools/testing/selftests/kvm/lib/x86_64/vmx.c      |  4 ++--
 4 files changed, 17 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h
index a18db6a7b3cf..1d2922447978 100644
--- a/tools/testing/selftests/kvm/include/kvm_util_base.h
+++ b/tools/testing/selftests/kvm/include/kvm_util_base.h
@@ -185,6 +185,7 @@ enum vm_guest_mode {
 	VM_MODE_P36V48_16K,
 	VM_MODE_P36V48_64K,
 	VM_MODE_P36V47_16K,
+	VM_MODE_PXXV48_4K_USER,	/* For 48bits VA but ANY bits PA with USER bit set */
 	NUM_VM_MODES,
 };
 
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 7a8af1821f5d..36f2acb378b6 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -162,6 +162,7 @@ const char *vm_guest_mode_string(uint32_t i)
 		[VM_MODE_P36V48_16K]	= "PA-bits:36,  VA-bits:48, 16K pages",
 		[VM_MODE_P36V48_64K]	= "PA-bits:36,  VA-bits:48, 64K pages",
 		[VM_MODE_P36V47_16K]	= "PA-bits:36,  VA-bits:47, 16K pages",
+		[VM_MODE_PXXV48_4K_USER]	= "PA-bits:ANY, VA-bits:48,  4K user pages",
 	};
 	_Static_assert(sizeof(strings)/sizeof(char *) == NUM_VM_MODES,
 		       "Missing new mode strings?");
@@ -187,6 +188,7 @@ const struct vm_guest_mode_params vm_guest_mode_params[] = {
 	[VM_MODE_P36V48_16K]	= { 36, 48,  0x4000, 14 },
 	[VM_MODE_P36V48_64K]	= { 36, 48, 0x10000, 16 },
 	[VM_MODE_P36V47_16K]	= { 36, 47,  0x4000, 14 },
+	[VM_MODE_PXXV48_4K_USER]	= {  0,  0,  0x1000, 12 },
 };
 _Static_assert(sizeof(vm_guest_mode_params)/sizeof(struct vm_guest_mode_params) == NUM_VM_MODES,
 	       "Missing new mode params?");
@@ -260,6 +262,7 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		vm->pgtable_levels = 3;
 		break;
 	case VM_MODE_PXXV48_4K:
+	case VM_MODE_PXXV48_4K_USER:
 #ifdef __x86_64__
 		kvm_get_cpu_address_width(&vm->pa_bits, &vm->va_bits);
 		/*
@@ -275,7 +278,7 @@ struct kvm_vm *____vm_create(enum vm_guest_mode mode)
 		vm->pgtable_levels = 4;
 		vm->va_bits = 48;
 #else
-		TEST_FAIL("VM_MODE_PXXV48_4K not supported on non-x86 platforms");
+		TEST_FAIL("VM_MODE_PXXV48_4K(_USER) not supported on non-x86 platforms");
 #endif
 		break;
 	case VM_MODE_P47V64_4K:
diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index d8288374078e..a8e60641df53 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -124,8 +124,8 @@ bool kvm_is_tdp_enabled(void)
 
 void virt_arch_pgd_alloc(struct kvm_vm *vm)
 {
-	TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use "
-		"unknown or unsupported guest mode, mode: 0x%x", vm->mode);
+	TEST_ASSERT((vm->mode == VM_MODE_PXXV48_4K) || (vm->mode == VM_MODE_PXXV48_4K_USER),
+		"Attempt to use unknown or unsupported guest mode, mode: 0x%x", vm->mode);
 
 	/* If needed, create page map l4 table. */
 	if (!vm->pgd_created) {
@@ -159,6 +159,8 @@ static uint64_t *virt_create_upper_pte(struct kvm_vm *vm,
 
 	if (!(*pte & PTE_PRESENT_MASK)) {
 		*pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK;
+		if (vm->mode == VM_MODE_PXXV48_4K_USER)
+			*pte |= PTE_USER_MASK;
 		if (current_level == target_level)
 			*pte |= PTE_LARGE_MASK | (paddr & PHYSICAL_PAGE_MASK);
 		else
@@ -185,7 +187,7 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level)
 	uint64_t *pml4e, *pdpe, *pde;
 	uint64_t *pte;
 
-	TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K,
+	TEST_ASSERT((vm->mode == VM_MODE_PXXV48_4K) || (vm->mode == VM_MODE_PXXV48_4K_USER),
 		    "Unknown or unsupported guest mode, mode: 0x%x", vm->mode);
 
 	TEST_ASSERT((vaddr % pg_size) == 0,
@@ -222,6 +224,8 @@ void __virt_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr, int level)
 	TEST_ASSERT(!(*pte & PTE_PRESENT_MASK),
 		    "PTE already present for 4k page at vaddr: 0x%lx\n", vaddr);
 	*pte = PTE_PRESENT_MASK | PTE_WRITABLE_MASK | (paddr & PHYSICAL_PAGE_MASK);
+	if (vm->mode == VM_MODE_PXXV48_4K_USER)
+		*pte |= PTE_USER_MASK;
 }
 
 void virt_arch_pg_map(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr)
@@ -268,8 +272,8 @@ uint64_t *__vm_get_page_table_entry(struct kvm_vm *vm, uint64_t vaddr,
 	TEST_ASSERT(*level >= PG_LEVEL_NONE && *level < PG_LEVEL_NUM,
 		    "Invalid PG_LEVEL_* '%d'", *level);
 
-	TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use "
-		"unknown or unsupported guest mode, mode: 0x%x", vm->mode);
+	TEST_ASSERT((vm->mode == VM_MODE_PXXV48_4K) || (vm->mode == VM_MODE_PXXV48_4K_USER),
+		"Attempt to use unknown or unsupported guest mode, mode: 0x%x", vm->mode);
 	TEST_ASSERT(sparsebit_is_set(vm->vpages_valid,
 		(vaddr >> vm->page_shift)),
 		"Invalid virtual address, vaddr: 0x%lx",
@@ -536,6 +540,7 @@ static void vcpu_setup(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
 
 	switch (vm->mode) {
 	case VM_MODE_PXXV48_4K:
+	case VM_MODE_PXXV48_4K_USER:
 		sregs.cr0 = X86_CR0_PE | X86_CR0_NE | X86_CR0_PG;
 		sregs.cr4 |= X86_CR4_PAE | X86_CR4_OSFXSR;
 		sregs.efer |= (EFER_LME | EFER_LMA | EFER_NX);
diff --git a/tools/testing/selftests/kvm/lib/x86_64/vmx.c b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
index 59d97531c9b1..65147de6f9c0 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/vmx.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/vmx.c
@@ -403,8 +403,8 @@ void __nested_pg_map(struct vmx_pages *vmx, struct kvm_vm *vm,
 	struct eptPageTableEntry *pt = vmx->eptp_hva, *pte;
 	uint16_t index;
 
-	TEST_ASSERT(vm->mode == VM_MODE_PXXV48_4K, "Attempt to use "
-		    "unknown or unsupported guest mode, mode: 0x%x", vm->mode);
+	TEST_ASSERT((vm->mode == VM_MODE_PXXV48_4K) || (vm->mode == VM_MODE_PXXV48_4K_USER),
+		    "Attempt to use unknown or unsupported guest mode, mode: 0x%x", vm->mode);
 
 	TEST_ASSERT((nested_paddr >> 48) == 0,
 		    "Nested physical address 0x%lx requires 5-level paging",
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* [PATCH v1 23/23] KVM: selftests: Add fred exception tests
  2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
                   ` (21 preceding siblings ...)
  2023-11-08 18:30 ` [PATCH v1 22/23] KVM: selftests: Add a new VM guest mode to run user level code Xin Li
@ 2023-11-08 18:30 ` Xin Li
  22 siblings, 0 replies; 70+ messages in thread
From: Xin Li @ 2023-11-08 18:30 UTC (permalink / raw)
  To: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

From: Shan Kang <shan.kang@intel.com>

Add tests for FRED event data and VMX nested-exception.

FRED is designed to save a complete event context in its stack frame,
e.g., FRED saves the faulting linear address of a #PF into a 64-bit
event data field defined in FRED stack frame.  As such, FRED VMX adds
event data handling during VMX transitions.

FRED introduces event stack levels to dispatch an event handler onto a
stack baesd on current stack level and stack levels defined in
IA32_FRED_STKLVLS MSR for each exception vector.  VMX nested-exception
support ensures a correct event stack level is chosen when a VM entry
injects a nested exception, which is "regarded" as occurred in ring 0.

Signed-off-by: Shan Kang <shan.kang@intel.com>
Co-developed-by: Xin Li <xin3.li@intel.com>
Signed-off-by: Xin Li <xin3.li@intel.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/include/x86_64/processor.h  |  29 ++
 .../testing/selftests/kvm/x86_64/fred_test.c  | 262 ++++++++++++++++++
 3 files changed, 292 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/fred_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index a5963ab9215b..06d16e59aa3c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -76,6 +76,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/get_msr_index_features
 TEST_GEN_PROGS_x86_64 += x86_64/exit_on_emulation_failure_test
 TEST_GEN_PROGS_x86_64 += x86_64/fix_hypercall_test
 TEST_GEN_PROGS_x86_64 += x86_64/hwcr_msr_test
+TEST_GEN_PROGS_x86_64 += x86_64/fred_test
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_clock
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_cpuid
 TEST_GEN_PROGS_x86_64 += x86_64/hyperv_evmcs
diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h
index 165d21fd1577..9c26705aa320 100644
--- a/tools/testing/selftests/kvm/include/x86_64/processor.h
+++ b/tools/testing/selftests/kvm/include/x86_64/processor.h
@@ -1260,4 +1260,33 @@ void virt_map_level(struct kvm_vm *vm, uint64_t vaddr, uint64_t paddr,
 #define PFERR_GUEST_PAGE_MASK	BIT_ULL(PFERR_GUEST_PAGE_BIT)
 #define PFERR_IMPLICIT_ACCESS	BIT_ULL(PFERR_IMPLICIT_ACCESS_BIT)
 
+/*
+ * FRED related data structures and functions
+ */
+struct fred_stack {
+	u64 r15;
+	u64 r14;
+	u64 r13;
+	u64 r12;
+	u64 bp;
+	u64 bx;
+	u64 r11;
+	u64 r10;
+	u64 r9;
+	u64 r8;
+	u64 ax;
+	u64 cx;
+	u64 dx;
+	u64 si;
+	u64 di;
+	u64 error_code;
+	u64 ip;
+	u64 csx;
+	u64 flags;
+	u64 sp;
+	u64 ssx;
+	u64 event_data;
+	u64 reserved;
+};
+
 #endif /* SELFTEST_KVM_PROCESSOR_H */
diff --git a/tools/testing/selftests/kvm/x86_64/fred_test.c b/tools/testing/selftests/kvm/x86_64/fred_test.c
new file mode 100644
index 000000000000..ed117db017cd
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/fred_test.c
@@ -0,0 +1,262 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * FRED nested exception tests
+ *
+ * Copyright (C) 2023, Intel, Inc.
+ */
+#define _GNU_SOURCE /* for program_invocation_short_name */
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/ioctl.h>
+#include <asm/msr-index.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "guest_modes.h"
+#include "processor.h"
+
+#define FRED_STKLVL(v,l)		(_AT(unsigned long, l) << (2 * (v)))
+#define FRED_CONFIG_ENTRYPOINT(p)	_AT(unsigned long, (p))
+
+/* This address is already mapped in guest page table. */
+#define FRED_VALID_RSP			0x8000
+
+/*
+ * The following addresses are not yet mapped in both EPT and guest page
+ * tables at the beginning.  As a result, it causes an EPT violation VM
+ * exit with an original guest #PF to access any of them for the first
+ * time.
+ *
+ * Use these addresses as guest FRED RSP0 to generate nested #PFs to test
+ * if event data are properly virtualized.
+ */
+static unsigned long fred_invalid_rsp[4] = {
+	0x0,
+	0xf0000000,
+	0xe0000000,
+	0xd0000000,
+};
+
+extern char asm_user_wrmsr[];
+extern char asm_user_ud[];
+extern char asm_done_fault[];
+
+extern void asm_test_fault(int test);
+
+/*
+ * user level code for triggering faults.
+ */
+asm(".pushsection .text\n"
+    ".type asm_user_wrmsr, @function\n"
+    ".align 4096\n"
+    "asm_user_wrmsr:\n"
+    /* Trigger a #GP */
+    "wrmsr\n"
+
+    ".fill asm_user_ud - ., 1, 0xcc\n"
+
+    ".type asm_user_ud, @function\n"
+    ".org asm_user_wrmsr + 16\n"
+    "asm_user_ud:\n"
+    /* Trigger a #UD */
+    "ud2\n"
+
+    ".align 4096, 0xcc\n"
+    ".popsection");
+
+/* Send current stack level and #PF address */
+#define GUEST_SYNC_CSL_FA(__stage, __pf_address)		\
+	GUEST_SYNC_ARGS(__stage, __pf_address, 0, 0, 0)
+
+void fred_entry_from_user(struct fred_stack *stack)
+{
+	u32 current_stack_level = rdmsr(MSR_IA32_FRED_CONFIG) & 0x3;
+
+	GUEST_SYNC_CSL_FA(current_stack_level, stack->event_data);
+
+	/* Do NOT go back to user level, continue the next test instead */
+	stack->ssx = 0x18;
+	stack->csx = 0x10;
+	stack->ip = (u64)&asm_done_fault;
+}
+
+void fred_entry_from_kernel(struct fred_stack *stack)
+{
+	TEST_FAIL("kernel events not allowed in FRED tests.");
+}
+
+#define PUSH_REGS	\
+	"push %rdi\n"	\
+	"push %rsi\n"	\
+	"push %rdx\n"	\
+	"push %rcx\n"	\
+	"push %rax\n"	\
+	"push %r8\n"	\
+	"push %r9\n"	\
+	"push %r10\n"	\
+	"push %r11\n"	\
+	"push %rbx\n"	\
+	"push %rbp\n"	\
+	"push %r12\n"	\
+	"push %r13\n"	\
+	"push %r14\n"	\
+	"push %r15\n"
+
+#define POP_REGS	\
+	"pop %r15\n"	\
+	"pop %r14\n"	\
+	"pop %r13\n"	\
+	"pop %r12\n"	\
+	"pop %rbp\n"	\
+	"pop %rbx\n"	\
+	"pop %r11\n"	\
+	"pop %r10\n"	\
+	"pop %r9\n"	\
+	"pop %r8\n"	\
+	"pop %rax\n"	\
+	"pop %rcx\n"	\
+	"pop %rdx\n"	\
+	"pop %rsi\n"	\
+	"pop %rdi\n"
+
+/*
+ * FRED entry points.
+ */
+asm(".pushsection .text\n"
+    ".type asm_fred_entrypoint_user, @function\n"
+    ".align 4096\n"
+    "asm_fred_entrypoint_user:\n"
+    "endbr64\n"
+    PUSH_REGS
+    "movq %rsp, %rdi\n"
+    "call fred_entry_from_user\n"
+    POP_REGS
+    /* Do NOT go back to user level, continue the next test instead */
+    ".byte 0xf2,0x0f,0x01,0xca\n"	/* ERETS */
+
+    ".fill asm_fred_entrypoint_kernel - ., 1, 0xcc\n"
+
+    ".type asm_fred_entrypoint_kernel, @function\n"
+    ".org asm_fred_entrypoint_user + 256\n"
+    "asm_fred_entrypoint_kernel:\n"
+    "endbr64\n"
+    PUSH_REGS
+    "movq %rsp, %rdi\n"
+    "call fred_entry_from_kernel\n"
+    POP_REGS
+    ".byte 0xf2,0x0f,0x01,0xca\n"	/* ERETS */
+    ".align 4096, 0xcc\n"
+    ".popsection");
+
+extern char asm_fred_entrypoint_user[];
+
+/*
+ * Prepare a FRED stack frame for ERETU to run user level code, WRMSR or UD,
+ * which causes a #GP or #UD.  However because FRED RSP0 is not yet mapped
+ * in guest page table, the delivery of the #GP or #UD causes a nested #PF,
+ * which is then delivered on FRED RSPx (x is 1, 2 or 3, determinated by MSR
+ * FRED_STKLVL(PF_VECTOR)).
+ *
+ * If FRED RSPx is also not yet mapped in guest page table, a triple fault is
+ * generated.
+ */
+asm(".pushsection .text\n"
+    ".type asm_test_fault, @function\n"
+    ".align 4096\n"
+    "asm_test_fault:\n"
+    "endbr64\n"
+    "push %rbp\n"
+    "mov %rsp, %rbp\n"
+    "and $(~0x3f), %rsp\n"
+    "push $0\n"
+    "push $0\n"
+    "mov $0x2b, %rax\n"
+    "bts $57, %rax\n"
+    "push %rax\n"
+    /* The FRED user level test code does NOT need a stack. */
+    "push $0\n"
+    "pushf\n"
+    "mov $0x33, %rax\n"
+    "push %rax\n"
+    "cmp $0, %edi\n"
+    "jne 1f\n"
+    "lea asm_user_wrmsr(%rip), %rax\n"
+    "jmp 2f\n"
+    "1: lea asm_user_ud(%rip), %rax\n"
+    "2: push %rax\n"
+    "push $0\n"
+    /* ERETU to user level code to generate a fault immediately */
+    ".byte 0xf3,0x0f,0x01,0xca\n"
+    "asm_done_fault:\n"
+    "mov %rbp, %rsp\n"
+    "pop %rbp\n"
+    "ret\n"
+    ".align 4096, 0xcc\n"
+    ".popsection");
+
+static void guest_code(void)
+{
+	wrmsr(MSR_IA32_FRED_CONFIG,
+	      FRED_CONFIG_ENTRYPOINT(asm_fred_entrypoint_user));
+
+	wrmsr(MSR_IA32_FRED_RSP1, FRED_VALID_RSP);
+	wrmsr(MSR_IA32_FRED_RSP2, FRED_VALID_RSP);
+	wrmsr(MSR_IA32_FRED_RSP3, FRED_VALID_RSP);
+
+	/* Enable FRED */
+	set_cr4(get_cr4() | X86_CR4_FRED);
+
+	wrmsr(MSR_IA32_FRED_STKLVLS, FRED_STKLVL(PF_VECTOR, 1));
+	wrmsr(MSR_IA32_FRED_RSP0, fred_invalid_rsp[1]);
+	/* 0: wrmsr to generate #GP */
+	asm_test_fault(0);
+
+	wrmsr(MSR_IA32_FRED_STKLVLS, FRED_STKLVL(PF_VECTOR, 2));
+	wrmsr(MSR_IA32_FRED_RSP0, fred_invalid_rsp[2]);
+	/* 1: ud2 to generate #UD */
+	asm_test_fault(1);
+
+	wrmsr(MSR_IA32_FRED_STKLVLS, FRED_STKLVL(PF_VECTOR, 3));
+	wrmsr(MSR_IA32_FRED_RSP0, fred_invalid_rsp[3]);
+	/* 0: wrmsr to generate #GP */
+	asm_test_fault(0);
+
+	GUEST_DONE();
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	struct ucall uc;
+	uint64_t expected_current_stack_level = 1;
+
+	TEST_REQUIRE(kvm_cpu_has(X86_FEATURE_FRED));
+
+	vm = __vm_create_with_vcpus(VM_MODE_PXXV48_4K_USER, 1, 0,
+				    guest_code, &vcpu);
+
+	while (true) {
+		uint64_t r;
+
+		vcpu_run(vcpu);
+
+		r = get_ucall(vcpu, &uc);
+
+		if (r == UCALL_DONE)
+			break;
+
+		if (r == UCALL_SYNC) {
+			TEST_ASSERT((uc.args[1] == expected_current_stack_level) &&
+				    (uc.args[2] == fred_invalid_rsp[expected_current_stack_level] - 1),
+				    "Incorrect stack level %lx and #PF address %lx\n",
+				    uc.args[1], uc.args[2]);
+			expected_current_stack_level++;
+		}
+	}
+
+	kvm_vm_free(vm);
+	return 0;
+}
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls
  2023-11-08 18:29 ` [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls Xin Li
@ 2023-11-09  8:21   ` Jeremi Piotrowski
  2023-11-10  0:12     ` Li, Xin3
  2023-11-20 15:52   ` Vitaly Kuznetsov
  1 sibling, 1 reply; 70+ messages in thread
From: Jeremi Piotrowski @ 2023-11-09  8:21 UTC (permalink / raw)
  To: Xin Li, kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On 08/11/2023 19:29, Xin Li wrote:
> Enable the secondary VM exit controls to prepare for nested FRED.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  Documentation/virt/kvm/x86/nested-vmx.rst |  1 +
>  arch/x86/include/asm/hyperv-tlfs.h        |  1 +
>  arch/x86/kvm/vmx/capabilities.h           |  1 +
>  arch/x86/kvm/vmx/hyperv.c                 | 18 +++++++++++++++++-
>  arch/x86/kvm/vmx/nested.c                 | 18 +++++++++++++++++-
>  arch/x86/kvm/vmx/vmcs12.c                 |  1 +
>  arch/x86/kvm/vmx/vmcs12.h                 |  2 ++
>  arch/x86/kvm/x86.h                        |  2 +-
>  8 files changed, 41 insertions(+), 3 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
> index ac2095d41f02..e64ef231f310 100644
> --- a/Documentation/virt/kvm/x86/nested-vmx.rst
> +++ b/Documentation/virt/kvm/x86/nested-vmx.rst
> @@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
>  		u16 host_fs_selector;
>  		u16 host_gs_selector;
>  		u16 host_tr_selector;
> +		u64 secondary_vm_exit_controls;
>  	};
>  
>  
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 2ff26f53cd62..299554708e37 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -616,6 +616,7 @@ struct hv_enlightened_vmcs {
>  	u64 host_ssp;
>  	u64 host_ia32_int_ssp_table_addr;
>  	u64 padding64_6;
> +	u64 secondary_vm_exit_controls;
>  } __packed;
> >  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0

Hi Xin Li,

The comment at the top of hyperv-tlfs.h says:
"This file contains definitions from Hyper-V Hypervisor Top-Level Functional Specification (TLFS)"

These struct definitions are shared with the hypervisor, so you can't just add fields to it.
Same comment for patch 16.

Would FRED work in nested virtualization if the L0 hypervisor does not support it (or doesn't know
about it)?

Thanks,
Jeremi

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-08 18:29 ` [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li
@ 2023-11-09  8:53   ` Chao Gao
  2023-11-09 15:15     ` Sean Christopherson
  2023-11-13 17:18   ` Nikolay Borisov
  1 sibling, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-09  8:53 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:45AM -0800, Xin Li wrote:
>Setup the global vmcs_config for FRED:
>1) Add VM_ENTRY_LOAD_IA32_FRED to KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS to
>   have a FRED CPU load guest FRED MSRs from VMCS upon VM entry.
>2) Add SECONDARY_VM_EXIT_SAVE_IA32_FRED to
>   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU save
>   guest FRED MSRs to VMCS during VM exit.
>3) add SECONDARY_VM_EXIT_LOAD_IA32_FRED to
>   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU load
>   host FRED MSRs from VMCS during VM exit.
>
>Also add sanity checks to make sure FRED VM entry/exit controls can be
>set on a FRED CPU.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/include/asm/vmx.h |  3 +++
> arch/x86/kvm/vmx/vmx.c     | 19 ++++++++++++++++++-
> arch/x86/kvm/vmx/vmx.h     |  7 +++++--
> 3 files changed, 26 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>index 4d4177ec802c..41796a733bc9 100644
>--- a/arch/x86/include/asm/vmx.h
>+++ b/arch/x86/include/asm/vmx.h
>@@ -106,6 +106,8 @@
> #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
> #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
>+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	0x00000001
>+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	0x00000002
> 
> #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
> 
>@@ -119,6 +121,7 @@
> #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
> #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
> #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
>+#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
> 
> #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
> 
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index df769207cbe0..9186f41974ab 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -2694,10 +2694,27 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> 		_vmexit_control &= ~x_ctrl;
> 	}
> 
>-	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
>+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
> 		_secondary_vmexit_control =
> 			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
> 					      MSR_IA32_VMX_EXIT_CTLS2);
>+		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
>+		    !(_secondary_vmexit_control & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
>+		      _secondary_vmexit_control & SECONDARY_VM_EXIT_LOAD_IA32_FRED)) {
>+			pr_warn_once("FRED enabled but no VMX VM-Exit {SAVE,LOAD}_IA32_FRED controls: %llx\n",
>+				     _secondary_vmexit_control);

if there is no VM_EXIT_ACTIVATE_SECONDARY_CONTROLS, shouldn't we also emit this
warning?

>+			if (error_on_inconsistent_vmcs_config)
>+				return -EIO;
>+		}
>+	}
>+
>+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
>+	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
>+		pr_warn_once("FRED enabled but no VMX VM-Entry LOAD_IA32_FRED control: %x\n",
>+			     _vmentry_control);

Can we just hide FRED from guests like what KVM does for other features which
have similar dependencies? see vmx_set_cpu_caps().

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-08 18:29 ` [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID Xin Li
@ 2023-11-09  9:15   ` Chao Gao
  2023-11-09 23:50     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-09  9:15 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:46AM -0800, Xin Li wrote:
>Clear FRED VM entry/exit controls when initializing a vCPU, and set
>these controls only if FRED is enumerated after set CPUID.
>
>FRED VM entry/exit controls need to be set to establish context
>sufficient to support FRED event delivery immediately after VM entry
>and exit.  However it is not required to save/load FRED MSRs for
>a non-FRED guest, which aren't supposed to access FRED MSRs.
>
>A non-FRED guest should get #GP upon accessing FRED MSRs, otherwise
>it corrupts host FRED MSRs.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 9186f41974ab..5d4786812664 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -4423,6 +4423,9 @@ static u32 vmx_vmentry_ctrl(void)
> 	if (cpu_has_perf_global_ctrl_bug())
> 		vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
> 
>+	/* Whether to load guest FRED MSRs is deferred until after set CPUID */
>+	vmentry_ctrl &= ~VM_ENTRY_LOAD_IA32_FRED;
>+
> 	return vmentry_ctrl;
> }
> 
>@@ -4458,7 +4461,13 @@ static u32 vmx_vmexit_ctrl(void)
> 
> static u64 vmx_secondary_vmexit_ctrl(void)
> {
>-	return vmcs_config.secondary_vmexit_ctrl;
>+	u64 secondary_vmexit_ctrl = vmcs_config.secondary_vmexit_ctrl;
>+
>+	/* Whether to save/load FRED MSRs is deferred until after set CPUID */
>+	secondary_vmexit_ctrl &= ~(SECONDARY_VM_EXIT_SAVE_IA32_FRED |
>+				   SECONDARY_VM_EXIT_LOAD_IA32_FRED);
>+
>+	return secondary_vmexit_ctrl;
> }
> 
> static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>@@ -7785,10 +7794,33 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
> 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
> }
> 
>+static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
>+{
>+	struct vcpu_vmx *vmx = to_vmx(vcpu);
>+
>+	if (!cpu_feature_enabled(X86_FEATURE_FRED) ||
>+	    !guest_cpuid_has(vcpu, X86_FEATURE_FRED))
>+		return;
>+
>+	/* Enable loading guest FRED MSRs from VMCS */
>+	vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_FRED);
>+
>+	/*
>+	 * Enable saving guest FRED MSRs into VMCS and loading host FRED MSRs
>+	 * from VMCS.
>+	 */
>+	vm_exit_controls_setbit(vmx, VM_EXIT_ACTIVATE_SECONDARY_CONTROLS);
>+	secondary_vm_exit_controls_setbit(vmx,
>+					  SECONDARY_VM_EXIT_SAVE_IA32_FRED |
>+					  SECONDARY_VM_EXIT_LOAD_IA32_FRED);

all above vmcs controls need to be cleared if guest doesn't enumerate FRED, see

https://lore.kernel.org/all/ZJYzPn7ipYfO0fLZ@google.com/

Clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS may be problematic when
new bits are added to secondary vmcs controls. Why not keep
VM_EXIT_ACTIVATE_SECONDARY_CONTROLS always on if it is supported? or you see
any perf impact?

>+}
>+
> static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> {
> 	struct vcpu_vmx *vmx = to_vmx(vcpu);
> 
>+	vmx_vcpu_config_fred_after_set_cpuid(vcpu);
>+
> 	/*
> 	 * XSAVES is effectively enabled if and only if XSAVE is also exposed
> 	 * to the guest.  XSAVES depends on CR4.OSXSAVE, and CR4.OSXSAVE can be
>-- 
>2.42.0
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs
  2023-11-08 18:29 ` [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs Xin Li
@ 2023-11-09  9:21   ` Chao Gao
  0 siblings, 0 replies; 70+ messages in thread
From: Chao Gao @ 2023-11-09  9:21 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:47AM -0800, Xin Li wrote:
>Add FRED MSRs to the valid passthrough MSR list and disable intercepting
>FRED MSRs only if FRED is enumerated after set CPUID.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 13 +++++++++++++
> 1 file changed, 13 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 5d4786812664..327e052d90c1 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -700,6 +700,9 @@ static bool is_valid_passthrough_msr(u32 msr)
> 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
> 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
> 		return true;
>+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
>+		/* FRED MSRs should be passthrough to FRED guests only */
>+		return true;
> 	}
> 
> 	r = possible_passthrough_msr_slot(msr) != -ENOENT;
>@@ -7813,6 +7816,16 @@ static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
> 	secondary_vm_exit_controls_setbit(vmx,
> 					  SECONDARY_VM_EXIT_SAVE_IA32_FRED |
> 					  SECONDARY_VM_EXIT_LOAD_IA32_FRED);
>+
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP0, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP1, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP2, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_RSP3, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_STKLVLS, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP1, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP2, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_SSP3, MSR_TYPE_RW);
>+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_FRED_CONFIG, MSR_TYPE_RW);

this has the same issue as patch 6.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-09  8:53   ` Chao Gao
@ 2023-11-09 15:15     ` Sean Christopherson
  2023-11-10  0:04       ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2023-11-09 15:15 UTC (permalink / raw)
  To: Chao Gao
  Cc: Xin Li, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu, decui,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Thu, Nov 09, 2023, Chao Gao wrote:
> On Wed, Nov 08, 2023 at 10:29:45AM -0800, Xin Li wrote:
> >Setup the global vmcs_config for FRED:
> >1) Add VM_ENTRY_LOAD_IA32_FRED to KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS to
> >   have a FRED CPU load guest FRED MSRs from VMCS upon VM entry.
> >2) Add SECONDARY_VM_EXIT_SAVE_IA32_FRED to
> >   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU save
> >   guest FRED MSRs to VMCS during VM exit.
> >3) add SECONDARY_VM_EXIT_LOAD_IA32_FRED to
> >   KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU load
> >   host FRED MSRs from VMCS during VM exit.
> >
> >Also add sanity checks to make sure FRED VM entry/exit controls can be
> >set on a FRED CPU.
> >
> >Tested-by: Shan Kang <shan.kang@intel.com>
> >Signed-off-by: Xin Li <xin3.li@intel.com>
> >---
> > arch/x86/include/asm/vmx.h |  3 +++
> > arch/x86/kvm/vmx/vmx.c     | 19 ++++++++++++++++++-
> > arch/x86/kvm/vmx/vmx.h     |  7 +++++--
> > 3 files changed, 26 insertions(+), 3 deletions(-)
> >
> >diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> >index 4d4177ec802c..41796a733bc9 100644
> >--- a/arch/x86/include/asm/vmx.h
> >+++ b/arch/x86/include/asm/vmx.h
> >@@ -106,6 +106,8 @@
> > #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
> > #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> > #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
> >+#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	0x00000001
> >+#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	0x00000002
> > 
> > #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
> > 
> >@@ -119,6 +121,7 @@
> > #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
> > #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
> > #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
> >+#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
> > 
> > #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
> > 
> >diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> >index df769207cbe0..9186f41974ab 100644
> >--- a/arch/x86/kvm/vmx/vmx.c
> >+++ b/arch/x86/kvm/vmx/vmx.c
> >@@ -2694,10 +2694,27 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
> > 		_vmexit_control &= ~x_ctrl;
> > 	}
> > 
> >-	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
> >+	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
> > 		_secondary_vmexit_control =
> > 			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
> > 					      MSR_IA32_VMX_EXIT_CTLS2);
> >+		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> >+		    !(_secondary_vmexit_control & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
> >+		      _secondary_vmexit_control & SECONDARY_VM_EXIT_LOAD_IA32_FRED)) {
> >+			pr_warn_once("FRED enabled but no VMX VM-Exit {SAVE,LOAD}_IA32_FRED controls: %llx\n",
> >+				     _secondary_vmexit_control);
> 
> if there is no VM_EXIT_ACTIVATE_SECONDARY_CONTROLS, shouldn't we also emit this
> warning?
> 
> >+			if (error_on_inconsistent_vmcs_config)
> >+				return -EIO;
> >+		}
> >+	}
> >+
> >+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> >+	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
> >+		pr_warn_once("FRED enabled but no VMX VM-Entry LOAD_IA32_FRED control: %x\n",
> >+			     _vmentry_control);
> 
> Can we just hide FRED from guests like what KVM does for other features which
> have similar dependencies? see vmx_set_cpu_caps().

Both of these warnings should simply be dropped.  The error_on_inconsistent_vmcs_config
stuff is for inconsistencies within the allowed VMCS fields.  Having a feature
that is supported in bare metal but not virtualized is perfectly legal, if
uncommon.

What *is* needed is for KVM to refuse to virtualize FRED if the entry/exit controls
aren't consistent.  E.g. if at least one control is present, and at least one
control is missing.   I.e. KVM needs a version of vmcs_entry_exit_pairs that can
deal with SECONDAY_VM_EXIT controls.  I'll circle back to this when I give the
series a proper review, which is going to be 3+ weeks. 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-09  9:15   ` Chao Gao
@ 2023-11-09 23:50     ` Li, Xin3
  2023-11-10  0:18       ` Sean Christopherson
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-09 23:50 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	Christopherson,,
	Sean, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >+static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
> >+{
> >+	struct vcpu_vmx *vmx = to_vmx(vcpu);
> >+
> >+	if (!cpu_feature_enabled(X86_FEATURE_FRED) ||
> >+	    !guest_cpuid_has(vcpu, X86_FEATURE_FRED))
> >+		return;
> >+
> >+	/* Enable loading guest FRED MSRs from VMCS */
> >+	vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_FRED);
> >+
> >+	/*
> >+	 * Enable saving guest FRED MSRs into VMCS and loading host FRED MSRs
> >+	 * from VMCS.
> >+	 */
> >+	vm_exit_controls_setbit(vmx,
> VM_EXIT_ACTIVATE_SECONDARY_CONTROLS);
> >+	secondary_vm_exit_controls_setbit(vmx,
> >+					  SECONDARY_VM_EXIT_SAVE_IA32_FRED
> |
> >+
> SECONDARY_VM_EXIT_LOAD_IA32_FRED);
> 
> all above vmcs controls need to be cleared if guest doesn't enumerate FRED, see
> 
> https://lore.kernel.org/all/ZJYzPn7ipYfO0fLZ@google.com/

Good point, the user space could set cpuid multiple times...
 
> Clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS may be problematic when
> new bits are added to secondary vmcs controls. Why not keep
> VM_EXIT_ACTIVATE_SECONDARY_CONTROLS always on if it is supported? or you
> see any perf impact?

I think it from the other way, why keeps hw loading it on every vmentry
even if it's not used by a guest?

Different CPUs may implement it in different ways, which we can't assume.

Other features needing it should set it separately, say with a refcount.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-09 15:15     ` Sean Christopherson
@ 2023-11-10  0:04       ` Li, Xin3
  2023-11-10 15:01         ` Sean Christopherson
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-10  0:04 UTC (permalink / raw)
  To: Sean Christopherson, Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz, Shankar,
	Ravi V

> > >+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> > >+	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
> > >+		pr_warn_once("FRED enabled but no VMX VM-Entry
> LOAD_IA32_FRED control: %x\n",
> > >+			     _vmentry_control);
> >
> > Can we just hide FRED from guests like what KVM does for other
> > features which have similar dependencies? see vmx_set_cpu_caps().
> 
> Both of these warnings should simply be dropped.  The
> error_on_inconsistent_vmcs_config stuff is for inconsistencies within the allowed
> VMCS fields.  Having a feature that is supported in bare metal but not virtualized
> is perfectly legal, if uncommon.

I deliberately keep it, at least for now, because these checks are helpful
during the development of nVMX FRED.  It will be helpful for other VMMs
being developed/tested on KVM.

> What *is* needed is for KVM to refuse to virtualize FRED if the entry/exit controls
> aren't consistent.  E.g. if at least one control is present, and at least one
> control is missing.   I.e. KVM needs a version of vmcs_entry_exit_pairs that can
> deal with SECONDAY_VM_EXIT controls.

I agree there are better ways.  But maybe after or before VMX FRED.

>  I'll circle back to this when I give the
> series a proper review, which is going to be 3+ weeks.

The traffic in KVM mailing list is surprisingly high recently.  So that is
totally expected.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls
  2023-11-09  8:21   ` Jeremi Piotrowski
@ 2023-11-10  0:12     ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-10  0:12 UTC (permalink / raw)
  To: Jeremi Piotrowski, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest
  Cc: Christopherson,,
	Sean, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > diff --git a/arch/x86/include/asm/hyperv-tlfs.h
> > b/arch/x86/include/asm/hyperv-tlfs.h
> > index 2ff26f53cd62..299554708e37 100644
> > --- a/arch/x86/include/asm/hyperv-tlfs.h
> > +++ b/arch/x86/include/asm/hyperv-tlfs.h
> > @@ -616,6 +616,7 @@ struct hv_enlightened_vmcs {
> >  	u64 host_ssp;
> >  	u64 host_ia32_int_ssp_table_addr;
> >  	u64 padding64_6;
> > +	u64 secondary_vm_exit_controls;
> >  } __packed;
> > >  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0
> 
> Hi Xin Li,
> 
> The comment at the top of hyperv-tlfs.h says:
> "This file contains definitions from Hyper-V Hypervisor Top-Level Functional
> Specification (TLFS)"
> 
> These struct definitions are shared with the hypervisor, so you can't just add fields
> to it.
> Same comment for patch 16.

I tried not to touch any hyperv stuff but hit some problems.

> 
> Would FRED work in nested virtualization if the L0 hypervisor does not support it
> (or doesn't know about it)?

I don't think so AFAICT.  But I could be wrong, say a VMM implements FRED
completely in software.  Otherwise L0 needs to add code to have hardware
to switch FRED context between host and guest, which can't be delayed.

Thanks a lot for your quick review!

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-09 23:50     ` Li, Xin3
@ 2023-11-10  0:18       ` Sean Christopherson
  2023-11-14  2:50         ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2023-11-10  0:18 UTC (permalink / raw)
  To: Xin3 Li
  Cc: Chao Gao, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu,
	Dexuan Cui, tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets,
	peterz, Ravi V Shankar

On Thu, Nov 09, 2023, Xin3 Li wrote:
> > >+static void vmx_vcpu_config_fred_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >+{
> > >+	struct vcpu_vmx *vmx = to_vmx(vcpu);
> > >+
> > >+	if (!cpu_feature_enabled(X86_FEATURE_FRED) ||
> > >+	    !guest_cpuid_has(vcpu, X86_FEATURE_FRED))
> > >+		return;
> > >+
> > >+	/* Enable loading guest FRED MSRs from VMCS */
> > >+	vm_entry_controls_setbit(vmx, VM_ENTRY_LOAD_IA32_FRED);
> > >+
> > >+	/*
> > >+	 * Enable saving guest FRED MSRs into VMCS and loading host FRED MSRs
> > >+	 * from VMCS.
> > >+	 */
> > >+	vm_exit_controls_setbit(vmx,
> > VM_EXIT_ACTIVATE_SECONDARY_CONTROLS);
> > >+	secondary_vm_exit_controls_setbit(vmx,
> > >+					  SECONDARY_VM_EXIT_SAVE_IA32_FRED
> > |
> > >+
> > SECONDARY_VM_EXIT_LOAD_IA32_FRED);
> > 
> > all above vmcs controls need to be cleared if guest doesn't enumerate FRED, see
> > 
> > https://lore.kernel.org/all/ZJYzPn7ipYfO0fLZ@google.com/
> 
> Good point, the user space could set cpuid multiple times...
>  
> > Clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS may be problematic when
> > new bits are added to secondary vmcs controls. Why not keep
> > VM_EXIT_ACTIVATE_SECONDARY_CONTROLS always on if it is supported? or you
> > see any perf impact?
> 
> I think it from the other way, why keeps hw loading it on every vmentry
> even if it's not used by a guest?

Oh, yeesh, this is clearing the activation control.  Yeah, NAK to that, just
leave it set.  If it weren't for the fact that there is apparently a metrict ton
of FRED state (I thought the whole point of FRED was to kill off legacy crap like
CPL1 and CPL2 stacks?) _and_ that KVM needs to toggle MSR intercepts, I'd probably
push back on toggling even the FRED controls.

> Different CPUs may implement it in different ways, which we can't assume.

Implement what in a different way?  The VMCS fields and FRED are architectural.
The internal layout of the VMCS is uarch specific, but the encodings and semantics
absolutely cannot change without breaking software.  And if Intel does something
asinine like make a control active-low then we have far, far bigger problems.

> Other features needing it should set it separately, say with a refcount.

Absolutely not.  Unless Intel screwed up the implementation, the cost of leaving
VM_EXIT_ACTIVATE_SECONDARY_CONTROLS set when it's supported shouldn't even be
measurable.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-10  0:04       ` Li, Xin3
@ 2023-11-10 15:01         ` Sean Christopherson
  2023-11-14  4:05           ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Sean Christopherson @ 2023-11-10 15:01 UTC (permalink / raw)
  To: Xin3 Li
  Cc: Chao Gao, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu,
	Dexuan Cui, tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets,
	peterz, Ravi V Shankar

On Fri, Nov 10, 2023, Xin3 Li wrote:
> > > >+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> > > >+	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
> > > >+		pr_warn_once("FRED enabled but no VMX VM-Entry
> > LOAD_IA32_FRED control: %x\n",
> > > >+			     _vmentry_control);
> > >
> > > Can we just hide FRED from guests like what KVM does for other
> > > features which have similar dependencies? see vmx_set_cpu_caps().
> > 
> > Both of these warnings should simply be dropped.  The
> > error_on_inconsistent_vmcs_config stuff is for inconsistencies within the allowed
> > VMCS fields.  Having a feature that is supported in bare metal but not virtualized
> > is perfectly legal, if uncommon.
> 
> I deliberately keep it, at least for now, because these checks are helpful
> during the development of nVMX FRED.  It will be helpful for other VMMs
> being developed/tested on KVM.

No, remove it.  It's architecturally legal for a CPU to support a feature in bare
metal but not provide virtualization support.

> > What *is* needed is for KVM to refuse to virtualize FRED if the entry/exit controls
> > aren't consistent.  E.g. if at least one control is present, and at least one
> > control is missing.   I.e. KVM needs a version of vmcs_entry_exit_pairs that can
> > deal with SECONDAY_VM_EXIT controls.
> 
> I agree there are better ways.  But maybe after or before VMX FRED.

Uh, no.  This is not optional.  FRED is the first feature that uses SECONDAY_VM_EXIT
controls, so FRED gets the honor of adding the necessary infrastructure support.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields
  2023-11-08 18:29 ` [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields Xin Li
@ 2023-11-13  3:04   ` Chao Gao
  2023-11-14  6:02     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-13  3:04 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:48AM -0800, Xin Li wrote:
>Initialize host VMCS FRED fields with host FRED MSRs' value and
>guest VMCS FRED fields to 0.
>
>FRED CPU states are managed in 9 new FRED MSRs, as well as a few
>existing CPU registers and MSRs, e.g., CR4.FRED.  To support FRED
>context management, new VMCS fields corresponding to most of FRED
>CPU state MSRs are added to both the host-state and guest-state
>areas of VMCS.
>
>Specifically no VMCS fields are added for FRED RSP0 and SSP0 MSRs,
>because the 2 FRED MSRs are used during ring 3 event delivery only,
>thus KVM, running on ring 0, can run safely even with guest FRED
>RSP0 and SSP0.  It can be deferred to load host FRED RSP0 and SSP0
>until before returning to user level.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/include/asm/vmx.h | 16 ++++++++++++++++
> arch/x86/kvm/vmx/vmx.c     | 32 ++++++++++++++++++++++++++++++++
> 2 files changed, 48 insertions(+)
>
>diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>index 41796a733bc9..d54a1a1057b0 100644
>--- a/arch/x86/include/asm/vmx.h
>+++ b/arch/x86/include/asm/vmx.h
>@@ -277,12 +277,28 @@ enum vmcs_field {
> 	GUEST_BNDCFGS_HIGH              = 0x00002813,
> 	GUEST_IA32_RTIT_CTL		= 0x00002814,
> 	GUEST_IA32_RTIT_CTL_HIGH	= 0x00002815,
>+	GUEST_IA32_FRED_CONFIG		= 0x0000281a,
>+	GUEST_IA32_FRED_RSP1		= 0x0000281c,
>+	GUEST_IA32_FRED_RSP2		= 0x0000281e,
>+	GUEST_IA32_FRED_RSP3		= 0x00002820,
>+	GUEST_IA32_FRED_STKLVLS		= 0x00002822,
>+	GUEST_IA32_FRED_SSP1		= 0x00002824,
>+	GUEST_IA32_FRED_SSP2		= 0x00002826,
>+	GUEST_IA32_FRED_SSP3		= 0x00002828,
> 	HOST_IA32_PAT			= 0x00002c00,
> 	HOST_IA32_PAT_HIGH		= 0x00002c01,
> 	HOST_IA32_EFER			= 0x00002c02,
> 	HOST_IA32_EFER_HIGH		= 0x00002c03,
> 	HOST_IA32_PERF_GLOBAL_CTRL	= 0x00002c04,
> 	HOST_IA32_PERF_GLOBAL_CTRL_HIGH	= 0x00002c05,
>+	HOST_IA32_FRED_CONFIG		= 0x00002c08,
>+	HOST_IA32_FRED_RSP1		= 0x00002c0a,
>+	HOST_IA32_FRED_RSP2		= 0x00002c0c,
>+	HOST_IA32_FRED_RSP3		= 0x00002c0e,
>+	HOST_IA32_FRED_STKLVLS		= 0x00002c10,
>+	HOST_IA32_FRED_SSP1		= 0x00002c12,
>+	HOST_IA32_FRED_SSP2		= 0x00002c14,
>+	HOST_IA32_FRED_SSP3		= 0x00002c16,
> 	PIN_BASED_VM_EXEC_CONTROL       = 0x00004000,
> 	CPU_BASED_VM_EXEC_CONTROL       = 0x00004002,
> 	EXCEPTION_BITMAP                = 0x00004004,
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 327e052d90c1..41772ecdd368 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -1477,6 +1477,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
> 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
> 		}
> 
>+#ifdef CONFIG_X86_64
>+		/* Per-CPU FRED MSRs */
>+		if (cpu_feature_enabled(X86_FEATURE_FRED)) {

how about kvm_cpu_cap_has()? to decouple KVM's capability to virtualize a feature
and host's enabling a feature.

>+			vmcs_write64(HOST_IA32_FRED_RSP1, read_msr(MSR_IA32_FRED_RSP1));
>+			vmcs_write64(HOST_IA32_FRED_RSP2, read_msr(MSR_IA32_FRED_RSP2));
>+			vmcs_write64(HOST_IA32_FRED_RSP3, read_msr(MSR_IA32_FRED_RSP3));
>+			vmcs_write64(HOST_IA32_FRED_SSP1, read_msr(MSR_IA32_FRED_SSP1));
>+			vmcs_write64(HOST_IA32_FRED_SSP2, read_msr(MSR_IA32_FRED_SSP2));
>+			vmcs_write64(HOST_IA32_FRED_SSP3, read_msr(MSR_IA32_FRED_SSP3));
>+		}
>+#endif

why is this hunk enclosed in #ifdef CONFIG_X86_64 while the one below isn't?

>+
> 		vmx->loaded_vmcs->cpu = cpu;
> 	}
> }
>@@ -4375,6 +4387,15 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
> 
> 	if (cpu_has_load_ia32_efer())
> 		vmcs_write64(HOST_IA32_EFER, host_efer);
>+
>+	/*
>+	 * FRED MSRs are per-cpu, however FRED CONFIG and STKLVLS MSRs
>+	 * are the same on all CPUs, thus they are initialized here.
>+	 */
>+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
>+		vmcs_write64(HOST_IA32_FRED_CONFIG, read_msr(MSR_IA32_FRED_CONFIG));
>+		vmcs_write64(HOST_IA32_FRED_STKLVLS, read_msr(MSR_IA32_FRED_STKLVLS));
>+	}
> }
> 
> void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
>@@ -4936,6 +4957,17 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> 	vmcs_writel(GUEST_IDTR_BASE, 0);
> 	vmcs_write32(GUEST_IDTR_LIMIT, 0xffff);
> 
>+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
>+		vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
>+		vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
>+		vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
>+		vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
>+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
>+		vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
>+		vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
>+		vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
>+	}
>+

move this hunk to __vmx_vcpu_reset() because FRED spec says

"INIT does not change the value of the new MSRs."

> 	vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
> 	vmcs_write32(GUEST_INTERRUPTIBILITY_INFO, 0);
> 	vmcs_writel(GUEST_PENDING_DBG_EXCEPTIONS, 0);
>-- 
>2.42.0
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest
  2023-11-08 18:29 ` [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest Xin Li
@ 2023-11-13  3:47   ` Chao Gao
  2023-11-14  5:17     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-13  3:47 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:49AM -0800, Xin Li wrote:
>Switch MSR_IA32_FRED_RSP0 between host and guest in
>vmx_prepare_switch_to_{host,guest}().
>
>MSR_IA32_FRED_RSP0 is used during ring 3 event delivery only, thus
>KVM, running on ring 0, can run safely with guest FRED RSP0, i.e.,
>no need to switch between host/guest FRED RSP0 during VM entry and
>exit.
>
>KVM should switch to host FRED RSP0 before returning to user level,
>and switch to guest FRED RSP0 before entering guest mode.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/kvm/vmx/vmx.c | 17 +++++++++++++++++
> arch/x86/kvm/vmx/vmx.h |  2 ++
> 2 files changed, 19 insertions(+)
>
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 41772ecdd368..d00ab9d4c93e 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -1344,6 +1344,17 @@ void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> 	}
> 
> 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_guest_kernel_gs_base);
>+
>+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
>+	    guest_cpuid_has(vcpu, X86_FEATURE_FRED)) {
>+		/*
>+		 * MSR_IA32_FRED_RSP0 is top of task stack, which never changes.
>+		 * Thus it should be initialized only once.
>+		 */
>+		if (unlikely(vmx->msr_host_fred_rsp0 == 0))
>+			vmx->msr_host_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
>+		wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_guest_fred_rsp0);
>+	}
> #else
> 	savesegment(fs, fs_sel);
> 	savesegment(gs, gs_sel);
>@@ -1388,6 +1399,12 @@ static void vmx_prepare_switch_to_host(struct vcpu_vmx *vmx)
> 	invalidate_tss_limit();
> #ifdef CONFIG_X86_64
> 	wrmsrl(MSR_KERNEL_GS_BASE, vmx->msr_host_kernel_gs_base);
>+
>+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
>+	    guest_cpuid_has(&vmx->vcpu, X86_FEATURE_FRED)) {

IIUC, vmx_prepare_switch_to_host() is called from IRQ-disabled context. using
guest_cpuid_has() in this context is not desired, see lockdep_assert_irqs_enabled()
in cpuid_entry2_find().

>+		vmx->msr_guest_fred_rsp0 = read_msr(MSR_IA32_FRED_RSP0);
>+		wrmsrl(MSR_IA32_FRED_RSP0, vmx->msr_host_fred_rsp0);
>+	}
> #endif
> 	load_fixmap_gdt(raw_smp_processor_id());
> 	vmx->guest_state_loaded = false;
>diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
>index f8c02bd37069..328a3447f064 100644
>--- a/arch/x86/kvm/vmx/vmx.h
>+++ b/arch/x86/kvm/vmx/vmx.h
>@@ -276,6 +276,8 @@ struct vcpu_vmx {
> #ifdef CONFIG_X86_64
> 	u64		      msr_host_kernel_gs_base;
> 	u64		      msr_guest_kernel_gs_base;
>+	u64		      msr_host_fred_rsp0;
>+	u64		      msr_guest_fred_rsp0;

resetting guest fred rsp0 to 0 during vcpu reset is missing.

> #endif
> 
> 	u64		      spec_ctrl;
>-- 
>2.42.0
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore
  2023-11-08 18:29 ` [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore Xin Li
@ 2023-11-13  5:24   ` Chao Gao
  2023-11-14  4:48     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-13  5:24 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 2c924075f6f1..c5a55810647f 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -1451,6 +1451,9 @@ static const u32 msrs_to_save_base[] = {
> 	MSR_STAR,
> #ifdef CONFIG_X86_64
> 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
>+	MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
>+	MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
>+	MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,

Need to handle the case where FRED MSRs are valid but KVM cannot virtualize
FRED, see kvm_probe_msr_to_save().

> #endif
> 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX,
>@@ -1890,6 +1893,16 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
> 
> 		data = (u32)data;
> 		break;
>+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
>+		if (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_FRED))
>+			break;

Nothing guarantees FRED MSRs/VMCS fields exist on the hardware here. Probably use
guest_cpu_cap_has()*.

*: https://lore.kernel.org/kvm/20231110235528.1561679-1-seanjc@google.com

>+
>+		/*
>+		 * Inject #GP upon FRED MSRs accesses from a non-FRED guest to
>+		 * make sure no malicious guest can write to FRED MSRs thus to
>+		 * corrupt host FRED MSRs.
>+		 */

I think injecting #GP here is simply because KVM should emulate hardware
behavior. To me, preventing guest from corrupting FRED MSRs is at most a
byproduct. I prefer to drop the comment.

>+		return 1;
> 	}
> 
> 	msr.data = data;
>@@ -1933,6 +1946,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
> 			return 1;
> 		break;
>+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
>+		if (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_FRED))
>+			break;
>+
>+		/*
>+		 * Inject #GP upon FRED MSRs accesses from a non-FRED guest to
>+		 * make sure no malicious guest can write to FRED MSRs thus to
>+		 * corrupt host FRED MSRs.
>+		 */
>+		return 1;
> 	}
> 
> 	msr.index = index;
>-- 
>2.42.0
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-08 18:29 ` [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled() Xin Li
@ 2023-11-13  7:35   ` Chao Gao
  2023-11-14  4:42     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-13  7:35 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:51AM -0800, Xin Li wrote:
>Add kvm_is_fred_enabled() to get if FRED is enabled on a vCPU.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/kvm/kvm_cache_regs.h | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
>diff --git a/arch/x86/kvm/kvm_cache_regs.h b/arch/x86/kvm/kvm_cache_regs.h
>index 75eae9c4998a..390643e8c532 100644
>--- a/arch/x86/kvm/kvm_cache_regs.h
>+++ b/arch/x86/kvm/kvm_cache_regs.h
>@@ -187,6 +187,16 @@ static __always_inline bool kvm_is_cr4_bit_set(struct kvm_vcpu *vcpu,
> 	return !!kvm_read_cr4_bits(vcpu, cr4_bit);
> }
> 
>+static __always_inline bool kvm_is_fred_enabled(struct kvm_vcpu *vcpu)
>+{
>+#ifdef CONFIG_X86_64
>+	return cpu_feature_enabled(X86_FEATURE_FRED) &&
>+	       kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);

FRED is enabled when CR4.FRED = IA32_EFER.LMA = 1. Any reason to omit
the check about long mode?

>+#else
>+	return false;
>+#endif
>+}
>+
> static inline ulong kvm_read_cr3(struct kvm_vcpu *vcpu)
> {
> 	if (!kvm_register_is_available(vcpu, VCPU_EXREG_CR3))
>-- 
>2.42.0
>
>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-08 18:29 ` [PATCH v1 12/23] KVM: VMX: Handle FRED event data Xin Li
@ 2023-11-13 10:14   ` Chao Gao
  2023-11-14  4:34     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-13 10:14 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

On Wed, Nov 08, 2023 at 10:29:52AM -0800, Xin Li wrote:
>Set injected-event data when injecting a #PF, #DB, or #NM caused
>by extended feature disable using FRED event delivery, and save
>original-event data for being used as injected-event data.
>
>Unlike IDT using some extra CPU register as part of an event
>context, e.g., %cr2 for #PF, FRED saves a complete event context
>in its stack frame, e.g., FRED saves the faulting linear address
>of a #PF into the event data field defined in its stack frame.
>
>Thus a new VMX control field called injected-event data is added
>to provide the event data that will be pushed into a FRED stack
>frame for VM entries that inject an event using FRED event delivery.
>In addition, a new VM exit information field called original-event
>data is added to store the event data that would have saved into a
>FRED stack frame for VM exits that occur during FRED event delivery.
>After such a VM exit is handled to allow the original-event to be
>delivered, the data in the original-event data VMCS field needs to
>be set into the injected-event data VMCS field for the injection of
>the original event.
>
>Tested-by: Shan Kang <shan.kang@intel.com>
>Signed-off-by: Xin Li <xin3.li@intel.com>
>---
> arch/x86/include/asm/vmx.h |  4 ++
> arch/x86/kvm/vmx/vmx.c     | 84 +++++++++++++++++++++++++++++++++++---
> arch/x86/kvm/x86.c         | 10 ++++-
> 3 files changed, 91 insertions(+), 7 deletions(-)
>
>diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
>index d54a1a1057b0..97729248e844 100644
>--- a/arch/x86/include/asm/vmx.h
>+++ b/arch/x86/include/asm/vmx.h
>@@ -253,8 +253,12 @@ enum vmcs_field {
> 	PID_POINTER_TABLE_HIGH		= 0x00002043,
> 	SECONDARY_VM_EXIT_CONTROLS	= 0x00002044,
> 	SECONDARY_VM_EXIT_CONTROLS_HIGH	= 0x00002045,
>+	INJECTED_EVENT_DATA		= 0x00002052,
>+	INJECTED_EVENT_DATA_HIGH	= 0x00002053,
> 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
> 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>+	ORIGINAL_EVENT_DATA		= 0x00002404,
>+	ORIGINAL_EVENT_DATA_HIGH	= 0x00002405,
> 	VMCS_LINK_POINTER               = 0x00002800,
> 	VMCS_LINK_POINTER_HIGH          = 0x00002801,
> 	GUEST_IA32_DEBUGCTL             = 0x00002802,
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 58d01e845804..67fd4a56d031 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -1880,9 +1880,30 @@ static void vmx_inject_exception(struct kvm_vcpu *vcpu)
> 		vmcs_write32(VM_ENTRY_INSTRUCTION_LEN,
> 			     vmx->vcpu.arch.event_exit_inst_len);
> 		intr_info |= INTR_TYPE_SOFT_EXCEPTION;
>-	} else
>+	} else {
> 		intr_info |= INTR_TYPE_HARD_EXCEPTION;
> 
>+		if (kvm_is_fred_enabled(vcpu)) {
>+			u64 event_data = 0;
>+
>+			if (is_debug(intr_info))
>+				/*
>+				 * Compared to DR6, FRED #DB event data saved on
>+				 * the stack frame have bits 4 ~ 11 and 16 ~ 31
>+				 * inverted, i.e.,
>+				 *   fred_db_event_data = dr6 ^ 0xFFFF0FF0UL
>+				 */
>+				event_data = vcpu->arch.dr6 ^ DR6_RESERVED;
>+			else if (is_page_fault(intr_info))
>+				event_data = vcpu->arch.cr2;
>+			else if (is_nm_fault(intr_info) &&
>+				 vcpu->arch.guest_fpu.fpstate->xfd)

does this necessarily mean the #NM is caused by XFD?

>+				event_data = vcpu->arch.guest_fpu.xfd_err;
>+
>+			vmcs_write64(INJECTED_EVENT_DATA, event_data);
>+		}
>+	}
>+
> 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
> 
> 	vmx_clear_hlt(vcpu);
>@@ -7226,7 +7247,8 @@ static void vmx_recover_nmi_blocking(struct vcpu_vmx *vmx)
> static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> 				      u32 idt_vectoring_info,
> 				      int instr_len_field,
>-				      int error_code_field)
>+				      int error_code_field,
>+				      int event_data_field)

event_data_field is used to indicate whether this is a "cancel". I may think it is
better to simply use a boolean e.g., bool cancel.


> {
> 	u8 vector;
> 	int type;
>@@ -7260,6 +7282,37 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
> 		fallthrough;
> 	case INTR_TYPE_HARD_EXCEPTION:
>+		if (kvm_is_fred_enabled(vcpu) && event_data_field) {
>+			/*
>+			 * Save original-event data for being used as injected-event data.
>+			 */

Looks we also expect CPU will update CR2/DR6/XFD_ERR. this hunk looks to me just a paranoid
check to ensure the cpu works as expected. if that's the case, I suggest documenting it
a bit in the comment.

>+			u64 event_data = vmcs_read64(event_data_field);
>+
>+			switch (vector) {
>+			case DB_VECTOR:
>+				get_debugreg(vcpu->arch.dr6, 6);
>+				WARN_ON(vcpu->arch.dr6 != (event_data ^ DR6_RESERVED));
>+				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
>+				break;
>+			case NM_VECTOR:
>+				if (vcpu->arch.guest_fpu.fpstate->xfd) {
>+					rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>+					WARN_ON(vcpu->arch.guest_fpu.xfd_err != event_data);
>+					vcpu->arch.guest_fpu.xfd_err = event_data;
>+				} else {
>+					WARN_ON(event_data != 0);
>+				}
>+				break;
>+			case PF_VECTOR:
>+				WARN_ON(vcpu->arch.cr2 != event_data);
>+				vcpu->arch.cr2 = event_data;
>+				break;
>+			default:
>+				WARN_ON(event_data != 0);

I am not sure if this WARN_ON() can be triggeded by nested VMX. It is
legitimate for L1 VMM to inject any event w/ an event_data.

FRED spec says:

Section 5.2.1 specifies the event data that FRED event delivery of certain events saves
on the stack. When FRED event delivery is used for an event injected by VM entry, the
event data saved is the value of the injected-event-data field in the VMCS. This value is
used instead of what is specified in Section 5.2.1 and is done for __ALL__ injected events
using FRED event delivery


>+				break;
>+			}
>+		}
>+
> 		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
> 			u32 err = vmcs_read32(error_code_field);
> 			kvm_requeue_exception_e(vcpu, vector, err);
>@@ -7279,9 +7332,11 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> 
> static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
> {
>-	__vmx_complete_interrupts(&vmx->vcpu, vmx->idt_vectoring_info,
>+	__vmx_complete_interrupts(&vmx->vcpu,
>+				  vmx->idt_vectoring_info,
> 				  VM_EXIT_INSTRUCTION_LEN,
>-				  IDT_VECTORING_ERROR_CODE);
>+				  IDT_VECTORING_ERROR_CODE,
>+				  ORIGINAL_EVENT_DATA);
> }
> 
> static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>@@ -7289,7 +7344,8 @@ static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> 	__vmx_complete_interrupts(vcpu,
> 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> 				  VM_ENTRY_INSTRUCTION_LEN,
>-				  VM_ENTRY_EXCEPTION_ERROR_CODE);
>+				  VM_ENTRY_EXCEPTION_ERROR_CODE,
>+				  0);
> 
> 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);
> }

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-08 18:29 ` [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li
  2023-11-09  8:53   ` Chao Gao
@ 2023-11-13 17:18   ` Nikolay Borisov
  2023-11-15  2:39     ` Li, Xin3
  1 sibling, 1 reply; 70+ messages in thread
From: Nikolay Borisov @ 2023-11-13 17:18 UTC (permalink / raw)
  To: Xin Li, kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar



On 8.11.23 г. 20:29 ч., Xin Li wrote:
> Setup the global vmcs_config for FRED:
> 1) Add VM_ENTRY_LOAD_IA32_FRED to KVM_OPTIONAL_VMX_VM_ENTRY_CONTROLS to
>     have a FRED CPU load guest FRED MSRs from VMCS upon VM entry.
> 2) Add SECONDARY_VM_EXIT_SAVE_IA32_FRED to
>     KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU save
>     guest FRED MSRs to VMCS during VM exit.
> 3) add SECONDARY_VM_EXIT_LOAD_IA32_FRED to
>     KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS to have a FRED CPU load
>     host FRED MSRs from VMCS during VM exit.
> 
> Also add sanity checks to make sure FRED VM entry/exit controls can be
> set on a FRED CPU.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   arch/x86/include/asm/vmx.h |  3 +++
>   arch/x86/kvm/vmx/vmx.c     | 19 ++++++++++++++++++-
>   arch/x86/kvm/vmx/vmx.h     |  7 +++++--
>   3 files changed, 26 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 4d4177ec802c..41796a733bc9 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -106,6 +106,8 @@
>   #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>   #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
>   #define VM_EXIT_ACTIVATE_SECONDARY_CONTROLS	0x80000000
> +#define SECONDARY_VM_EXIT_SAVE_IA32_FRED	0x00000001
> +#define SECONDARY_VM_EXIT_LOAD_IA32_FRED	0x00000002
>   
>   #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>   
> @@ -119,6 +121,7 @@
>   #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
>   #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
>   #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
> +#define VM_ENTRY_LOAD_IA32_FRED			0x00800000
>   
>   #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
>   
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index df769207cbe0..9186f41974ab 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2694,10 +2694,27 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>   		_vmexit_control &= ~x_ctrl;
>   	}
>   
> -	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
> +	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
>   		_secondary_vmexit_control =
>   			adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CONTROLS,
>   					      MSR_IA32_VMX_EXIT_CTLS2);
> +		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> +		    !(_secondary_vmexit_control & SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
> +		      _secondary_vmexit_control & SECONDARY_VM_EXIT_LOAD_IA32_FRED)) {

Can those checks actually trigger? I.e if FEATURE_FRED is set it means 
the CPU implements the FRED spec. According to the spec it's guaranteed 
that VMX_EXIT_CTLS2 will contain those bits set to 1. So aren't those 
checks superfluous?

> +			pr_warn_once("FRED enabled but no VMX VM-Exit {SAVE,LOAD}_IA32_FRED controls: %llx\n",
> +				     _secondary_vmexit_control);
> +			if (error_on_inconsistent_vmcs_config)
> +				return -EIO;
> +		}
> +	}
> +
> +	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> +	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
DITTO

<snip>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-10  0:18       ` Sean Christopherson
@ 2023-11-14  2:50         ` Li, Xin3
  2023-11-15 21:47           ` Sean Christopherson
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  2:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gao, Chao, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu, Cui,
	Dexuan, tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > > Clearing VM_EXIT_ACTIVATE_SECONDARY_CONTROLS may be problematic when
> > > new bits are added to secondary vmcs controls. Why not keep
> > > VM_EXIT_ACTIVATE_SECONDARY_CONTROLS always on if it is supported? or
> > > you see any perf impact?
> >
> > I think it from the other way, why keeps hw loading it on every
> > vmentry even if it's not used by a guest?
> 
> Oh, yeesh, this is clearing the activation control.  Yeah, NAK to that, just leave it
> set.  If it weren't for the fact that there is apparently a metrict ton of FRED state (I
> thought the whole point of FRED was to kill off legacy crap like
> CPL1 and CPL2 stacks?) _and_ that KVM needs to toggle MSR intercepts, I'd
> probably push back on toggling even the FRED controls.

To me, FRED is to _architecturally_ do the right thing for x86 event handling.

I don't think FRED's major purpose is to kill legacy craps, but longer term
it paves the way for that.

Yeah, I would like to discuss whether to toggle FRED controls.

> 
> > Different CPUs may implement it in different ways, which we can't assume.
> 
> Implement what in a different way?  The VMCS fields and FRED are architectural.
> The internal layout of the VMCS is uarch specific, but the encodings and semantics
> absolutely cannot change without breaking software.  And if Intel does something
> asinine like make a control active-low then we have far, far bigger problems.

I should have made it clear that I wasn't talking at the ISA level.  And
of course CPU uarch implementations should be transparent to software.

I mean a CPU uarch could choose to check the activation bit in the VM exit
controls first and then decide whether to load the 2nd VM exit controls.
While if resources allow, a CPU uarch could always load the 2nd VM exit
controls.

BTW, I believe the active-low controls are really gone with new features.
All new controls are all 0s by default.

> > Other features needing it should set it separately, say with a refcount.
> 
> Absolutely not.  Unless Intel screwed up the implementation, the cost of leaving
> VM_EXIT_ACTIVATE_SECONDARY_CONTROLS set when it's supported shouldn't
> even be measurable.

I do hope so.  However, I don't know whether this is guaranteed or not on
all uarch implementations.

A decision to leave it set is good enough for now.

Thanks!
    Xin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-10 15:01         ` Sean Christopherson
@ 2023-11-14  4:05           ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  4:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Gao, Chao, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu, Cui,
	Dexuan, tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > > > Can we just hide FRED from guests like what KVM does for other
> > > > features which have similar dependencies? see vmx_set_cpu_caps().
> > >
> > > Both of these warnings should simply be dropped.  The
> > > error_on_inconsistent_vmcs_config stuff is for inconsistencies within the
> allowed
> > > VMCS fields.  Having a feature that is supported in bare metal but not
> virtualized
> > > is perfectly legal, if uncommon.
> >
> > I deliberately keep it, at least for now, because these checks are helpful
> > during the development of nVMX FRED.  It will be helpful for other VMMs
> > being developed/tested on KVM.
> 
> No, remove it.  It's architecturally legal for a CPU to support a feature in bare
> metal but not provide virtualization support.

Like the stage when native Linux has FRED support while KVM not yet?

> > > What *is* needed is for KVM to refuse to virtualize FRED if the entry/exit
> controls
> > > aren't consistent.  E.g. if at least one control is present, and at least one
> > > control is missing.   I.e. KVM needs a version of vmcs_entry_exit_pairs that can
> > > deal with SECONDAY_VM_EXIT controls.
> >
> > I agree there are better ways.  But maybe after or before VMX FRED.
> 
> Uh, no.  This is not optional.  FRED is the first feature that uses
> SECONDAY_VM_EXIT
> controls, so FRED gets the honor of adding the necessary infrastructure support.

The 2nd VM exit controls is a must for FRED, so I should do it.

I think you mean the consistency checks can be done in a better way (which
is not just for FRED controls consistency checks).  No?

Thanks!
    Xin

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-13 10:14   ` Chao Gao
@ 2023-11-14  4:34     ` Li, Xin3
  2023-11-14  8:58       ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  4:34 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >+			else if (is_nm_fault(intr_info) &&
> >+				 vcpu->arch.guest_fpu.fpstate->xfd)
> 
> does this necessarily mean the #NM is caused by XFD?

Then the event data should be 0.  Or I missed something obvious? I.e.,
it can be easily differentiated and we should just explicitly set it
to 0.

> 
> >+				event_data = vcpu->arch.guest_fpu.xfd_err;
> >+
> >+			vmcs_write64(INJECTED_EVENT_DATA, event_data);
> >+		}
> >+	}
> >+
> > 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
> >
> > 	vmx_clear_hlt(vcpu);
> >@@ -7226,7 +7247,8 @@ static void vmx_recover_nmi_blocking(struct
> >vcpu_vmx *vmx)  static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> > 				      u32 idt_vectoring_info,
> > 				      int instr_len_field,
> >-				      int error_code_field)
> >+				      int error_code_field,
> >+				      int event_data_field)
> 
> event_data_field is used to indicate whether this is a "cancel". I may think it is
> better to simply use a boolean e.g., bool cancel.

I'm fine with the idea if no objections.

> 
> 
> > {
> > 	u8 vector;
> > 	int type;
> >@@ -7260,6 +7282,37 @@ static void __vmx_complete_interrupts(struct
> kvm_vcpu *vcpu,
> > 		vcpu->arch.event_exit_inst_len = vmcs_read32(instr_len_field);
> > 		fallthrough;
> > 	case INTR_TYPE_HARD_EXCEPTION:
> >+		if (kvm_is_fred_enabled(vcpu) && event_data_field) {
> >+			/*
> >+			 * Save original-event data for being used as injected-
> event data.
> >+			 */
> 
> Looks we also expect CPU will update CR2/DR6/XFD_ERR. this hunk looks to me
> just a paranoid check to ensure the cpu works as expected. if that's the case, I
> suggest documenting it a bit in the comment.

These checks are not intended for hw, they make sure nVMX FRED is correctly
implemented and catch regressions.

And yes, in the early stage, I prefer to be paranoid.

> 
> >+			u64 event_data = vmcs_read64(event_data_field);
> >+
> >+			switch (vector) {
> >+			case DB_VECTOR:
> >+				get_debugreg(vcpu->arch.dr6, 6);
> >+				WARN_ON(vcpu->arch.dr6 != (event_data ^
> DR6_RESERVED));
> >+				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
> >+				break;
> >+			case NM_VECTOR:
> >+				if (vcpu->arch.guest_fpu.fpstate->xfd) {
> >+					rdmsrl(MSR_IA32_XFD_ERR, vcpu-
> >arch.guest_fpu.xfd_err);
> >+					WARN_ON(vcpu-
> >arch.guest_fpu.xfd_err != event_data);
> >+					vcpu->arch.guest_fpu.xfd_err =
> event_data;
> >+				} else {
> >+					WARN_ON(event_data != 0);
> >+				}
> >+				break;
> >+			case PF_VECTOR:
> >+				WARN_ON(vcpu->arch.cr2 != event_data);
> >+				vcpu->arch.cr2 = event_data;
> >+				break;
> >+			default:
> >+				WARN_ON(event_data != 0);
> 
> I am not sure if this WARN_ON() can be triggeded by nested VMX. It is legitimate
> for L1 VMM to inject any event w/ an event_data.
> 
> FRED spec says:
> 
> Section 5.2.1 specifies the event data that FRED event delivery of certain events
> saves on the stack. When FRED event delivery is used for an event injected by VM
> entry, the event data saved is the value of the injected-event-data field in the
> VMCS. This value is used instead of what is specified in Section 5.2.1 and is done
> for __ALL__ injected events using FRED event delivery

5.2.1 Saving Information on the Regular Stack also says:
- For any other event, the event data are not currently defined and will
  be zero until they are.

Or you mean something else?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-13  7:35   ` Chao Gao
@ 2023-11-14  4:42     ` Li, Xin3
  2023-11-14  8:16       ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  4:42 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >+	return cpu_feature_enabled(X86_FEATURE_FRED) &&
> >+	       kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
> 
> FRED is enabled when CR4.FRED = IA32_EFER.LMA = 1. Any reason to omit the
> check about long mode?

It won' t allow CR4.FRED to be set if not in long mode, I don't expect it
at runtime.  Or you have one?

If you are talking about save/restore a corrupted vCPU state, a following
VM entry should fail anyway.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore
  2023-11-13  5:24   ` Chao Gao
@ 2023-11-14  4:48     ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  4:48 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > #ifdef CONFIG_X86_64
> > 	MSR_CSTAR, MSR_KERNEL_GS_BASE, MSR_SYSCALL_MASK, MSR_LSTAR,
> >+	MSR_IA32_FRED_RSP0, MSR_IA32_FRED_RSP1, MSR_IA32_FRED_RSP2,
> >+	MSR_IA32_FRED_RSP3, MSR_IA32_FRED_STKLVLS, MSR_IA32_FRED_SSP1,
> >+	MSR_IA32_FRED_SSP2, MSR_IA32_FRED_SSP3, MSR_IA32_FRED_CONFIG,
> 
> Need to handle the case where FRED MSRs are valid but KVM cannot virtualize
> FRED, see kvm_probe_msr_to_save().

Will take care of it, thanks for reminding.

> > #endif
> > 	MSR_IA32_TSC, MSR_IA32_CR_PAT, MSR_VM_HSAVE_PA,
> > 	MSR_IA32_FEAT_CTL, MSR_IA32_BNDCFGS, MSR_TSC_AUX, @@ -1890,6
> +1893,16
> >@@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
> >
> > 		data = (u32)data;
> > 		break;
> >+	case MSR_IA32_FRED_RSP0 ... MSR_IA32_FRED_CONFIG:
> >+		if (host_initiated || guest_cpuid_has(vcpu, X86_FEATURE_FRED))
> >+			break;
> 
> Nothing guarantees FRED MSRs/VMCS fields exist on the hardware here. Probably
> use guest_cpu_cap_has()*.

Ah, my bad!

> *: https://lore.kernel.org/kvm/20231110235528.1561679-1-seanjc@google.com
> >+
> >+		/*
> >+		 * Inject #GP upon FRED MSRs accesses from a non-FRED guest to
> >+		 * make sure no malicious guest can write to FRED MSRs thus to
> >+		 * corrupt host FRED MSRs.
> >+		 */
> 
> I think injecting #GP here is simply because KVM should emulate hardware
> behavior. To me, preventing guest from corrupting FRED MSRs is at most a
> byproduct. I prefer to drop the comment.

From security POV, this is important to mention.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest
  2023-11-13  3:47   ` Chao Gao
@ 2023-11-14  5:17     ` Li, Xin3
  2023-11-14  7:47       ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  5:17 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >+	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> >+	    guest_cpuid_has(&vmx->vcpu, X86_FEATURE_FRED)) {
> 
> IIUC, vmx_prepare_switch_to_host() is called from IRQ-disabled context. using
> guest_cpuid_has() in this context is not desired, see
> lockdep_assert_irqs_enabled() in cpuid_entry2_find().

Nice catch!

Anyway it's a bad idea to do a search call here, let me find a better way
for all FRED CPUID checks.

> >diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index
> >f8c02bd37069..328a3447f064 100644
> >--- a/arch/x86/kvm/vmx/vmx.h
> >+++ b/arch/x86/kvm/vmx/vmx.h
> >@@ -276,6 +276,8 @@ struct vcpu_vmx {
> > #ifdef CONFIG_X86_64
> > 	u64		      msr_host_kernel_gs_base;
> > 	u64		      msr_guest_kernel_gs_base;
> >+	u64		      msr_host_fred_rsp0;
> >+	u64		      msr_guest_fred_rsp0;
> 
> resetting guest fred rsp0 to 0 during vcpu reset is missing.

hmm, I assume it gets the same treatment as guest_kernel_gs_base.

It seems we don't reset guest_kernel_gs_base.  No?

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields
  2023-11-13  3:04   ` Chao Gao
@ 2023-11-14  6:02     ` Li, Xin3
  2023-11-14  6:51       ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  6:02 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >@@ -1477,6 +1477,18 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int
> cpu,
> > 				    (unsigned long)(cpu_entry_stack(cpu) + 1));
> > 		}
> >
> >+#ifdef CONFIG_X86_64
> >+		/* Per-CPU FRED MSRs */
> >+		if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> 
> how about kvm_cpu_cap_has()? to decouple KVM's capability to virtualize a feature
> and host's enabling a feature.

Very likely I guess.

> >+			vmcs_write64(HOST_IA32_FRED_RSP1,
> read_msr(MSR_IA32_FRED_RSP1));
> >+			vmcs_write64(HOST_IA32_FRED_RSP2,
> read_msr(MSR_IA32_FRED_RSP2));
> >+			vmcs_write64(HOST_IA32_FRED_RSP3,
> read_msr(MSR_IA32_FRED_RSP3));
> >+			vmcs_write64(HOST_IA32_FRED_SSP1,
> read_msr(MSR_IA32_FRED_SSP1));
> >+			vmcs_write64(HOST_IA32_FRED_SSP2,
> read_msr(MSR_IA32_FRED_SSP2));
> >+			vmcs_write64(HOST_IA32_FRED_SSP3,
> read_msr(MSR_IA32_FRED_SSP3));
> >+		}
> >+#endif
> 
> why is this hunk enclosed in #ifdef CONFIG_X86_64 while the one below isn't?

As if the compiler doesn't complain, I should NOT add it.

> 
> >+	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> >+		vmcs_write64(GUEST_IA32_FRED_CONFIG, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_RSP1, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_RSP2, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_RSP3, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_STKLVLS, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_SSP1, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_SSP2, 0);
> >+		vmcs_write64(GUEST_IA32_FRED_SSP3, 0);
> >+	}
> >+
> 
> move this hunk to __vmx_vcpu_reset() because FRED spec says
> 
> "INIT does not change the value of the new MSRs."
> 

Yeah, will do.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields
  2023-11-14  6:02     ` Li, Xin3
@ 2023-11-14  6:51       ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14  6:51 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Dexuan Cui,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> 
> > >+			vmcs_write64(HOST_IA32_FRED_RSP1,
> > read_msr(MSR_IA32_FRED_RSP1));
> > >+			vmcs_write64(HOST_IA32_FRED_RSP2,
> > read_msr(MSR_IA32_FRED_RSP2));
> > >+			vmcs_write64(HOST_IA32_FRED_RSP3,
> > read_msr(MSR_IA32_FRED_RSP3));
> > >+			vmcs_write64(HOST_IA32_FRED_SSP1,
> > read_msr(MSR_IA32_FRED_SSP1));
> > >+			vmcs_write64(HOST_IA32_FRED_SSP2,
> > read_msr(MSR_IA32_FRED_SSP2));
> > >+			vmcs_write64(HOST_IA32_FRED_SSP3,
> > read_msr(MSR_IA32_FRED_SSP3));
> > >+		}
> > >+#endif
> >
> > why is this hunk enclosed in #ifdef CONFIG_X86_64 while the one below isn't?
> 
> As if the compiler doesn't complain, I should NOT add it.

I think I don't need to add CONFIG_X86_64 for the above, but somehow
this was left over.  Let me double check.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-11-08 18:29 ` [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED Xin Li
@ 2023-11-14  7:40   ` Chao Gao
  2023-11-15  3:03     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-14  7:40 UTC (permalink / raw)
  To: Xin Li
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar

> 	/* Require Write-Back (WB) memory type for VMCS accesses. */
>@@ -7313,11 +7328,12 @@ static void __vmx_complete_interrupts(struct kvm_vcpu *vcpu,
> 			}
> 		}
> 
>-		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK) {
>-			u32 err = vmcs_read32(error_code_field);
>-			kvm_requeue_exception_e(vcpu, vector, err);
>-		} else
>-			kvm_requeue_exception(vcpu, vector);
>+		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
>+			kvm_requeue_exception_e(vcpu, vector, vmcs_read32(error_code_field),
>+						idt_vectoring_info & INTR_INFO_NESTED_EXCEPTION_MASK);
>+		else
>+			kvm_requeue_exception(vcpu, vector,
>+					      idt_vectoring_info & INTR_INFO_NESTED_EXCEPTION_MASK);

Exiting-event identification can also have bit 13 set, indicating a nested
exception encountered and caused VM-exit. when reinjecting the exception to
guests, kvm needs to set the "nested" bit, right? I suspect some changes
to e.g., handle_exception_nmi() are needed.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest
  2023-11-14  5:17     ` Li, Xin3
@ 2023-11-14  7:47       ` Chao Gao
  2023-11-15  3:04         ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-14  7:47 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

>> >diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index
>> >f8c02bd37069..328a3447f064 100644
>> >--- a/arch/x86/kvm/vmx/vmx.h
>> >+++ b/arch/x86/kvm/vmx/vmx.h
>> >@@ -276,6 +276,8 @@ struct vcpu_vmx {
>> > #ifdef CONFIG_X86_64
>> > 	u64		      msr_host_kernel_gs_base;
>> > 	u64		      msr_guest_kernel_gs_base;
>> >+	u64		      msr_host_fred_rsp0;
>> >+	u64		      msr_guest_fred_rsp0;
>> 
>> resetting guest fred rsp0 to 0 during vcpu reset is missing.
>
>hmm, I assume it gets the same treatment as guest_kernel_gs_base.
>
>It seems we don't reset guest_kernel_gs_base.  No?

Yes. But for fred MSRs, FRED spec clearly says their RESET values
are 0s. for kernel_gs_base MSR, looks there is no such description
in SDM.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-14  4:42     ` Li, Xin3
@ 2023-11-14  8:16       ` Chao Gao
  2023-11-14 18:57         ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-14  8:16 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

On Tue, Nov 14, 2023 at 12:42:13PM +0800, Li, Xin3 wrote:
>> >+	return cpu_feature_enabled(X86_FEATURE_FRED) &&
>> >+	       kvm_is_cr4_bit_set(vcpu, X86_CR4_FRED);
>> 
>> FRED is enabled when CR4.FRED = IA32_EFER.LMA = 1. Any reason to omit the
>> check about long mode?
>
>It won' t allow CR4.FRED to be set if not in long mode, I don't expect it
>at runtime.  Or you have one?

I was thinking about a very contrived case:

1. the CPU enters 64-bit long mode and sets CR4.FRED
2. the CPU switches out of 64-bit long mode

and SDM vol3 chapter 2.5 CONTROL REGISTERS says:

A 64-bit capable processor will retain the upper 32 bits of each control
register when transitioning out of IA-32e mode.

so, to me, it is possible that CR4.FRED is 1 while IA32_EFER.LMA is 0.
and in this case, FRED should be considered disabled.


Anyway, I think we should align with FRED SPEC. If we deliberately omit
the check about long mode, please add a comment to explain why it is ok
to do that.

>
>If you are talking about save/restore a corrupted vCPU state, a following
>VM entry should fail anyway.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-14  4:34     ` Li, Xin3
@ 2023-11-14  8:58       ` Chao Gao
  2023-11-15  2:52         ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-14  8:58 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

On Tue, Nov 14, 2023 at 12:34:02PM +0800, Li, Xin3 wrote:
>> >+			else if (is_nm_fault(intr_info) &&
>> >+				 vcpu->arch.guest_fpu.fpstate->xfd)
>> 
>> does this necessarily mean the #NM is caused by XFD?
>
>Then the event data should be 0.  Or I missed something obvious? I.e.,
>it can be easily differentiated and we should just explicitly set it
>to 0.

vcpu->arch.guest_fpu.fpstate->xfd just means the guest is enabling XFD.
I don't think we can conclude that this #NM is caused by XFD only from
this.  i.e., there may be some false positives.

>> >+			u64 event_data = vmcs_read64(event_data_field);
>> >+
>> >+			switch (vector) {
>> >+			case DB_VECTOR:
>> >+				get_debugreg(vcpu->arch.dr6, 6);
>> >+				WARN_ON(vcpu->arch.dr6 != (event_data ^
>> DR6_RESERVED));
>> >+				vcpu->arch.dr6 = event_data ^ DR6_RESERVED;
>> >+				break;
>> >+			case NM_VECTOR:
>> >+				if (vcpu->arch.guest_fpu.fpstate->xfd) {
>> >+					rdmsrl(MSR_IA32_XFD_ERR, vcpu-
>> >arch.guest_fpu.xfd_err);
>> >+					WARN_ON(vcpu-
>> >arch.guest_fpu.xfd_err != event_data);
>> >+					vcpu->arch.guest_fpu.xfd_err =
>> event_data;
>> >+				} else {
>> >+					WARN_ON(event_data != 0);
>> >+				}
>> >+				break;
>> >+			case PF_VECTOR:
>> >+				WARN_ON(vcpu->arch.cr2 != event_data);
>> >+				vcpu->arch.cr2 = event_data;
>> >+				break;
>> >+			default:
>> >+				WARN_ON(event_data != 0);
>> 
>> I am not sure if this WARN_ON() can be triggeded by nested VMX. It is legitimate
>> for L1 VMM to inject any event w/ an event_data.
>> 
>> FRED spec says:
>> 
>> Section 5.2.1 specifies the event data that FRED event delivery of certain events
>> saves on the stack. When FRED event delivery is used for an event injected by VM
>> entry, the event data saved is the value of the injected-event-data field in the
>> VMCS. This value is used instead of what is specified in Section 5.2.1 and is done
>> for __ALL__ injected events using FRED event delivery
>
>5.2.1 Saving Information on the Regular Stack also says:
>- For any other event, the event data are not currently defined and will
>  be zero until they are.
>
>Or you mean something else?

IIUC, L1 KVM can inject a nested exception whose vector isn't #DB, or #NM or
#PF with a non-zero event_data to L2. If delivering the nested exception causes a
VM-exit to L0 KVM, the assertion that event_data is always 0 for vectors other
than #DB/#NM/#PF fails.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs()
  2023-11-08 18:29 ` [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li
@ 2023-11-14 14:36   ` Nikolay Borisov
  2023-11-15  2:41     ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Nikolay Borisov @ 2023-11-14 14:36 UTC (permalink / raw)
  To: Xin Li, kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	ravi.v.shankar



On 8.11.23 г. 20:29 ч., Xin Li wrote:
> Add FRED related VMCS fields to dump_vmcs() to have it dump FRED context.
> 
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>   arch/x86/kvm/vmx/vmx.c | 48 ++++++++++++++++++++++++++++++++++++------
>   1 file changed, 41 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 518e68ee5a0d..b826dc188fc7 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6429,7 +6429,7 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	u32 vmentry_ctl, vmexit_ctl;
>   	u32 cpu_based_exec_ctrl, pin_based_exec_ctrl, secondary_exec_control;
> -	u64 tertiary_exec_control;
> +	u64 tertiary_exec_control, secondary_vmexit_ctl;
>   	unsigned long cr4;
>   	int efer_slot;
>   
> @@ -6440,6 +6440,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>   
>   	vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
>   	vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
> +	secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
> +			       vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
>   	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
>   	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
>   	cr4 = vmcs_readl(GUEST_CR4);
> @@ -6486,6 +6488,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>   	vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
>   	vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
>   	vmx_dump_sel("TR:  ", GUEST_TR_SELECTOR);
> +#ifdef CONFIG_X86_64
> +	if (cpu_feature_enabled(X86_FEATURE_FRED)) {

Shouldn't this be gated on whether FRED is enabled in kvm aka the CPUID 
bit is enumerated ?

<snip>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-14  8:16       ` Chao Gao
@ 2023-11-14 18:57         ` Li, Xin3
  2023-11-20  9:04           ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-14 18:57 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Dexuan Cui,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V


> >> FRED is enabled when CR4.FRED = IA32_EFER.LMA = 1. Any reason to omit the
> >> check about long mode?
> >
> >It won' t allow CR4.FRED to be set if not in long mode, I don't expect it
> >at runtime.  Or you have one?
> 
> I was thinking about a very contrived case:
> 
> 1. the CPU enters 64-bit long mode and sets CR4.FRED
> 2. the CPU switches out of 64-bit long mode
> 
> and SDM vol3 chapter 2.5 CONTROL REGISTERS says:
> 
> A 64-bit capable processor will retain the upper 32 bits of each control
> register when transitioning out of IA-32e mode.
> 
> so, to me, it is possible that CR4.FRED is 1 while IA32_EFER.LMA is 0.
> and in this case, FRED should be considered disabled.

You're correct, this is a solid case.

It's not one-way, but I forgot the other way around.

> 
> Anyway, I think we should align with FRED SPEC. If we deliberately omit
> the check about long mode, please add a comment to explain why it is ok
> to do that.

Yeah,  I will add it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config
  2023-11-13 17:18   ` Nikolay Borisov
@ 2023-11-15  2:39     ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-15  2:39 UTC (permalink / raw)
  To: Nikolay Borisov, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > -	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
> > +	if (_vmexit_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
> >   		_secondary_vmexit_control =
> >
> 	adjust_vmx_controls64(KVM_OPTIONAL_VMX_SECONDARY_VM_EXIT_CO
> NTROLS,
> >   					      MSR_IA32_VMX_EXIT_CTLS2);
> > +		if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> > +		    !(_secondary_vmexit_control &
> SECONDARY_VM_EXIT_SAVE_IA32_FRED &&
> > +		      _secondary_vmexit_control &
> SECONDARY_VM_EXIT_LOAD_IA32_FRED)) {
> 
> Can those checks actually trigger? I.e if FEATURE_FRED is set it means
> the CPU implements the FRED spec. According to the spec it's guaranteed
> that VMX_EXIT_CTLS2 will contain those bits set to 1. So aren't those
> checks superfluous?

Such checks are for nVMX FRED.

> > +	if (cpu_feature_enabled(X86_FEATURE_FRED) &&
> > +	    !(_vmentry_control & VM_ENTRY_LOAD_IA32_FRED)) {
> DITTO
> 
> <snip>

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs()
  2023-11-14 14:36   ` Nikolay Borisov
@ 2023-11-15  2:41     ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-15  2:41 UTC (permalink / raw)
  To: Nikolay Borisov, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > @@ -6440,6 +6440,8 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> >
> >   	vmentry_ctl = vmcs_read32(VM_ENTRY_CONTROLS);
> >   	vmexit_ctl = vmcs_read32(VM_EXIT_CONTROLS);
> > +	secondary_vmexit_ctl = cpu_has_secondary_vmexit_ctrls() ?
> > +			       vmcs_read64(SECONDARY_VM_EXIT_CONTROLS) : 0;
> >   	cpu_based_exec_ctrl = vmcs_read32(CPU_BASED_VM_EXEC_CONTROL);
> >   	pin_based_exec_ctrl = vmcs_read32(PIN_BASED_VM_EXEC_CONTROL);
> >   	cr4 = vmcs_readl(GUEST_CR4);
> > @@ -6486,6 +6488,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
> >   	vmx_dump_sel("LDTR:", GUEST_LDTR_SELECTOR);
> >   	vmx_dump_dtsel("IDTR:", GUEST_IDTR_LIMIT);
> >   	vmx_dump_sel("TR:  ", GUEST_TR_SELECTOR);
> > +#ifdef CONFIG_X86_64
> > +	if (cpu_feature_enabled(X86_FEATURE_FRED)) {
> 
> Shouldn't this be gated on whether FRED is enabled in kvm aka the CPUID bit is
> enumerated ?

Yeah, that is more accurate.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-14  8:58       ` Chao Gao
@ 2023-11-15  2:52         ` Li, Xin3
  2023-11-16  2:39           ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-15  2:52 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >> >+			else if (is_nm_fault(intr_info) &&
> >> >+				 vcpu->arch.guest_fpu.fpstate->xfd)
> >>
> >> does this necessarily mean the #NM is caused by XFD?
> >
> >Then the event data should be 0.  Or I missed something obvious? I.e.,
> >it can be easily differentiated and we should just explicitly set it to
> >0.
> 
> vcpu->arch.guest_fpu.fpstate->xfd just means the guest is enabling XFD.
> I don't think we can conclude that this #NM is caused by XFD only from this.  i.e.,
> there may be some false positives.

Then we should get 0 in event data.  Otherwise, it is a bug in how we deal
with IA32_XFD_ERR MSR, w/ or w/o FRED.

> >> >+			default:
> >> >+				WARN_ON(event_data != 0);
> >>
> >> I am not sure if this WARN_ON() can be triggeded by nested VMX. It is
> >> legitimate for L1 VMM to inject any event w/ an event_data.
> >>
> >> FRED spec says:
> >>
> >> Section 5.2.1 specifies the event data that FRED event delivery of
> >> certain events saves on the stack. When FRED event delivery is used
> >> for an event injected by VM entry, the event data saved is the value
> >> of the injected-event-data field in the VMCS. This value is used
> >> instead of what is specified in Section 5.2.1 and is done for __ALL__
> >> injected events using FRED event delivery
> >
> >5.2.1 Saving Information on the Regular Stack also says:
> >- For any other event, the event data are not currently defined and
> >will
> >  be zero until they are.
> >
> >Or you mean something else?
> 
> IIUC, L1 KVM can inject a nested exception whose vector isn't #DB, or #NM or #PF
> with a non-zero event_data to L2.

No, this is not allowed.

> If delivering the nested exception causes a VM-
> exit to L0 KVM, the assertion that event_data is always 0 for vectors other than
> #DB/#NM/#PF fails.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-11-14  7:40   ` Chao Gao
@ 2023-11-15  3:03     ` Li, Xin3
  2023-12-06  8:37       ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-11-15  3:03 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >+		if (idt_vectoring_info & VECTORING_INFO_DELIVER_CODE_MASK)
> >+			kvm_requeue_exception_e(vcpu, vector,
> vmcs_read32(error_code_field),
> >+						idt_vectoring_info &
> INTR_INFO_NESTED_EXCEPTION_MASK);
> >+		else
> >+			kvm_requeue_exception(vcpu, vector,
> >+					      idt_vectoring_info &
> INTR_INFO_NESTED_EXCEPTION_MASK);
> 
> Exiting-event identification can also have bit 13 set, indicating a nested
> exception encountered and caused VM-exit. when reinjecting the exception to
> guests, kvm needs to set the "nested" bit, right? I suspect some changes
> to e.g., handle_exception_nmi() are needed.

The current patch relies on kvm_multiple_exception() to do that.  But TBH,
I'm not sure it can recognize all nested cases.  I probably should revisit
it.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest
  2023-11-14  7:47       ` Chao Gao
@ 2023-11-15  3:04         ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-15  3:04 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >> >diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index
> >> >f8c02bd37069..328a3447f064 100644
> >> >--- a/arch/x86/kvm/vmx/vmx.h
> >> >+++ b/arch/x86/kvm/vmx/vmx.h
> >> >@@ -276,6 +276,8 @@ struct vcpu_vmx {
> >> > #ifdef CONFIG_X86_64
> >> > 	u64		      msr_host_kernel_gs_base;
> >> > 	u64		      msr_guest_kernel_gs_base;
> >> >+	u64		      msr_host_fred_rsp0;
> >> >+	u64		      msr_guest_fred_rsp0;
> >>
> >> resetting guest fred rsp0 to 0 during vcpu reset is missing.
> >
> >hmm, I assume it gets the same treatment as guest_kernel_gs_base.
> >
> >It seems we don't reset guest_kernel_gs_base.  No?
> 
> Yes. But for fred MSRs, FRED spec clearly says their RESET values
> are 0s. for kernel_gs_base MSR, looks there is no such description
> in SDM.

Right, maybe better to set both to 0s.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID
  2023-11-14  2:50         ` Li, Xin3
@ 2023-11-15 21:47           ` Sean Christopherson
  0 siblings, 0 replies; 70+ messages in thread
From: Sean Christopherson @ 2023-11-15 21:47 UTC (permalink / raw)
  To: Xin3 Li
  Cc: Chao Gao, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest, pbonzini, corbet, kys, haiyangz, wei.liu,
	Dexuan Cui, tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets,
	peterz, Ravi V Shankar

On Tue, Nov 14, 2023, Xin3 Li wrote:
> > Implement what in a different way?  The VMCS fields and FRED are architectural.
> > The internal layout of the VMCS is uarch specific, but the encodings and semantics
> > absolutely cannot change without breaking software.  And if Intel does something
> > asinine like make a control active-low then we have far, far bigger problems.
> 
> I should have made it clear that I wasn't talking at the ISA level.  And
> of course CPU uarch implementations should be transparent to software.
> 
> I mean a CPU uarch could choose to check the activation bit in the VM exit
> controls first and then decide whether to load the 2nd VM exit controls.
> While if resources allow, a CPU uarch could always load the 2nd VM exit
> controls.

And why does that matter?  Loading a field speculatively/out-of-order is fine,
consuming it when it architecturally is supposed to be ignored is not.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-15  2:52         ` Li, Xin3
@ 2023-11-16  2:39           ` Chao Gao
  2023-11-20  8:16             ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-11-16  2:39 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

>> >> >+			default:
>> >> >+				WARN_ON(event_data != 0);
>> >>
>> >> I am not sure if this WARN_ON() can be triggeded by nested VMX. It is
>> >> legitimate for L1 VMM to inject any event w/ an event_data.
>> >>
>> >> FRED spec says:
>> >>
>> >> Section 5.2.1 specifies the event data that FRED event delivery of
>> >> certain events saves on the stack. When FRED event delivery is used
>> >> for an event injected by VM entry, the event data saved is the value
>> >> of the injected-event-data field in the VMCS. This value is used
>> >> instead of what is specified in Section 5.2.1 and is done for __ALL__
>> >> injected events using FRED event delivery
>> >
>> >5.2.1 Saving Information on the Regular Stack also says:
>> >- For any other event, the event data are not currently defined and
>> >will
>> >  be zero until they are.
>> >
>> >Or you mean something else?
>> 
>> IIUC, L1 KVM can inject a nested exception whose vector isn't #DB, or #NM or #PF
>> with a non-zero event_data to L2.
>
>No, this is not allowed.

How do you interpret the last sentence:

	Section 5.2.1 specifies the event data that FRED event delivery of
	certain events saves on the stack. When FRED event delivery is used
	for an event injected by VM entry, the event data saved is the value
	of the injected-event-data field in the VMCS. This value is used
	instead of what is specified in Section 5.2.1 and is done for __ALL__
	injected events using FRED event delivery

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 12/23] KVM: VMX: Handle FRED event data
  2023-11-16  2:39           ` Chao Gao
@ 2023-11-20  8:16             ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-20  8:16 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >> IIUC, L1 KVM can inject a nested exception whose vector isn't #DB, or
> >> #NM or #PF with a non-zero event_data to L2.
> >
> >No, this is not allowed.
> 
> How do you interpret the last sentence:
> 
> 	Section 5.2.1 specifies the event data that FRED event delivery of
> 	certain events saves on the stack. When FRED event delivery is used
> 	for an event injected by VM entry, the event data saved is the value
> 	of the injected-event-data field in the VMCS. This value is used
> 	instead of what is specified in Section 5.2.1 and is done for __ALL__
> 	injected events using FRED event delivery

To me, it means FRED event injection during VM entry simply pushes the
value in the injected-event-data field of the VMCS as event data.  But
the event data definition should comply with Section 5.2.1.  It is a
forward compatibility issue otherwise.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled()
  2023-11-14 18:57         ` Li, Xin3
@ 2023-11-20  9:04           ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-20  9:04 UTC (permalink / raw)
  To: Li, Xin3, Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> > >> FRED is enabled when CR4.FRED = IA32_EFER.LMA = 1. Any reason to
> > >> omit the check about long mode?
> > >
> > >It won' t allow CR4.FRED to be set if not in long mode, I don't
> > >expect it at runtime.  Or you have one?
> >
> > I was thinking about a very contrived case:
> >
> > 1. the CPU enters 64-bit long mode and sets CR4.FRED 2. the CPU
> > switches out of 64-bit long mode

Actually, SDM3 Section 10.8.5 Initializing IA-32e Mode says: 64-bit mode
consistency checks fail on attempts to enable or disable IA-32e mode
while paging is enabled.  In another word, the CPU allows software to
modify IA32_EFER.LME only when CR0.PG = 0 (i.e., only in legacy mode).
Thus, attempts to do so when CR0.PG = 1 will cause #GP.

Remember FRED only works with 64-bit kernel, which can be entered only
if paging is enabled.  As such, to clear IA32_EFER.LME, OS/VMM needs to
turn off paging first.  And to turn off paging, 64-bit mode needs to be
turned off (to enter compatibility mode in ring 0 because it's not allowed
to disable paging in 64-bit mode).  Thus CR4.FRED needs be cleared first.

So CR4.FRED fully indicates whether FRED is enabled or not, which is
why FRED spec 5.0 section 9.5 says so.  But we can put more explanation
there.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls
  2023-11-08 18:29 ` [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls Xin Li
  2023-11-09  8:21   ` Jeremi Piotrowski
@ 2023-11-20 15:52   ` Vitaly Kuznetsov
  2023-11-20 17:42     ` Li, Xin3
  1 sibling, 1 reply; 70+ messages in thread
From: Vitaly Kuznetsov @ 2023-11-20 15:52 UTC (permalink / raw)
  To: Xin Li, kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, decui, tglx,
	mingo, bp, dave.hansen, x86, hpa, peterz, ravi.v.shankar

Xin Li <xin3.li@intel.com> writes:

> Enable the secondary VM exit controls to prepare for nested FRED.
>
> Tested-by: Shan Kang <shan.kang@intel.com>
> Signed-off-by: Xin Li <xin3.li@intel.com>
> ---
>  Documentation/virt/kvm/x86/nested-vmx.rst |  1 +
>  arch/x86/include/asm/hyperv-tlfs.h        |  1 +
>  arch/x86/kvm/vmx/capabilities.h           |  1 +
>  arch/x86/kvm/vmx/hyperv.c                 | 18 +++++++++++++++++-
>  arch/x86/kvm/vmx/nested.c                 | 18 +++++++++++++++++-
>  arch/x86/kvm/vmx/vmcs12.c                 |  1 +
>  arch/x86/kvm/vmx/vmcs12.h                 |  2 ++
>  arch/x86/kvm/x86.h                        |  2 +-
>  8 files changed, 41 insertions(+), 3 deletions(-)
>
> diff --git a/Documentation/virt/kvm/x86/nested-vmx.rst b/Documentation/virt/kvm/x86/nested-vmx.rst
> index ac2095d41f02..e64ef231f310 100644
> --- a/Documentation/virt/kvm/x86/nested-vmx.rst
> +++ b/Documentation/virt/kvm/x86/nested-vmx.rst
> @@ -217,6 +217,7 @@ struct shadow_vmcs is ever changed.
>  		u16 host_fs_selector;
>  		u16 host_gs_selector;
>  		u16 host_tr_selector;
> +		u64 secondary_vm_exit_controls;
>  	};
>  
>  
> diff --git a/arch/x86/include/asm/hyperv-tlfs.h b/arch/x86/include/asm/hyperv-tlfs.h
> index 2ff26f53cd62..299554708e37 100644
> --- a/arch/x86/include/asm/hyperv-tlfs.h
> +++ b/arch/x86/include/asm/hyperv-tlfs.h
> @@ -616,6 +616,7 @@ struct hv_enlightened_vmcs {
>  	u64 host_ssp;
>  	u64 host_ia32_int_ssp_table_addr;
>  	u64 padding64_6;
> +	u64 secondary_vm_exit_controls;

(I think Jeremi has asked a similar question but just to be sure)

This doesn't seem to be present in the currently available TLFS version
e.g. here:
https://learn.microsoft.com/en-us/virtualization/hyper-v-on-windows/tlfs/datatypes/hv_vmx_enlightened_vmcs

That wouldn't be the first time when TLFS lags behind but as I don't see
anyone from Microsoft signing this off, let me ask: where did you get
this information and, in case it came from someone @microsoft.com, can
we get their sign-off on the patch?

>  } __packed;
>  
>  #define HV_VMX_ENLIGHTENED_CLEAN_FIELD_NONE			0
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index e8f3ad0f79ee..caf38a54856c 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -38,6 +38,7 @@ struct nested_vmx_msrs {
>  	u32 pinbased_ctls_high;
>  	u32 exit_ctls_low;
>  	u32 exit_ctls_high;
> +	u64 secondary_exit_ctls;
>  	u32 entry_ctls_low;
>  	u32 entry_ctls_high;
>  	u32 misc_low;
> diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
> index 313b8bb5b8a7..b8cd53601a00 100644
> --- a/arch/x86/kvm/vmx/hyperv.c
> +++ b/arch/x86/kvm/vmx/hyperv.c
> @@ -103,7 +103,10 @@
>  	 VM_EXIT_LOAD_IA32_EFER |					\
>  	 VM_EXIT_CLEAR_BNDCFGS |					\
>  	 VM_EXIT_PT_CONCEAL_PIP |					\
> -	 VM_EXIT_CLEAR_IA32_RTIT_CTL)
> +	 VM_EXIT_CLEAR_IA32_RTIT_CTL |					\
> +	 VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
> +
> +#define EVMCS1_SUPPORTED_VMEXIT_CTRL2 (0ULL)
>  
>  #define EVMCS1_SUPPORTED_VMENTRY_CTRL					\
>  	(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR |				\
> @@ -315,6 +318,8 @@ const struct evmcs_field vmcs_field_to_evmcs_1[] = {
>  		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
>  	EVMCS1_FIELD(VM_EXIT_CONTROLS, vm_exit_controls,
>  		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
> +	EVMCS1_FIELD(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls,
> +		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
>  	EVMCS1_FIELD(SECONDARY_VM_EXEC_CONTROL, secondary_vm_exec_control,
>  		     HV_VMX_ENLIGHTENED_CLEAN_FIELD_CONTROL_GRP1),
>  	EVMCS1_FIELD(GUEST_ES_LIMIT, guest_es_limit,
> @@ -464,6 +469,7 @@ enum evmcs_revision {
>  
>  enum evmcs_ctrl_type {
>  	EVMCS_EXIT_CTRLS,
> +	EVMCS_2NDEXIT,
>  	EVMCS_ENTRY_CTRLS,
>  	EVMCS_EXEC_CTRL,
>  	EVMCS_2NDEXEC,
> @@ -477,6 +483,9 @@ static const u32 evmcs_supported_ctrls[NR_EVMCS_CTRLS][NR_EVMCS_REVISIONS] = {
>  	[EVMCS_EXIT_CTRLS] = {
>  		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL,
>  	},
> +	[EVMCS_2NDEXIT] = {
> +		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL2,
> +	},
>  	[EVMCS_ENTRY_CTRLS] = {
>  		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMENTRY_CTRL,
>  	},

What's the desired effect here? I.e. why exposing
VM_EXIT_ACTIVATE_SECONDARY_CONTROLS when none of the controls are going
to be exposed?

> @@ -539,6 +548,9 @@ void nested_evmcs_filter_control_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 *
>  			supported_ctrls &= ~VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>  		ctl_high &= supported_ctrls;
>  		break;
> +	case MSR_IA32_VMX_EXIT_CTLS2:
> +		ctl_low &= evmcs_get_supported_ctls(EVMCS_2NDEXIT);
> +		break;
>  	case MSR_IA32_VMX_ENTRY_CTLS:
>  	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
>  		supported_ctrls = evmcs_get_supported_ctls(EVMCS_ENTRY_CTRLS);
> @@ -589,6 +601,10 @@ int nested_evmcs_check_controls(struct vmcs12 *vmcs12)
>  					       vmcs12->vm_exit_controls)))
>  		return -EINVAL;
>  
> +	if (CC(!nested_evmcs_is_valid_controls(EVMCS_2NDEXIT,
> +					       vmcs12->secondary_vm_exit_controls)))
> +		return -EINVAL;
> +
>  	if (CC(!nested_evmcs_is_valid_controls(EVMCS_ENTRY_CTRLS,
>  					       vmcs12->vm_entry_controls)))
>  		return -EINVAL;
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index ff07d6e736a2..d6341845df43 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1411,6 +1411,7 @@ int vmx_set_vmx_msr(struct kvm_vcpu *vcpu, u32 msr_index, u64 data)
>  	case MSR_IA32_VMX_PINBASED_CTLS:
>  	case MSR_IA32_VMX_PROCBASED_CTLS:
>  	case MSR_IA32_VMX_EXIT_CTLS:
> +	case MSR_IA32_VMX_EXIT_CTLS2:
>  	case MSR_IA32_VMX_ENTRY_CTLS:
>  		/*
>  		 * The "non-true" VMX capability MSRs are generated from the
> @@ -1489,6 +1490,9 @@ int vmx_get_vmx_msr(struct nested_vmx_msrs *msrs, u32 msr_index, u64 *pdata)
>  		if (msr_index == MSR_IA32_VMX_EXIT_CTLS)
>  			*pdata |= VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR;
>  		break;
> +	case MSR_IA32_VMX_EXIT_CTLS2:
> +		*pdata = msrs->secondary_exit_ctls;
> +		break;
>  	case MSR_IA32_VMX_TRUE_ENTRY_CTLS:
>  	case MSR_IA32_VMX_ENTRY_CTLS:
>  		*pdata = vmx_control_msr(
> @@ -1692,6 +1696,8 @@ static void copy_enlightened_to_vmcs12(struct vcpu_vmx *vmx, u32 hv_clean_fields
>  		vmcs12->pin_based_vm_exec_control =
>  			evmcs->pin_based_vm_exec_control;
>  		vmcs12->vm_exit_controls = evmcs->vm_exit_controls;
> +		vmcs12->secondary_vm_exit_controls =
> +			evmcs->secondary_vm_exit_controls;
>  		vmcs12->secondary_vm_exec_control =
>  			evmcs->secondary_vm_exec_control;
>  	}
> @@ -1894,6 +1900,7 @@ static void copy_vmcs12_to_enlightened(struct vcpu_vmx *vmx)
>  	 * evmcs->vmcs_link_pointer = vmcs12->vmcs_link_pointer;
>  	 * evmcs->pin_based_vm_exec_control = vmcs12->pin_based_vm_exec_control;
>  	 * evmcs->vm_exit_controls = vmcs12->vm_exit_controls;
> +	 * evmcs->secondary_vm_exit_controls = vmcs12->secondary_vm_exit_controls;
>  	 * evmcs->secondary_vm_exec_control = vmcs12->secondary_vm_exec_control;
>  	 * evmcs->page_fault_error_code_mask =
>  	 *		vmcs12->page_fault_error_code_mask;
> @@ -2411,6 +2418,11 @@ static void prepare_vmcs02_early(struct vcpu_vmx *vmx, struct loaded_vmcs *vmcs0
>  		exec_control &= ~VM_EXIT_LOAD_IA32_EFER;
>  	vm_exit_controls_set(vmx, exec_control);
>  
> +	if (exec_control & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS) {
> +		exec_control = __secondary_vm_exit_controls_get(vmcs01);
> +		secondary_vm_exit_controls_set(vmx, exec_control);
> +	}
> +
>  	/*
>  	 * Interrupt/Exception Fields
>  	 */
> @@ -6819,13 +6831,17 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>  		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>  #endif
>  		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> -		VM_EXIT_CLEAR_BNDCFGS;
> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_ACTIVATE_SECONDARY_CONTROLS;
>  	msrs->exit_ctls_high |=
>  		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>  		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>  		VM_EXIT_SAVE_VMX_PREEMPTION_TIMER | VM_EXIT_ACK_INTR_ON_EXIT |
>  		VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL;
>  
> +	/* secondary exit controls */
> +	if (msrs->exit_ctls_high & VM_EXIT_ACTIVATE_SECONDARY_CONTROLS)
> +		rdmsrl(MSR_IA32_VMX_EXIT_CTLS2, msrs->secondary_exit_ctls);
> +
>  	/* We support free control of debug control saving. */
>  	msrs->exit_ctls_low &= ~VM_EXIT_SAVE_DEBUG_CONTROLS;
>  }
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..98457d7b2b23 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -73,6 +73,7 @@ const unsigned short vmcs12_field_offsets[] = {
>  	FIELD(PAGE_FAULT_ERROR_CODE_MATCH, page_fault_error_code_match),
>  	FIELD(CR3_TARGET_COUNT, cr3_target_count),
>  	FIELD(VM_EXIT_CONTROLS, vm_exit_controls),
> +	FIELD(SECONDARY_VM_EXIT_CONTROLS, secondary_vm_exit_controls),
>  	FIELD(VM_EXIT_MSR_STORE_COUNT, vm_exit_msr_store_count),
>  	FIELD(VM_EXIT_MSR_LOAD_COUNT, vm_exit_msr_load_count),
>  	FIELD(VM_ENTRY_CONTROLS, vm_entry_controls),
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..f50f897b9b5f 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -185,6 +185,7 @@ struct __packed vmcs12 {
>  	u16 host_gs_selector;
>  	u16 host_tr_selector;
>  	u16 guest_pml_index;
> +	u64 secondary_vm_exit_controls;
>  };
>  
>  /*
> @@ -358,6 +359,7 @@ static inline void vmx_check_vmcs12_offsets(void)
>  	CHECK_OFFSET(host_gs_selector, 992);
>  	CHECK_OFFSET(host_tr_selector, 994);
>  	CHECK_OFFSET(guest_pml_index, 996);
> +	CHECK_OFFSET(secondary_vm_exit_controls, 998);
>  }
>  
>  extern const unsigned short vmcs12_field_offsets[];
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 63e543c6834b..96ad139adc3f 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -47,7 +47,7 @@ void kvm_spurious_fault(void);
>   * associated feature that KVM supports for nested virtualization.
>   */
>  #define KVM_FIRST_EMULATED_VMX_MSR	MSR_IA32_VMX_BASIC
> -#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_VMFUNC
> +#define KVM_LAST_EMULATED_VMX_MSR	MSR_IA32_VMX_EXIT_CTLS2
>  
>  #define KVM_DEFAULT_PLE_GAP		128
>  #define KVM_VMX_DEFAULT_PLE_WINDOW	4096

-- 
Vitaly


^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls
  2023-11-20 15:52   ` Vitaly Kuznetsov
@ 2023-11-20 17:42     ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-11-20 17:42 UTC (permalink / raw)
  To: Vitaly Kuznetsov, kvm, linux-doc, linux-kernel, linux-hyperv,
	linux-kselftest
  Cc: seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, peterz, Shankar, Ravi V

> > diff --git a/arch/x86/include/asm/hyperv-tlfs.h
> > b/arch/x86/include/asm/hyperv-tlfs.h
> > index 2ff26f53cd62..299554708e37 100644
> > --- a/arch/x86/include/asm/hyperv-tlfs.h
> > +++ b/arch/x86/include/asm/hyperv-tlfs.h
> > @@ -616,6 +616,7 @@ struct hv_enlightened_vmcs {
> >  	u64 host_ssp;
> >  	u64 host_ia32_int_ssp_table_addr;
> >  	u64 padding64_6;
> > +	u64 secondary_vm_exit_controls;
> 
> (I think Jeremi has asked a similar question but just to be sure)
> 
> This doesn't seem to be present in the currently available TLFS version e.g. here:
> https://learn.microsoft.com/en-us/virtualization/hyper-v-on-
> windows/tlfs/datatypes/hv_vmx_enlightened_vmcs
> 
> That wouldn't be the first time when TLFS lags behind but as I don't see anyone
> from Microsoft signing this off, let me ask: where did you get this information
> and, in case it came from someone @microsoft.com, can we get their sign-off on
> the patch?

This is being worked on.

> > diff --git a/arch/x86/kvm/vmx/hyperv.c b/arch/x86/kvm/vmx/hyperv.c
> > index 313b8bb5b8a7..b8cd53601a00 100644
> > --- a/arch/x86/kvm/vmx/hyperv.c
> > +++ b/arch/x86/kvm/vmx/hyperv.c
> > @@ -477,6 +483,9 @@ static const u32
> evmcs_supported_ctrls[NR_EVMCS_CTRLS][NR_EVMCS_REVISIONS] = {
> >  	[EVMCS_EXIT_CTRLS] = {
> >  		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL,
> >  	},
> > +	[EVMCS_2NDEXIT] = {
> > +		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMEXIT_CTRL2,
> > +	},
> >  	[EVMCS_ENTRY_CTRLS] = {
> >  		[EVMCSv1_LEGACY] = EVMCS1_SUPPORTED_VMENTRY_CTRL,
> >  	},
> 
> What's the desired effect here? I.e. why exposing
> VM_EXIT_ACTIVATE_SECONDARY_CONTROLS when none of the controls are
> going to be exposed?
 
This is wrong for evmcs v1, I will drop it.


^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-11-15  3:03     ` Li, Xin3
@ 2023-12-06  8:37       ` Li, Xin3
  2023-12-07  8:42         ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-12-06  8:37 UTC (permalink / raw)
  To: Li, Xin3, Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> Subject: RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
> 
> > >+		if (idt_vectoring_info &
> VECTORING_INFO_DELIVER_CODE_MASK)
> > >+			kvm_requeue_exception_e(vcpu, vector,
> > vmcs_read32(error_code_field),
> > >+						idt_vectoring_info &
> > INTR_INFO_NESTED_EXCEPTION_MASK);
> > >+		else
> > >+			kvm_requeue_exception(vcpu, vector,
> > >+					      idt_vectoring_info &
> > INTR_INFO_NESTED_EXCEPTION_MASK);
> >
> > Exiting-event identification can also have bit 13 set, indicating a
> > nested exception encountered and caused VM-exit. when reinjecting the
> > exception to guests, kvm needs to set the "nested" bit, right? I
> > suspect some changes to e.g., handle_exception_nmi() are needed.
> 
> The current patch relies on kvm_multiple_exception() to do that.  But TBH, I'm
> not sure it can recognize all nested cases.  I probably should revisit it.

So the conclusion is that kvm_multiple_exception() is smart enough, and
a VMM doesn't have to check bit 13 of the Exiting-event identification.

In FRED spec 5.0, section 9.2 - New VMX Feature: VMX Nested-Exception
Support, there is a statement at the end of Exiting-event identification:

(The value of this bit is always identical to that of the valid bit of
the original-event identification field.)

It means that even w/o VMX Nested-Exception support, a VMM already knows
if an exception is a nested exception encountered during delivery of
another event in an exception caused VM exit (exit reason 0).  This is
done in KVM through reading IDT_VECTORING_INFO_FIELD and calling
vmx_complete_interrupts() immediately after VM exits.

vmx_complete_interrupts() simply queues the original exception if there is
one, and later the nested exception causing the VM exit could be cancelled
if it is a shadow page fault.  However if the shadow page fault is caused
by a guest page fault, KVM injects it as a nested exception to have guest
fix its page table.

I will add comments about this background in the next iteration.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-12-06  8:37       ` Li, Xin3
@ 2023-12-07  8:42         ` Chao Gao
  2023-12-07 10:09           ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-12-07  8:42 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

On Wed, Dec 06, 2023 at 04:37:39PM +0800, Li, Xin3 wrote:
>> Subject: RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
>> 
>> > >+		if (idt_vectoring_info &
>> VECTORING_INFO_DELIVER_CODE_MASK)
>> > >+			kvm_requeue_exception_e(vcpu, vector,
>> > vmcs_read32(error_code_field),
>> > >+						idt_vectoring_info &
>> > INTR_INFO_NESTED_EXCEPTION_MASK);
>> > >+		else
>> > >+			kvm_requeue_exception(vcpu, vector,
>> > >+					      idt_vectoring_info &
>> > INTR_INFO_NESTED_EXCEPTION_MASK);
>> >
>> > Exiting-event identification can also have bit 13 set, indicating a
>> > nested exception encountered and caused VM-exit. when reinjecting the
>> > exception to guests, kvm needs to set the "nested" bit, right? I
>> > suspect some changes to e.g., handle_exception_nmi() are needed.
>> 
>> The current patch relies on kvm_multiple_exception() to do that.  But TBH, I'm
>> not sure it can recognize all nested cases.  I probably should revisit it.
>
>So the conclusion is that kvm_multiple_exception() is smart enough, and
>a VMM doesn't have to check bit 13 of the Exiting-event identification.
>
>In FRED spec 5.0, section 9.2 - New VMX Feature: VMX Nested-Exception
>Support, there is a statement at the end of Exiting-event identification:
>
>(The value of this bit is always identical to that of the valid bit of
>the original-event identification field.)
>
>It means that even w/o VMX Nested-Exception support, a VMM already knows
>if an exception is a nested exception encountered during delivery of
>another event in an exception caused VM exit (exit reason 0).  This is
>done in KVM through reading IDT_VECTORING_INFO_FIELD and calling
>vmx_complete_interrupts() immediately after VM exits.
>
>vmx_complete_interrupts() simply queues the original exception if there is
>one, and later the nested exception causing the VM exit could be cancelled
>if it is a shadow page fault.  However if the shadow page fault is caused
>by a guest page fault, KVM injects it as a nested exception to have guest
>fix its page table.
>
>I will add comments about this background in the next iteration.

is it possible that the CPU encounters an exception and causes VM-exit during
injecting an __interrupt__? in this case, no __exception__ will be (re-)queued
by vmx_complete_interrupts().

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-12-07  8:42         ` Chao Gao
@ 2023-12-07 10:09           ` Li, Xin3
  2023-12-08  1:56             ` Chao Gao
  0 siblings, 1 reply; 70+ messages in thread
From: Li, Xin3 @ 2023-12-07 10:09 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >> > Exiting-event identification can also have bit 13 set, indicating a
> >> > nested exception encountered and caused VM-exit. when reinjecting the
> >> > exception to guests, kvm needs to set the "nested" bit, right? I
> >> > suspect some changes to e.g., handle_exception_nmi() are needed.
> >>
> >> The current patch relies on kvm_multiple_exception() to do that.  But TBH, I'm
> >> not sure it can recognize all nested cases.  I probably should revisit it.
> >
> >So the conclusion is that kvm_multiple_exception() is smart enough, and
> >a VMM doesn't have to check bit 13 of the Exiting-event identification.
> >
> >In FRED spec 5.0, section 9.2 - New VMX Feature: VMX Nested-Exception
> >Support, there is a statement at the end of Exiting-event identification:
> >
> >(The value of this bit is always identical to that of the valid bit of
> >the original-event identification field.)
> >
> >It means that even w/o VMX Nested-Exception support, a VMM already knows
> >if an exception is a nested exception encountered during delivery of
> >another event in an exception caused VM exit (exit reason 0).  This is
> >done in KVM through reading IDT_VECTORING_INFO_FIELD and calling
> >vmx_complete_interrupts() immediately after VM exits.
> >
> >vmx_complete_interrupts() simply queues the original exception if there is
> >one, and later the nested exception causing the VM exit could be cancelled
> >if it is a shadow page fault.  However if the shadow page fault is caused
> >by a guest page fault, KVM injects it as a nested exception to have guest
> >fix its page table.
> >
> >I will add comments about this background in the next iteration.
> 
> is it possible that the CPU encounters an exception and causes VM-exit during
> injecting an __interrupt__? in this case, no __exception__ will be (re-)queued
> by vmx_complete_interrupts().

I guess the following case is what you're suggesting:
KVM injects an external interrupt after shadow page tables are nuked.

vmx_complete_interrupts() are called after each VM exit to clear both
interrupt and exception queues, which means it always pushes the
deepest event if there is an original event.  In the above case, the
original event is the external interrupt KVM just tried to inject.

^ permalink raw reply	[flat|nested] 70+ messages in thread

* Re: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-12-07 10:09           ` Li, Xin3
@ 2023-12-08  1:56             ` Chao Gao
  2023-12-08 23:48               ` Li, Xin3
  0 siblings, 1 reply; 70+ messages in thread
From: Chao Gao @ 2023-12-08  1:56 UTC (permalink / raw)
  To: Li, Xin3
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

On Thu, Dec 07, 2023 at 06:09:46PM +0800, Li, Xin3 wrote:
>> >> > Exiting-event identification can also have bit 13 set, indicating a
>> >> > nested exception encountered and caused VM-exit. when reinjecting the
>> >> > exception to guests, kvm needs to set the "nested" bit, right? I
>> >> > suspect some changes to e.g., handle_exception_nmi() are needed.
>> >>
>> >> The current patch relies on kvm_multiple_exception() to do that.  But TBH, I'm
>> >> not sure it can recognize all nested cases.  I probably should revisit it.
>> >
>> >So the conclusion is that kvm_multiple_exception() is smart enough, and
>> >a VMM doesn't have to check bit 13 of the Exiting-event identification.
>> >
>> >In FRED spec 5.0, section 9.2 - New VMX Feature: VMX Nested-Exception
>> >Support, there is a statement at the end of Exiting-event identification:
>> >
>> >(The value of this bit is always identical to that of the valid bit of
>> >the original-event identification field.)
>> >
>> >It means that even w/o VMX Nested-Exception support, a VMM already knows
>> >if an exception is a nested exception encountered during delivery of
>> >another event in an exception caused VM exit (exit reason 0).  This is
>> >done in KVM through reading IDT_VECTORING_INFO_FIELD and calling
>> >vmx_complete_interrupts() immediately after VM exits.
>> >
>> >vmx_complete_interrupts() simply queues the original exception if there is
>> >one, and later the nested exception causing the VM exit could be cancelled
>> >if it is a shadow page fault.  However if the shadow page fault is caused
>> >by a guest page fault, KVM injects it as a nested exception to have guest
>> >fix its page table.
>> >
>> >I will add comments about this background in the next iteration.
>> 
>> is it possible that the CPU encounters an exception and causes VM-exit during
>> injecting an __interrupt__? in this case, no __exception__ will be (re-)queued
>> by vmx_complete_interrupts().
>
>I guess the following case is what you're suggesting:
>KVM injects an external interrupt after shadow page tables are nuked.
>
>vmx_complete_interrupts() are called after each VM exit to clear both
>interrupt and exception queues, which means it always pushes the
>deepest event if there is an original event.  In the above case, the
>original event is the external interrupt KVM just tried to inject.

in my understanding, your point is:
1. if bit 13 of the Exiting-event identification is set. the original-event
identification field should be valid.
2. vmx_complete_interrupts() is done immediately after VM exits and reads
original-event identification and reinjects the event there.
3. if KVM injects the exception in exiting-event identification
to guest, KVM doesn't need to read the bit 13 because kvm_multiple_exception()
is "smart enough" and recognize the exception as nested-exception because if
bit 13 is 1, one exception must has been queued in #2.

my question is:
what if the event in original-event identification is an interrupt e.g.,
external interrupt or NMI, rather than exception. vmx_complete_interrupts()
won't queue an exception, then how can KVM or kvm_multiple_exception() know the
exception that caused VM-exit is an nested exception w/o reading bit 13 of the
Exiting-event identification? 

^ permalink raw reply	[flat|nested] 70+ messages in thread

* RE: [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED
  2023-12-08  1:56             ` Chao Gao
@ 2023-12-08 23:48               ` Li, Xin3
  0 siblings, 0 replies; 70+ messages in thread
From: Li, Xin3 @ 2023-12-08 23:48 UTC (permalink / raw)
  To: Gao, Chao
  Cc: kvm, linux-doc, linux-kernel, linux-hyperv, linux-kselftest,
	seanjc, pbonzini, corbet, kys, haiyangz, wei.liu, Cui, Dexuan,
	tglx, mingo, bp, dave.hansen, x86, hpa, vkuznets, peterz,
	Shankar, Ravi V

> >> >> > Exiting-event identification can also have bit 13 set, indicating a
> >> >> > nested exception encountered and caused VM-exit. when reinjecting the
> >> >> > exception to guests, kvm needs to set the "nested" bit, right? I
> >> >> > suspect some changes to e.g., handle_exception_nmi() are needed.
> >> >>
> >> >> The current patch relies on kvm_multiple_exception() to do that.  But TBH,
> I'm
> >> >> not sure it can recognize all nested cases.  I probably should revisit it.
> >> >
> >> >So the conclusion is that kvm_multiple_exception() is smart enough, and
> >> >a VMM doesn't have to check bit 13 of the Exiting-event identification.
> >> >
> >> >In FRED spec 5.0, section 9.2 - New VMX Feature: VMX Nested-Exception
> >> >Support, there is a statement at the end of Exiting-event identification:
> >> >
> >> >(The value of this bit is always identical to that of the valid bit of
> >> >the original-event identification field.)
> >> >
> >> >It means that even w/o VMX Nested-Exception support, a VMM already
> knows
> >> >if an exception is a nested exception encountered during delivery of
> >> >another event in an exception caused VM exit (exit reason 0).  This is
> >> >done in KVM through reading IDT_VECTORING_INFO_FIELD and calling
> >> >vmx_complete_interrupts() immediately after VM exits.
> >> >
> >> >vmx_complete_interrupts() simply queues the original exception if there is
> >> >one, and later the nested exception causing the VM exit could be cancelled
> >> >if it is a shadow page fault.  However if the shadow page fault is caused
> >> >by a guest page fault, KVM injects it as a nested exception to have guest
> >> >fix its page table.
> >> >
> >> >I will add comments about this background in the next iteration.
> >>
> >> is it possible that the CPU encounters an exception and causes VM-exit during
> >> injecting an __interrupt__? in this case, no __exception__ will be (re-)queued
> >> by vmx_complete_interrupts().
> >
> >I guess the following case is what you're suggesting:
> >KVM injects an external interrupt after shadow page tables are nuked.
> >
> >vmx_complete_interrupts() are called after each VM exit to clear both
> >interrupt and exception queues, which means it always pushes the
> >deepest event if there is an original event.  In the above case, the
> >original event is the external interrupt KVM just tried to inject.
> 
> in my understanding, your point is:
> 1. if bit 13 of the Exiting-event identification is set. the original-event
> identification field should be valid.
> 2. vmx_complete_interrupts() is done immediately after VM exits and reads
> original-event identification and reinjects the event there.
> 3. if KVM injects the exception in exiting-event identification
> to guest, KVM doesn't need to read the bit 13 because kvm_multiple_exception()
> is "smart enough" and recognize the exception as nested-exception because if
> bit 13 is 1, one exception must has been queued in #2.
> 
> my question is:
> what if the event in original-event identification is an interrupt e.g.,
> external interrupt or NMI, rather than exception.  vmx_complete_interrupts()
> won't queue an exception, then how can KVM or kvm_multiple_exception()
> know the
> exception that caused VM-exit is an nested exception w/o reading bit 13 of the
> Exiting-event identification?

The good news is that vmx_complete_interrupts() still queues the event
even it's not a hardware exception.  It's just that kvm_multiple_exception()
doesn't check if there is an original interrupt or NMI because IDT event
delivery doesn't care such a case.

I think your point is more of that we should check it when FRED is enabled
for a guest.  Yes, architecturally we should do it.

What I want to emphasize is that bit 13 of the exiting-event identification
is set to the valid bit of the original-event identification, they are
logically the same thing when FRED is enabled.  It doens't matter which one
a VMM reads and uses.  But a VMM doesn't need to differentiate FRED and IDT
if it reads the info from original-event identification.

^ permalink raw reply	[flat|nested] 70+ messages in thread

end of thread, other threads:[~2023-12-08 23:48 UTC | newest]

Thread overview: 70+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-08 18:29 [PATCH v1 00/23] Enable FRED with KVM VMX Xin Li
2023-11-08 18:29 ` [PATCH v1 01/23] KVM: VMX: Cleanup VMX basic information defines and usages Xin Li
2023-11-08 18:29 ` [PATCH v1 02/23] KVM: VMX: Cleanup VMX misc " Xin Li
2023-11-08 18:29 ` [PATCH v1 03/23] KVM: VMX: Add support for the secondary VM exit controls Xin Li
2023-11-08 18:29 ` [PATCH v1 04/23] KVM: x86: Mark CR4.FRED as not reserved Xin Li
2023-11-08 18:29 ` [PATCH v1 05/23] KVM: VMX: Initialize FRED VM entry/exit controls in vmcs_config Xin Li
2023-11-09  8:53   ` Chao Gao
2023-11-09 15:15     ` Sean Christopherson
2023-11-10  0:04       ` Li, Xin3
2023-11-10 15:01         ` Sean Christopherson
2023-11-14  4:05           ` Li, Xin3
2023-11-13 17:18   ` Nikolay Borisov
2023-11-15  2:39     ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 06/23] KVM: VMX: Defer enabling FRED MSRs save/load until after set CPUID Xin Li
2023-11-09  9:15   ` Chao Gao
2023-11-09 23:50     ` Li, Xin3
2023-11-10  0:18       ` Sean Christopherson
2023-11-14  2:50         ` Li, Xin3
2023-11-15 21:47           ` Sean Christopherson
2023-11-08 18:29 ` [PATCH v1 07/23] KVM: VMX: Disable intercepting FRED MSRs Xin Li
2023-11-09  9:21   ` Chao Gao
2023-11-08 18:29 ` [PATCH v1 08/23] KVM: VMX: Initialize VMCS FRED fields Xin Li
2023-11-13  3:04   ` Chao Gao
2023-11-14  6:02     ` Li, Xin3
2023-11-14  6:51       ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 09/23] KVM: VMX: Switch FRED RSP0 between host and guest Xin Li
2023-11-13  3:47   ` Chao Gao
2023-11-14  5:17     ` Li, Xin3
2023-11-14  7:47       ` Chao Gao
2023-11-15  3:04         ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 10/23] KVM: VMX: Add support for FRED context save/restore Xin Li
2023-11-13  5:24   ` Chao Gao
2023-11-14  4:48     ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 11/23] KVM: x86: Add kvm_is_fred_enabled() Xin Li
2023-11-13  7:35   ` Chao Gao
2023-11-14  4:42     ` Li, Xin3
2023-11-14  8:16       ` Chao Gao
2023-11-14 18:57         ` Li, Xin3
2023-11-20  9:04           ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 12/23] KVM: VMX: Handle FRED event data Xin Li
2023-11-13 10:14   ` Chao Gao
2023-11-14  4:34     ` Li, Xin3
2023-11-14  8:58       ` Chao Gao
2023-11-15  2:52         ` Li, Xin3
2023-11-16  2:39           ` Chao Gao
2023-11-20  8:16             ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 13/23] KVM: VMX: Handle VMX nested exception for FRED Xin Li
2023-11-14  7:40   ` Chao Gao
2023-11-15  3:03     ` Li, Xin3
2023-12-06  8:37       ` Li, Xin3
2023-12-07  8:42         ` Chao Gao
2023-12-07 10:09           ` Li, Xin3
2023-12-08  1:56             ` Chao Gao
2023-12-08 23:48               ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 14/23] KVM: VMX: Dump FRED context in dump_vmcs() Xin Li
2023-11-14 14:36   ` Nikolay Borisov
2023-11-15  2:41     ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 15/23] KVM: nVMX: Add support for the secondary VM exit controls Xin Li
2023-11-09  8:21   ` Jeremi Piotrowski
2023-11-10  0:12     ` Li, Xin3
2023-11-20 15:52   ` Vitaly Kuznetsov
2023-11-20 17:42     ` Li, Xin3
2023-11-08 18:29 ` [PATCH v1 16/23] KVM: nVMX: Add FRED VMCS fields Xin Li
2023-11-08 18:29 ` [PATCH v1 17/23] KVM: nVMX: Add support for VMX FRED controls Xin Li
2023-11-08 18:29 ` [PATCH v1 18/23] KVM: nVMX: Add VMCS FRED states checking Xin Li
2023-11-08 18:29 ` [PATCH v1 19/23] KVM: x86: Allow FRED/LKGS/WRMSRNS to be exposed to guests Xin Li
2023-11-08 18:30 ` [PATCH v1 20/23] KVM: selftests: Add FRED VMCS fields to evmcs Xin Li
2023-11-08 18:30 ` [PATCH v1 21/23] KVM: selftests: Run debug_regs test with FRED enabled Xin Li
2023-11-08 18:30 ` [PATCH v1 22/23] KVM: selftests: Add a new VM guest mode to run user level code Xin Li
2023-11-08 18:30 ` [PATCH v1 23/23] KVM: selftests: Add fred exception tests Xin Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).