All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 00/25] Enable CET Virtualization
@ 2023-09-14  6:33 Yang Weijiang
  2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
                   ` (25 more replies)
  0 siblings, 26 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates a
  #CP. These instruction behaves as a NOP on platforms that doesn't support
  CET.


Dependency:
--------------------------------------------------------------------------
At the moment, the CET native series for user mode shadow stack is upstream
-merged in v6.6-rc1, so no native patches are enclosed in this series.

The first 8 kernel patches are prerequisites for this KVM patch series as
guest CET user mode and supervisor mode xstates/MSRs rely on host FPU
framework to properly saves/reloads guest MSRs when it's required to do so,
e.g., when vCPU thread is sched in/out. The kernel patches are released in
separate review thread here [1].

To test CET guest, patch this KVM series to kernel tree to build qualified
host kernel. Also apply QEMU CET enabling patches[2] to build qualified QEMU.


Implementation:
--------------------------------------------------------------------------
This series enables full support for guest CET SHSTK/IBT register states,
i.e., guest CET register states in below usage models are supported.

                  |
    User SHSTK    |    User IBT      (user mode)
--------------------------------------------------
    Kernel SHSTK  |    Kernel IBT    (kernel mode)
                  |

KVM cooperates with host kernel FPU framework to back guest CET xstates switch
when guest CET MSRs need to be saved/reloaded on host side, thus KVM relies on
host FPU xstate settings. From KVM perspective, part of user mode CET state
support is in the native series but requires series [1] to fix some issues
and enable CET supervisor xstate support for guest.

Note, guest supervisor(kernel) SHSTK cannot be fully supported by this series,
therefore guest SSS_CET bit of CPUID(0x7,1):EDX[bit18] is cleared. Check SDM
(Vol 1, Section 17.2.3) for details.


CET states management:
--------------------------------------------------------------------------
CET user mode and supervisor mode xstates, i.e., MSR_IA32_{U_CET,PL3_SSP}
and MSR_IA32_PL{0,1,2}, depend on host FPU framework to swap guest and host
xstates. On VM-Exit, guest CET xstates are saved to guest fpu area and host
CET xstates are loaded from task/thread context before vCPU returns to
userspace, vice-versa on VM-Entry. See details in kvm_{load,put}_guest_fpu().
So guest CET xstates management depends on CET xstate bits(U_CET/S_CET bit)
set in host XSS MSR.

CET supervisor mode states are grouped into two categories : XSAVE-managed
and non-XSAVE-managed, the former includes MSR_IA32_PL{0,1,2}_SSP and are
controlled by CET supervisor mode bit(S_CET bit) in XSS, the later consists
of MSR_IA32_S_CET and MSR_IA32_INTR_SSP_TBL.

VMX introduces new VMCS fields, {GUEST|HOST}_{S_CET,SSP,INTR_SSP_TABL}, to
facilitate guest/host non-XSAVES-managed states. When VMX CET entry/exit load
bits are set, guest/host MSR_IA32_{S_CET,INTR_SSP_TBL,SSP} are loaded from
equivalent fields at VM-Exit/Entry. With these new fields, such supervisor
states require no addtional KVM save/reload actions.


Tests:
--------------------------------------------------------------------------
This series passed basic CET user shadow stack test and kernel IBT test in L1
and L2 guest.

The patch series _has_ impact to existing vmx test cases in KVM-unit-tests,the
failures have been fixed in this series[3].

All other parts of KVM unit-tests and selftests passed with this series. One
new selftest app for CET MSRs is also included in this series.

Note, this series hasn't been tested on AMD platform yet.

To run user SHSTK test and kernel IBT test in guest, an CET capable platform
is required, e.g., Sapphire Rapids server, and follow below steps to build host/
guest kernel properly:

1. Build host kernel: Apply this series to kernel tree(>= v6.6-rc1) and build.

2. Build guest kernel: Pull kernel (>= v6.6-rc1) and opt-in CONFIG_X86_KERNEL_IBT
and CONFIG_X86_USER_SHADOW_STACK options. Build with CET enabled gcc versions
(>= 8.5.0).

3. Use patched QEMU to launch a guest.

Check kernel selftest test_shadow_stack_64 output:

[INFO]  new_ssp = 7f8c82100ff8, *new_ssp = 7f8c82101001
[INFO]  changing ssp from 7f8c82900ff0 to 7f8c82100ff8
[INFO]  ssp is now 7f8c82101000
[OK]    Shadow stack pivot
[OK]    Shadow stack faults
[INFO]  Corrupting shadow stack
[INFO]  Generated shadow stack violation successfully
[OK]    Shadow stack violation test
[INFO]  Gup read -> shstk access success
[INFO]  Gup write -> shstk access success
[INFO]  Violation from normal write
[INFO]  Gup read -> write access success
[INFO]  Violation from normal write
[INFO]  Gup write -> write access success
[INFO]  Cow gup write -> write access success
[OK]    Shadow gup test
[INFO]  Violation from shstk access
[OK]    mprotect() test
[SKIP]  Userfaultfd unavailable.
[OK]    32 bit test


Check kernel IBT with dmesg | grep CET:

CET detected: Indirect Branch Tracking enabled

--------------------------------------------------------------------------
Changes in v6:
1. Added kernel patches to enable CET supervisor xstate support for guest. [Sean, Paolo]
2. Overhauled CET MSR access interface to make read/write clearer.[Sean, Chao]
3. Removed KVM-managed CET supervisor state patches.
4. Tweaked the code for accessing XSS MSR/reporting CET MSRs/SSP access in SMM mode/
CET MSR interception etc.per review feedback. [Sean, Paolo, Chao]
5. Rebased to: https://github.com/kvm-x86/linux tag: kvm-x86-next-2023.09.07


[1]: CET supervisor xstate support:
https://lore.kernel.org/all/20230914032334.75212-1-weijiang.yang@intel.com/
[2]: QEMU patch:
https://lore.kernel.org/all/20230720111445.99509-1-weijiang.yang@intel.com/
[3]: KVM-unit-tests fixup:
https://lore.kernel.org/all/20230913235006.74172-1-weijiang.yang@intel.com/
[4]: v5 patchset:
https://lore.kernel.org/kvm/20230803042732.88515-1-weijiang.yang@intel.com/


Patch 1-8: 	Kernel patches to enable CET supervisor state.
Patch 9-14: 	Enable XSS support in KVM.
Patch 15:  	Fault check for CR4.CET setting.
Patch 16:	Report CET MSRs to userspace.
Patch 17:	Introduce CET VMCS fields.
Patch 18:  	Add SHSTK/IBT to KVM-governed framework.
Patch 19: 	Emulate CET MSR access.
Patch 20: 	Handle SSP at entry/exit to SMM.
Patch 21: 	Set up CET MSR interception.
Patch 22: 	Initialize host constant supervisor state.
Patch 23: 	Add CET virtualization settings.
Patch 24-25: 	Add CET nested support.



Sean Christopherson (3):
  KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  KVM: x86: Report XSS as to-be-saved if there are supported features
  KVM: x86: Load guest FPU state when access XSAVE-managed MSRs

Yang Weijiang (22):
  x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  x86/fpu/xstate: Fix guest fpstate allocation size calculation
  x86/fpu/xstate: Add CET supervisor mode state support
  x86/fpu/xstate: Introduce kernel dynamic xfeature set
  x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel
    default_features
  x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate
    size
  x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic
    xfeatures
  x86/fpu/xstate: WARN if normal fpstate contains kernel dynamic
    xfeatures
  KVM: x86: Add kvm_msr_{read,write}() helpers
  KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  KVM: x86: Initialize kvm_caps.supported_xss
  KVM: x86: Add fault checks for guest CR4.CET setting
  KVM: x86: Report KVM supported CET MSRs as to-be-saved
  KVM: VMX: Introduce CET VMCS fields and control bits
  KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT
    enabled"
  KVM: VMX: Emulate read and write to CET MSRs
  KVM: x86: Save and reload SSP to/from SMRAM
  KVM: VMX: Set up interception for CET MSRs
  KVM: VMX: Set host constant supervisor states to VMCS fields
  KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery
    to L1
  KVM: nVMX: Enable CET support for nested guest

 arch/x86/include/asm/fpu/types.h     |  14 +-
 arch/x86/include/asm/fpu/xstate.h    |   6 +-
 arch/x86/include/asm/kvm_host.h      |   8 +-
 arch/x86/include/asm/msr-index.h     |   1 +
 arch/x86/include/asm/vmx.h           |   8 ++
 arch/x86/include/uapi/asm/kvm_para.h |   1 +
 arch/x86/kernel/fpu/core.c           |  56 ++++++--
 arch/x86/kernel/fpu/xstate.c         |  49 ++++++-
 arch/x86/kernel/fpu/xstate.h         |   5 +
 arch/x86/kvm/cpuid.c                 |  62 ++++++---
 arch/x86/kvm/governed_features.h     |   2 +
 arch/x86/kvm/smm.c                   |   8 ++
 arch/x86/kvm/smm.h                   |   2 +-
 arch/x86/kvm/vmx/capabilities.h      |  10 ++
 arch/x86/kvm/vmx/nested.c            |  49 +++++--
 arch/x86/kvm/vmx/nested.h            |   5 +
 arch/x86/kvm/vmx/vmcs12.c            |   6 +
 arch/x86/kvm/vmx/vmcs12.h            |  14 +-
 arch/x86/kvm/vmx/vmx.c               | 104 ++++++++++++++-
 arch/x86/kvm/vmx/vmx.h               |   6 +-
 arch/x86/kvm/x86.c                   | 192 +++++++++++++++++++++++++--
 arch/x86/kvm/x86.h                   |  28 ++++
 22 files changed, 569 insertions(+), 67 deletions(-)


base-commit: ff6e6ded54725cd01623b9a1a86b74a523198733
-- 
2.27.0


^ permalink raw reply	[flat|nested] 120+ messages in thread

* [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-14 22:39   ` Edgecombe, Rick P
  2023-10-31 17:43   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation Yang Weijiang
                   ` (24 subsequent siblings)
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
reflect true dependency between CET features and the xstate bit, instead
manually check and add the bit back if either SHSTK or IBT is supported.

Both user mode shadow stack and indirect branch tracking features depend
on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.

Although in real world a platform with IBT but no SHSTK is rare, but in
virtualization world it's common, guest SHSTK and IBT can be controlled
independently via userspace app.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index cadf68737e6b..12c8cb278346 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_OSPKE,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
-	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 			fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
 	}
 
+	/*
+	 * Manually add CET user mode xstate bit if either SHSTK or IBT is
+	 * available. Both features depend on the xstate bit to save/restore
+	 * CET user mode state.
+	 */
+	if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
+		fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
+
 	if (!cpu_feature_enabled(X86_FEATURE_XFD))
 		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
  2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-14 22:45   ` Edgecombe, Rick P
  2023-10-21  0:39   ` Sean Christopherson
  2023-09-14  6:33 ` [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support Yang Weijiang
                   ` (23 subsequent siblings)
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Fix guest xsave area allocation size from fpu_user_cfg.default_size to
fpu_kernel_cfg.default_size so that the xsave area size is consistent
with fpstate->size set in __fpstate_reset().

With the fix, guest fpstate size is sufficient for KVM supported guest
xfeatures.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/core.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a86d37052a64..a42d8ad26ce6 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 	struct fpstate *fpstate;
 	unsigned int size;
 
-	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
+	size = fpu_kernel_cfg.default_size +
+	       ALIGN(offsetof(struct fpstate, regs), 64);
+
 	fpstate = vzalloc(size);
 	if (!fpstate)
 		return false;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
  2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
  2023-09-14  6:33 ` [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-15  0:06   ` Edgecombe, Rick P
  2023-09-14  6:33 ` [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set Yang Weijiang
                   ` (22 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Add supervisor mode state support within FPU xstate management framework.
Although supervisor shadow stack is not enabled/used today in kernel,KVM
requires the support because when KVM advertises shadow stack feature to
guest, architechturally it claims the support for both user and supervisor
modes for Linux and non-Linux guest OSes.

With the xstate support, guest supervisor mode shadow stack state can be
properly saved/restored when 1) guest/host FPU context is swapped 2) vCPU
thread is sched out/in.

The alternative is to enable it in KVM domain, but KVM maintainers NAKed
the solution. The external discussion can be found at [*], it ended up
with adding the support in kernel instead of KVM domain.

Note, in KVM case, guest CET supervisor state i.e., IA32_PL{0,1,2}_MSRs,
are preserved after VM-Exit until host/guest fpstates are swapped, but
since host supervisor shadow stack is disabled, the preserved MSRs won't
hurt host.

[*]: https://lore.kernel.org/all/806e26c2-8d21-9cc9-a0b7-7787dd231729@intel.com/

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/asm/fpu/types.h  | 14 ++++++++++++--
 arch/x86/include/asm/fpu/xstate.h |  6 +++---
 arch/x86/kernel/fpu/xstate.c      |  6 +++++-
 3 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index eb810074f1e7..c6fd13a17205 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -116,7 +116,7 @@ enum xfeature {
 	XFEATURE_PKRU,
 	XFEATURE_PASID,
 	XFEATURE_CET_USER,
-	XFEATURE_CET_KERNEL_UNUSED,
+	XFEATURE_CET_KERNEL,
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
@@ -139,7 +139,7 @@ enum xfeature {
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_PASID		(1 << XFEATURE_PASID)
 #define XFEATURE_MASK_CET_USER		(1 << XFEATURE_CET_USER)
-#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL_UNUSED)
+#define XFEATURE_MASK_CET_KERNEL	(1 << XFEATURE_CET_KERNEL)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
 #define XFEATURE_MASK_XTILE_CFG		(1 << XFEATURE_XTILE_CFG)
 #define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
@@ -264,6 +264,16 @@ struct cet_user_state {
 	u64 user_ssp;
 };
 
+/*
+ * State component 12 is Control-flow Enforcement supervisor states
+ */
+struct cet_supervisor_state {
+	/* supervisor ssp pointers  */
+	u64 pl0_ssp;
+	u64 pl1_ssp;
+	u64 pl2_ssp;
+};
+
 /*
  * State component 15: Architectural LBR configuration state.
  * The size of Arch LBR state depends on the number of LBRs (lbr_depth).
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index d4427b88ee12..3b4a038d3c57 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -51,7 +51,8 @@
 
 /* All currently supported supervisor features */
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (XFEATURE_MASK_PASID | \
-					    XFEATURE_MASK_CET_USER)
+					    XFEATURE_MASK_CET_USER | \
+					    XFEATURE_MASK_CET_KERNEL)
 
 /*
  * A supervisor state component may not always contain valuable information,
@@ -78,8 +79,7 @@
  * Unsupported supervisor features. When a supervisor feature in this mask is
  * supported in the future, move it to the supported supervisor feature mask.
  */
-#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT | \
-					      XFEATURE_MASK_CET_KERNEL)
+#define XFEATURE_MASK_SUPERVISOR_UNSUPPORTED (XFEATURE_MASK_PT)
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 12c8cb278346..c3ed86732d33 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -51,7 +51,7 @@ static const char *xfeature_names[] =
 	"Protection Keys User registers",
 	"PASID state",
 	"Control-flow User registers",
-	"Control-flow Kernel registers (unused)",
+	"Control-flow Kernel registers",
 	"unknown xstate feature",
 	"unknown xstate feature",
 	"unknown xstate feature",
@@ -73,6 +73,7 @@ static unsigned short xsave_cpuid_features[] __initdata = {
 	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
 	[XFEATURE_PKRU]				= X86_FEATURE_OSPKE,
 	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
+	[XFEATURE_CET_KERNEL]			= X86_FEATURE_SHSTK,
 	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
 	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
 };
@@ -277,6 +278,7 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_PKRU);
 	print_xstate_feature(XFEATURE_MASK_PASID);
 	print_xstate_feature(XFEATURE_MASK_CET_USER);
+	print_xstate_feature(XFEATURE_MASK_CET_KERNEL);
 	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
 	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
@@ -346,6 +348,7 @@ static __init void os_xrstor_booting(struct xregs_state *xstate)
 	 XFEATURE_MASK_BNDCSR |			\
 	 XFEATURE_MASK_PASID |			\
 	 XFEATURE_MASK_CET_USER |		\
+	 XFEATURE_MASK_CET_KERNEL |		\
 	 XFEATURE_MASK_XTILE)
 
 /*
@@ -546,6 +549,7 @@ static bool __init check_xstate_against_struct(int nr)
 	case XFEATURE_PASID:	  return XCHECK_SZ(sz, nr, struct ia32_pasid_state);
 	case XFEATURE_XTILE_CFG:  return XCHECK_SZ(sz, nr, struct xtile_cfg);
 	case XFEATURE_CET_USER:	  return XCHECK_SZ(sz, nr, struct cet_user_state);
+	case XFEATURE_CET_KERNEL: return XCHECK_SZ(sz, nr, struct cet_supervisor_state);
 	case XFEATURE_XTILE_DATA: check_xtile_data_against_struct(sz); return true;
 	default:
 		XSTATE_WARN_ON(1, "No structure for xstate: %d\n", nr);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (2 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-15  0:24   ` Edgecombe, Rick P
  2023-09-14  6:33 ` [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features Yang Weijiang
                   ` (21 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Define a new kernel xfeature set including the features can be dynamically
enabled, i.e., the relevant feature is enabled on demand. The xfeature set
is currently used by KVM to configure __guest__ fpstate, i.e., calculating
the xfeature and fpstate storage size etc. The xfeature set is initialized
once and used whenever it's referenced to avoid repeat calculation.

Currently it's used when 1) guest fpstate __state_size is calculated while
guest permits are configured 2) guest vCPU is created and its fpstate is
initialized.

Suggested-by: Dave Hansen <dave.hansen@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c3ed86732d33..eaec05bc1b3c 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -84,6 +84,8 @@ static unsigned int xstate_sizes[XFEATURE_MAX] __ro_after_init =
 	{ [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_flags[XFEATURE_MAX] __ro_after_init;
 
+u64 fpu_kernel_dynamic_xfeatures __ro_after_init;
+
 #define XSTATE_FLAG_SUPERVISOR	BIT(0)
 #define XSTATE_FLAG_ALIGNED64	BIT(1)
 
@@ -740,6 +742,23 @@ static void __init fpu__init_disable_system_xstate(unsigned int legacy_size)
 	fpstate_reset(&current->thread.fpu);
 }
 
+static unsigned short xsave_kernel_dynamic_xfeatures[] = {
+	[XFEATURE_CET_KERNEL]	= X86_FEATURE_SHSTK,
+};
+
+static void __init init_kernel_dynamic_xfeatures(void)
+{
+	unsigned short cid;
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures); i++) {
+		cid = xsave_kernel_dynamic_xfeatures[i];
+
+		if (cid && boot_cpu_has(cid))
+			fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
+	}
+}
+
 /*
  * Enable and initialize the xsave feature.
  * Called once per system bootup.
@@ -809,6 +828,8 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 	if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
 		fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
 
+	init_kernel_dynamic_xfeatures();
+
 	if (!cpu_feature_enabled(X86_FEATURE_XFD))
 		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (3 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-14 16:22   ` Dave Hansen
  2023-09-14  6:33 ` [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size Yang Weijiang
                   ` (20 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

The kernel dynamic xfeatures are supported by host, i.e., they're enabled
in xsaves/xrstors operating xfeature set (XCR0 | XSS), but the corresponding
CPU features are disabled for the time-being in host kernel so the bits are
not necessarily set by default.

Remove the bits from fpu_kernel_cfg.default_features so that the bits in
xstate_bv and xcomp_bv are cleared and xsaves/xrstors can be optimized by HW
for normal fpstate.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index eaec05bc1b3c..4753c677e2e1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -845,6 +845,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 	/* Clean out dynamic features from default */
 	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
 	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+	fpu_kernel_cfg.default_features &= ~fpu_kernel_dynamic_xfeatures;
 
 	fpu_user_cfg.default_features = fpu_user_cfg.max_features;
 	fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (4 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-14 17:40   ` Dave Hansen
  2023-09-14  6:33 ` [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures Yang Weijiang
                   ` (19 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

When user space requests guest xstate permits, the sufficient xstate size
is calculated from permitted mask. Currently the max guest permits are set
to fpu_kernel_cfg.default_features, and the latter doesn't include kernel
dynamic xfeatures, so add them back for correct guest fpstate size.

If guest dynamic xfeatures are enabled, KVM re-allocates guest fpstate area
with above resulting size before launches VM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/xstate.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 4753c677e2e1..c5d903b4df4d 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
 
 	/* Calculate the resulting kernel state size */
 	mask = permitted | requested;
-	/* Take supervisor states into account on the host */
+	/*
+	 * Take supervisor states into account on the host. And add
+	 * kernel dynamic xfeatures to guest since guest kernel may
+	 * enable corresponding CPU feaures and the xstate registers
+	 * need to be saved/restored properly.
+	 */
 	if (!guest)
 		mask |= xfeatures_mask_supervisor();
+	else
+		mask |= fpu_kernel_dynamic_xfeatures;
+
 	ksize = xstate_calculate_size(mask, compacted);
 
 	/* Calculate the resulting user state size */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (5 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:45   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains " Yang Weijiang
                   ` (18 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

The guest fpstate is sized with fpu_kernel_cfg.default_size (by preceding
fix) and the kernel dynamic xfeatures are not taken into account, so add
the support and tweak fpstate xfeatures and size accordingly.

Below configuration steps are currently enforced to get guest fpstate:
1) User space sets thread group xstate permits via arch_prctl().
2) User space creates vcpu thread.
3) User space enables guest dynamic xfeatures.

In #1, guest fpstate size (i.e., __state_size [1]) is induced from
(fpu_kernel_cfg.default_features | user dynamic xfeatures) [2].
In #2, guest fpstate size is calculated with fpu_kernel_cfg.default_size
and fpstate->size is set to the same. fpstate->xfeatures is set to
fpu_kernel_cfg.default_features.
In #3, guest fpstate is re-allocated as [1] and fpstate->xfeatures is
set to [2].

By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
_xfeatures | user dynamic xfeatures)[3], and guest fpstate->xfeatures is
set to [3]. Then host xsaves/xrstors can act on all guest xfeatures.

The user_* fields remain unchanged for compatibility of non-compacted KVM
uAPIs.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/core.c   | 56 +++++++++++++++++++++++++++++-------
 arch/x86/kernel/fpu/xstate.c |  2 +-
 arch/x86/kernel/fpu/xstate.h |  2 ++
 3 files changed, 49 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a42d8ad26ce6..e5819b38545a 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,6 +33,8 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
 DEFINE_PER_CPU(u64, xfd_state);
 #endif
 
+extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
+
 /* The FPU state configuration data for kernel and user space */
 struct fpu_state_config	fpu_kernel_cfg __ro_after_init;
 struct fpu_state_config fpu_user_cfg __ro_after_init;
@@ -193,8 +195,6 @@ void fpu_reset_from_exception_fixup(void)
 }
 
 #if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
-
 static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
 {
 	struct fpu_state_perm *fpuperm;
@@ -215,28 +215,64 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
 	gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
 }
 
-bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
 {
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	unsigned int gfpstate_size, size;
 	struct fpstate *fpstate;
-	unsigned int size;
+	u64 xfeatures;
+
+	/*
+	 * fpu_kernel_cfg.default_features includes all enabled xfeatures
+	 * except those dynamic xfeatures. Compared with user dynamic
+	 * xfeatures, the kernel dynamic ones are enabled for guest by
+	 * default, so add the kernel dynamic xfeatures back when calculate
+	 * guest fpstate size.
+	 *
+	 * If the user dynamic xfeatures are enabled, the guest fpstate will
+	 * be re-allocated to hold all guest enabled xfeatures, so omit user
+	 * dynamic xfeatures here.
+	 */
+	xfeatures = fpu_kernel_cfg.default_features |
+		    fpu_kernel_dynamic_xfeatures;
+
+	gfpstate_size = xstate_calculate_size(xfeatures, compacted);
 
-	size = fpu_kernel_cfg.default_size +
-	       ALIGN(offsetof(struct fpstate, regs), 64);
+	size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);
 
 	fpstate = vzalloc(size);
 	if (!fpstate)
-		return false;
+		return NULL;
+	/*
+	 * Initialize sizes and feature masks, use fpu_user_cfg.*
+	 * for user_* settings for compatibility of exiting uAPIs.
+	 */
+	fpstate->size		= gfpstate_size;
+	fpstate->xfeatures	= xfeatures;
+	fpstate->user_size	= fpu_user_cfg.default_size;
+	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
+	fpstate->xfd		= 0;
 
-	/* Leave xfd to 0 (the reset value defined by spec) */
-	__fpstate_reset(fpstate, 0);
 	fpstate_init_user(fpstate);
 	fpstate->is_valloc	= true;
 	fpstate->is_guest	= true;
 
 	gfpu->fpstate		= fpstate;
-	gfpu->xfeatures		= fpu_user_cfg.default_features;
+	gfpu->xfeatures		= xfeatures;
 	gfpu->perm		= fpu_user_cfg.default_features;
 
+	return fpstate;
+}
+
+bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
+{
+	struct fpstate *fpstate;
+
+	fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
+
+	if (!fpstate)
+		return false;
+
 	/*
 	 * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
 	 * to userspace, even when XSAVE is unsupported, so that restoring FPU
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index c5d903b4df4d..87149aba6f11 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -561,7 +561,7 @@ static bool __init check_xstate_against_struct(int nr)
 	return true;
 }
 
-static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
+unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
 {
 	unsigned int topmost = fls64(xfeatures) -  1;
 	unsigned int offset = xstate_offsets[topmost];
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..9c6e3ca05c5c 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,6 +10,8 @@
 DECLARE_PER_CPU(u64, xfd_state);
 #endif
 
+extern u64 fpu_kernel_dynamic_xfeatures;
+
 static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 {
 	/*
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains kernel dynamic xfeatures
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (6 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:45   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data Yang Weijiang
                   ` (17 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

fpu_kernel_dynamic_xfeatures now are __ONLY__ enabled by guest kernel and
used for guest fpstate, i.e., none for normal fpstate. The bits are added
when guest fpstate is allocated and fpstate->is_guest set to %true.

For normal fpstate, the bits should have been removed when init system FPU
settings, WARN_ONCE() if normal fpstate contains kernel dynamic xfeatures
before xsaves is executed.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kernel/fpu/xstate.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 9c6e3ca05c5c..c2b33a5db53d 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -186,6 +186,9 @@ static inline void os_xsave(struct fpstate *fpstate)
 	WARN_ON_FPU(!alternatives_patched);
 	xfd_validate_state(fpstate, mask, false);
 
+	WARN_ON_FPU(!fpstate->is_guest &&
+		    (mask & fpu_kernel_dynamic_xfeatures));
+
 	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (7 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains " Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:46   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers Yang Weijiang
                   ` (16 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

From: Sean Christopherson <seanjc@google.com>

Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
state, i.e. on a vCPU's CPUID state.  Prior to commit 275a87244ec8 ("KVM:
x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)"), KVM
incorrectly fudged guest CPUID at runtime, which in turn necessitated
massaging the incoming CPUID state for KVM_SET_CPUID{2} so as not to run
afoul of kvm_cpuid_check_equal().

Opportunistically move the helper below kvm_update_cpuid_runtime() to make
it harder to repeat the mistake of querying supported XCR0 for runtime
updates.

No functional change intended.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
 1 file changed, 16 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0544e30b4946..7c3e4a550ca7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
 		vcpu->arch.pv_cpuid.features = best->eax;
 }
 
-/*
- * Calculate guest's supported XCR0 taking into account guest CPUID data and
- * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
- */
-static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
-{
-	struct kvm_cpuid_entry2 *best;
-
-	best = cpuid_entry2_find(entries, nent, 0xd, 0);
-	if (!best)
-		return 0;
-
-	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
-}
-
 static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
 				       int nent)
 {
@@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
 
+/*
+ * Calculate guest's supported XCR0 taking into account guest CPUID data and
+ * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
+ */
+static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
+	if (!best)
+		return 0;
+
+	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
+}
+
 static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
 {
 	struct kvm_cpuid_entry2 *entry;
@@ -357,8 +357,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		kvm_apic_set_version(vcpu);
 	}
 
-	vcpu->arch.guest_supported_xcr0 =
-		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
+	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
 
 	/*
 	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (8 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:47   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features Yang Weijiang
                   ` (15 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
helpers to replace existing usage of the raw functions.
kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
to get/set a MSR value for emulating CPU behavior.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++-
 arch/x86/kvm/cpuid.c            |  2 +-
 arch/x86/kvm/x86.c              | 16 +++++++++++++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1a4def36d5bb..0fc5e6312e93 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1956,7 +1956,9 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
 
 void kvm_enable_efer_bits(u64);
 bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
 int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
 int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
 int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 7c3e4a550ca7..1f206caec559 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -1531,7 +1531,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
 		*edx = entry->edx;
 		if (function == 7 && index == 0) {
 			u64 data;
-		        if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
+		        if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
 			    (data & TSX_CTRL_CPUID_CLEAR))
 				*ebx &= ~(F(RTM) | F(HLE));
 		} else if (function == 0x80000007) {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..e0b55c043dab 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
-		  bool host_initiated)
+static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
+			 bool host_initiated)
 {
 	struct msr_data msr;
 	int ret;
@@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 	return ret;
 }
 
+int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
+{
+	return __kvm_set_msr(vcpu, index, data, true);
+}
+
+int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
+{
+	return __kvm_get_msr(vcpu, index, data, true);
+}
+
 static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
 				     u32 index, u64 *data, bool host_initiated)
 {
@@ -12082,7 +12092,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 						  MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
 
 		__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
-		__kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
+		kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
 	}
 
 	/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (9 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:47   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
                   ` (14 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

From: Sean Christopherson <seanjc@google.com>

Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
is non-zero, i.e. KVM supports at least one XSS based feature.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/x86.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e0b55c043dab..1258d1d6dd52 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1464,6 +1464,7 @@ static const u32 msrs_to_save_base[] = {
 	MSR_IA32_UMWAIT_CONTROL,
 
 	MSR_IA32_XFD, MSR_IA32_XFD_ERR,
+	MSR_IA32_XSS,
 };
 
 static const u32 msrs_to_save_pmu[] = {
@@ -7195,6 +7196,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
 		if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
 			return;
 		break;
+	case MSR_IA32_XSS:
+		if (!kvm_caps.supported_xss)
+			return;
+		break;
 	default:
 		break;
 	}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (10 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-08  5:54   ` Chao Gao
                     ` (2 more replies)
  2023-09-14  6:33 ` [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss Yang Weijiang
                   ` (13 subsequent siblings)
  25 siblings, 3 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen, Zhang Yi Z

Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
due to XSS MSR modification.
CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
before allocate sufficient xsave buffer.

Note, KVM does not yet support any XSS based features, i.e. supported_xss
is guaranteed to be zero at this time.

Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
write initiated from host is allowed iff the write is reset operaiton,
i.e., data == 0, reject host_initiated non-reset write and any guest write.

Suggested-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/cpuid.c            | 15 ++++++++++++++-
 arch/x86/kvm/x86.c              | 13 +++++++++----
 3 files changed, 24 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0fc5e6312e93..d77b030e996c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
 
 	u64 xcr0;
 	u64 guest_supported_xcr0;
+	u64 guest_supported_xss;
 
 	struct kvm_pio_request pio;
 	void *pio_data;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 1f206caec559..4e7a820cba62 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
 	best = cpuid_entry2_find(entries, nent, 0xD, 1);
 	if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
 		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
-		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
+		best->ebx = xstate_required_size(vcpu->arch.xcr0 |
+						 vcpu->arch.ia32_xss, true);
 
 	best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
 	if (kvm_hlt_in_guest(vcpu->kvm) && best &&
@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
 	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
 }
 
+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *best;
+
+	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
+	if (!best)
+		return 0;
+
+	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
+}
+
 static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
 {
 	struct kvm_cpuid_entry2 *entry;
@@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	}
 
 	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
+	vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
 
 	/*
 	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1258d1d6dd52..9a616d84bd39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			vcpu->arch.ia32_tsc_adjust_msr += adj;
 		}
 		break;
-	case MSR_IA32_XSS:
-		if (!msr_info->host_initiated &&
-		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
+	case MSR_IA32_XSS: {
+		bool host_msr_reset = msr_info->host_initiated && data == 0;
+
+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
+		    (!host_msr_reset || !msr_info->host_initiated))
 			return 1;
 		/*
 		 * KVM supports exposing PT to the guest, but does not support
 		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
 		 * XSAVES/XRSTORS to save/restore PT MSRs.
 		 */
-		if (data & ~kvm_caps.supported_xss)
+		if (data & ~vcpu->arch.guest_supported_xss)
 			return 1;
+		if (vcpu->arch.ia32_xss == data)
+			break;
 		vcpu->arch.ia32_xss = data;
 		kvm_update_cpuid_runtime(vcpu);
 		break;
+	}
 	case MSR_SMI_COUNT:
 		if (!msr_info->host_initiated)
 			return 1;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (11 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:51   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs Yang Weijiang
                   ` (12 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
XSAVES is supported. host_xss contains the host supported xstate feature
bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
enabled XSS feature bits, the resulting value represents the supervisor
xstates that are available to guest and are backed by host FPU framework
for swapping {guest,host} XSAVE-managed registers/MSRs.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/x86.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9a616d84bd39..66edbed25db8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -226,6 +226,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
 				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
 				| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
 
+#define KVM_SUPPORTED_XSS     0
+
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
 
@@ -9515,12 +9517,13 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
 		kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
 	}
+	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
+		rdmsrl(MSR_IA32_XSS, host_xss);
+		kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
+	}
 
 	rdmsrl_safe(MSR_EFER, &host_efer);
 
-	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		rdmsrl(MSR_IA32_XSS, host_xss);
-
 	kvm_init_pmu_capability(ops->pmu_ops);
 
 	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (12 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:51   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting Yang Weijiang
                   ` (11 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

From: Sean Christopherson <seanjc@google.com>

Load the guest's FPU state if userspace is accessing MSRs whose values
are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
to facilitate access to such kind of MSRs.

If MSRs supported in kvm_caps.supported_xss are passed through to guest,
the guest MSRs are swapped with host's before vCPU exits to userspace and
after it re-enters kernel before next VM-entry.

Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
explicitly check @vcpu is non-null before attempting to load guest state.
The XSS supporting MSRs cannot be retrieved via the device ioctl() without
loading guest FPU state (which doesn't exist).

Note that guest_cpuid_has() is not queried as host userspace is allowed to
access MSRs that have not been exposed to the guest, e.g. it might do
KVM_SET_MSRS prior to KVM_SET_CPUID2.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
 2 files changed, 53 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 66edbed25db8..a091764bf1d2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
 static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
 
 static DEFINE_MUTEX(vendor_module_lock);
+static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
+static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
+
 struct kvm_x86_ops kvm_x86_ops __read_mostly;
 
 #define KVM_X86_OP(func)					     \
@@ -4372,6 +4375,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 }
 EXPORT_SYMBOL_GPL(kvm_get_msr_common);
 
+static const u32 xstate_msrs[] = {
+	MSR_IA32_U_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP,
+	MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP,
+};
+
+static bool is_xstate_msr(u32 index)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(xstate_msrs); i++) {
+		if (index == xstate_msrs[i])
+			return true;
+	}
+	return false;
+}
+
 /*
  * Read or write a bunch of msrs. All parameters are kernel addresses.
  *
@@ -4382,11 +4401,20 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
 		    int (*do_msr)(struct kvm_vcpu *vcpu,
 				  unsigned index, u64 *data))
 {
+	bool fpu_loaded = false;
 	int i;
 
-	for (i = 0; i < msrs->nmsrs; ++i)
+	for (i = 0; i < msrs->nmsrs; ++i) {
+		if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
+		    is_xstate_msr(entries[i].index)) {
+			kvm_load_guest_fpu(vcpu);
+			fpu_loaded = true;
+		}
 		if (do_msr(vcpu, entries[i].index, &entries[i].data))
 			break;
+	}
+	if (fpu_loaded)
+		kvm_put_guest_fpu(vcpu);
 
 	return i;
 }
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 1e7be1f6ab29..9a8e3a84eaf4 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -540,4 +540,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 			 unsigned int port, void *data,  unsigned int count,
 			 int in);
 
+/*
+ * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
+ * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
+ * guest FPU should have been loaded already.
+ */
+
+static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
+				      struct msr_data *msr_info)
+{
+	KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+	kvm_fpu_get();
+	rdmsrl(msr_info->index, msr_info->data);
+	kvm_fpu_put();
+}
+
+static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
+				      struct msr_data *msr_info)
+{
+	KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
+	kvm_fpu_get();
+	wrmsrl(msr_info->index, msr_info->data);
+	kvm_fpu_put();
+}
+
 #endif
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (13 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:51   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved Yang Weijiang
                   ` (10 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Check potential faults for CR4.CET setting per Intel SDM requirements.
CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.

Reviewed-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/x86.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a091764bf1d2..dda9c7141ea1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1006,6 +1006,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
 	    (is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)))
 		return 1;
 
+	if (!(cr0 & X86_CR0_WP) && kvm_is_cr4_bit_set(vcpu, X86_CR4_CET))
+		return 1;
+
 	static_call(kvm_x86_set_cr0)(vcpu, cr0);
 
 	kvm_post_set_cr0(vcpu, old_cr0, cr0);
@@ -1217,6 +1220,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 			return 1;
 	}
 
+	if ((cr4 & X86_CR4_CET) && !kvm_is_cr0_bit_set(vcpu, X86_CR0_WP))
+		return 1;
+
 	static_call(kvm_x86_set_cr4)(vcpu, cr4);
 
 	kvm_post_set_cr4(vcpu, old_cr4, cr4);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (14 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-08  6:19   ` Chao Gao
  2023-10-31 17:52   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits Yang Weijiang
                   ` (9 subsequent siblings)
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Add CET MSRs to the list of MSRs reported to userspace if the feature,
i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.

SSP can only be read via RDSSP. Writing even requires destructive and
potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
for the GUEST_SSP field of the VMCS.

Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/uapi/asm/kvm_para.h |  1 +
 arch/x86/kvm/vmx/vmx.c               |  2 ++
 arch/x86/kvm/x86.c                   | 18 ++++++++++++++++++
 3 files changed, 21 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 6e64b27b2c1e..9864bbcf2470 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -58,6 +58,7 @@
 #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
 #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
 #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
+#define MSR_KVM_SSP	0x4b564d09
 
 struct kvm_steal_time {
 	__u64 steal;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 72e3943f3693..9409753f45b0 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
 	case MSR_AMD64_TSC_RATIO:
 		/* This is AMD only.  */
 		return false;
+	case MSR_KVM_SSP:
+		return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
 	default:
 		return true;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index dda9c7141ea1..73b45351c0fc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {
 
 	MSR_IA32_XFD, MSR_IA32_XFD_ERR,
 	MSR_IA32_XSS,
+	MSR_IA32_U_CET, MSR_IA32_S_CET,
+	MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
+	MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
 };
 
 static const u32 msrs_to_save_pmu[] = {
@@ -1576,6 +1579,7 @@ static const u32 emulated_msrs_all[] = {
 
 	MSR_K7_HWCR,
 	MSR_KVM_POLL_CONTROL,
+	MSR_KVM_SSP,
 };
 
 static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
@@ -7241,6 +7245,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
 		if (!kvm_caps.supported_xss)
 			return;
 		break;
+	case MSR_IA32_U_CET:
+	case MSR_IA32_S_CET:
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
+		    !kvm_cpu_cap_has(X86_FEATURE_IBT))
+			return;
+		break;
+	case MSR_IA32_INT_SSP_TAB:
+		if (!kvm_cpu_cap_has(X86_FEATURE_LM))
+			return;
+		fallthrough;
+	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+			return;
+		break;
 	default:
 		break;
 	}
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (15 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:52   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" Yang Weijiang
                   ` (8 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen, Zhang Yi Z

Control-flow Enforcement Technology (CET) is a kind of CPU feature used
to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
style control-flow subversion attacks.

Shadow Stack (SHSTK):
  A shadow stack is a second stack used exclusively for control transfer
  operations. The shadow stack is separate from the data/normal stack and
  can be enabled individually in user and kernel mode. When shadow stack
  is enabled, CALL pushes the return address on both the data and shadow
  stack. RET pops the return address from both stacks and compares them.
  If the return addresses from the two stacks do not match, the processor
  generates a #CP.

Indirect Branch Tracking (IBT):
  IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
  indirect branches (CALL, JMP etc...). If an indirect branch is executed
  and the next instruction is _not_ an ENDBRANCH, the processor generates
  a #CP. These instruction behaves as a NOP on platforms that have no CET.

Several new CET MSRs are defined to support CET:
  MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.

  MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.

  MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
			is indexed by IST of interrupt gate desc.

Two XSAVES state bits are introduced for CET:
  IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
  IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.

Six VMCS fields are introduced for CET:
  {HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
  {HOST,GUEST}_SSP: Stores current active SSP.
  {HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.

On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
control fields:
If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
VMCS fields at VM-Exit:
  HOST_S_CET
  HOST_SSP
  HOST_INTR_SSP_TABLE

If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
VMCS fields at VM-Entry:
  GUEST_S_CET
  GUEST_SSP
  GUEST_INTR_SSP_TABLE

Reviewed-by: Chao Gao <chao.gao@intel.com>
Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/asm/vmx.h | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0e73616b82f3..451fd4f4fedc 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -104,6 +104,7 @@
 #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
 #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
 #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
+#define VM_EXIT_LOAD_CET_STATE                  0x10000000
 
 #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
 
@@ -117,6 +118,7 @@
 #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
 #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
 #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
+#define VM_ENTRY_LOAD_CET_STATE                 0x00100000
 
 #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
 
@@ -345,6 +347,9 @@ enum vmcs_field {
 	GUEST_PENDING_DBG_EXCEPTIONS    = 0x00006822,
 	GUEST_SYSENTER_ESP              = 0x00006824,
 	GUEST_SYSENTER_EIP              = 0x00006826,
+	GUEST_S_CET                     = 0x00006828,
+	GUEST_SSP                       = 0x0000682a,
+	GUEST_INTR_SSP_TABLE            = 0x0000682c,
 	HOST_CR0                        = 0x00006c00,
 	HOST_CR3                        = 0x00006c02,
 	HOST_CR4                        = 0x00006c04,
@@ -357,6 +362,9 @@ enum vmcs_field {
 	HOST_IA32_SYSENTER_EIP          = 0x00006c12,
 	HOST_RSP                        = 0x00006c14,
 	HOST_RIP                        = 0x00006c16,
+	HOST_S_CET                      = 0x00006c18,
+	HOST_SSP                        = 0x00006c1a,
+	HOST_INTR_SSP_TABLE             = 0x00006c1c
 };
 
 /*
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (16 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:54   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs Yang Weijiang
                   ` (7 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Use the governed feature framework to track whether X86_FEATURE_SHSTK
and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
the features can be used iff both KVM and guest CPUID can support them.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/governed_features.h | 2 ++
 arch/x86/kvm/vmx/vmx.c           | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
index 423a73395c10..db7e21c5ecc2 100644
--- a/arch/x86/kvm/governed_features.h
+++ b/arch/x86/kvm/governed_features.h
@@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
 KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
 KVM_GOVERNED_X86_FEATURE(VGIF)
 KVM_GOVERNED_X86_FEATURE(VNMI)
+KVM_GOVERNED_X86_FEATURE(SHSTK)
+KVM_GOVERNED_X86_FEATURE(IBT)
 
 #undef KVM_GOVERNED_X86_FEATURE
 #undef KVM_GOVERNED_FEATURE
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9409753f45b0..fd5893b3a2c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
 
 	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
+	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
+	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
 
 	vmx_setup_uret_msrs(vmx);
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (17 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:55   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM Yang Weijiang
                   ` (6 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Add emulation interface for CET MSR access. The emulation code is split
into common part and vendor specific part. The former does common check
for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
helpers while the latter accesses the MSRs linked to VMCS fields.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 18 +++++++++++
 arch/x86/kvm/x86.c     | 71 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 89 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fd5893b3a2c8..9f4b56337251 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
 		break;
+	case MSR_IA32_S_CET:
+		msr_info->data = vmcs_readl(GUEST_S_CET);
+		break;
+	case MSR_KVM_SSP:
+		msr_info->data = vmcs_readl(GUEST_SSP);
+		break;
+	case MSR_IA32_INT_SSP_TAB:
+		msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
+		break;
 	case MSR_IA32_DEBUGCTLMSR:
 		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
 		break;
@@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		else
 			vmx->pt_desc.guest.addr_a[index / 2] = data;
 		break;
+	case MSR_IA32_S_CET:
+		vmcs_writel(GUEST_S_CET, data);
+		break;
+	case MSR_KVM_SSP:
+		vmcs_writel(GUEST_SSP, data);
+		break;
+	case MSR_IA32_INT_SSP_TAB:
+		vmcs_writel(GUEST_INTR_SSP_TABLE, data);
+		break;
 	case MSR_IA32_PERF_CAPABILITIES:
 		if (data && !vcpu_to_pmu(vcpu)->version)
 			return 1;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 73b45351c0fc..c85ee42ab4f1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1847,6 +1847,11 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
 }
 EXPORT_SYMBOL_GPL(kvm_msr_allowed);
 
+#define CET_US_RESERVED_BITS		GENMASK(9, 6)
+#define CET_US_SHSTK_MASK_BITS		GENMASK(1, 0)
+#define CET_US_IBT_MASK_BITS		(GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
+#define CET_US_LEGACY_BITMAP_BASE(data)	((data) >> 12)
+
 /*
  * Write @data into the MSR specified by @index.  Select MSR specific fault
  * checks are bypassed if @host_initiated is %true.
@@ -1856,6 +1861,7 @@ EXPORT_SYMBOL_GPL(kvm_msr_allowed);
 static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 			 bool host_initiated)
 {
+	bool host_msr_reset = host_initiated && data == 0;
 	struct msr_data msr;
 
 	switch (index) {
@@ -1906,6 +1912,46 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
 
 		data = (u32)data;
 		break;
+	case MSR_IA32_U_CET:
+	case MSR_IA32_S_CET:
+		if (host_msr_reset && (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
+				       kvm_cpu_cap_has(X86_FEATURE_IBT)))
+			break;
+		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+		    !guest_can_use(vcpu, X86_FEATURE_IBT))
+			return 1;
+		if (data & CET_US_RESERVED_BITS)
+			return 1;
+		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
+		    (data & CET_US_SHSTK_MASK_BITS))
+			return 1;
+		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+		    (data & CET_US_IBT_MASK_BITS))
+			return 1;
+		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
+			return 1;
+
+		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
+		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
+			return 1;
+		break;
+	case MSR_IA32_INT_SSP_TAB:
+		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
+			return 1;
+		fallthrough;
+	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+	case MSR_KVM_SSP:
+		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
+			break;
+		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+			return 1;
+		if (index == MSR_KVM_SSP && !host_initiated)
+			return 1;
+		if (is_noncanonical_address(data, vcpu))
+			return 1;
+		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
+			return 1;
+		break;
 	}
 
 	msr.data = data;
@@ -1949,6 +1995,23 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
 		    !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
 			return 1;
 		break;
+	case MSR_IA32_U_CET:
+	case MSR_IA32_S_CET:
+		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
+		    !guest_can_use(vcpu, X86_FEATURE_SHSTK))
+			return 1;
+		break;
+	case MSR_IA32_INT_SSP_TAB:
+		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
+			return 1;
+		fallthrough;
+	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+	case MSR_KVM_SSP:
+		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
+			return 1;
+		if (index == MSR_KVM_SSP && !host_initiated)
+			return 1;
+		break;
 	}
 
 	msr.index = index;
@@ -4009,6 +4072,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		vcpu->arch.guest_fpu.xfd_err = data;
 		break;
 #endif
+	case MSR_IA32_U_CET:
+	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+		kvm_set_xstate_msr(vcpu, msr_info);
+		break;
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr))
 			return kvm_pmu_set_msr(vcpu, msr_info);
@@ -4365,6 +4432,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 		msr_info->data = vcpu->arch.guest_fpu.xfd_err;
 		break;
 #endif
+	case MSR_IA32_U_CET:
+	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
+		kvm_get_xstate_msr(vcpu, msr_info);
+		break;
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
 			return kvm_pmu_get_msr(vcpu, msr_info);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (18 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:55   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs Yang Weijiang
                   ` (5 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
one of such registers on 64bit Arch, so add the support for SSP.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/smm.c | 8 ++++++++
 arch/x86/kvm/smm.h | 2 +-
 2 files changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
index b42111a24cc2..235fca95f103 100644
--- a/arch/x86/kvm/smm.c
+++ b/arch/x86/kvm/smm.c
@@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
 	enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
 
 	smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
+
+	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+		KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
+			   vcpu->kvm);
 }
 #endif
 
@@ -565,6 +569,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
 	static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
 	ctxt->interruptibility = (u8)smstate->int_shadow;
 
+	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
+		KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
+			   vcpu->kvm);
+
 	return X86EMUL_CONTINUE;
 }
 #endif
diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..1e2a3e18207f 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
 	u32 smbase;
 	u32 reserved4[5];
 
-	/* ssp and svm_* fields below are not implemented by KVM */
 	u64 ssp;
+	/* svm_* fields below are not implemented by KVM */
 	u64 svm_guest_pat;
 	u64 svm_host_efer;
 	u64 svm_host_cr4;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (19 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:56   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields Yang Weijiang
                   ` (4 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Enable/disable CET MSRs interception per associated feature configuration.
Shadow Stack feature requires all CET MSRs passed through to guest to make
it supported in user and supervisor mode while IBT feature only depends on
MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.

Note, this MSR design introduced an architectual limitation of SHSTK and
IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
to guest from architectual perspective since IBT relies on subset of SHSTK
relevant MSRs.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 42 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9f4b56337251..30373258573d 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
 	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
 		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
 		return true;
+	case MSR_IA32_U_CET:
+	case MSR_IA32_S_CET:
+	case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
+		return true;
 	}
 
 	r = possible_passthrough_msr_slot(msr) != -ENOENT;
@@ -7769,6 +7773,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
+{
+	bool incpt;
+
+	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
+		incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
+
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
+					  MSR_TYPE_RW, incpt);
+		if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
+			vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
+						  MSR_TYPE_RW, incpt);
+		if (!incpt)
+			return;
+	}
+
+	if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
+		incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
+
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
+					  MSR_TYPE_RW, incpt);
+		vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
+					  MSR_TYPE_RW, incpt);
+	}
+}
+
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7846,6 +7886,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 
 	/* Refresh #PF interception to account for MAXPHYADDR changes. */
 	vmx_update_exception_bitmap(vcpu);
+
+	vmx_update_intercept_for_cet_msr(vcpu);
 }
 
 static u64 vmx_get_perf_capabilities(void)
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (20 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:56   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace Yang Weijiang
                   ` (3 subsequent siblings)
  25 siblings, 1 reply; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
post-boot(The exception is BIOS call case but vCPU thread never across it)
and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
VM-Exit sequence.

Host supervisor shadow stack is not enabled now and SSP is not accessible
to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.

Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.

Suggested-by: Sean Christopherson <seanjc@google.com>
Suggested-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/capabilities.h |  4 ++++
 arch/x86/kvm/vmx/vmx.c          | 15 +++++++++++++++
 arch/x86/kvm/x86.c              | 14 ++++++++++++++
 arch/x86/kvm/x86.h              |  1 +
 4 files changed, 34 insertions(+)

diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index 41a4533f9989..ee8938818c8a 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -106,6 +106,10 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
 	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
 }
 
+static inline bool cpu_has_load_cet_ctrl(void)
+{
+	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE);
+}
 static inline bool cpu_has_vmx_mpx(void)
 {
 	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 30373258573d..9ccc2c552f55 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4375,6 +4375,21 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 
 	if (cpu_has_load_ia32_efer())
 		vmcs_write64(HOST_IA32_EFER, host_efer);
+
+	/*
+	 * Supervisor shadow stack is not enabled on host side, i.e.,
+	 * host IA32_S_CET.SHSTK_EN bit is guaranteed to 0 now, per SDM
+	 * description(RDSSP instruction), SSP is not readable in CPL0,
+	 * so resetting the two registers to 0s at VM-Exit does no harm
+	 * to kernel execution. When execution flow exits to userspace,
+	 * SSP is reloaded from IA32_PL3_SSP. Check SDM Vol.2A/B Chapter
+	 * 3 and 4 for details.
+	 */
+	if (cpu_has_load_cet_ctrl()) {
+		vmcs_writel(HOST_S_CET, host_s_cet);
+		vmcs_writel(HOST_SSP, 0);
+		vmcs_writel(HOST_INTR_SSP_TABLE, 0);
+	}
 }
 
 void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c85ee42ab4f1..231d4a7b6f3d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -114,6 +114,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
 #endif
 
 static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
+u64 __read_mostly host_s_cet;
+EXPORT_SYMBOL_GPL(host_s_cet);
 
 #define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)
 
@@ -9618,6 +9620,18 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 		return -EIO;
 	}
 
+	if (boot_cpu_has(X86_FEATURE_SHSTK)) {
+		rdmsrl(MSR_IA32_S_CET, host_s_cet);
+		/*
+		 * Linux doesn't yet support supervisor shadow stacks (SSS), so
+		 * KVM doesn't save/restore the associated MSRs, i.e. KVM may
+		 * clobber the host values.  Yell and refuse to load if SSS is
+		 * unexpectedly enabled, e.g. to avoid crashing the host.
+		 */
+		if (WARN_ON_ONCE(host_s_cet & CET_SHSTK_EN))
+			return -EIO;
+	}
+
 	x86_emulator_cache = kvm_alloc_emulator_cache();
 	if (!x86_emulator_cache) {
 		pr_err("failed to allocate cache for x86 emulator\n");
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9a8e3a84eaf4..0d5f673338dd 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -324,6 +324,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
 extern u64 host_xcr0;
 extern u64 host_xss;
 extern u64 host_arch_capabilities;
+extern u64 host_s_cet;
 
 extern struct kvm_caps kvm_caps;
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (21 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-09-24 13:38   ` kernel test robot
  2023-10-31 17:56   ` Maxim Levitsky
  2023-09-14  6:33 ` [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 Yang Weijiang
                   ` (2 subsequent siblings)
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Expose CET features to guest if KVM/host can support them, clear CPUID
feature bits if KVM/host cannot support.

Set CPUID feature bits so that CET features are available in guest CPUID.
Add CR4.CET bit support in order to allow guest set CET master control
bit.

Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
KVM does not support emulating CET.
Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
in host XSS or if XSAVES isn't supported.

The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
guest CET xstates isolated from host's. And all platforms that support CET
enumerate VMX_BASIC[bit56] as 1, clear CET feature bits if the bit doesn't
read 1.

Regarding the CET MSR contents after Reset/INIT, SDM doesn't mention the
default values, neither can I get the answer internally so far, will fill
the gap once it's clear.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/include/asm/kvm_host.h  |  3 ++-
 arch/x86/include/asm/msr-index.h |  1 +
 arch/x86/kvm/cpuid.c             | 12 ++++++++++--
 arch/x86/kvm/vmx/capabilities.h  |  6 ++++++
 arch/x86/kvm/vmx/vmx.c           | 23 ++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h           |  6 ++++--
 arch/x86/kvm/x86.c               | 12 +++++++++++-
 arch/x86/kvm/x86.h               |  3 +++
 8 files changed, 59 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d77b030e996c..db0010fa3363 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -125,7 +125,8 @@
 			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
 			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
 			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
-			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
+			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
+			  | X86_CR4_CET))
 
 #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..1f8dc04da468 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -1091,6 +1091,7 @@
 #define VMX_BASIC_MEM_TYPE_MASK	0x003c000000000000LLU
 #define VMX_BASIC_MEM_TYPE_WB	6LLU
 #define VMX_BASIC_INOUT		0x0040000000000000LLU
+#define VMX_BASIC_NO_HW_ERROR_CODE_CC	0x0100000000000000LLU
 
 /* Resctrl MSRs: */
 /* - Intel: */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 4e7a820cba62..d787a506746a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -654,7 +654,7 @@ void kvm_set_cpu_caps(void)
 		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
 		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
 		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
-		F(SGX_LC) | F(BUS_LOCK_DETECT)
+		F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
 	);
 	/* Set LA57 based on hardware capability. */
 	if (cpuid_ecx(7) & F(LA57))
@@ -672,7 +672,8 @@ void kvm_set_cpu_caps(void)
 		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
 		F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
 		F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
-		F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
+		F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
+		F(IBT)
 	);
 
 	/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -685,6 +686,13 @@ void kvm_set_cpu_caps(void)
 		kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
 	if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
 		kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
+	/*
+	 * The feature bit in boot_cpu_data.x86_capability could have been
+	 * cleared due to ibt=off cmdline option, then add it back if CPU
+	 * supports IBT.
+	 */
+	if (cpuid_edx(7) & F(IBT))
+		kvm_cpu_cap_set(X86_FEATURE_IBT);
 
 	kvm_cpu_cap_mask(CPUID_7_1_EAX,
 		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
index ee8938818c8a..e12bc233d88b 100644
--- a/arch/x86/kvm/vmx/capabilities.h
+++ b/arch/x86/kvm/vmx/capabilities.h
@@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
 	return	(((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
 }
 
+static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
+{
+	return	((u64)vmcs_config.basic_cap << 32) &
+		 VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
 static inline bool cpu_has_virtual_nmis(void)
 {
 	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9ccc2c552f55..f0dea8ecd0c6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 		{ VM_ENTRY_LOAD_IA32_EFER,		VM_EXIT_LOAD_IA32_EFER },
 		{ VM_ENTRY_LOAD_BNDCFGS,		VM_EXIT_CLEAR_BNDCFGS },
 		{ VM_ENTRY_LOAD_IA32_RTIT_CTL,		VM_EXIT_CLEAR_IA32_RTIT_CTL },
+		{ VM_ENTRY_LOAD_CET_STATE,		VM_EXIT_LOAD_CET_STATE },
 	};
 
 	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
@@ -4934,6 +4935,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		vmcs_write64(GUEST_BNDCFGS, 0);
 
 	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);  /* 22.2.1 */
+	vmcs_writel(GUEST_SSP, 0);
+	vmcs_writel(GUEST_S_CET, 0);
+	vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
 
 	kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
 
@@ -6354,6 +6358,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
 		vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
 
+	if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
+		pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
+		pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
+		pr_err("INTR SSP TABLE = 0x%016lx\n",
+		       vmcs_readl(GUEST_INTR_SSP_TABLE));
+	}
 	pr_err("*** Host State ***\n");
 	pr_err("RIP = 0x%016lx  RSP = 0x%016lx\n",
 	       vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
@@ -6431,6 +6441,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
 		pr_err("Virtual processor ID = 0x%04x\n",
 		       vmcs_read16(VIRTUAL_PROCESSOR_ID));
+	if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
+		pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
+		pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
+		pr_err("INTR SSP TABLE = 0x%016lx\n",
+		       vmcs_readl(HOST_INTR_SSP_TABLE));
+	}
 }
 
 /*
@@ -7967,7 +7983,6 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_set(X86_FEATURE_UMIP);
 
 	/* CPUID 0xD.1 */
-	kvm_caps.supported_xss = 0;
 	if (!cpu_has_vmx_xsaves())
 		kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
 
@@ -7979,6 +7994,12 @@ static __init void vmx_set_cpu_caps(void)
 
 	if (cpu_has_vmx_waitpkg())
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
+
+	if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
+	    !cpu_has_vmx_basic_no_hw_errcode()) {
+		kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+		kvm_cpu_cap_clear(X86_FEATURE_IBT);
+	}
 }
 
 static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index c2130d2c8e24..fb72819fbb41 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
 	 VM_ENTRY_LOAD_IA32_EFER |					\
 	 VM_ENTRY_LOAD_BNDCFGS |					\
 	 VM_ENTRY_PT_CONCEAL_PIP |					\
-	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
+	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
+	 VM_ENTRY_LOAD_CET_STATE)
 
 #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
 	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
@@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
 	       VM_EXIT_LOAD_IA32_EFER |					\
 	       VM_EXIT_CLEAR_BNDCFGS |					\
 	       VM_EXIT_PT_CONCEAL_PIP |					\
-	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
+	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
+	       VM_EXIT_LOAD_CET_STATE)
 
 #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
 	(PIN_BASED_EXT_INTR_MASK |					\
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 231d4a7b6f3d..b7d1ac6b8d75 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
 				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
 				| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
 
-#define KVM_SUPPORTED_XSS     0
+#define KVM_SUPPORTED_XSS	(XFEATURE_MASK_CET_USER | \
+				 XFEATURE_MASK_CET_KERNEL)
 
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
@@ -9699,6 +9700,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
 		kvm_caps.supported_xss = 0;
 
+	if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
+	     XFEATURE_MASK_CET_KERNEL)) !=
+	    (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
+		kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
+		kvm_cpu_cap_clear(X86_FEATURE_IBT);
+		kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
+		kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
+	}
+
 #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
 	cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
 #undef __kvm_cpu_cap_has
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 0d5f673338dd..665a7f91d04f 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -530,6 +530,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
 		__reserved_bits |= X86_CR4_VMXE;        \
 	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
 		__reserved_bits |= X86_CR4_PCIDE;       \
+	if (!__cpu_has(__c, X86_FEATURE_SHSTK) &&       \
+	    !__cpu_has(__c, X86_FEATURE_IBT))           \
+		__reserved_bits |= X86_CR4_CET;         \
 	__reserved_bits;                                \
 })
 
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (22 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:57   ` Maxim Levitsky
  2023-11-01  4:21   ` Chao Gao
  2023-09-14  6:33 ` [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest Yang Weijiang
  2023-09-25  0:31 ` [PATCH v6 00/25] Enable CET Virtualization Yang, Weijiang
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Per SDM description(Vol.3D, Appendix A.1):
"If bit 56 is read as 1, software can use VM entry to deliver a hardware
exception with or without an error code, regardless of vector"

Modify has_error_code check before inject events to nested guest. Only
enforce the check when guest is in real mode, the exception is not hard
exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
other case ignore the check to make the logic consistent with SDM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
 arch/x86/kvm/vmx/nested.h |  5 +++++
 2 files changed, 19 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index c5ec0ef51ff7..78a3be394d00 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
 {
 	const u64 feature_and_reserved =
 		/* feature (except bit 48; see below) */
-		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
+		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
 		/* reserved */
-		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
+		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
 	u64 vmx_basic = vmcs_config.nested.basic;
 
 	if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
@@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
 		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
 			return -EINVAL;
 
-		/* VM-entry interruption-info field: deliver error code */
-		should_have_error_code =
-			intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
-			x86_exception_has_error_code(vector);
-		if (CC(has_error_code != should_have_error_code))
-			return -EINVAL;
+		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
+		    !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
+			/* VM-entry interruption-info field: deliver error code */
+			should_have_error_code =
+				intr_type == INTR_TYPE_HARD_EXCEPTION &&
+				prot_mode &&
+				x86_exception_has_error_code(vector);
+			if (CC(has_error_code != should_have_error_code))
+				return -EINVAL;
+		}
 
 		/* VM-entry exception error code */
 		if (CC(has_error_code &&
@@ -6968,6 +6972,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
 
 	if (cpu_has_vmx_basic_inout())
 		msrs->basic |= VMX_BASIC_INOUT;
+	if (cpu_has_vmx_basic_no_hw_errcode())
+		msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
 }
 
 static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
index b4b9d51438c6..26842da6857d 100644
--- a/arch/x86/kvm/vmx/nested.h
+++ b/arch/x86/kvm/vmx/nested.h
@@ -284,6 +284,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
 	       __kvm_is_valid_cr4(vcpu, val);
 }
 
+static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
+{
+	return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
+}
+
 /* No difference in the restrictions on guest and host CR4 in VMX operation. */
 #define nested_guest_cr4_valid	nested_cr4_valid
 #define nested_host_cr4_valid	nested_cr4_valid
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (23 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 Yang Weijiang
@ 2023-09-14  6:33 ` Yang Weijiang
  2023-10-31 17:57   ` Maxim Levitsky
  2023-11-01  2:09   ` Chao Gao
  2023-09-25  0:31 ` [PATCH v6 00/25] Enable CET Virtualization Yang, Weijiang
  25 siblings, 2 replies; 120+ messages in thread
From: Yang Weijiang @ 2023-09-14  6:33 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, weijiang.yang,
	john.allen

Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
to enable CET for nested VM.

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
 arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
 arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
 arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
 arch/x86/kvm/vmx/vmx.c    |  2 ++
 4 files changed, 46 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 78a3be394d00..2c4ff13fddb0 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
 
+	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_U_CET, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_S_CET, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
+
+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
+					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
+
 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
 
 	vmx->nested.force_msr_bitmap_recalc = false;
@@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
 #endif
 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
-		VM_EXIT_CLEAR_BNDCFGS;
+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
 	msrs->exit_ctls_high |=
 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
@@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
 #ifdef CONFIG_X86_64
 		VM_ENTRY_IA32E_MODE |
 #endif
-		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
+		VM_ENTRY_LOAD_CET_STATE;
 	msrs->entry_ctls_high |=
 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
index 106a72c923ca..4233b5ca9461 100644
--- a/arch/x86/kvm/vmx/vmcs12.c
+++ b/arch/x86/kvm/vmx/vmcs12.c
@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
+	FIELD(GUEST_S_CET, guest_s_cet),
+	FIELD(GUEST_SSP, guest_ssp),
+	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
 	FIELD(HOST_CR0, host_cr0),
 	FIELD(HOST_CR3, host_cr3),
 	FIELD(HOST_CR4, host_cr4),
@@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
 	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
 	FIELD(HOST_RSP, host_rsp),
 	FIELD(HOST_RIP, host_rip),
+	FIELD(HOST_S_CET, host_s_cet),
+	FIELD(HOST_SSP, host_ssp),
+	FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
 };
 const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
index 01936013428b..3884489e7f7e 100644
--- a/arch/x86/kvm/vmx/vmcs12.h
+++ b/arch/x86/kvm/vmx/vmcs12.h
@@ -117,7 +117,13 @@ struct __packed vmcs12 {
 	natural_width host_ia32_sysenter_eip;
 	natural_width host_rsp;
 	natural_width host_rip;
-	natural_width paddingl[8]; /* room for future expansion */
+	natural_width host_s_cet;
+	natural_width host_ssp;
+	natural_width host_ssp_tbl;
+	natural_width guest_s_cet;
+	natural_width guest_ssp;
+	natural_width guest_ssp_tbl;
+	natural_width paddingl[2]; /* room for future expansion */
 	u32 pin_based_vm_exec_control;
 	u32 cpu_based_vm_exec_control;
 	u32 exception_bitmap;
@@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
 	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
 	CHECK_OFFSET(host_rsp, 664);
 	CHECK_OFFSET(host_rip, 672);
+	CHECK_OFFSET(host_s_cet, 680);
+	CHECK_OFFSET(host_ssp, 688);
+	CHECK_OFFSET(host_ssp_tbl, 696);
+	CHECK_OFFSET(guest_s_cet, 704);
+	CHECK_OFFSET(guest_ssp, 712);
+	CHECK_OFFSET(guest_ssp_tbl, 720);
 	CHECK_OFFSET(pin_based_vm_exec_control, 744);
 	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
 	CHECK_OFFSET(exception_bitmap, 752);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f0dea8ecd0c6..2c43f1088d77 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7731,6 +7731,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
 	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
 	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
 	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
+	cr4_fixed1_update(X86_CR4_CET,	      ecx, feature_bit(SHSTK));
+	cr4_fixed1_update(X86_CR4_CET,	      edx, feature_bit(IBT));
 
 #undef cr4_fixed1_update
 }
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features
  2023-09-14  6:33 ` [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features Yang Weijiang
@ 2023-09-14 16:22   ` Dave Hansen
  2023-09-15  1:52     ` Yang, Weijiang
  2023-10-31 17:44     ` Maxim Levitsky
  0 siblings, 2 replies; 120+ messages in thread
From: Dave Hansen @ 2023-09-14 16:22 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: peterz, chao.gao, rick.p.edgecombe, john.allen

On 9/13/23 23:33, Yang Weijiang wrote:
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -845,6 +845,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>  	/* Clean out dynamic features from default */
>  	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>  	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> +	fpu_kernel_cfg.default_features &= ~fpu_kernel_dynamic_xfeatures;

I'd much rather that this be a closer analog to XFEATURE_MASK_USER_DYNAMIC.

Please define a XFEATURE_MASK_KERNEL_DYNAMIC value and use it here.
Don't use a dynamically generated one.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-09-14  6:33 ` [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size Yang Weijiang
@ 2023-09-14 17:40   ` Dave Hansen
  2023-09-15  2:22     ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Dave Hansen @ 2023-09-14 17:40 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: peterz, chao.gao, rick.p.edgecombe, john.allen

On 9/13/23 23:33, Yang Weijiang wrote:
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>  
>  	/* Calculate the resulting kernel state size */
>  	mask = permitted | requested;
> -	/* Take supervisor states into account on the host */
> +	/*
> +	 * Take supervisor states into account on the host. And add
> +	 * kernel dynamic xfeatures to guest since guest kernel may
> +	 * enable corresponding CPU feaures and the xstate registers
> +	 * need to be saved/restored properly.
> +	 */
>  	if (!guest)
>  		mask |= xfeatures_mask_supervisor();
> +	else
> +		mask |= fpu_kernel_dynamic_xfeatures;
> +
>  	ksize = xstate_calculate_size(mask, compacted);

Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
but didn't change the logic.

As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
guest xfeatures.  So, they're different in name only.

If you want to change the rules for guests, we have *ONE* place that's
done: fpstate_reset().  It establishes the permissions and the sizes for
the default guest FPU.  Start there.  If you want to make the guest
defaults include XFEATURE_CET_USER, then you need to put the bit in *there*.

The other option is to have the KVM code actually go and "request" that
the dynamic states get added to 'fpu->guest_perm'.  Would there ever be
any reason for KVM to be on a system which supports a dynamic kernel
feature but where it doesn't get enabled for guest use, or at least
shouldn't have the FPU space allocated?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
@ 2023-09-14 22:39   ` Edgecombe, Rick P
  2023-09-15  2:32     ` Yang, Weijiang
  2023-10-31 17:43   ` Maxim Levitsky
  1 sibling, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-14 22:39 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Remove XFEATURE_CET_USER entry from dependency array as the entry
> doesn't
> reflect true dependency between CET features and the xstate bit,
> instead
> manually check and add the bit back if either SHSTK or IBT is
> supported.
> 
> Both user mode shadow stack and indirect branch tracking features
> depend
> on XFEATURE_CET_USER bit in XSS to automatically save/restore user
> mode
> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever
> necessary.
> 
> Although in real world a platform with IBT but no SHSTK is rare, but
> in
> virtualization world it's common, guest SHSTK and IBT can be
> controlled
> independently via userspace app.

Nit, not sure we can assert it's common yet. It's true in general that
guests can have CPUID combinations that don't appear in real world of
course. Is that what you meant?

Also, this doesn't discuss the real main reason for this patch, and
that is that KVM will soon use the xfeature for user ibt, and so there
will now be a reason to have XFEATURE_CET_USER depend on IBT.

> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

Otherwise:

Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-09-14  6:33 ` [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation Yang Weijiang
@ 2023-09-14 22:45   ` Edgecombe, Rick P
  2023-09-15  2:45     ` Yang, Weijiang
  2023-10-21  0:39   ` Sean Christopherson
  1 sibling, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-14 22:45 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Fix guest xsave area allocation size from fpu_user_cfg.default_size
> to
> fpu_kernel_cfg.default_size so that the xsave area size is consistent
> with fpstate->size set in __fpstate_reset().
> 
> With the fix, guest fpstate size is sufficient for KVM supported
> guest
> xfeatures.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

There is no fix (Fixes: ...) here, right? I think this change is needed
to make sure KVM guests can support supervisor features. But KVM CET
support (to follow in future patches) will be the first one, right?

The side effect will be that KVM guest FPUs now get guaranteed room for
PASID as well as CET. I think I remember you mentioned that due to
alignment requirements, there shouldn't usually be any size change
though? It might be nice to add that in the log, if I'm remembering
correctly.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support
  2023-09-14  6:33 ` [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support Yang Weijiang
@ 2023-09-15  0:06   ` Edgecombe, Rick P
  2023-09-15  6:30     ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-15  0:06 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Add supervisor mode state support within FPU xstate management
> framework.
> Although supervisor shadow stack is not enabled/used today in
> kernel,KVM
         ^ Nit: needs a space
> requires the support because when KVM advertises shadow stack feature
> to
> guest, architechturally it claims the support for both user and
         ^ Spelling: "architecturally"
> supervisor
> modes for Linux and non-Linux guest OSes.
> 
> With the xstate support, guest supervisor mode shadow stack state can
> be
> properly saved/restored when 1) guest/host FPU context is swapped 
> 2) vCPU
> thread is sched out/in.
(2) is a little bit confusing, because the lazy FPU stuff won't always
save/restore while scheduling. But trying to explain the details in
this commit log is probably unnecessary. Maybe something like?

   2) At the proper times while other tasks are scheduled

I think also a key part of this is that XFEATURE_CET_KERNEL is not
*all* of the "guest supervisor mode shadow stack state", at least with
respect to the MSRs. It might be worth calling that out a little more
loudly.

> 
> The alternative is to enable it in KVM domain, but KVM maintainers
> NAKed
> the solution. The external discussion can be found at [*], it ended
> up
> with adding the support in kernel instead of KVM domain.
> 
> Note, in KVM case, guest CET supervisor state i.e.,
> IA32_PL{0,1,2}_MSRs,
> are preserved after VM-Exit until host/guest fpstates are swapped,
> but
> since host supervisor shadow stack is disabled, the preserved MSRs
> won't
> hurt host.

It might beg the question of if this solution will need to be redone by
some future Linux supervisor shadow stack effort. I *think* the answer
is no.

Most of the xsave managed features are restored before returning to
userspace because they would have userspace effect. But
XFEATURE_CET_KERNEL is different. It only effects the kernel. But the
IA32_PL{0,1,2}_MSRs are used when transitioning to those rings. So for
Linux they would get used when transitioning back from userspace. In
order for it to be used when control transfers back *from* userspace,
it needs to be restored before returning *to* userspace. So despite
being needed only for the kernel, and having no effect on userspace, it
might need to be swapped/restored at the same time as the rest of the
FPU state that only affects userspace.

Probably supervisor shadow stack for Linux needs much more analysis,
but trying to leave some breadcrumbs on the thinking from internal
reviews. I don't know if it might be good to include some of this
reasoning in the commit log. It's a bit hand wavy.

> 
> [*]:
> https://lore.kernel.org/all/806e26c2-8d21-9cc9-a0b7-7787dd231729@intel.com/
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>

Otherwise, the code looked good to me.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set
  2023-09-14  6:33 ` [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set Yang Weijiang
@ 2023-09-15  0:24   ` Edgecombe, Rick P
  2023-09-15  6:42     ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-15  0:24 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> +static void __init init_kernel_dynamic_xfeatures(void)
> +{
> +       unsigned short cid;
> +       int i;
> +
> +       for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
> i++) {
> +               cid = xsave_kernel_dynamic_xfeatures[i];
> +
> +               if (cid && boot_cpu_has(cid))
> +                       fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
> +       }
> +}
> +

I think this can be part of the max_features calculation that uses
xsave_cpuid_features when you use use a fixed mask like Dave suggested
in the other patch.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features
  2023-09-14 16:22   ` Dave Hansen
@ 2023-09-15  1:52     ` Yang, Weijiang
  2023-10-31 17:44     ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  1:52 UTC (permalink / raw)
  To: Dave Hansen, seanjc, pbonzini, kvm, linux-kernel
  Cc: peterz, chao.gao, rick.p.edgecombe, john.allen

On 9/15/2023 12:22 AM, Dave Hansen wrote:
> On 9/13/23 23:33, Yang Weijiang wrote:
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -845,6 +845,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>>   	/* Clean out dynamic features from default */
>>   	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
>>   	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>> +	fpu_kernel_cfg.default_features &= ~fpu_kernel_dynamic_xfeatures;
> I'd much rather that this be a closer analog to XFEATURE_MASK_USER_DYNAMIC.
>
> Please define a XFEATURE_MASK_KERNEL_DYNAMIC value and use it here.
> Don't use a dynamically generated one.

OK,  I will change it, thanks!


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-09-14 17:40   ` Dave Hansen
@ 2023-09-15  2:22     ` Yang, Weijiang
  2023-10-24 17:07       ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  2:22 UTC (permalink / raw)
  To: Dave Hansen, seanjc, pbonzini, kvm, linux-kernel
  Cc: peterz, chao.gao, rick.p.edgecombe, john.allen

On 9/15/2023 1:40 AM, Dave Hansen wrote:
> On 9/13/23 23:33, Yang Weijiang wrote:
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>>   
>>   	/* Calculate the resulting kernel state size */
>>   	mask = permitted | requested;
>> -	/* Take supervisor states into account on the host */
>> +	/*
>> +	 * Take supervisor states into account on the host. And add
>> +	 * kernel dynamic xfeatures to guest since guest kernel may
>> +	 * enable corresponding CPU feaures and the xstate registers
>> +	 * need to be saved/restored properly.
>> +	 */
>>   	if (!guest)
>>   		mask |= xfeatures_mask_supervisor();
>> +	else
>> +		mask |= fpu_kernel_dynamic_xfeatures;
>> +
>>   	ksize = xstate_calculate_size(mask, compacted);
> Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
> but didn't change the logic.
>
> As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
> guest xfeatures.  So, they're different in name only.
>
> If you want to change the rules for guests, we have *ONE* place that's
> done: fpstate_reset().  It establishes the permissions and the sizes for
> the default guest FPU.  Start there.  If you want to make the guest
> defaults include XFEATURE_CET_USER, then you need to put the bit in *there*.

Yeah, fpstate_reset() is the right place to hold the guest init permits and  propagate
them here, thanks for the suggestion!

Nit, did you actually mean XFEATURE_CET_KERNEL instead of XFEATURE_CET_USER above?
because the latter is already supported by upstream kernel.

> The other option is to have the KVM code actually go and "request" that
> the dynamic states get added to 'fpu->guest_perm'.

Yes, compared with above option, it will change current userspace handling logic, i.e.,
only user xstates are dynamically requested. So I'd try above option first.

>   Would there ever be
> any reason for KVM to be on a system which supports a dynamic kernel
> feature but where it doesn't get enabled for guest use, or at least
> shouldn't have the FPU space allocated?

I haven't heard of that kind of usage for other features so far, CET supervisor xstate is the
only dynamic kernel feature now,  not sure whether other CPU features having supervisor
xstate would share the handling logic like CET does one day.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-14 22:39   ` Edgecombe, Rick P
@ 2023-09-15  2:32     ` Yang, Weijiang
  2023-09-15 16:35       ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  2:32 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On 9/15/2023 6:39 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Remove XFEATURE_CET_USER entry from dependency array as the entry
>> doesn't
>> reflect true dependency between CET features and the xstate bit,
>> instead
>> manually check and add the bit back if either SHSTK or IBT is
>> supported.
>>
>> Both user mode shadow stack and indirect branch tracking features
>> depend
>> on XFEATURE_CET_USER bit in XSS to automatically save/restore user
>> mode
>> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever
>> necessary.
>>
>> Although in real world a platform with IBT but no SHSTK is rare, but
>> in
>> virtualization world it's common, guest SHSTK and IBT can be
>> controlled
>> independently via userspace app.
> Nit, not sure we can assert it's common yet. It's true in general that
> guests can have CPUID combinations that don't appear in real world of
> course. Is that what you meant?

Yes, guest CPUID features can be configured by userspace flexibly.

>
> Also, this doesn't discuss the real main reason for this patch, and
> that is that KVM will soon use the xfeature for user ibt, and so there
> will now be a reason to have XFEATURE_CET_USER depend on IBT.

This is one justification for Linux OS, another reason is there's non-Linux
OS which is using the user IBT feature.  I should make the reasons clearer
in changelog, thanks for pointing it out!

>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> Otherwise:
>
> Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Tested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-09-14 22:45   ` Edgecombe, Rick P
@ 2023-09-15  2:45     ` Yang, Weijiang
  2023-09-15 16:35       ` Edgecombe, Rick P
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  2:45 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On 9/15/2023 6:45 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Fix guest xsave area allocation size from fpu_user_cfg.default_size
>> to
>> fpu_kernel_cfg.default_size so that the xsave area size is consistent
>> with fpstate->size set in __fpstate_reset().
>>
>> With the fix, guest fpstate size is sufficient for KVM supported
>> guest
>> xfeatures.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> There is no fix (Fixes: ...) here, right?

Ooh, I got it lost during rebase, thanks!

> I think this change is needed
> to make sure KVM guests can support supervisor features. But KVM CET
> support (to follow in future patches) will be the first one, right?

Exactly, the existing code takes only user xfeatures into account, and we have more
and more CPU features rely on supervisor xstates.

> The side effect will be that KVM guest FPUs now get guaranteed room for
> PASID as well as CET. I think I remember you mentioned that due to
> alignment requirements, there shouldn't usually be any size change
> though?

Yes, IIUC the precondition is AMX feature is enabled for guest so that the CET supervisor
state actually resides in the gap to next 64byte aligned address, so no actual size expansion.

> It might be nice to add that in the log, if I'm remembering
> correctly.

Sure, thanks!


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support
  2023-09-15  0:06   ` Edgecombe, Rick P
@ 2023-09-15  6:30     ` Yang, Weijiang
  2023-10-31 17:44       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  6:30 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On 9/15/2023 8:06 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Add supervisor mode state support within FPU xstate management
>> framework.
>> Although supervisor shadow stack is not enabled/used today in
>> kernel,KVM
>           ^ Nit: needs a space
>> requires the support because when KVM advertises shadow stack feature
>> to
>> guest, architechturally it claims the support for both user and
>           ^ Spelling: "architecturally"

Thank you!!

>> supervisor
>> modes for Linux and non-Linux guest OSes.
>>
>> With the xstate support, guest supervisor mode shadow stack state can
>> be
>> properly saved/restored when 1) guest/host FPU context is swapped
>> 2) vCPU
>> thread is sched out/in.
> (2) is a little bit confusing, because the lazy FPU stuff won't always
> save/restore while scheduling.

It's true for normal thread, but for vCPU thread, it's a bit different, on the path to
vm-entry, after host/guest fpu states swapped, preemption is not disabled and
vCPU thread could be sched out/in, in this case,  guest FPU states will  be saved/
restored because TIF_NEED_FPU_LOAD is always cleared after swap.

> But trying to explain the details in
> this commit log is probably unnecessary. Maybe something like?
>
>     2) At the proper times while other tasks are scheduled

I just want to justify that enabling of supervisor xstate is necessary for guest.
Maybe I need to reword a bit :-)

> I think also a key part of this is that XFEATURE_CET_KERNEL is not
> *all* of the "guest supervisor mode shadow stack state", at least with
> respect to the MSRs. It might be worth calling that out a little more
> loudly.

OK, I will call it out that supervisor mode shadow stack state also includes IA32_S_CET msr.

>> The alternative is to enable it in KVM domain, but KVM maintainers
>> NAKed
>> the solution. The external discussion can be found at [*], it ended
>> up
>> with adding the support in kernel instead of KVM domain.
>>
>> Note, in KVM case, guest CET supervisor state i.e.,
>> IA32_PL{0,1,2}_MSRs,
>> are preserved after VM-Exit until host/guest fpstates are swapped,
>> but
>> since host supervisor shadow stack is disabled, the preserved MSRs
>> won't
>> hurt host.
> It might beg the question of if this solution will need to be redone by
> some future Linux supervisor shadow stack effort. I *think* the answer
> is no.

AFAICT KVM needs to be modified if host shadow stack is implemented, at least
guest/host CET supervisor MSRs should be swapped at the earliest time after
vm-exit so that host won't misbehavior on *guest*  MSR contents.

> Most of the xsave managed features are restored before returning to
> userspace because they would have userspace effect. But
> XFEATURE_CET_KERNEL is different. It only effects the kernel. But the
> IA32_PL{0,1,2}_MSRs are used when transitioning to those rings. So for
> Linux they would get used when transitioning back from userspace. In
> order for it to be used when control transfers back *from* userspace,
> it needs to be restored before returning *to* userspace. So despite
> being needed only for the kernel, and having no effect on userspace, it
> might need to be swapped/restored at the same time as the rest of the
> FPU state that only affects userspace.

You're right, for enabling of supervisor mode shadow stack, we need to take
it carefully whenever ring/stack is switching. But we still have time to figure out
the points.

Thanks a lot for bring up such kind of thinking!

> Probably supervisor shadow stack for Linux needs much more analysis,
> but trying to leave some breadcrumbs on the thinking from internal
> reviews. I don't know if it might be good to include some of this
> reasoning in the commit log. It's a bit hand wavy.

IMO, we have put much assumption on the fact that CET supervisor shadow stack is not
enabled in kernel and this patch itself is straightforward and simple, it's just a small
brick for enabling supervisor shadow stack, we would revisit whether something is an
issue based on how SSS is implemented in kernel. So let's not add such kind of reasoning :-)

Thank you for the enlightenment!
>> [*]:
>> https://lore.kernel.org/all/806e26c2-8d21-9cc9-a0b7-7787dd231729@intel.com/
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> Otherwise, the code looked good to me.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set
  2023-09-15  0:24   ` Edgecombe, Rick P
@ 2023-09-15  6:42     ` Yang, Weijiang
  2023-10-31 17:44       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-15  6:42 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On 9/15/2023 8:24 AM, Edgecombe, Rick P wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> +static void __init init_kernel_dynamic_xfeatures(void)
>> +{
>> +       unsigned short cid;
>> +       int i;
>> +
>> +       for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
>> i++) {
>> +               cid = xsave_kernel_dynamic_xfeatures[i];
>> +
>> +               if (cid && boot_cpu_has(cid))
>> +                       fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
>> +       }
>> +}
>> +
> I think this can be part of the max_features calculation that uses
> xsave_cpuid_features when you use use a fixed mask like Dave suggested
> in the other patch.

Yes, the max_features has already included CET supervisor state bit. After  use
fixed mask, this function is not needed.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-15  2:32     ` Yang, Weijiang
@ 2023-09-15 16:35       ` Edgecombe, Rick P
  2023-09-18  7:16         ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-15 16:35 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Fri, 2023-09-15 at 10:32 +0800, Yang, Weijiang wrote:
> > 
> > Also, this doesn't discuss the real main reason for this patch, and
> > that is that KVM will soon use the xfeature for user ibt, and so
> > there
> > will now be a reason to have XFEATURE_CET_USER depend on IBT.
> 
> This is one justification for Linux OS, another reason is there's
> non-Linux
> OS which is using the user IBT feature.  I should make the reasons
> clearer
> in changelog, thanks for pointing it out!

The point I was trying to make was today (before this series) nothing
on the system can use user IBT. Not the host, and not in any guest
because KVM doesn't support it. So the added xfeature dependency on IBT
was not previously needed. It is being added only for KVM CET support
(which, yes, may run on guests with non-standard CPUID).



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-09-15  2:45     ` Yang, Weijiang
@ 2023-09-15 16:35       ` Edgecombe, Rick P
  0 siblings, 0 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-09-15 16:35 UTC (permalink / raw)
  To: kvm, Yang, Weijiang, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Fri, 2023-09-15 at 10:45 +0800, Yang, Weijiang wrote:
> On 9/15/2023 6:45 AM, Edgecombe, Rick P wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Fix guest xsave area allocation size from
> > > fpu_user_cfg.default_size
> > > to
> > > fpu_kernel_cfg.default_size so that the xsave area size is
> > > consistent
> > > with fpstate->size set in __fpstate_reset().
> > > 
> > > With the fix, guest fpstate size is sufficient for KVM supported
> > > guest
> > > xfeatures.
> > > 
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > There is no fix (Fixes: ...) here, right?
> 
> Ooh, I got it lost during rebase, thanks!
> 
> > I think this change is needed
> > to make sure KVM guests can support supervisor features. But KVM
> > CET
> > support (to follow in future patches) will be the first one, right?
> 
> Exactly, the existing code takes only user xfeatures into account,
> and we have more
> and more CPU features rely on supervisor xstates.

If KVM is not using any supervisor features, then pre CET KVM support I
think the current code is more correct.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-15 16:35       ` Edgecombe, Rick P
@ 2023-09-18  7:16         ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-18  7:16 UTC (permalink / raw)
  To: Edgecombe, Rick P, kvm, pbonzini, Christopherson,, Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On 9/16/2023 12:35 AM, Edgecombe, Rick P wrote:
> On Fri, 2023-09-15 at 10:32 +0800, Yang, Weijiang wrote:
>>> Also, this doesn't discuss the real main reason for this patch, and
>>> that is that KVM will soon use the xfeature for user ibt, and so
>>> there
>>> will now be a reason to have XFEATURE_CET_USER depend on IBT.
>> This is one justification for Linux OS, another reason is there's
>> non-Linux
>> OS which is using the user IBT feature.  I should make the reasons
>> clearer
>> in changelog, thanks for pointing it out!
> The point I was trying to make was today (before this series) nothing
> on the system can use user IBT. Not the host, and not in any guest
> because KVM doesn't support it. So the added xfeature dependency on IBT
> was not previously needed. It is being added only for KVM CET support
> (which, yes, may run on guests with non-standard CPUID).

Agree, I'll highlight this in changelog, thanks!



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  2023-09-14  6:33 ` [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace Yang Weijiang
@ 2023-09-24 13:38   ` kernel test robot
  2023-09-25  0:26     ` Yang, Weijiang
  2023-10-31 17:56   ` Maxim Levitsky
  1 sibling, 1 reply; 120+ messages in thread
From: kernel test robot @ 2023-09-24 13:38 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: oe-lkp, lkp, kvm, linux-kernel, seanjc, pbonzini, dave.hansen,
	peterz, chao.gao, rick.p.edgecombe, weijiang.yang, john.allen,
	oliver.sang



Hello,

kernel test robot noticed "WARNING:at_arch/x86/kvm/vmx/vmx.c:#vmwrite_error[kvm_intel]" on:

commit: 68d0338a67df85ab18482295976e7bd873987165 ("[PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace")
url: https://github.com/intel-lab-lkp/linux/commits/Yang-Weijiang/x86-fpu-xstate-Manually-check-and-add-XFEATURE_CET_USER-xstate-bit/20230914-174056
patch link: https://lore.kernel.org/all/20230914063325.85503-24-weijiang.yang@intel.com/
patch subject: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace

in testcase: kvm-unit-tests-qemu
version: 
with following parameters:




compiler: gcc-12
test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <oliver.sang@intel.com>
| Closes: https://lore.kernel.org/oe-lkp/202309242050.90b36814-oliver.sang@intel.com



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230924/202309242050.90b36814-oliver.sang@intel.com



[  271.856711][T15436] ------------[ cut here ]------------
[  271.863011][T15436] vmwrite failed: field=682a val=0 err=12
[  271.869458][T15436] WARNING: CPU: 117 PID: 15436 at arch/x86/kvm/vmx/vmx.c:444 vmwrite_error+0x16b/0x2e0 [kvm_intel]
[  271.880940][T15436] Modules linked in: kvm_intel kvm irqbypass btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate ipmi_ssif ahci ast libahci mei_me drm_shmem_helper intel_uncore dax_hmem ioatdma joydev drm_kms_helper acpi_ipmi libata mei intel_pch_thermal dca wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad fuse drm ip_tables [last unloaded: irqbypass]
[  271.939752][T15436] CPU: 117 PID: 15436 Comm: qemu-system-x86 Not tainted 6.5.0-12553-g68d0338a67df #1
[  271.950090][T15436] RIP: 0010:vmwrite_error+0x16b/0x2e0 [kvm_intel]
[  271.957256][T15436] Code: ff c6 05 f1 4b 82 ff 01 66 90 b9 00 44 00 00 0f 78 c9 0f 86 e0 00 00 00 48 89 ea 48 89 de 48 c7 c7 80 1c d9 c0 e8 c5 b7 c4 bf <0f> 0b e9 ae fe ff ff 48 c7 c0 a0 6f d9 c0 48 ba 00 00 00 00 00 fc
[  271.978720][T15436] RSP: 0018:ffa000000e117980 EFLAGS: 00010286
[  271.985599][T15436] RAX: 0000000000000000 RBX: 000000000000682a RCX: ffffffff82216eee
[  271.994345][T15436] RDX: 1fe2200403fd57c8 RSI: 0000000000000008 RDI: ffa000000e117738
[  272.003044][T15436] RBP: 0000000000000000 R08: 0000000000000001 R09: fff3fc0001c22ee7
[  272.011865][T15436] R10: ffa000000e11773f R11: 0000000000000001 R12: ff110011b12a4b20
[  272.020632][T15436] R13: 0000000000000000 R14: 0000000000000000 R15: ff110011b12a4980
[  272.029340][T15436] FS:  00007f79fd975700(0000) GS:ff1100201fe80000(0000) knlGS:0000000000000000
[  272.039141][T15436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  272.046484][T15436] CR2: 00007f79e8000010 CR3: 00000010d23c0003 CR4: 0000000000773ee0
[  272.055167][T15436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  272.063980][T15436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  272.072749][T15436] PKRU: 55555554
[  272.076985][T15436] Call Trace:
[  272.080947][T15436]  <TASK>
[  272.084650][T15436]  ? __warn+0xcd/0x260
[  272.089420][T15436]  ? vmwrite_error+0x16b/0x2e0 [kvm_intel]
[  272.096014][T15436]  ? report_bug+0x267/0x2d0
[  272.101163][T15436]  ? handle_bug+0x3c/0x70
[  272.106130][T15436]  ? exc_invalid_op+0x17/0x40
[  272.111483][T15436]  ? asm_exc_invalid_op+0x1a/0x20
[  272.117132][T15436]  ? llist_add_batch+0xbe/0x130
[  272.122685][T15436]  ? vmwrite_error+0x16b/0x2e0 [kvm_intel]
[  272.129113][T15436]  vmx_vcpu_reset+0x2382/0x30b0 [kvm_intel]
[  272.135741][T15436]  ? init_vmcs+0x7230/0x7230 [kvm_intel]
[  272.141988][T15436]  ? irq_work_sync+0x8a/0x1f0
[  272.147302][T15436]  ? kvm_clear_async_pf_completion_queue+0x2e6/0x4c0 [kvm]
[  272.155191][T15436]  kvm_vcpu_reset+0x8cc/0x1080 [kvm]
[  272.161154][T15436]  kvm_arch_vcpu_create+0x8c5/0xbd0 [kvm]
[  272.167584][T15436]  kvm_vm_ioctl_create_vcpu+0x4be/0xe20 [kvm]
[  272.174297][T15436]  ? __alloc_pages+0x1d5/0x440
[  272.179723][T15436]  ? kvm_get_dirty_log_protect+0x5f0/0x5f0 [kvm]
[  272.186757][T15436]  ? __alloc_pages_slowpath+0x1cf0/0x1cf0
[  272.194079][T15436]  ? do_user_addr_fault+0x26c/0xac0
[  272.199837][T15436]  ? mem_cgroup_handle_over_high+0x570/0x570
[  272.206405][T15436]  ? _raw_spin_lock+0x85/0xe0
[  272.211721][T15436]  ? _raw_write_lock_irq+0xe0/0xe0
[  272.217414][T15436]  kvm_vm_ioctl+0x939/0xde0 [kvm]
[  272.223014][T15436]  ? __mod_memcg_lruvec_state+0x100/0x220
[  272.229278][T15436]  ? kvm_unregister_device_ops+0x90/0x90 [kvm]
[  272.235978][T15436]  ? __mod_lruvec_page_state+0x1ad/0x3a0
[  272.242092][T15436]  ? perf_trace_mm_lru_insertion+0x7c0/0x7c0
[  272.248627][T15436]  ? folio_batch_add_and_move+0xc1/0x110
[  272.254832][T15436]  ? do_anonymous_page+0x5e2/0xc10
[  272.260431][T15436]  ? up_write+0x52/0x90
[  272.265006][T15436]  ? vfs_fileattr_set+0x4e0/0x4e0
[  272.270502][T15436]  ? copy_page_range+0x880/0x880
[  272.275831][T15436]  ? __count_memcg_events+0xdd/0x1e0
[  272.281564][T15436]  ? handle_mm_fault+0x187/0x7a0
[  272.286855][T15436]  ? __fget_light+0x236/0x4d0
[  272.291883][T15436]  __x64_sys_ioctl+0x130/0x1a0
[  272.296994][T15436]  do_syscall_64+0x38/0x80
[  272.301756][T15436]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[  272.307993][T15436] RIP: 0033:0x7f79fe886237
[  272.312758][T15436] Code: 00 00 00 48 8b 05 59 cc 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 29 cc 0d 00 f7 d8 64 89 01 48
[  272.333241][T15436] RSP: 002b:00007f79fd974808 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[  272.342024][T15436] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX: 00007f79fe886237
[  272.350428][T15436] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 000000000000000d
[  272.358789][T15436] RBP: 00005606ece4cc90 R08: 0000000000000000 R09: 0000000000000000
[  272.367151][T15436] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[  272.375587][T15436] R13: 00007ffefe5a1daf R14: 00007f79fd974a80 R15: 0000000000802000
[  272.383950][T15436]  </TASK>
[  272.387416][T15436] ---[ end trace 0000000000000000 ]---
[  272.393295][T15436] kvm_intel: vmwrite failed: field=682a val=0 err=12



-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  2023-09-24 13:38   ` kernel test robot
@ 2023-09-25  0:26     ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-25  0:26 UTC (permalink / raw)
  To: kernel test robot
  Cc: oe-lkp, lkp, kvm, linux-kernel, seanjc, pbonzini, dave.hansen,
	peterz, chao.gao, rick.p.edgecombe, john.allen


It's due to lack of capability check, I will fix the calltrace in next verison.

On 9/24/2023 9:38 PM, kernel test robot wrote:
>
> Hello,
>
> kernel test robot noticed "WARNING:at_arch/x86/kvm/vmx/vmx.c:#vmwrite_error[kvm_intel]" on:
>
> commit: 68d0338a67df85ab18482295976e7bd873987165 ("[PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace")
> url: https://github.com/intel-lab-lkp/linux/commits/Yang-Weijiang/x86-fpu-xstate-Manually-check-and-add-XFEATURE_CET_USER-xstate-bit/20230914-174056
> patch link: https://lore.kernel.org/all/20230914063325.85503-24-weijiang.yang@intel.com/
> patch subject: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
>
> in testcase: kvm-unit-tests-qemu
> version:
> with following parameters:
>
>
>
>
> compiler: gcc-12
> test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz (Ice Lake) with 128G memory
>
> (please refer to attached dmesg/kmsg for entire log/backtrace)
>
>
>
> If you fix the issue in a separate patch/commit (i.e. not just a new version of
> the same patch/commit), kindly add following tags
> | Reported-by: kernel test robot <oliver.sang@intel.com>
> | Closes: https://lore.kernel.org/oe-lkp/202309242050.90b36814-oliver.sang@intel.com
>
>
>
> The kernel config and materials to reproduce are available at:
> https://download.01.org/0day-ci/archive/20230924/202309242050.90b36814-oliver.sang@intel.com
>
>
>
> [  271.856711][T15436] ------------[ cut here ]------------
> [  271.863011][T15436] vmwrite failed: field=682a val=0 err=12
> [  271.869458][T15436] WARNING: CPU: 117 PID: 15436 at arch/x86/kvm/vmx/vmx.c:444 vmwrite_error+0x16b/0x2e0 [kvm_intel]
> [  271.880940][T15436] Modules linked in: kvm_intel kvm irqbypass btrfs blake2b_generic xor raid6_pq zstd_compress libcrc32c sd_mod t10_pi crc64_rocksoft_generic crc64_rocksoft crc64 sg intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel sha512_ssse3 rapl intel_cstate ipmi_ssif ahci ast libahci mei_me drm_shmem_helper intel_uncore dax_hmem ioatdma joydev drm_kms_helper acpi_ipmi libata mei intel_pch_thermal dca wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad fuse drm ip_tables [last unloaded: irqbypass]
> [  271.939752][T15436] CPU: 117 PID: 15436 Comm: qemu-system-x86 Not tainted 6.5.0-12553-g68d0338a67df #1
> [  271.950090][T15436] RIP: 0010:vmwrite_error+0x16b/0x2e0 [kvm_intel]
> [  271.957256][T15436] Code: ff c6 05 f1 4b 82 ff 01 66 90 b9 00 44 00 00 0f 78 c9 0f 86 e0 00 00 00 48 89 ea 48 89 de 48 c7 c7 80 1c d9 c0 e8 c5 b7 c4 bf <0f> 0b e9 ae fe ff ff 48 c7 c0 a0 6f d9 c0 48 ba 00 00 00 00 00 fc
> [  271.978720][T15436] RSP: 0018:ffa000000e117980 EFLAGS: 00010286
> [  271.985599][T15436] RAX: 0000000000000000 RBX: 000000000000682a RCX: ffffffff82216eee
> [  271.994345][T15436] RDX: 1fe2200403fd57c8 RSI: 0000000000000008 RDI: ffa000000e117738
> [  272.003044][T15436] RBP: 0000000000000000 R08: 0000000000000001 R09: fff3fc0001c22ee7
> [  272.011865][T15436] R10: ffa000000e11773f R11: 0000000000000001 R12: ff110011b12a4b20
> [  272.020632][T15436] R13: 0000000000000000 R14: 0000000000000000 R15: ff110011b12a4980
> [  272.029340][T15436] FS:  00007f79fd975700(0000) GS:ff1100201fe80000(0000) knlGS:0000000000000000
> [  272.039141][T15436] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  272.046484][T15436] CR2: 00007f79e8000010 CR3: 00000010d23c0003 CR4: 0000000000773ee0
> [  272.055167][T15436] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  272.063980][T15436] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [  272.072749][T15436] PKRU: 55555554
> [  272.076985][T15436] Call Trace:
> [  272.080947][T15436]  <TASK>
> [  272.084650][T15436]  ? __warn+0xcd/0x260
> [  272.089420][T15436]  ? vmwrite_error+0x16b/0x2e0 [kvm_intel]
> [  272.096014][T15436]  ? report_bug+0x267/0x2d0
> [  272.101163][T15436]  ? handle_bug+0x3c/0x70
> [  272.106130][T15436]  ? exc_invalid_op+0x17/0x40
> [  272.111483][T15436]  ? asm_exc_invalid_op+0x1a/0x20
> [  272.117132][T15436]  ? llist_add_batch+0xbe/0x130
> [  272.122685][T15436]  ? vmwrite_error+0x16b/0x2e0 [kvm_intel]
> [  272.129113][T15436]  vmx_vcpu_reset+0x2382/0x30b0 [kvm_intel]
> [  272.135741][T15436]  ? init_vmcs+0x7230/0x7230 [kvm_intel]
> [  272.141988][T15436]  ? irq_work_sync+0x8a/0x1f0
> [  272.147302][T15436]  ? kvm_clear_async_pf_completion_queue+0x2e6/0x4c0 [kvm]
> [  272.155191][T15436]  kvm_vcpu_reset+0x8cc/0x1080 [kvm]
> [  272.161154][T15436]  kvm_arch_vcpu_create+0x8c5/0xbd0 [kvm]
> [  272.167584][T15436]  kvm_vm_ioctl_create_vcpu+0x4be/0xe20 [kvm]
> [  272.174297][T15436]  ? __alloc_pages+0x1d5/0x440
> [  272.179723][T15436]  ? kvm_get_dirty_log_protect+0x5f0/0x5f0 [kvm]
> [  272.186757][T15436]  ? __alloc_pages_slowpath+0x1cf0/0x1cf0
> [  272.194079][T15436]  ? do_user_addr_fault+0x26c/0xac0
> [  272.199837][T15436]  ? mem_cgroup_handle_over_high+0x570/0x570
> [  272.206405][T15436]  ? _raw_spin_lock+0x85/0xe0
> [  272.211721][T15436]  ? _raw_write_lock_irq+0xe0/0xe0
> [  272.217414][T15436]  kvm_vm_ioctl+0x939/0xde0 [kvm]
> [  272.223014][T15436]  ? __mod_memcg_lruvec_state+0x100/0x220
> [  272.229278][T15436]  ? kvm_unregister_device_ops+0x90/0x90 [kvm]
> [  272.235978][T15436]  ? __mod_lruvec_page_state+0x1ad/0x3a0
> [  272.242092][T15436]  ? perf_trace_mm_lru_insertion+0x7c0/0x7c0
> [  272.248627][T15436]  ? folio_batch_add_and_move+0xc1/0x110
> [  272.254832][T15436]  ? do_anonymous_page+0x5e2/0xc10
> [  272.260431][T15436]  ? up_write+0x52/0x90
> [  272.265006][T15436]  ? vfs_fileattr_set+0x4e0/0x4e0
> [  272.270502][T15436]  ? copy_page_range+0x880/0x880
> [  272.275831][T15436]  ? __count_memcg_events+0xdd/0x1e0
> [  272.281564][T15436]  ? handle_mm_fault+0x187/0x7a0
> [  272.286855][T15436]  ? __fget_light+0x236/0x4d0
> [  272.291883][T15436]  __x64_sys_ioctl+0x130/0x1a0
> [  272.296994][T15436]  do_syscall_64+0x38/0x80
> [  272.301756][T15436]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [  272.307993][T15436] RIP: 0033:0x7f79fe886237
> [  272.312758][T15436] Code: 00 00 00 48 8b 05 59 cc 0d 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 29 cc 0d 00 f7 d8 64 89 01 48
> [  272.333241][T15436] RSP: 002b:00007f79fd974808 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> [  272.342024][T15436] RAX: ffffffffffffffda RBX: 000000000000ae41 RCX: 00007f79fe886237
> [  272.350428][T15436] RDX: 0000000000000000 RSI: 000000000000ae41 RDI: 000000000000000d
> [  272.358789][T15436] RBP: 00005606ece4cc90 R08: 0000000000000000 R09: 0000000000000000
> [  272.367151][T15436] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
> [  272.375587][T15436] R13: 00007ffefe5a1daf R14: 00007f79fd974a80 R15: 0000000000802000
> [  272.383950][T15436]  </TASK>
> [  272.387416][T15436] ---[ end trace 0000000000000000 ]---
> [  272.393295][T15436] kvm_intel: vmwrite failed: field=682a val=0 err=12
>
>
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 00/25] Enable CET Virtualization
  2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
                   ` (24 preceding siblings ...)
  2023-09-14  6:33 ` [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest Yang Weijiang
@ 2023-09-25  0:31 ` Yang, Weijiang
  25 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-09-25  0:31 UTC (permalink / raw)
  To: seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen


Kindly ping maintainers for KVM part review, thanks!

On 9/14/2023 2:33 PM, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
>
> Shadow Stack (SHSTK):
>    A shadow stack is a second stack used exclusively for control transfer
>    operations. The shadow stack is separate from the data/normal stack and
>    can be enabled individually in user and kernel mode. When shadow stack
>    is enabled, CALL pushes the return address on both the data and shadow
>    stack. RET pops the return address from both stacks and compares them.
>    If the return addresses from the two stacks do not match, the processor
>    generates a #CP.
>
> Indirect Branch Tracking (IBT):
>    IBT introduces new instruction(ENDBRANCH)to mark valid target addresses of
>    indirect branches (CALL, JMP etc...). If an indirect branch is executed
>    and the next instruction is _not_ an ENDBRANCH, the processor generates a
>    #CP. These instruction behaves as a NOP on platforms that doesn't support
>    CET.
>
>
[...]

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
@ 2023-10-08  5:54   ` Chao Gao
  2023-10-10  0:49     ` Yang, Weijiang
  2023-10-31 17:51   ` Maxim Levitsky
  2023-11-15  7:18   ` Binbin Wu
  2 siblings, 1 reply; 120+ messages in thread
From: Chao Gao @ 2023-10-08  5:54 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen, Zhang Yi Z

>diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>index 0fc5e6312e93..d77b030e996c 100644
>--- a/arch/x86/include/asm/kvm_host.h
>+++ b/arch/x86/include/asm/kvm_host.h
>@@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
> 
> 	u64 xcr0;
> 	u64 guest_supported_xcr0;
>+	u64 guest_supported_xss;

This structure has the ia32_xss field. how about moving it here for symmetry?

> 
> 	struct kvm_pio_request pio;
> 	void *pio_data;
>diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>index 1f206caec559..4e7a820cba62 100644
>--- a/arch/x86/kvm/cpuid.c
>+++ b/arch/x86/kvm/cpuid.c
>@@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
> 	best = cpuid_entry2_find(entries, nent, 0xD, 1);
> 	if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
> 		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>-		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>+		best->ebx = xstate_required_size(vcpu->arch.xcr0 |
>+						 vcpu->arch.ia32_xss, true);
> 
> 	best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
> 	if (kvm_hlt_in_guest(vcpu->kvm) && best &&
>@@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> 	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> }
> 
>+static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
>+{
>+	struct kvm_cpuid_entry2 *best;
>+
>+	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
>+	if (!best)
>+		return 0;
>+
>+	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
>+}
>+
> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
> {
> 	struct kvm_cpuid_entry2 *entry;
>@@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> 	}
> 
> 	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>+	vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
> 
> 	/*
> 	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
>diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>index 1258d1d6dd52..9a616d84bd39 100644
>--- a/arch/x86/kvm/x86.c
>+++ b/arch/x86/kvm/x86.c
>@@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> 			vcpu->arch.ia32_tsc_adjust_msr += adj;
> 		}
> 		break;
>-	case MSR_IA32_XSS:
>-		if (!msr_info->host_initiated &&
>-		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>+	case MSR_IA32_XSS: {
>+		bool host_msr_reset = msr_info->host_initiated && data == 0;
>+
>+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
>+		    (!host_msr_reset || !msr_info->host_initiated))

!msr_info->host_initiated can be dropped here.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved
  2023-09-14  6:33 ` [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved Yang Weijiang
@ 2023-10-08  6:19   ` Chao Gao
  2023-10-10  0:54     ` Yang, Weijiang
  2023-10-31 17:52   ` Maxim Levitsky
  1 sibling, 1 reply; 120+ messages in thread
From: Chao Gao @ 2023-10-08  6:19 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On Thu, Sep 14, 2023 at 02:33:16AM -0400, Yang Weijiang wrote:
>Add CET MSRs to the list of MSRs reported to userspace if the feature,
>i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
>
>SSP can only be read via RDSSP. Writing even requires destructive and
>potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
>SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
>for the GUEST_SSP field of the VMCS.
>
>Suggested-by: Chao Gao <chao.gao@intel.com>
>Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>---
> arch/x86/include/uapi/asm/kvm_para.h |  1 +
> arch/x86/kvm/vmx/vmx.c               |  2 ++
> arch/x86/kvm/x86.c                   | 18 ++++++++++++++++++
> 3 files changed, 21 insertions(+)
>
>diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>index 6e64b27b2c1e..9864bbcf2470 100644
>--- a/arch/x86/include/uapi/asm/kvm_para.h
>+++ b/arch/x86/include/uapi/asm/kvm_para.h
>@@ -58,6 +58,7 @@
> #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
> #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
> #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
>+#define MSR_KVM_SSP	0x4b564d09
> 
> struct kvm_steal_time {
> 	__u64 steal;
>diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>index 72e3943f3693..9409753f45b0 100644
>--- a/arch/x86/kvm/vmx/vmx.c
>+++ b/arch/x86/kvm/vmx/vmx.c
>@@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
> 	case MSR_AMD64_TSC_RATIO:
> 		/* This is AMD only.  */
> 		return false;
>+	case MSR_KVM_SSP:
>+		return kvm_cpu_cap_has(X86_FEATURE_SHSTK);

For other MSRs in emulated_msrs_all[], KVM doesn't check the associated
CPUID feature bits. Why bother doing this for MSR_KVM_SSP?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-10-08  5:54   ` Chao Gao
@ 2023-10-10  0:49     ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-10-10  0:49 UTC (permalink / raw)
  To: Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen, Zhang Yi Z

On 10/8/2023 1:54 PM, Chao Gao wrote:
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 0fc5e6312e93..d77b030e996c 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>>
>> 	u64 xcr0;
>> 	u64 guest_supported_xcr0;
>> +	u64 guest_supported_xss;
> This structure has the ia32_xss field. how about moving it here for symmetry?

OK, will do it, thanks!

>> 	struct kvm_pio_request pio;
>> 	void *pio_data;
>> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
>> index 1f206caec559..4e7a820cba62 100644
>> --- a/arch/x86/kvm/cpuid.c
>> +++ b/arch/x86/kvm/cpuid.c
>> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
>> 	best = cpuid_entry2_find(entries, nent, 0xD, 1);
>> 	if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
>> 		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
>> -		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
>> +		best->ebx = xstate_required_size(vcpu->arch.xcr0 |
>> +						 vcpu->arch.ia32_xss, true);
>>
>> 	best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
>> 	if (kvm_hlt_in_guest(vcpu->kvm) && best &&
>> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
>> 	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
>> }
>>
>> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvm_cpuid_entry2 *best;
>> +
>> +	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
>> +	if (!best)
>> +		return 0;
>> +
>> +	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
>> +}
>> +
>> static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
>> {
>> 	struct kvm_cpuid_entry2 *entry;
>> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>> 	}
>>
>> 	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>> +	vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>>
>> 	/*
>> 	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1258d1d6dd52..9a616d84bd39 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>> 			vcpu->arch.ia32_tsc_adjust_msr += adj;
>> 		}
>> 		break;
>> -	case MSR_IA32_XSS:
>> -		if (!msr_info->host_initiated &&
>> -		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
>> +	case MSR_IA32_XSS: {
>> +		bool host_msr_reset = msr_info->host_initiated && data == 0;
>> +
>> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
>> +		    (!host_msr_reset || !msr_info->host_initiated))
> !msr_info->host_initiated can be dropped here.

Yes, it's not necessary, will remove it, thanks!


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved
  2023-10-08  6:19   ` Chao Gao
@ 2023-10-10  0:54     ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-10-10  0:54 UTC (permalink / raw)
  To: Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On 10/8/2023 2:19 PM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:16AM -0400, Yang Weijiang wrote:
>> Add CET MSRs to the list of MSRs reported to userspace if the feature,
>> i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
>>
>> SSP can only be read via RDSSP. Writing even requires destructive and
>> potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
>> SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
>> for the GUEST_SSP field of the VMCS.
>>
>> Suggested-by: Chao Gao <chao.gao@intel.com>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>> arch/x86/include/uapi/asm/kvm_para.h |  1 +
>> arch/x86/kvm/vmx/vmx.c               |  2 ++
>> arch/x86/kvm/x86.c                   | 18 ++++++++++++++++++
>> 3 files changed, 21 insertions(+)
>>
>> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
>> index 6e64b27b2c1e..9864bbcf2470 100644
>> --- a/arch/x86/include/uapi/asm/kvm_para.h
>> +++ b/arch/x86/include/uapi/asm/kvm_para.h
>> @@ -58,6 +58,7 @@
>> #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
>> #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
>> #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
>> +#define MSR_KVM_SSP	0x4b564d09
>>
>> struct kvm_steal_time {
>> 	__u64 steal;
>> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
>> index 72e3943f3693..9409753f45b0 100644
>> --- a/arch/x86/kvm/vmx/vmx.c
>> +++ b/arch/x86/kvm/vmx/vmx.c
>> @@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
>> 	case MSR_AMD64_TSC_RATIO:
>> 		/* This is AMD only.  */
>> 		return false;
>> +	case MSR_KVM_SSP:
>> +		return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
> For other MSRs in emulated_msrs_all[], KVM doesn't check the associated
> CPUID feature bits. Why bother doing this for MSR_KVM_SSP?

As you can see MSR_KVM_SSP is not purely emulated MSR, it's linked to VMCS field(GUEST_SSP),
IMO, the check is necessary, in other words, no need to expose it when SHSTK is not supported
by KVM.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-09-14  6:33 ` [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation Yang Weijiang
  2023-09-14 22:45   ` Edgecombe, Rick P
@ 2023-10-21  0:39   ` Sean Christopherson
  2023-10-24  8:50     ` Yang, Weijiang
  1 sibling, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-10-21  0:39 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Thu, Sep 14, 2023, Yang Weijiang wrote:
> Fix guest xsave area allocation size from fpu_user_cfg.default_size to
> fpu_kernel_cfg.default_size so that the xsave area size is consistent
> with fpstate->size set in __fpstate_reset().
> 
> With the fix, guest fpstate size is sufficient for KVM supported guest
> xfeatures.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kernel/fpu/core.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a86d37052a64..a42d8ad26ce6 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>  	struct fpstate *fpstate;
>  	unsigned int size;
>  
> -	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
> +	size = fpu_kernel_cfg.default_size +
> +	       ALIGN(offsetof(struct fpstate, regs), 64);

Shouldn't all the other calculations in this function also switch to fpu_kernel_cfg?
At the very least, this looks wrong when paired with the above:

	gfpu->uabi_size		= sizeof(struct kvm_xsave);
	if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size))
		gfpu->uabi_size = fpu_user_cfg.default_size;

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-10-21  0:39   ` Sean Christopherson
@ 2023-10-24  8:50     ` Yang, Weijiang
  2023-10-24 16:32       ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-10-24  8:50 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 10/21/2023 8:39 AM, Sean Christopherson wrote:
> On Thu, Sep 14, 2023, Yang Weijiang wrote:
>> Fix guest xsave area allocation size from fpu_user_cfg.default_size to
>> fpu_kernel_cfg.default_size so that the xsave area size is consistent
>> with fpstate->size set in __fpstate_reset().
>>
>> With the fix, guest fpstate size is sufficient for KVM supported guest
>> xfeatures.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>>   arch/x86/kernel/fpu/core.c | 4 +++-
>>   1 file changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>> index a86d37052a64..a42d8ad26ce6 100644
>> --- a/arch/x86/kernel/fpu/core.c
>> +++ b/arch/x86/kernel/fpu/core.c
>> @@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>>   	struct fpstate *fpstate;
>>   	unsigned int size;
>>   
>> -	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
>> +	size = fpu_kernel_cfg.default_size +
>> +	       ALIGN(offsetof(struct fpstate, regs), 64);
> Shouldn't all the other calculations in this function also switch to fpu_kernel_cfg?
> At the very least, this looks wrong when paired with the above:
>
> 	gfpu->uabi_size		= sizeof(struct kvm_xsave);
> 	if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size))
> 		gfpu->uabi_size = fpu_user_cfg.default_size;

Hi, Sean,
Not sure what's your concerns.
 From my understanding fpu_kernel_cfg.default_size should include all enabled xfeatures in host (XCR0 | XSS),
this is also expected for supporting all guest enabled xfeatures. gfpu->uabi_size only includes enabled user
xfeatures which are operated via KVM uABIs(KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2), so the two
sizes are relatively independent since guest supervisor xfeatures are saved/restored via GET/SET_MSRS interfaces.



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-10-24  8:50     ` Yang, Weijiang
@ 2023-10-24 16:32       ` Sean Christopherson
  2023-10-25 13:49         ` Yang, Weijiang
  2023-10-31 17:43         ` Maxim Levitsky
  0 siblings, 2 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-10-24 16:32 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Tue, Oct 24, 2023, Weijiang Yang wrote:
> On 10/21/2023 8:39 AM, Sean Christopherson wrote:
> > On Thu, Sep 14, 2023, Yang Weijiang wrote:
> > > Fix guest xsave area allocation size from fpu_user_cfg.default_size to
> > > fpu_kernel_cfg.default_size so that the xsave area size is consistent
> > > with fpstate->size set in __fpstate_reset().
> > > 
> > > With the fix, guest fpstate size is sufficient for KVM supported guest
> > > xfeatures.
> > > 
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > ---
> > >   arch/x86/kernel/fpu/core.c | 4 +++-
> > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > > index a86d37052a64..a42d8ad26ce6 100644
> > > --- a/arch/x86/kernel/fpu/core.c
> > > +++ b/arch/x86/kernel/fpu/core.c
> > > @@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> > >   	struct fpstate *fpstate;
> > >   	unsigned int size;
> > > -	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
> > > +	size = fpu_kernel_cfg.default_size +
> > > +	       ALIGN(offsetof(struct fpstate, regs), 64);
> > Shouldn't all the other calculations in this function also switch to fpu_kernel_cfg?
> > At the very least, this looks wrong when paired with the above:
> > 
> > 	gfpu->uabi_size		= sizeof(struct kvm_xsave);
> > 	if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size))
> > 		gfpu->uabi_size = fpu_user_cfg.default_size;
> 
> Hi, Sean,
> Not sure what's your concerns.
> From my understanding fpu_kernel_cfg.default_size should include all enabled
> xfeatures in host (XCR0 | XSS), this is also expected for supporting all
> guest enabled xfeatures. gfpu->uabi_size only includes enabled user xfeatures
> which are operated via KVM uABIs(KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2),
> so the two sizes are relatively independent since guest supervisor xfeatures
> are saved/restored via GET/SET_MSRS interfaces.

Ah, right, I keep forgetting that KVM's ABI can't use XRSTOR because it forces
the compacted format.

This part still looks odd to me:

	gfpu->xfeatures		= fpu_user_cfg.default_features;
	gfpu->perm		= fpu_user_cfg.default_features;

but I'm probably just not understanding something in the other patches changes yet.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-09-15  2:22     ` Yang, Weijiang
@ 2023-10-24 17:07       ` Sean Christopherson
  2023-10-25 14:49         ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-10-24 17:07 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: Dave Hansen, pbonzini, kvm, linux-kernel, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Fri, Sep 15, 2023, Weijiang Yang wrote:
> On 9/15/2023 1:40 AM, Dave Hansen wrote:
> > On 9/13/23 23:33, Yang Weijiang wrote:
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > >   	/* Calculate the resulting kernel state size */
> > >   	mask = permitted | requested;
> > > -	/* Take supervisor states into account on the host */
> > > +	/*
> > > +	 * Take supervisor states into account on the host. And add
> > > +	 * kernel dynamic xfeatures to guest since guest kernel may
> > > +	 * enable corresponding CPU feaures and the xstate registers
> > > +	 * need to be saved/restored properly.
> > > +	 */
> > >   	if (!guest)
> > >   		mask |= xfeatures_mask_supervisor();
> > > +	else
> > > +		mask |= fpu_kernel_dynamic_xfeatures;

This looks wrong.  Per commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor
states in XSTATE permissions"), mask at this point only contains user features,
which somewhat unintuitively doesn't include CET_USER (I get that they're MSRs
and thus supervisor state, it's just the name that's odd).

IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
conatins PASID, CET_USER, and CET_KERNEL.  PASID isn't virtualized by KVM, but
doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
enabling?

The existing code also seems odd, but I might be missing something.  Won't the
kernel drop PASID if the guest request AMX/XTILE?  I'm not at all familiar with
what PASID state is managed via XSAVE, so I've no idea if that's an actual problem
or just an oddity.

> > >   	ksize = xstate_calculate_size(mask, compacted);
> > Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
> > but didn't change the logic.
> > 
> > As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
> > guest xfeatures.  So, they're different in name only.

...

> > Would there ever be any reason for KVM to be on a system which supports a
> > dynamic kernel feature but where it doesn't get enabled for guest use, or
> > at least shouldn't have the FPU space allocated?
> 
> I haven't heard of that kind of usage for other features so far, CET
> supervisor xstate is the only dynamic kernel feature now,  not sure whether
> other CPU features having supervisor xstate would share the handling logic
> like CET does one day.

There are definitely scenarios where CET will not be exposed to KVM guests, but
I don't see any reason to make the guest FPU space dynamically sized for CET.
It's what, 40 bytes?

I would much prefer to avoid the whole "dynamic" thing and instead make CET
explicitly guest-only.  E.g. fpu_kernel_guest_only_xfeatures?  Or even better
if it doesn't cause weirdness elsewhere, a dedicated fpu_guest_cfg.  For me at
least, a fpu_guest_cfg would make it easier to understand what all is going on. 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-10-24 16:32       ` Sean Christopherson
@ 2023-10-25 13:49         ` Yang, Weijiang
  2023-10-31 17:43         ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-10-25 13:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 10/25/2023 12:32 AM, Sean Christopherson wrote:
> On Tue, Oct 24, 2023, Weijiang Yang wrote:
>> On 10/21/2023 8:39 AM, Sean Christopherson wrote:
>>> On Thu, Sep 14, 2023, Yang Weijiang wrote:
>>>> Fix guest xsave area allocation size from fpu_user_cfg.default_size to
>>>> fpu_kernel_cfg.default_size so that the xsave area size is consistent
>>>> with fpstate->size set in __fpstate_reset().
>>>>
>>>> With the fix, guest fpstate size is sufficient for KVM supported guest
>>>> xfeatures.
>>>>
>>>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>>>> ---
>>>>    arch/x86/kernel/fpu/core.c | 4 +++-
>>>>    1 file changed, 3 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>>>> index a86d37052a64..a42d8ad26ce6 100644
>>>> --- a/arch/x86/kernel/fpu/core.c
>>>> +++ b/arch/x86/kernel/fpu/core.c
>>>> @@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>>>>    	struct fpstate *fpstate;
>>>>    	unsigned int size;
>>>> -	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
>>>> +	size = fpu_kernel_cfg.default_size +
>>>> +	       ALIGN(offsetof(struct fpstate, regs), 64);
>>> Shouldn't all the other calculations in this function also switch to fpu_kernel_cfg?
>>> At the very least, this looks wrong when paired with the above:
>>>
>>> 	gfpu->uabi_size		= sizeof(struct kvm_xsave);
>>> 	if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size))
>>> 		gfpu->uabi_size = fpu_user_cfg.default_size;
>> Hi, Sean,
>> Not sure what's your concerns.
>>  From my understanding fpu_kernel_cfg.default_size should include all enabled
>> xfeatures in host (XCR0 | XSS), this is also expected for supporting all
>> guest enabled xfeatures. gfpu->uabi_size only includes enabled user xfeatures
>> which are operated via KVM uABIs(KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2),
>> so the two sizes are relatively independent since guest supervisor xfeatures
>> are saved/restored via GET/SET_MSRS interfaces.
> Ah, right, I keep forgetting that KVM's ABI can't use XRSTOR because it forces
> the compacted format.
>
> This part still looks odd to me:
>
> 	gfpu->xfeatures		= fpu_user_cfg.default_features;
> 	gfpu->perm		= fpu_user_cfg.default_features;

I guess when the kernel FPU code was overhauled, the supervisor xstates were not taken into
account for guest supported xfeaures, so the first line looks reasonable until supervisor xfeatures
are landing. And for the second line, per current design, the user mode can only control user
xfeatures via arch_prctl() kernel uAPI, so it also makes sense to initialize perm with
fpu_user_cfg.default_features too.

But in this CET KVM series I'd like to expand the former to support all guest enabled xfeatures, i.e.,
both user and supervisor xfeaures, and keep the latter as-is since there seems no reason
to allow userspace to alter supervisor xfeatures.

> but I'm probably just not understanding something in the other patches changes yet.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-10-24 17:07       ` Sean Christopherson
@ 2023-10-25 14:49         ` Yang, Weijiang
  2023-10-26 17:24           ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-10-25 14:49 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Dave Hansen, pbonzini, kvm, linux-kernel, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Weijiang Yang wrote:
>> On 9/15/2023 1:40 AM, Dave Hansen wrote:
>>> On 9/13/23 23:33, Yang Weijiang wrote:
>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>> @@ -1636,9 +1636,17 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>>>>    	/* Calculate the resulting kernel state size */
>>>>    	mask = permitted | requested;
>>>> -	/* Take supervisor states into account on the host */
>>>> +	/*
>>>> +	 * Take supervisor states into account on the host. And add
>>>> +	 * kernel dynamic xfeatures to guest since guest kernel may
>>>> +	 * enable corresponding CPU feaures and the xstate registers
>>>> +	 * need to be saved/restored properly.
>>>> +	 */
>>>>    	if (!guest)
>>>>    		mask |= xfeatures_mask_supervisor();
>>>> +	else
>>>> +		mask |= fpu_kernel_dynamic_xfeatures;
> This looks wrong.  Per commit 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor
> states in XSTATE permissions"), mask at this point only contains user features,
> which somewhat unintuitively doesn't include CET_USER (I get that they're MSRs
> and thus supervisor state, it's just the name that's odd).

I think the user-only boundary becomes unclear when fpstate_reset() introduce below line:
fpu->perm.__state_perm          = fpu_kernel_cfg.default_features;

Then in xstate_request_perm(), it re-uses above reset value for __xstate_request_perm(),
so in the latter, the mask is already mixed with supervisor xfeatures.

> IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> conatins PASID, CET_USER, and CET_KERNEL.  PASID isn't virtualized by KVM, but
> doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> enabling?

Yes, __state_size is correct for guest enabled xfeatures, including CET_USER, and it gets
removed from __state_perm.

IIUC, from current qemu/kernel interaction for guest permission settings, __xstate_request_perm()
is called only _ONCE_ to set AMX/XTILE for every vCPU thread, so the removal of guest supervisor
xfeatures won't hurt guest! ;-/

> The existing code also seems odd, but I might be missing something.  Won't the
> kernel drop PASID if the guest request AMX/XTILE?

Yeah, dropped after the first invocation.

> I'm not at all familiar with
> what PASID state is managed via XSAVE, so I've no idea if that's an actual problem
> or just an oddity.
>
>>>>    	ksize = xstate_calculate_size(mask, compacted);
>>> Heh, you changed the "guest" naming in "fpu_kernel_dynamic_xfeatures"
>>> but didn't change the logic.
>>>
>>> As it's coded at the moment *ALL* "fpu_kernel_dynamic_xfeatures" are
>>> guest xfeatures.  So, they're different in name only.
> ...
>
>>> Would there ever be any reason for KVM to be on a system which supports a
>>> dynamic kernel feature but where it doesn't get enabled for guest use, or
>>> at least shouldn't have the FPU space allocated?
>> I haven't heard of that kind of usage for other features so far, CET
>> supervisor xstate is the only dynamic kernel feature now,  not sure whether
>> other CPU features having supervisor xstate would share the handling logic
>> like CET does one day.
> There are definitely scenarios where CET will not be exposed to KVM guests, but
> I don't see any reason to make the guest FPU space dynamically sized for CET.
> It's what, 40 bytes?

Could it also be xsave/xrstor operation efficiency for non-guest threads?

> I would much prefer to avoid the whole "dynamic" thing and instead make CET
> explicitly guest-only.  E.g. fpu_kernel_guest_only_xfeatures?  Or even better
> if it doesn't cause weirdness elsewhere, a dedicated fpu_guest_cfg.  For me at
> least, a fpu_guest_cfg would make it easier to understand what all is going on.

Agree,  guess non-kernel-generic designs are not very much welcome for kernel...


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-10-25 14:49         ` Yang, Weijiang
@ 2023-10-26 17:24           ` Sean Christopherson
  2023-10-26 22:06             ` Edgecombe, Rick P
  2023-10-31 17:45             ` Maxim Levitsky
  0 siblings, 2 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-10-26 17:24 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: Dave Hansen, pbonzini, kvm, linux-kernel, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Wed, Oct 25, 2023, Weijiang Yang wrote:
> On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> > On Fri, Sep 15, 2023, Weijiang Yang wrote:
> > IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> > conatins PASID, CET_USER, and CET_KERNEL.  PASID isn't virtualized by KVM, but
> > doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> > enabling?
> 
> Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
> and it gets removed from __state_perm.
> 
> IIUC, from current qemu/kernel interaction for guest permission settings,
> __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
> thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/

Huh?  I don't follow.  What does calling __xstate_request_perm() only once have
to do with anything?

/me stares more

OMG, hell no.  First off, this code is a nightmare to follow.  The existing comment
is useless.  No shit the code is adding in supervisor states for the host.  What's
not AT ALL clear is *why*.

The commit says it's necessary because the "permission bitmap is only relevant
for user states":

  commit 781c64bfcb735960717d1cb45428047ff6a5030c
  Author: Thomas Gleixner <tglx@linutronix.de>
  Date:   Thu Mar 24 14:47:14 2022 +0100

    x86/fpu/xstate: Handle supervisor states in XSTATE permissions
    
    The size calculation in __xstate_request_perm() fails to take supervisor
    states into account because the permission bitmap is only relevant for user
    states.

But @permitted comes from:

  permitted = xstate_get_group_perm(guest);

which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm.  And
__state_perm is initialized to:

	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;

where fpu_kernel_cfg.default_features contains everything except the dynamic
xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:

	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

So why on earth does this code to force back xfeatures_mask_supervisor()?  Because
the code just below drops the supervisor bits to compute the user xstate size and
then clobbers __state_perm.

	/* Calculate the resulting user state size */
	mask &= XFEATURE_MASK_USER_SUPPORTED;
	usize = xstate_calculate_size(mask, false);

	...

	WRITE_ONCE(perm->__state_perm, mask);

That is beyond asinine.  IIUC, the intent is to apply the permission bitmap only
for user states, because the only dynamic states are user states.  Bbut the above
creates an inconsistent mess.  If userspace doesn't request XTILE_DATA,
__state_perm will contain supervisor states, but once userspace does request
XTILE_DATA, __state_perm will be lost.

And because that's not confusing enough, clobbering __state_perm would also drop
FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
said LOCKED flag set.

fpu_xstate_prctl() already strips out supervisor features:

	case ARCH_GET_XCOMP_PERM:
		/*
		 * Lockless snapshot as it can also change right after the
		 * dropping the lock.
		 */
		permitted = xstate_get_host_group_perm();
		permitted &= XFEATURE_MASK_USER_SUPPORTED;
		return put_user(permitted, uptr);

	case ARCH_GET_XCOMP_GUEST_PERM:
		permitted = xstate_get_guest_group_perm();
		permitted &= XFEATURE_MASK_USER_SUPPORTED;
		return put_user(permitted, uptr);

and while KVM doesn't apply the __state_perm to supervisor states, if it did
there would be zero harm in doing so.

	case 0xd: {
		u64 permitted_xcr0 = kvm_get_filtered_xcr0();
		u64 permitted_xss = kvm_caps.supported_xss;

Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
It "works" for the current code, but only because there's only a single dynamic
feature, i.e. this will short circuit and prevent computing a bad ksize.

	/* Check whether fully enabled */
	if ((permitted & requested) == requested)
		return 0;

I don't know how I can possibly make it any clearer: KVM absolutely must not assume
userspace behavior.

So rather than continue with the current madness, which will break if/when the
next dynamic feature comes along, just preserve non-user xfeatures/flags in
__guest_perm.
 
If there are no objections, I'll test the below and write a proper changelog.
 
--
From: Sean Christopherson <seanjc@google.com>
Date: Thu, 26 Oct 2023 10:17:33 -0700
Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
 __state_perm

Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ef6906107c54..73f6bc00d178 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
 	if ((permitted & requested) == requested)
 		return 0;
 
-	/* Calculate the resulting kernel state size */
+	/*
+	 * Calculate the resulting kernel state size.  Note, @permitted also
+	 * contains supervisor xfeatures even though supervisor are always
+	 * permitted for kernel and guest FPUs, and never permitted for user
+	 * FPUs.
+	 */
 	mask = permitted | requested;
-	/* Take supervisor states into account on the host */
-	if (!guest)
-		mask |= xfeatures_mask_supervisor();
 	ksize = xstate_calculate_size(mask, compacted);
 
-	/* Calculate the resulting user state size */
-	mask &= XFEATURE_MASK_USER_SUPPORTED;
-	usize = xstate_calculate_size(mask, false);
+	/*
+	 * Calculate the resulting user state size.  Take care not to clobber
+	 * the supervisor xfeatures in the new mask!
+	 */
+	usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
 
 	if (!guest) {
 		ret = validate_sigaltstack(usize);

base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4
-- 

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-10-26 17:24           ` Sean Christopherson
@ 2023-10-26 22:06             ` Edgecombe, Rick P
  2023-10-31 17:45             ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Edgecombe, Rick P @ 2023-10-26 22:06 UTC (permalink / raw)
  To: Yang, Weijiang, Christopherson,, Sean
  Cc: kvm, pbonzini, Hansen, Dave, linux-kernel, Gao, Chao, john.allen, peterz

On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> +       /*
> +        * Calculate the resulting kernel state size.  Note,
> @permitted also
> +        * contains supervisor xfeatures even though supervisor are
> always
> +        * permitted for kernel and guest FPUs, and never permitted
> for user
> +        * FPUs.

What is a user FPU vs kernel FPU in this context? By user FPU do you
mean, like user FPU state in a sigframe or something? Or a kernel
task's FPU? If the former I think this comment could be made more
clear. Maybe just drop the bit about user FPUs. At least the comment
makes more sense to me without it.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
  2023-09-14 22:39   ` Edgecombe, Rick P
@ 2023-10-31 17:43   ` Maxim Levitsky
  2023-11-01  9:19     ` Yang, Weijiang
  1 sibling, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:43 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
> reflect true dependency between CET features and the xstate bit, instead
> manually check and add the bit back if either SHSTK or IBT is supported.
> 
> Both user mode shadow stack and indirect branch tracking features depend
> on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.
> 
> Although in real world a platform with IBT but no SHSTK is rare, but in
> virtualization world it's common, guest SHSTK and IBT can be controlled
> independently via userspace app.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
>  1 file changed, 8 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index cadf68737e6b..12c8cb278346 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
>  	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
>  	[XFEATURE_PKRU]				= X86_FEATURE_OSPKE,
>  	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
> -	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
>  	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
>  	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
>  };
> @@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>  			fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
>  	}
>  
> +	/*
> +	 * Manually add CET user mode xstate bit if either SHSTK or IBT is
> +	 * available. Both features depend on the xstate bit to save/restore
> +	 * CET user mode state.
> +	 */
> +	if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
> +		fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
> +
>  	if (!cpu_feature_enabled(X86_FEATURE_XFD))
>  		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>  


The goal of the xsave_cpuid_features is to disable xfeature state bits which are enabled
in CPUID, but their parent feature bit (e.g X86_FEATURE_AVX512) is disabled in CPUID, 
something that should not happen on real CPU, but can happen if the user explicitly
disables the feature on the kernel command line and/or due to virtualization.

However the above code does the opposite, it will enable XFEATURE_CET_USER xsaves component,
when in fact, it might be disabled in the CPUID (and one can say that in theory such
configuration is even useful, since the kernel can still context switch CET msrs manually).


So I think that the code should do this instead:

if (!boot_cpu_has(X86_FEATURE_SHSTK) && !boot_cpu_has(X86_FEATURE_IBT))
 	fpu_kernel_cfg.max_features &= ~BIT_ULL(XFEATURE_CET_USER);


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation
  2023-10-24 16:32       ` Sean Christopherson
  2023-10-25 13:49         ` Yang, Weijiang
@ 2023-10-31 17:43         ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:43 UTC (permalink / raw)
  To: Sean Christopherson, Weijiang Yang
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Tue, 2023-10-24 at 09:32 -0700, Sean Christopherson wrote:
> On Tue, Oct 24, 2023, Weijiang Yang wrote:
> > On 10/21/2023 8:39 AM, Sean Christopherson wrote:
> > > On Thu, Sep 14, 2023, Yang Weijiang wrote:
> > > > Fix guest xsave area allocation size from fpu_user_cfg.default_size to
> > > > fpu_kernel_cfg.default_size so that the xsave area size is consistent
> > > > with fpstate->size set in __fpstate_reset().
> > > > 
> > > > With the fix, guest fpstate size is sufficient for KVM supported guest
> > > > xfeatures.
> > > > 
> > > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > > ---
> > > >   arch/x86/kernel/fpu/core.c | 4 +++-
> > > >   1 file changed, 3 insertions(+), 1 deletion(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > > > index a86d37052a64..a42d8ad26ce6 100644
> > > > --- a/arch/x86/kernel/fpu/core.c
> > > > +++ b/arch/x86/kernel/fpu/core.c
> > > > @@ -220,7 +220,9 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> > > >   	struct fpstate *fpstate;
> > > >   	unsigned int size;
> > > > -	size = fpu_user_cfg.default_size + ALIGN(offsetof(struct fpstate, regs), 64);
> > > > +	size = fpu_kernel_cfg.default_size +
> > > > +	       ALIGN(offsetof(struct fpstate, regs), 64);
> > > Shouldn't all the other calculations in this function also switch to fpu_kernel_cfg?
> > > At the very least, this looks wrong when paired with the above:
> > > 
> > > 	gfpu->uabi_size		= sizeof(struct kvm_xsave);
> > > 	if (WARN_ON_ONCE(fpu_user_cfg.default_size > gfpu->uabi_size))
> > > 		gfpu->uabi_size = fpu_user_cfg.default_size;
> > 
> > Hi, Sean,
> > Not sure what's your concerns.
> > From my understanding fpu_kernel_cfg.default_size should include all enabled
> > xfeatures in host (XCR0 | XSS), this is also expected for supporting all
> > guest enabled xfeatures. gfpu->uabi_size only includes enabled user xfeatures
> > which are operated via KVM uABIs(KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2),
> > so the two sizes are relatively independent since guest supervisor xfeatures
> > are saved/restored via GET/SET_MSRS interfaces.
> 
> Ah, right, I keep forgetting that KVM's ABI can't use XRSTOR because it forces
> the compacted format.
> 
> This part still looks odd to me:
> 
> 	gfpu->xfeatures		= fpu_user_cfg.default_features;

That should be indeed fpu_kernel_cfg.default_features. 
This variable is also currently hardly used, it only tracks which dynamic userspace features
are enabled and KVM only uses it once (in fpu_enable_guest_xfd_features)



> 	gfpu->perm		= fpu_user_cfg.default_features;


This variable I think is currently only set and never read.

Note that current->group_leader->thread.fpu.guest_perm is actually initialized to fpu_kernel_cfg.default_features
but the kernel components of it masked in the corresponding prctl 
(ARCH_GET_XCOMP_SUPP/ARCH_GET_XCOMP_GUEST_PERM/ARCH_REQ_XCOMP_GUEST_PERM).

So I think that we also should use fpu_kernel_cfg.default_features here for the sake of not having uninitilized
variable on 32 bit kernels, because the whole FPU permission thing I see is implemented for 64 bit kernels only.

Or even better IMHO is to remove both variables and in fpu_enable_guest_xfd_features,
just mask the xfeatures with the XFEATURE_MASK_USER_DYNAMIC instead.


Best regards,
	Maxim Levitsky

> 
> but I'm probably just not understanding something in the other patches changes yet.
> 








^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support
  2023-09-15  6:30     ` Yang, Weijiang
@ 2023-10-31 17:44       ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:44 UTC (permalink / raw)
  To: Yang, Weijiang, Edgecombe, Rick P, kvm, pbonzini, Christopherson,,
	Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Fri, 2023-09-15 at 14:30 +0800, Yang, Weijiang wrote:
> On 9/15/2023 8:06 AM, Edgecombe, Rick P wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Add supervisor mode state support within FPU xstate management
> > > framework.
> > > Although supervisor shadow stack is not enabled/used today in
> > > kernel,KVM
> >           ^ Nit: needs a space
> > > requires the support because when KVM advertises shadow stack feature
> > > to
> > > guest, architechturally it claims the support for both user and
> >           ^ Spelling: "architecturally"
> 
> Thank you!!
> 
> > > supervisor
> > > modes for Linux and non-Linux guest OSes.
> > > 
> > > With the xstate support, guest supervisor mode shadow stack state can
> > > be
> > > properly saved/restored when 1) guest/host FPU context is swapped
> > > 2) vCPU
> > > thread is sched out/in.
> > (2) is a little bit confusing, because the lazy FPU stuff won't always
> > save/restore while scheduling.
> 
> It's true for normal thread, but for vCPU thread, it's a bit different, on the path to
> vm-entry, after host/guest fpu states swapped, preemption is not disabled and
> vCPU thread could be sched out/in, in this case,  guest FPU states will  be saved/
> restored because TIF_NEED_FPU_LOAD is always cleared after swap.
> 
> > But trying to explain the details in
> > this commit log is probably unnecessary. Maybe something like?
> > 
> >     2) At the proper times while other tasks are scheduled
> 
> I just want to justify that enabling of supervisor xstate is necessary for guest.
> Maybe I need to reword a bit :-)
> 
> > I think also a key part of this is that XFEATURE_CET_KERNEL is not
> > *all* of the "guest supervisor mode shadow stack state", at least with
> > respect to the MSRs. It might be worth calling that out a little more
> > loudly.
> 
> OK, I will call it out that supervisor mode shadow stack state also includes IA32_S_CET msr.
> 
> > > The alternative is to enable it in KVM domain, but KVM maintainers
> > > NAKed
> > > the solution. The external discussion can be found at [*], it ended
> > > up
> > > with adding the support in kernel instead of KVM domain.
> > > 
> > > Note, in KVM case, guest CET supervisor state i.e.,
> > > IA32_PL{0,1,2}_MSRs,
> > > are preserved after VM-Exit until host/guest fpstates are swapped,
> > > but
> > > since host supervisor shadow stack is disabled, the preserved MSRs
> > > won't
> > > hurt host.
> > It might beg the question of if this solution will need to be redone by
> > some future Linux supervisor shadow stack effort. I *think* the answer
> > is no.
> 
> AFAICT KVM needs to be modified if host shadow stack is implemented, at least
> guest/host CET supervisor MSRs should be swapped at the earliest time after
> vm-exit so that host won't misbehavior on *guest*  MSR contents.

I agree.

> 
> > Most of the xsave managed features are restored before returning to
> > userspace because they would have userspace effect. But
> > XFEATURE_CET_KERNEL is different. It only effects the kernel. But the
> > IA32_PL{0,1,2}_MSRs are used when transitioning to those rings. So for
> > Linux they would get used when transitioning back from userspace. In
> > order for it to be used when control transfers back *from* userspace,
> > it needs to be restored before returning *to* userspace. So despite
> > being needed only for the kernel, and having no effect on userspace, it
> > might need to be swapped/restored at the same time as the rest of the
> > FPU state that only affects userspace.
> 
> You're right, for enabling of supervisor mode shadow stack, we need to take
> it carefully whenever ring/stack is switching. But we still have time to figure out
> the points.
> 
> Thanks a lot for bring up such kind of thinking!
> 
> > Probably supervisor shadow stack for Linux needs much more analysis,
> > but trying to leave some breadcrumbs on the thinking from internal
> > reviews. I don't know if it might be good to include some of this
> > reasoning in the commit log. It's a bit hand wavy.
> 
> IMO, we have put much assumption on the fact that CET supervisor shadow stack is not
> enabled in kernel and this patch itself is straightforward and simple, it's just a small
> brick for enabling supervisor shadow stack, we would revisit whether something is an
> issue based on how SSS is implemented in kernel. So let's not add such kind of reasoning :-)

Overall the patch looks OK to me.

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky

> 
> Thank you for the enlightenment!
> > > [*]:
> > > https://lore.kernel.org/all/806e26c2-8d21-9cc9-a0b7-7787dd231729@intel.com/
> > > 
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > Otherwise, the code looked good to me.






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set
  2023-09-15  6:42     ` Yang, Weijiang
@ 2023-10-31 17:44       ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:44 UTC (permalink / raw)
  To: Yang, Weijiang, Edgecombe, Rick P, kvm, pbonzini, Christopherson,,
	Sean, linux-kernel
  Cc: peterz, Hansen, Dave, Gao, Chao, john.allen

On Fri, 2023-09-15 at 14:42 +0800, Yang, Weijiang wrote:
> On 9/15/2023 8:24 AM, Edgecombe, Rick P wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > +static void __init init_kernel_dynamic_xfeatures(void)
> > > +{
> > > +       unsigned short cid;
> > > +       int i;
> > > +
> > > +       for (i = 0; i < ARRAY_SIZE(xsave_kernel_dynamic_xfeatures);
> > > i++) {
> > > +               cid = xsave_kernel_dynamic_xfeatures[i];
> > > +
> > > +               if (cid && boot_cpu_has(cid))
> > > +                       fpu_kernel_dynamic_xfeatures |= BIT_ULL(i);
> > > +       }
> > > +}
> > > +
> > I think this can be part of the max_features calculation that uses
> > xsave_cpuid_features when you use use a fixed mask like Dave suggested
> > in the other patch.
> 
> Yes, the max_features has already included CET supervisor state bit. After  use
> fixed mask, this function is not needed.
> 
> 
My 0.2 cents are also on having XFEATURE_MASK_KERNEL_DYNAMIC macro instead.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features
  2023-09-14 16:22   ` Dave Hansen
  2023-09-15  1:52     ` Yang, Weijiang
@ 2023-10-31 17:44     ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:44 UTC (permalink / raw)
  To: Dave Hansen, Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 09:22 -0700, Dave Hansen wrote:
> On 9/13/23 23:33, Yang Weijiang wrote:
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -845,6 +845,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
> >  	/* Clean out dynamic features from default */
> >  	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> >  	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> > +	fpu_kernel_cfg.default_features &= ~fpu_kernel_dynamic_xfeatures;
> 
> I'd much rather that this be a closer analog to XFEATURE_MASK_USER_DYNAMIC.
> 
> Please define a XFEATURE_MASK_KERNEL_DYNAMIC value and use it here.
> Don't use a dynamically generated one.
> 

I also think so.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-10-26 17:24           ` Sean Christopherson
  2023-10-26 22:06             ` Edgecombe, Rick P
@ 2023-10-31 17:45             ` Maxim Levitsky
  2023-11-01 14:16               ` Sean Christopherson
  1 sibling, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:45 UTC (permalink / raw)
  To: Sean Christopherson, Weijiang Yang
  Cc: Dave Hansen, pbonzini, kvm, linux-kernel, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> On Wed, Oct 25, 2023, Weijiang Yang wrote:
> > On 10/25/2023 1:07 AM, Sean Christopherson wrote:
> > > On Fri, Sep 15, 2023, Weijiang Yang wrote:
> > > IIUC, the "dynamic" features contains CET_KERNEL, whereas xfeatures_mask_supervisor()
> > > conatins PASID, CET_USER, and CET_KERNEL.  PASID isn't virtualized by KVM, but
> > > doesn't that mean CET_USER will get dropped/lost if userspace requests AMX/XTILE
> > > enabling?
> > 
> > Yes, __state_size is correct for guest enabled xfeatures, including CET_USER,
> > and it gets removed from __state_perm.
> > 
> > IIUC, from current qemu/kernel interaction for guest permission settings,
> > __xstate_request_perm() is called only _ONCE_ to set AMX/XTILE for every vCPU
> > thread, so the removal of guest supervisor xfeatures won't hurt guest! ;-/
> 
> Huh?  I don't follow.  What does calling __xstate_request_perm() only once have
> to do with anything?
> 
> /me stares more
> 
> OMG, hell no.  First off, this code is a nightmare to follow.  The existing comment
> is useless.  No shit the code is adding in supervisor states for the host.  What's
> not AT ALL clear is *why*.
> 
> The commit says it's necessary because the "permission bitmap is only relevant
> for user states":
> 
>   commit 781c64bfcb735960717d1cb45428047ff6a5030c
>   Author: Thomas Gleixner <tglx@linutronix.de>
>   Date:   Thu Mar 24 14:47:14 2022 +0100
> 
>     x86/fpu/xstate: Handle supervisor states in XSTATE permissions
>     
>     The size calculation in __xstate_request_perm() fails to take supervisor
>     states into account because the permission bitmap is only relevant for user
>     states.
> 
> But @permitted comes from:
> 
>   permitted = xstate_get_group_perm(guest);
> 
> which is either fpu->guest_perm.__state_perm or fpu->perm.__state_perm.  And
> __state_perm is initialized to:
> 
> 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;

Not anymore after patch 5, and patch 5 does make sense in the regard to the fact
that we might not want to save/restore kernel CET state for nothing for regular kernel threads.

> 
> where fpu_kernel_cfg.default_features contains everything except the dynamic
> xfeatures, i.e. everything except XFEATURE_MASK_XTILE_DATA:
> 
> 	fpu_kernel_cfg.default_features = fpu_kernel_cfg.max_features;
> 	fpu_kernel_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
> 
> So why on earth does this code to force back xfeatures_mask_supervisor()?  Because
> the code just below drops the supervisor bits to compute the user xstate size and
> then clobbers __state_perm.
> 
> 	/* Calculate the resulting user state size */
> 	mask &= XFEATURE_MASK_USER_SUPPORTED;
> 	usize = xstate_calculate_size(mask, false);
> 
> 	...
> 
> 	WRITE_ONCE(perm->__state_perm, mask);
> 
> That is beyond asinine.  IIUC, the intent is to apply the permission bitmap only
> for user states, because the only dynamic states are user states.  Bbut the above
> creates an inconsistent mess.  If userspace doesn't request XTILE_DATA,
> __state_perm will contain supervisor states, but once userspace does request
> XTILE_DATA, __state_perm will be lost.
> 
> And because that's not confusing enough, clobbering __state_perm would also drop
> FPU_GUEST_PERM_LOCKED, except that __xstate_request_perm() can' be reached with
> said LOCKED flag set.
> 
> fpu_xstate_prctl() already strips out supervisor features:
> 
> 	case ARCH_GET_XCOMP_PERM:
> 		/*
> 		 * Lockless snapshot as it can also change right after the
> 		 * dropping the lock.
> 		 */
> 		permitted = xstate_get_host_group_perm();
> 		permitted &= XFEATURE_MASK_USER_SUPPORTED;
> 		return put_user(permitted, uptr);
> 
> 	case ARCH_GET_XCOMP_GUEST_PERM:
> 		permitted = xstate_get_guest_group_perm();
> 		permitted &= XFEATURE_MASK_USER_SUPPORTED;
> 		return put_user(permitted, uptr);
> 
> and while KVM doesn't apply the __state_perm to supervisor states, if it did
> there would be zero harm in doing so.
> 
> 	case 0xd: {
> 		u64 permitted_xcr0 = kvm_get_filtered_xcr0();
> 		u64 permitted_xss = kvm_caps.supported_xss;
> 
> Second, the relying on QEMU to only trigger __xstate_request_perm() is not acceptable.
> It "works" for the current code, but only because there's only a single dynamic
> feature, i.e. this will short circuit and prevent computing a bad ksize.
> 
> 	/* Check whether fully enabled */
> 	if ((permitted & requested) == requested)
> 		return 0;
> 
> I don't know how I can possibly make it any clearer: KVM absolutely must not assume
> userspace behavior.
> 
> So rather than continue with the current madness, which will break if/when the
> next dynamic feature comes along, just preserve non-user xfeatures/flags in
> __guest_perm.

I more or less agree with you, however I would like to discuss the FPU permissions
in more depth:


First of all we have two things at play here:

1. On demand resize of the thread's FPU state buffer to avoid penalty of context switching the AMX state.

2. The fact that allowing this on demand resize of this state buffer breaks the x86_64 ABI,
   because FPU state has to be saved on the signal stack and ABI allows the stack size to be smaller than what is
   needed to save the FPU state with AMX features enabled.

Thus a two tiered approach was done: first application asks for a permission to use the dynamic features,
and then when it actually uses it, the FPU state buffer is resized.

Otherwise if an AMX instruction is used by the app but the permission was not asked by the app for its xstate component, 
the application is terminated.

(I might not 100% understand this correctly, please correct me if I am wrong).

However IMHO the 'fpu permission' name is a bit misleading,
This feature is not really about security/permissions but more like opt-in to use newer ABI,
for example KVM capabilities API, and the kernel will never refuse the permission request
(except if the
signal stack size is too small but the userspace can adjust it before asking for the permission)


On top of that I think that applying the same permission approach to guest's FPU state is not a good fit,
because of two reasons:

1. The guest FPU state will never be pushed on the signal stack - KVM swaps back the host FPU state
   before it returns from the KVM_RUN ioctl.

   Also I think (not sure) that ptrace can only access (FPU) state of a stopped process, and a stopped vCPU process
   will also first return to userspace. Again I might be mistaken here, I never researched this in depth.

   Assuming that I am correct on these assumptions, the guest FPU state can only be accessed via 
   KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
   which also returns the userspace portion of the state including optionally the AMX state, 
   but this ioctl doesn't really need FPU permission framework, because it is a KVM ABI, and in 
   fact KVM_GET_XSAVE2 was added exactly because of that: to make sure that userspace
   is aware that larger than 4K buffer can be returned.

2. Guest FPU state is not even on demand resized (but I can imagine that in the future we will do this).


And of course, adding permissions for kernel features, that is even worse idea, which we really
shouldn't do.

>  
> If there are no objections, I'll test the below and write a proper changelog.
>  
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Thu, 26 Oct 2023 10:17:33 -0700
> Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
>  __state_perm
> 
> Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
>  1 file changed, 11 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index ef6906107c54..73f6bc00d178 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>  	if ((permitted & requested) == requested)
>  		return 0;
>  
> -	/* Calculate the resulting kernel state size */
> +	/*
> +	 * Calculate the resulting kernel state size.  Note, @permitted also
> +	 * contains supervisor xfeatures even though supervisor are always
> +	 * permitted for kernel and guest FPUs, and never permitted for user
> +	 * FPUs.
> +	 */
>  	mask = permitted | requested;
> -	/* Take supervisor states into account on the host */
> -	if (!guest)
> -		mask |= xfeatures_mask_supervisor();
>  	ksize = xstate_calculate_size(mask, compacted);

This might not work with kernel dynamic features, because xfeatures_mask_supervisor() will
return all supported supervisor features.


Therefore at least until we have an actual kernel dynamic feature 
(a feature used by the host kernel and not KVM, and which has to be dynamic like AMX),
I suggest that KVM stops using the permission
API completely for the guest FPU state, 
and just gives all the features it wants to enable right to __fpu_alloc_init_guest_fpstate()
(Guest FPU permission API IMHO should be deprecated and ignored)



>  
> -	/* Calculate the resulting user state size */
> -	mask &= XFEATURE_MASK_USER_SUPPORTED;
> -	usize = xstate_calculate_size(mask, false);
> +	/*
> +	 * Calculate the resulting user state size.  Take care not to clobber
> +	 * the supervisor xfeatures in the new mask!
> +	 */
> +	usize = xstate_calculate_size(mask & XFEATURE_MASK_USER_SUPPORTED, false);
>  
>  	if (!guest) {
>  		ret = validate_sigaltstack(usize);




Best regards,
	Maxim Levitsky

> 
> base-commit: c076acf10c78c0d7e1aa50670e9cc4c91e8d59b4







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures
  2023-09-14  6:33 ` [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures Yang Weijiang
@ 2023-10-31 17:45   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:45 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> The guest fpstate is sized with fpu_kernel_cfg.default_size (by preceding
> fix) and the kernel dynamic xfeatures are not taken into account, so add
> the support and tweak fpstate xfeatures and size accordingly.
> 
> Below configuration steps are currently enforced to get guest fpstate:
> 1) User space sets thread group xstate permits via arch_prctl().
> 2) User space creates vcpu thread.
> 3) User space enables guest dynamic xfeatures.
> 
> In #1, guest fpstate size (i.e., __state_size [1]) is induced from
> (fpu_kernel_cfg.default_features | user dynamic xfeatures) [2].
> In #2, guest fpstate size is calculated with fpu_kernel_cfg.default_size
> and fpstate->size is set to the same. fpstate->xfeatures is set to
> fpu_kernel_cfg.default_features.
> In #3, guest fpstate is re-allocated as [1] and fpstate->xfeatures is
> set to [2].
> 
> By adding kernel dynamic xfeatures in above #1 and #2, guest xstate area
> size is expanded to hold (fpu_kernel_cfg.default_features | kernel dynamic
> _xfeatures | user dynamic xfeatures)[3], and guest fpstate->xfeatures is
> set to [3]. Then host xsaves/xrstors can act on all guest xfeatures.
> 
> The user_* fields remain unchanged for compatibility of non-compacted KVM
> uAPIs.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kernel/fpu/core.c   | 56 +++++++++++++++++++++++++++++-------
>  arch/x86/kernel/fpu/xstate.c |  2 +-
>  arch/x86/kernel/fpu/xstate.h |  2 ++
>  3 files changed, 49 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a42d8ad26ce6..e5819b38545a 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -33,6 +33,8 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
>  DEFINE_PER_CPU(u64, xfd_state);
>  #endif
>  
> +extern unsigned int xstate_calculate_size(u64 xfeatures, bool compacted);
> +
>  /* The FPU state configuration data for kernel and user space */
>  struct fpu_state_config	fpu_kernel_cfg __ro_after_init;
>  struct fpu_state_config fpu_user_cfg __ro_after_init;
> @@ -193,8 +195,6 @@ void fpu_reset_from_exception_fixup(void)
>  }
>  
>  #if IS_ENABLED(CONFIG_KVM)
> -static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
> -
>  static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
>  {
>  	struct fpu_state_perm *fpuperm;
> @@ -215,28 +215,64 @@ static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
>  	gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
>  }
>  
> -bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +static struct fpstate *__fpu_alloc_init_guest_fpstate(struct fpu_guest *gfpu)
>  {
> +	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
> +	unsigned int gfpstate_size, size;
>  	struct fpstate *fpstate;
> -	unsigned int size;
> +	u64 xfeatures;
> +
> +	/*
> +	 * fpu_kernel_cfg.default_features includes all enabled xfeatures
> +	 * except those dynamic xfeatures. Compared with user dynamic
> +	 * xfeatures, the kernel dynamic ones are enabled for guest by
> +	 * default, so add the kernel dynamic xfeatures back when calculate
> +	 * guest fpstate size.
> +	 *
> +	 * If the user dynamic xfeatures are enabled, the guest fpstate will
> +	 * be re-allocated to hold all guest enabled xfeatures, so omit user
> +	 * dynamic xfeatures here.
> +	 */
> +	xfeatures = fpu_kernel_cfg.default_features |
> +		    fpu_kernel_dynamic_xfeatures;


This is roughly what I had in mind when I was reviewing the previous patch,
however for the sake of not hard-coding even more of the KVM policy here,
I would let the KVM tell which dynamic kernel features it wants to enable
as a parameter of this function, or even better *which* features it wants
to enable.


> +
> +	gfpstate_size = xstate_calculate_size(xfeatures, compacted);
>  
> -	size = fpu_kernel_cfg.default_size +
> -	       ALIGN(offsetof(struct fpstate, regs), 64);
> +	size = gfpstate_size + ALIGN(offsetof(struct fpstate, regs), 64);
>  
>  	fpstate = vzalloc(size);
>  	if (!fpstate)
> -		return false;
> +		return NULL;
> +	/*
> +	 * Initialize sizes and feature masks, use fpu_user_cfg.*
> +	 * for user_* settings for compatibility of exiting uAPIs.
> +	 */
> +	fpstate->size		= gfpstate_size;
> +	fpstate->xfeatures	= xfeatures;
> +	fpstate->user_size	= fpu_user_cfg.default_size;
> +	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
> +	fpstate->xfd		= 0;
>  
> -	/* Leave xfd to 0 (the reset value defined by spec) */
> -	__fpstate_reset(fpstate, 0);
>  	fpstate_init_user(fpstate);
>  	fpstate->is_valloc	= true;
>  	fpstate->is_guest	= true;
>  
>  	gfpu->fpstate		= fpstate;
> -	gfpu->xfeatures		= fpu_user_cfg.default_features;
> +	gfpu->xfeatures		= xfeatures;
>  	gfpu->perm		= fpu_user_cfg.default_features;
>  
> +	return fpstate;
> +}


I think that this code will break, later when permission api is called by the KVM,
because it will overwrite the fpstate->user_size and with fpstate->size
assuming that all kernel dynamic features are enabled/disabled (depending
on Sean's patch).


> +
> +bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
> +{
> +	struct fpstate *fpstate;
> +
> +	fpstate = __fpu_alloc_init_guest_fpstate(gfpu);
> +
> +	if (!fpstate)
> +		return false;
> +

What is the point of the __fpu_alloc_init_guest_fpstate / fpu_alloc_guest_fpstate split,
since there is only one caller?


Best regards,
	Maxim Levitsky

>  	/*
>  	 * KVM sets the FP+SSE bits in the XSAVE header when copying FPU state
>  	 * to userspace, even when XSAVE is unsupported, so that restoring FPU
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index c5d903b4df4d..87149aba6f11 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -561,7 +561,7 @@ static bool __init check_xstate_against_struct(int nr)
>  	return true;
>  }
>  
> -static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
> +unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
>  {
>  	unsigned int topmost = fls64(xfeatures) -  1;
>  	unsigned int offset = xstate_offsets[topmost];
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index a4ecb04d8d64..9c6e3ca05c5c 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -10,6 +10,8 @@
>  DECLARE_PER_CPU(u64, xfd_state);
>  #endif
>  
> +extern u64 fpu_kernel_dynamic_xfeatures;
> +
>  static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
>  {
>  	/*








^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains kernel dynamic xfeatures
  2023-09-14  6:33 ` [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains " Yang Weijiang
@ 2023-10-31 17:45   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:45 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> fpu_kernel_dynamic_xfeatures now are __ONLY__ enabled by guest kernel and
> used for guest fpstate, i.e., none for normal fpstate. The bits are added
> when guest fpstate is allocated and fpstate->is_guest set to %true.
> 
> For normal fpstate, the bits should have been removed when init system FPU
> settings, WARN_ONCE() if normal fpstate contains kernel dynamic xfeatures
> before xsaves is executed.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kernel/fpu/xstate.h | 3 +++
>  1 file changed, 3 insertions(+)
> 
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index 9c6e3ca05c5c..c2b33a5db53d 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -186,6 +186,9 @@ static inline void os_xsave(struct fpstate *fpstate)
>  	WARN_ON_FPU(!alternatives_patched);
>  	xfd_validate_state(fpstate, mask, false);
>  
> +	WARN_ON_FPU(!fpstate->is_guest &&
> +		    (mask & fpu_kernel_dynamic_xfeatures));
> +
>  	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
>  
>  	/* We should never fault when copying to a kernel buffer: */

I am not sure about this patch. It's true that now the kernel dynamic features
are for guest only, but in the future I can easily see a kernel dynamic feature
that will also be used in the kernel itself.

Maybe we can add a comment above this warning to say that _currently_ there are
no kernel dynamic features that are enabled for the host kernel.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  2023-09-14  6:33 ` [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data Yang Weijiang
@ 2023-10-31 17:46   ` Maxim Levitsky
  2023-11-01 14:41     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:46 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
> state, i.e. on a vCPU's CPUID state.  Prior to commit 275a87244ec8 ("KVM:
> x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)"), KVM
> incorrectly fudged guest CPUID at runtime,
Can you explain how commit 275a87244ec8 relates to this patch?


>  which in turn necessitated
> massaging the incoming CPUID state for KVM_SET_CPUID{2} so as not to run
> afoul of kvm_cpuid_check_equal().

Can you link the commit that added this 'massaging' and explain on how this relates to this patch?

Can you explain what is the problem that this patch is trying to solve?


Is it really allowed in x86 spec to have different supported mask of XCR0 bits
on different CPUs (assuming all CPUs of the same type)?

If true, does KVM supports it?

Assuming that the answer to the above questions is no, won't this patch make it easier
to break this rule and thus make it easier to introduce a bug?

Best regards,
	Maxim Levitsky

> 
> Opportunistically move the helper below kvm_update_cpuid_runtime() to make
> it harder to repeat the mistake of querying supported XCR0 for runtime
> updates.

> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/cpuid.c | 33 ++++++++++++++++-----------------
>  1 file changed, 16 insertions(+), 17 deletions(-)
> 
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0544e30b4946..7c3e4a550ca7 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -247,21 +247,6 @@ void kvm_update_pv_runtime(struct kvm_vcpu *vcpu)
>  		vcpu->arch.pv_cpuid.features = best->eax;
>  }
>  
> -/*
> - * Calculate guest's supported XCR0 taking into account guest CPUID data and
> - * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
> - */
> -static u64 cpuid_get_supported_xcr0(struct kvm_cpuid_entry2 *entries, int nent)
> -{
> -	struct kvm_cpuid_entry2 *best;
> -
> -	best = cpuid_entry2_find(entries, nent, 0xd, 0);
> -	if (!best)
> -		return 0;
> -
> -	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> -}
> -
>  static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_entry2 *entries,
>  				       int nent)
>  {
> @@ -312,6 +297,21 @@ void kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu)
>  }
>  EXPORT_SYMBOL_GPL(kvm_update_cpuid_runtime);
>  
> +/*
> + * Calculate guest's supported XCR0 taking into account guest CPUID data and
> + * KVM's supported XCR0 (comprised of host's XCR0 and KVM_SUPPORTED_XCR0).
> + */
> +static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *best;
> +
> +	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 0);
> +	if (!best)
> +		return 0;
> +
> +	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> +}
> +
>  static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
>  {
>  	struct kvm_cpuid_entry2 *entry;
> @@ -357,8 +357,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		kvm_apic_set_version(vcpu);
>  	}
>  
> -	vcpu->arch.guest_supported_xcr0 =
> -		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
> +	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
>  
>  	/*
>  	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers
  2023-09-14  6:33 ` [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers Yang Weijiang
@ 2023-10-31 17:47   ` Maxim Levitsky
  2023-11-01 19:32     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:47 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> helpers to replace existing usage of the raw functions.
> kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> to get/set a MSR value for emulating CPU behavior.

I am not sure if I like this patch or not. On one hand the code is cleaner this way,
but on the other hand now it is easier to call kvm_msr_write() on behalf of the guest.

For example we also have the 'kvm_set_msr()' which does actually set the msr on behalf of the guest.

How about we call the new function kvm_msr_set_host() and rename kvm_set_msr() to kvm_msr_set_guest(),
together with good comments explaning what they do?


Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
IMHO have names that are not very user friendly. 

A refactoring is very welcome in this area. At the very least they should gain 
thoughtful comments about what they do.


For reading msrs API, I can suggest similar names and comments:

/* 
 * Read a value of a MSR. 
 * Some MSRs exist in the KVM model even when the guest can't read them.
 */
int kvm_get_msr_value(struct kvm_vcpu *vcpu, u32 index, u64 *data);


/*  Read a value of a MSR on the behalf of the guest */

int kvm_get_guest_msr_value(struct kvm_vcpu *vcpu, u32 index, u64 *data);


Although I am not going to argue over this, there are multiple ways to improve this,
and also keeping things as is, or something similar to this patch is also fine with me.


Best regards,
	Maxim Levitsky

> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 +++-
>  arch/x86/kvm/cpuid.c            |  2 +-
>  arch/x86/kvm/x86.c              | 16 +++++++++++++---
>  3 files changed, 17 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 1a4def36d5bb..0fc5e6312e93 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1956,7 +1956,9 @@ void kvm_prepare_emulation_failure_exit(struct kvm_vcpu *vcpu);
>  
>  void kvm_enable_efer_bits(u64);
>  bool kvm_valid_efer(struct kvm_vcpu *vcpu, u64 efer);
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data, bool host_initiated);
> +
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data);
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data);
>  int kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data);
>  int kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data);
>  int kvm_emulate_rdmsr(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 7c3e4a550ca7..1f206caec559 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -1531,7 +1531,7 @@ bool kvm_cpuid(struct kvm_vcpu *vcpu, u32 *eax, u32 *ebx,
>  		*edx = entry->edx;
>  		if (function == 7 && index == 0) {
>  			u64 data;
> -		        if (!__kvm_get_msr(vcpu, MSR_IA32_TSX_CTRL, &data, true) &&
> +		        if (!kvm_msr_read(vcpu, MSR_IA32_TSX_CTRL, &data) &&
>  			    (data & TSX_CTRL_CPUID_CLEAR))
>  				*ebx &= ~(F(RTM) | F(HLE));
>  		} else if (function == 0x80000007) {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..e0b55c043dab 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1917,8 +1917,8 @@ static int kvm_set_msr_ignored_check(struct kvm_vcpu *vcpu,
>   * Returns 0 on success, non-0 otherwise.
>   * Assumes vcpu_load() was already called.
>   */
> -int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> -		  bool host_initiated)
> +static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
> +			 bool host_initiated)
>  {
>  	struct msr_data msr;
>  	int ret;
> @@ -1944,6 +1944,16 @@ int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
>  	return ret;
>  }
>  
> +int kvm_msr_write(struct kvm_vcpu *vcpu, u32 index, u64 data)
> +{
> +	return __kvm_set_msr(vcpu, index, data, true);
> +}
> +
> +int kvm_msr_read(struct kvm_vcpu *vcpu, u32 index, u64 *data)
> +{
> +	return __kvm_get_msr(vcpu, index, data, true);
> +}
> +
>  static int kvm_get_msr_ignored_check(struct kvm_vcpu *vcpu,
>  				     u32 index, u64 *data, bool host_initiated)
>  {
> @@ -12082,7 +12092,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  						  MSR_IA32_MISC_ENABLE_BTS_UNAVAIL;
>  
>  		__kvm_set_xcr(vcpu, 0, XFEATURE_MASK_FP);
> -		__kvm_set_msr(vcpu, MSR_IA32_XSS, 0, true);
> +		kvm_msr_write(vcpu, MSR_IA32_XSS, 0);
>  	}
>  
>  	/* All GPRs except RDX (handled below) are zeroed on RESET/INIT. */







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features
  2023-09-14  6:33 ` [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features Yang Weijiang
@ 2023-10-31 17:47   ` Maxim Levitsky
  2023-11-01 19:18     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:47 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
> is non-zero, i.e. KVM supports at least one XSS based feature.


I can't believe that CET is the first supervisor feature that KVM supports...

Ah, now I understand why:

1. XSAVES on AMD can't really be intercepted (other than clearing CR4.OSXSAVE bit, which isn't an option if you want to support AVX for example)
   On VMX however you can intercept XSAVES and even intercept it only when it touches specific bits of state that you don't want the guest to read/write
   freely.

2. Even if it was possible to intercept it, guests use XSAVES on every context switch if available and emulating it might be costly.

3. Emulating XSAVES is also not that easy to do correctly.

However XSAVES touches various MSRs, thus letting the guest use it unintercepted means giving access to host MSRs,
which might be wrong security wise in some cases.

Thus I see that KVM hardcodes the IA32_XSS to 0, and that makes the XSAVES work exactly like XSAVE.

And for some features which would benefit from XSAVES state components,
KVM likely won't even be able to do so due to this limitation.
(this is allowed thankfully by the CPUID), forcing the guests to use rdmsr/wrmsr instead.


However it is possible to enable IA32_XSS bits in case the msrs XSAVES reads/writes can't do harm to the host, and then KVM
can context switch these MSRs when the guest exits and that is what is done here with CET.

If you think that a short summary of the above can help the future reader to understand why IA32_XSS support is added only now,
it might be a good idea to add a few lines to the changelog of this patch.

> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/x86.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e0b55c043dab..1258d1d6dd52 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1464,6 +1464,7 @@ static const u32 msrs_to_save_base[] = {
>  	MSR_IA32_UMWAIT_CONTROL,
>  
>  	MSR_IA32_XFD, MSR_IA32_XFD_ERR,
> +	MSR_IA32_XSS,
>  };
>  
>  static const u32 msrs_to_save_pmu[] = {
> @@ -7195,6 +7196,10 @@ static void kvm_probe_msr_to_save(u32 msr_index)
>  		if (!(kvm_get_arch_capabilities() & ARCH_CAP_TSX_CTRL_MSR))
>  			return;
>  		break;
> +	case MSR_IA32_XSS:
> +		if (!kvm_caps.supported_xss)
> +			return;
> +		break;
>  	default:
>  		break;
>  	}


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
  2023-10-08  5:54   ` Chao Gao
@ 2023-10-31 17:51   ` Maxim Levitsky
  2023-11-01 17:20     ` Sean Christopherson
  2023-11-15  7:18   ` Binbin Wu
  2 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:51 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen, Zhang Yi Z

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
> due to XSS MSR modification.
> CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
> xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
> before allocate sufficient xsave buffer.
> 
> Note, KVM does not yet support any XSS based features, i.e. supported_xss
> is guaranteed to be zero at this time.
> 
> Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
> write initiated from host is allowed iff the write is reset operaiton,
> i.e., data == 0, reject host_initiated non-reset write and any guest write.

The commit message is not clear and somewhat misleading because it forces the reader 
to parse the whole patch before one could understand what '!guest_cpuid_has()'
means.

Also I don't think that the term 'reset operation' is a good choice, because it is too closely
related to vCPU reset IMHO. Let's at least call it 'reset to a default value' or something like that.
Also note that 0 is not always the default/reset value of an MSR.

I suggest this instead:

"If XSAVES is not enabled in the guest CPUID, forbid setting IA32_XSS msr
to anything but 0, even if the write is host initiated."

Also, isn't this change is at least in theory not backward compatible?
While KVM didn't report IA32_XSS as one needing save/restore, the userspace
could before this patch set the IA32_XSS to any value, now it can't.

Maybe it's safer to allow to set any value, ignore the set value and
issue a WARN_ON_ONCE or something?

Finally, I think that this change is better to be done in a separate patch
because it is unrelated and might not even be backward compatible.

Best regards,
	Maxim Levitsky

> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/cpuid.c            | 15 ++++++++++++++-
>  arch/x86/kvm/x86.c              | 13 +++++++++----
>  3 files changed, 24 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0fc5e6312e93..d77b030e996c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>  
>  	u64 xcr0;
>  	u64 guest_supported_xcr0;
> +	u64 guest_supported_xss;
>  
>  	struct kvm_pio_request pio;
>  	void *pio_data;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1f206caec559..4e7a820cba62 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
>  	best = cpuid_entry2_find(entries, nent, 0xD, 1);
>  	if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
>  		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
> -		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
> +		best->ebx = xstate_required_size(vcpu->arch.xcr0 |
> +						 vcpu->arch.ia32_xss, true);
>  
>  	best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
>  	if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
>  	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
>  }
>  
> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *best;
> +
> +	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> +	if (!best)
> +		return 0;
> +
> +	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> +}

Same question as one for patch that added vcpu_get_supported_xcr0()
Why to have per vCPU supported XSS if we assume that all CPUs have the same
CPUID?

I mean I am not against supporting hybrid CPU models, but KVM currently doesn't
support this and this creates illusion that it does.

> +
>  static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
>  {
>  	struct kvm_cpuid_entry2 *entry;
> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	}
>  
>  	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
> +	vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>  
>  	/*
>  	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1258d1d6dd52..9a616d84bd39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  			vcpu->arch.ia32_tsc_adjust_msr += adj;
>  		}
>  		break;
> -	case MSR_IA32_XSS:
> -		if (!msr_info->host_initiated &&
> -		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> +	case MSR_IA32_XSS: {
> +		bool host_msr_reset = msr_info->host_initiated && data == 0;
> +
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> +		    (!host_msr_reset || !msr_info->host_initiated))
>  			return 1;
>  		/*
>  		 * KVM supports exposing PT to the guest, but does not support
>  		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>  		 * XSAVES/XRSTORS to save/restore PT MSRs.
>  		 */
Just in case.... TODO

> -		if (data & ~kvm_caps.supported_xss)
> +		if (data & ~vcpu->arch.guest_supported_xss)
>  			return 1;
> +		if (vcpu->arch.ia32_xss == data)
> +			break;
>  		vcpu->arch.ia32_xss = data;
>  		kvm_update_cpuid_runtime(vcpu);
>  		break;
> +	}
>  	case MSR_SMI_COUNT:
>  		if (!msr_info->host_initiated)
>  			return 1;







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss
  2023-09-14  6:33 ` [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss Yang Weijiang
@ 2023-10-31 17:51   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:51 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Set original kvm_caps.supported_xss to (host_xss & KVM_SUPPORTED_XSS) if
> XSAVES is supported. host_xss contains the host supported xstate feature
> bits for thread FPU context switch, KVM_SUPPORTED_XSS includes all KVM
> enabled XSS feature bits, the resulting value represents the supervisor
> xstates that are available to guest and are backed by host FPU framework
> for swapping {guest,host} XSAVE-managed registers/MSRs.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/x86.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9a616d84bd39..66edbed25db8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -226,6 +226,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
>  				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
>  				| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>  
> +#define KVM_SUPPORTED_XSS     0
> +
>  u64 __read_mostly host_efer;
>  EXPORT_SYMBOL_GPL(host_efer);
>  
> @@ -9515,12 +9517,13 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>  		host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);
>  		kvm_caps.supported_xcr0 = host_xcr0 & KVM_SUPPORTED_XCR0;
>  	}
> +	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
> +		rdmsrl(MSR_IA32_XSS, host_xss);
> +		kvm_caps.supported_xss = host_xss & KVM_SUPPORTED_XSS;
> +	}
>  
>  	rdmsrl_safe(MSR_EFER, &host_efer);
>  
> -	if (boot_cpu_has(X86_FEATURE_XSAVES))
> -		rdmsrl(MSR_IA32_XSS, host_xss);
> -
>  	kvm_init_pmu_capability(ops->pmu_ops);
>  
>  	if (boot_cpu_has(X86_FEATURE_ARCH_CAPABILITIES))


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-09-14  6:33 ` [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs Yang Weijiang
@ 2023-10-31 17:51   ` Maxim Levitsky
  2023-11-01 18:05     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:51 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> From: Sean Christopherson <seanjc@google.com>
> 
> Load the guest's FPU state if userspace is accessing MSRs whose values
> are managed by XSAVES. Introduce two helpers, kvm_{get,set}_xstate_msr(),
> to facilitate access to such kind of MSRs.
> 
> If MSRs supported in kvm_caps.supported_xss are passed through to guest,
> the guest MSRs are swapped with host's before vCPU exits to userspace and
> after it re-enters kernel before next VM-entry.
> 
> Because the modified code is also used for the KVM_GET_MSRS device ioctl(),
> explicitly check @vcpu is non-null before attempting to load guest state.
> The XSS supporting MSRs cannot be retrieved via the device ioctl() without
> loading guest FPU state (which doesn't exist).
> 
> Note that guest_cpuid_has() is not queried as host userspace is allowed to
> access MSRs that have not been exposed to the guest, e.g. it might do
> KVM_SET_MSRS prior to KVM_SET_CPUID2.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Yang Weijiang <weijiang.yang@intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/x86.c | 30 +++++++++++++++++++++++++++++-
>  arch/x86/kvm/x86.h | 24 ++++++++++++++++++++++++
>  2 files changed, 53 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 66edbed25db8..a091764bf1d2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>  static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>  
>  static DEFINE_MUTEX(vendor_module_lock);
> +static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
> +static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> +
>  struct kvm_x86_ops kvm_x86_ops __read_mostly;
>  
>  #define KVM_X86_OP(func)					     \
> @@ -4372,6 +4375,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  }
>  EXPORT_SYMBOL_GPL(kvm_get_msr_common);
>  
> +static const u32 xstate_msrs[] = {
> +	MSR_IA32_U_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP,
> +	MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP,
> +};
> +
> +static bool is_xstate_msr(u32 index)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(xstate_msrs); i++) {
> +		if (index == xstate_msrs[i])
> +			return true;
> +	}
> +	return false;
> +}

The name 'xstate_msr' IMHO is not clear.

How about naming it 'guest_fpu_state_msrs', together with adding a comment like that:

"These msrs are context switched together with the rest of the guest FPU state,
on exit/entry to/from userspace

There is also an assumption that loading guest values while the host kernel runs,
doesn't cause harm to the host kernel"


But if you prefer something else, its fine with me, but I do appreciate to have some
comment attached to 'xstate_msr' at least.

> +
>  /*
>   * Read or write a bunch of msrs. All parameters are kernel addresses.
>   *
> @@ -4382,11 +4401,20 @@ static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
>  		    int (*do_msr)(struct kvm_vcpu *vcpu,
>  				  unsigned index, u64 *data))
>  {
> +	bool fpu_loaded = false;
>  	int i;
>  
> -	for (i = 0; i < msrs->nmsrs; ++i)
> +	for (i = 0; i < msrs->nmsrs; ++i) {
> +		if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
> +		    is_xstate_msr(entries[i].index)) {

A comment here about why this is done, will also be appreciated:

"Userspace requested us to read a MSR which value resides in the guest FPU state.
Load this state temporarily to CPU to read/update it."

> +			kvm_load_guest_fpu(vcpu);
> +			fpu_loaded = true;
> +		}
>  		if (do_msr(vcpu, entries[i].index, &entries[i].data))
>  			break;
> +	}

And maybe here too:

"If KVM loaded the guest FPU state, unload to it to restore the original userspace FPU state
and to update the guest FPU state in case it was modified."

> +	if (fpu_loaded)
> +		kvm_put_guest_fpu(vcpu);
>  
>  	return i;
>  }
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 1e7be1f6ab29..9a8e3a84eaf4 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -540,4 +540,28 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>  			 unsigned int port, void *data,  unsigned int count,
>  			 int in);
>  
> +/*
> + * Lock and/or reload guest FPU and access xstate MSRs. For accesses initiated
> + * by host, guest FPU is loaded in __msr_io(). For accesses initiated by guest,
> + * guest FPU should have been loaded already.
> + */
> +
> +static inline void kvm_get_xstate_msr(struct kvm_vcpu *vcpu,
> +				      struct msr_data *msr_info)
> +{
> +	KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
> +	kvm_fpu_get();
> +	rdmsrl(msr_info->index, msr_info->data);
> +	kvm_fpu_put();
> +}
> +
> +static inline void kvm_set_xstate_msr(struct kvm_vcpu *vcpu,
> +				      struct msr_data *msr_info)
> +{
> +	KVM_BUG_ON(!vcpu->arch.guest_fpu.fpstate->in_use, vcpu->kvm);
> +	kvm_fpu_get();
> +	wrmsrl(msr_info->index, msr_info->data);
> +	kvm_fpu_put();
> +}

These functions are not used in the patch. I think they should be added later
when used.

Best regards,
	Maxim Levitsky

> +
>  #endif






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting
  2023-09-14  6:33 ` [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting Yang Weijiang
@ 2023-10-31 17:51   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:51 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Check potential faults for CR4.CET setting per Intel SDM requirements.
> CET can be enabled if and only if CR0.WP == 1, i.e. setting CR4.CET ==
> 1 faults if CR0.WP == 0 and setting CR0.WP == 0 fails if CR4.CET == 1.
> 
> Reviewed-by: Chao Gao <chao.gao@intel.com>
> Co-developed-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/x86.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index a091764bf1d2..dda9c7141ea1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1006,6 +1006,9 @@ int kvm_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
>  	    (is_64_bit_mode(vcpu) || kvm_is_cr4_bit_set(vcpu, X86_CR4_PCIDE)))
>  		return 1;
>  
> +	if (!(cr0 & X86_CR0_WP) && kvm_is_cr4_bit_set(vcpu, X86_CR4_CET))
> +		return 1;
> +
>  	static_call(kvm_x86_set_cr0)(vcpu, cr0);
>  
>  	kvm_post_set_cr0(vcpu, old_cr0, cr0);
> @@ -1217,6 +1220,9 @@ int kvm_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>  			return 1;
>  	}
>  
> +	if ((cr4 & X86_CR4_CET) && !kvm_is_cr0_bit_set(vcpu, X86_CR0_WP))
> +		return 1;
> +
>  	static_call(kvm_x86_set_cr4)(vcpu, cr4);
>  
>  	kvm_post_set_cr4(vcpu, old_cr4, cr4);

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved
  2023-09-14  6:33 ` [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved Yang Weijiang
  2023-10-08  6:19   ` Chao Gao
@ 2023-10-31 17:52   ` Maxim Levitsky
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:52 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Add CET MSRs to the list of MSRs reported to userspace if the feature,
> i.e. IBT or SHSTK, associated with the MSRs is supported by KVM.
> 
> SSP can only be read via RDSSP. Writing even requires destructive and
> potentially faulting operations such as SAVEPREVSSP/RSTORSSP or
> SETSSBSY/CLRSSBSY. Let the host use a pseudo-MSR that is just a wrapper
> for the GUEST_SSP field of the VMCS.

Fake MSR just feels wrong for the future generations of readers of this code. 
This is not a MSR no matter how we look at it, and KVM never supported such
fake MSRs - this is the first one.

I'll say its better to define a new ioctl for this register,
or if you are feeling adventurous, you can try to add support for 
KVM_GET_ONE_REG/KVM_SET_ONE_REG which is what at least arm uses
for this purpose.


Also I think it will be better to split this patch into two - first patch that adds new ioctl,
and the second patch will add the normal CET msrs to the list of msrs to be saved.


> 
> Suggested-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/include/uapi/asm/kvm_para.h |  1 +
>  arch/x86/kvm/vmx/vmx.c               |  2 ++
>  arch/x86/kvm/x86.c                   | 18 ++++++++++++++++++
>  3 files changed, 21 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
> index 6e64b27b2c1e..9864bbcf2470 100644
> --- a/arch/x86/include/uapi/asm/kvm_para.h
> +++ b/arch/x86/include/uapi/asm/kvm_para.h
> @@ -58,6 +58,7 @@
>  #define MSR_KVM_ASYNC_PF_INT	0x4b564d06
>  #define MSR_KVM_ASYNC_PF_ACK	0x4b564d07
>  #define MSR_KVM_MIGRATION_CONTROL	0x4b564d08
> +#define MSR_KVM_SSP	0x4b564d09

Another reason for not doing this - someone will think that this is a KVM PV msr.

>  
>  struct kvm_steal_time {
>  	__u64 steal;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 72e3943f3693..9409753f45b0 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7009,6 +7009,8 @@ static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
>  	case MSR_AMD64_TSC_RATIO:
>  		/* This is AMD only.  */
>  		return false;
> +	case MSR_KVM_SSP:
> +		return kvm_cpu_cap_has(X86_FEATURE_SHSTK);
>  	default:
>  		return true;
>  	}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index dda9c7141ea1..73b45351c0fc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1476,6 +1476,9 @@ static const u32 msrs_to_save_base[] = {
>  
>  	MSR_IA32_XFD, MSR_IA32_XFD_ERR,
>  	MSR_IA32_XSS,
> +	MSR_IA32_U_CET, MSR_IA32_S_CET,
> +	MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP, MSR_IA32_PL2_SSP,
> +	MSR_IA32_PL3_SSP, MSR_IA32_INT_SSP_TAB,
>  };
>  
>  static const u32 msrs_to_save_pmu[] = {
> @@ -1576,6 +1579,7 @@ static const u32 emulated_msrs_all[] = {
>  
>  	MSR_K7_HWCR,
>  	MSR_KVM_POLL_CONTROL,
> +	MSR_KVM_SSP,
>  };
>  
>  static u32 emulated_msrs[ARRAY_SIZE(emulated_msrs_all)];
> @@ -7241,6 +7245,20 @@ static void kvm_probe_msr_to_save(u32 msr_index)
>  		if (!kvm_caps.supported_xss)
>  			return;
>  		break;
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_S_CET:
> +		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK) &&
> +		    !kvm_cpu_cap_has(X86_FEATURE_IBT))
> +			return;
> +		break;
> +	case MSR_IA32_INT_SSP_TAB:
> +		if (!kvm_cpu_cap_has(X86_FEATURE_LM))
> +			return;
> +		fallthrough;
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> +		if (!kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> +			return;
> +		break;
>  	default:
>  		break;
>  	}

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits
  2023-09-14  6:33 ` [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits Yang Weijiang
@ 2023-10-31 17:52   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:52 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen, Zhang Yi Z

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Control-flow Enforcement Technology (CET) is a kind of CPU feature used
> to prevent Return/CALL/Jump-Oriented Programming (ROP/COP/JOP) attacks.
> It provides two sub-features(SHSTK,IBT) to defend against ROP/COP/JOP
> style control-flow subversion attacks.
> 
> Shadow Stack (SHSTK):
>   A shadow stack is a second stack used exclusively for control transfer
>   operations. The shadow stack is separate from the data/normal stack and
>   can be enabled individually in user and kernel mode. When shadow stack
>   is enabled, CALL pushes the return address on both the data and shadow
>   stack. RET pops the return address from both stacks and compares them.
>   If the return addresses from the two stacks do not match, the processor
>   generates a #CP.
> 
> Indirect Branch Tracking (IBT):
>   IBT introduces instruction(ENDBRANCH)to mark valid target addresses of
>   indirect branches (CALL, JMP etc...). If an indirect branch is executed
>   and the next instruction is _not_ an ENDBRANCH, the processor generates
>   a #CP. These instruction behaves as a NOP on platforms that have no CET.
> 
> Several new CET MSRs are defined to support CET:
>   MSR_IA32_{U,S}_CET: CET settings for {user,supervisor} CET respectively.
> 
>   MSR_IA32_PL{0,1,2,3}_SSP: SHSTK pointer linear address for CPL{0,1,2,3}.
> 
>   MSR_IA32_INT_SSP_TAB: Linear address of SHSTK pointer table, whose entry
> 			is indexed by IST of interrupt gate desc.
> 
> Two XSAVES state bits are introduced for CET:
>   IA32_XSS:[bit 11]: Control saving/restoring user mode CET states
>   IA32_XSS:[bit 12]: Control saving/restoring supervisor mode CET states.
> 
> Six VMCS fields are introduced for CET:
>   {HOST,GUEST}_S_CET: Stores CET settings for kernel mode.
>   {HOST,GUEST}_SSP: Stores current active SSP.
>   {HOST,GUEST}_INTR_SSP_TABLE: Stores current active MSR_IA32_INT_SSP_TAB.
> 
> On Intel platforms, two additional bits are defined in VM_EXIT and VM_ENTRY
> control fields:
> If VM_EXIT_LOAD_CET_STATE = 1, host CET states are loaded from following
> VMCS fields at VM-Exit:
>   HOST_S_CET
>   HOST_SSP
>   HOST_INTR_SSP_TABLE
> 
> If VM_ENTRY_LOAD_CET_STATE = 1, guest CET states are loaded from following
> VMCS fields at VM-Entry:
>   GUEST_S_CET
>   GUEST_SSP
>   GUEST_INTR_SSP_TABLE
> 
> Reviewed-by: Chao Gao <chao.gao@intel.com>
> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/include/asm/vmx.h | 8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0e73616b82f3..451fd4f4fedc 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -104,6 +104,7 @@
>  #define VM_EXIT_CLEAR_BNDCFGS                   0x00800000
>  #define VM_EXIT_PT_CONCEAL_PIP			0x01000000
>  #define VM_EXIT_CLEAR_IA32_RTIT_CTL		0x02000000
> +#define VM_EXIT_LOAD_CET_STATE                  0x10000000
Bit 28, matches PRM.
>  
>  #define VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR	0x00036dff
>  
> @@ -117,6 +118,7 @@
>  #define VM_ENTRY_LOAD_BNDCFGS                   0x00010000
>  #define VM_ENTRY_PT_CONCEAL_PIP			0x00020000
>  #define VM_ENTRY_LOAD_IA32_RTIT_CTL		0x00040000
> +#define VM_ENTRY_LOAD_CET_STATE                 0x00100000
Bit 20, matches PRM.


I wish we redefine these masks with BIT_ULL(n) macros for the sake of
having less chance of a mistake. Patches to refactor this are welcome!

>  
>  #define VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR	0x000011ff
>  
> @@ -345,6 +347,9 @@ enum vmcs_field {
>  	GUEST_PENDING_DBG_EXCEPTIONS    = 0x00006822,
>  	GUEST_SYSENTER_ESP              = 0x00006824,
>  	GUEST_SYSENTER_EIP              = 0x00006826,
> +	GUEST_S_CET                     = 0x00006828,
> +	GUEST_SSP                       = 0x0000682a,
> +	GUEST_INTR_SSP_TABLE            = 0x0000682c,
Matches the PRM.

>  	HOST_CR0                        = 0x00006c00,
>  	HOST_CR3                        = 0x00006c02,
>  	HOST_CR4                        = 0x00006c04,
> @@ -357,6 +362,9 @@ enum vmcs_field {
>  	HOST_IA32_SYSENTER_EIP          = 0x00006c12,
>  	HOST_RSP                        = 0x00006c14,
>  	HOST_RIP                        = 0x00006c16,

> +	HOST_S_CET                      = 0x00006c18,
> +	HOST_SSP                        = 0x00006c1a,
> +	HOST_INTR_SSP_TABLE             = 0x00006c1c
Matches the PRM as well.


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky



>  };
>  
>  /*






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-09-14  6:33 ` [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" Yang Weijiang
@ 2023-10-31 17:54   ` Maxim Levitsky
  2023-11-01 15:46     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:54 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Use the governed feature framework to track whether X86_FEATURE_SHSTK
> and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> the features can be used iff both KVM and guest CPUID can support them.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/governed_features.h | 2 ++
>  arch/x86/kvm/vmx/vmx.c           | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> index 423a73395c10..db7e21c5ecc2 100644
> --- a/arch/x86/kvm/governed_features.h
> +++ b/arch/x86/kvm/governed_features.h
> @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
>  KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
>  KVM_GOVERNED_X86_FEATURE(VGIF)
>  KVM_GOVERNED_X86_FEATURE(VNMI)
> +KVM_GOVERNED_X86_FEATURE(SHSTK)
> +KVM_GOVERNED_X86_FEATURE(IBT)
>  
>  #undef KVM_GOVERNED_X86_FEATURE
>  #undef KVM_GOVERNED_FEATURE
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9409753f45b0..fd5893b3a2c8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  		kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
>  
>  	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
>  
>  	vmx_setup_uret_msrs(vmx);
>  

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>


PS: IMHO The whole 'governed feature framework' is very confusing and somewhat poorly documented.

Currently the only partial explanation of it, is at 'governed_features', which doesn't
explain how to use it.

For the reference this is how KVM expects governed features to be used in the common case
(there are some exceptions to this but they are rare)

1. If a feature is not enabled in host CPUID or KVM doesn't support it, 
   KVM is expected to not enable it in KVM cpu caps.

2. Userspace uploads guest CPUID.

3. After the guest CPUID upload, the vendor code calls kvm_governed_feature_check_and_set() which sets
	governed features = True iff feature is supported in both kvm cpu caps and in guest CPUID.

4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed feature instead of reading
guest CPUID.

It might make sense to document the above somewhere at least.

Now about another thing I am thinking:

I do know that the mess of boolean flags that svm had is worse than these governed features and functionality wise these are equivalent.

However thinking again about the whole thing: 

IMHO the 'governed features' is another quite confusing term that a KVM developer will need to learn and keep in memory.

Because of that, can't we just use guest CPUID as a single source of truth and drop all the governed features code?

In most cases, when the governed feature value will differ from the guest CPUID is when a feature is enabled in the guest CPUID,
but not enabled in the KVM caps.

I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in which the opposite happens,
governed feature is enabled, even when the feature is hidden from the guest CPUID, but it might be
better from
readability wise point, to deal with these cases manually and we unlikely to have many new such cases in the future.

So for the common case of CPUID mismatch, when the governed feature is disabled but guest CPUID is enabled,
does it make sense to allow this? 

Such a feature which is advertised as supported but not really working is a recipe of hard to find guest bugs IMHO.

IMHO it would be much better to just check this condition and do kvm_vm_bugged() or something in case when a feature
is enabled in the guest CPUID but KVM can't support it, and then just use guest CPUID in 'guest_can_use()'.

Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-09-14  6:33 ` [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs Yang Weijiang
@ 2023-10-31 17:55   ` Maxim Levitsky
  2023-11-01 16:31     ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:55 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Add emulation interface for CET MSR access. The emulation code is split
> into common part and vendor specific part. The former does common check
> for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> helpers while the latter accesses the MSRs linked to VMCS fields.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 18 +++++++++++
>  arch/x86/kvm/x86.c     | 71 ++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 89 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index fd5893b3a2c8..9f4b56337251 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2111,6 +2111,15 @@ static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		else
>  			msr_info->data = vmx->pt_desc.guest.addr_a[index / 2];
>  		break;
> +	case MSR_IA32_S_CET:
> +		msr_info->data = vmcs_readl(GUEST_S_CET);
> +		break;
> +	case MSR_KVM_SSP:
> +		msr_info->data = vmcs_readl(GUEST_SSP);
> +		break;
> +	case MSR_IA32_INT_SSP_TAB:
> +		msr_info->data = vmcs_readl(GUEST_INTR_SSP_TABLE);
> +		break;
>  	case MSR_IA32_DEBUGCTLMSR:
>  		msr_info->data = vmcs_read64(GUEST_IA32_DEBUGCTL);
>  		break;
> @@ -2420,6 +2429,15 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		else
>  			vmx->pt_desc.guest.addr_a[index / 2] = data;
>  		break;
> +	case MSR_IA32_S_CET:
> +		vmcs_writel(GUEST_S_CET, data);
> +		break;
> +	case MSR_KVM_SSP:
> +		vmcs_writel(GUEST_SSP, data);
> +		break;
> +	case MSR_IA32_INT_SSP_TAB:
> +		vmcs_writel(GUEST_INTR_SSP_TABLE, data);
> +		break;
>  	case MSR_IA32_PERF_CAPABILITIES:
>  		if (data && !vcpu_to_pmu(vcpu)->version)
>  			return 1;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 73b45351c0fc..c85ee42ab4f1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -1847,6 +1847,11 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type)
>  }
>  EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>  
> +#define CET_US_RESERVED_BITS		GENMASK(9, 6)
> +#define CET_US_SHSTK_MASK_BITS		GENMASK(1, 0)
> +#define CET_US_IBT_MASK_BITS		(GENMASK_ULL(5, 2) | GENMASK_ULL(63, 10))
> +#define CET_US_LEGACY_BITMAP_BASE(data)	((data) >> 12)
> +
>  /*
>   * Write @data into the MSR specified by @index.  Select MSR specific fault
>   * checks are bypassed if @host_initiated is %true.
> @@ -1856,6 +1861,7 @@ EXPORT_SYMBOL_GPL(kvm_msr_allowed);
>  static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>  			 bool host_initiated)
>  {
> +	bool host_msr_reset = host_initiated && data == 0;

I really don't like this boolean. While 0 is usually the reset value, it doesn't have to
be (like SVM tsc ratio reset value is 1 for example).
Also its name is confusing.

I suggest to just open code this instead.

>  	struct msr_data msr;
>  
>  	switch (index) {
> @@ -1906,6 +1912,46 @@ static int __kvm_set_msr(struct kvm_vcpu *vcpu, u32 index, u64 data,
>  
>  		data = (u32)data;
>  		break;
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_S_CET:
> +		if (host_msr_reset && (kvm_cpu_cap_has(X86_FEATURE_SHSTK) ||
> +				       kvm_cpu_cap_has(X86_FEATURE_IBT)))
> +			break;
> +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> +		    !guest_can_use(vcpu, X86_FEATURE_IBT))
> +			return 1;
> +		if (data & CET_US_RESERVED_BITS)
> +			return 1;
> +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> +		    (data & CET_US_SHSTK_MASK_BITS))
> +			return 1;
> +		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> +		    (data & CET_US_IBT_MASK_BITS))
> +			return 1;
> +		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> +			return 1;
> +
> +		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> +		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> +			return 1;
> +		break;
> +	case MSR_IA32_INT_SSP_TAB:
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> +			return 1;
> +		fallthrough;
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> +	case MSR_KVM_SSP:
> +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> +			break;
> +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> +			return 1;
> +		if (index == MSR_KVM_SSP && !host_initiated)
> +			return 1;
> +		if (is_noncanonical_address(data, vcpu))
> +			return 1;
> +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> +			return 1;
> +		break;
Once again I'll prefer to have an ioctl for setting/getting SSP, this will make the above
code simpler (e.g there will be no need to check that write comes from the host/etc).

>  	}
>  
>  	msr.data = data;
> @@ -1949,6 +1995,23 @@ static int __kvm_get_msr(struct kvm_vcpu *vcpu, u32 index, u64 *data,
>  		    !guest_cpuid_has(vcpu, X86_FEATURE_RDPID))
>  			return 1;
>  		break;
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_S_CET:
> +		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> +		    !guest_can_use(vcpu, X86_FEATURE_SHSTK))
> +			return 1;
> +		break;
> +	case MSR_IA32_INT_SSP_TAB:
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> +			return 1;
> +		fallthrough;
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> +	case MSR_KVM_SSP:
> +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> +			return 1;
> +		if (index == MSR_KVM_SSP && !host_initiated)
> +			return 1;
> +		break;
>  	}
>  
>  	msr.index = index;
> @@ -4009,6 +4072,10 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		vcpu->arch.guest_fpu.xfd_err = data;
>  		break;
>  #endif
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> +		kvm_set_xstate_msr(vcpu, msr_info);

Ah, so here these functions (kvm_set_xstate_msr/kvm_get_xstate_msr) are used. 
I think that this patch should introduce them.

Also I will appreciate a comment to kvm_set_xstate_msr/kvm_get_xstate_msr saying something like:

"This function updates a guest MSR which value is saved in the guest FPU state. 
Wrap the write with load/save of the guest FPU state to keep the state consistent with the new MSR value"

Or something similar, although I will not argue over this.

> +		break;
>  	default:
>  		if (kvm_pmu_is_valid_msr(vcpu, msr))
>  			return kvm_pmu_set_msr(vcpu, msr_info);
> @@ -4365,6 +4432,10 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  		msr_info->data = vcpu->arch.guest_fpu.xfd_err;
>  		break;
>  #endif
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> +		kvm_get_xstate_msr(vcpu, msr_info);
> +		break;
>  	default:
>  		if (kvm_pmu_is_valid_msr(vcpu, msr_info->index))
>  			return kvm_pmu_get_msr(vcpu, msr_info);


Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM
  2023-09-14  6:33 ` [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM Yang Weijiang
@ 2023-10-31 17:55   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:55 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Save CET SSP to SMRAM on SMI and reload it on RSM. KVM emulates HW arch
> behavior when guest enters/leaves SMM mode,i.e., save registers to SMRAM
> at the entry of SMM and reload them at the exit to SMM. Per SDM, SSP is
> one of such registers on 64bit Arch, so add the support for SSP.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/smm.c | 8 ++++++++
>  arch/x86/kvm/smm.h | 2 +-
>  2 files changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/smm.c b/arch/x86/kvm/smm.c
> index b42111a24cc2..235fca95f103 100644
> --- a/arch/x86/kvm/smm.c
> +++ b/arch/x86/kvm/smm.c
> @@ -275,6 +275,10 @@ static void enter_smm_save_state_64(struct kvm_vcpu *vcpu,
>  	enter_smm_save_seg_64(vcpu, &smram->gs, VCPU_SREG_GS);
>  
>  	smram->int_shadow = static_call(kvm_x86_get_interrupt_shadow)(vcpu);
> +
> +	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> +		KVM_BUG_ON(kvm_msr_read(vcpu, MSR_KVM_SSP, &smram->ssp),
> +			   vcpu->kvm);
>  }
>  #endif
>  
> @@ -565,6 +569,10 @@ static int rsm_load_state_64(struct x86_emulate_ctxt *ctxt,
>  	static_call(kvm_x86_set_interrupt_shadow)(vcpu, 0);
>  	ctxt->interruptibility = (u8)smstate->int_shadow;
>  
> +	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> +		KVM_BUG_ON(kvm_msr_write(vcpu, MSR_KVM_SSP, smstate->ssp),
> +			   vcpu->kvm);
> +
>  	return X86EMUL_CONTINUE;
>  }
>  #endif
> diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
> index a1cf2ac5bd78..1e2a3e18207f 100644
> --- a/arch/x86/kvm/smm.h
> +++ b/arch/x86/kvm/smm.h
> @@ -116,8 +116,8 @@ struct kvm_smram_state_64 {
>  	u32 smbase;
>  	u32 reserved4[5];
>  
> -	/* ssp and svm_* fields below are not implemented by KVM */
>  	u64 ssp;
> +	/* svm_* fields below are not implemented by KVM */
>  	u64 svm_guest_pat;
>  	u64 svm_host_efer;
>  	u64 svm_host_cr4;


Just one note: Due to historical reasons, KVM supports 2 formats of the SMM save area: 32 and 64 bit.
32 bit format more or less resembles the format that true 32 bit Intel and AMD CPUs used, 
while 64 bit format more or less resembles the format that 64 bit AMD cpus use (Intel uses a very different SMRAM layout)

32 bit format is used when X86_FEATURE_LM is not exposed to the guest CPUID which is very rare (only 32 bit qemu doesn't set it),
and it lacks several fields because it is no longer maintained.

Still, for the sake of completeness, it might make sense to fail enter_smm_save_state_32 (need to add return value and, do 'goto error' in
the main 'enter_smm' in this case, if CET is enabled.

I did a similar thing in SVM code 'svm_enter_smm' when it detects the lack of the X86_FEATURE_LM.

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs
  2023-09-14  6:33 ` [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs Yang Weijiang
@ 2023-10-31 17:56   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:56 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Enable/disable CET MSRs interception per associated feature configuration.
> Shadow Stack feature requires all CET MSRs passed through to guest to make
> it supported in user and supervisor mode while IBT feature only depends on
> MSR_IA32_{U,S}_CETS_CET to enable user and supervisor IBT.

I don't think that this statement is 100% true.

KVM can still technically intercept wrmsr/rdmsr access to all CET msrs because they should not be used
often by the guest and this way allow to show the guest different values than what the actual
hardware values are.

For example KVM can hide (and maybe it should) indirect branch tracking bits in the MSR_IA32_S_CET
if only the shadow stack is enabled and indirect branch tracking is disabled.

The real problem is that MSR_IA32_U_CET is indirectly allowed to be read/written unintercepted,
because of XSAVES (CET_U state component 11).

Note that on the other hand the MSR_IA32_S_CET is not saved/restored by XSAVES.

So this is what I think would be the best effort that KVM can do to separate the
two features:

1. If support state of shadow stack and indirect branch tracking matches the host (the common case) then
it is simple:
	- allow both CET_S and CET_U XSAVES components
	- allow unintercepted access to all CET msrs

2. If only indirect branch is enabled in the guest CPUID, but *host also supports shadow stacks*:
	- don't expose to the guest either the CET_S nor CET_U XSAVES components.
	- only support IA32_S_CET/IA32_U_CET msrs, intercept them, 
          and hide the shadow stack bits from the guest.

3. If only shadow stacks are enabled in the guest CPUID but the *host also supports indirect branch tracking*:

	- intercept access to IA32_S_CET and IA32_U_CET and disallow 
	  indirect branch tracking bits to be set there.

	- for the sake of performance allow both CET_S and CET_U XSAVES components,
	  and accept the fact that these instructions can enable the hidden indirect branch
	  tracking bits there (this causes no harm to the host, and will likely let the
	  guest keep both pieces, fair for using undocumented features).

	  -or-

	  don't enable CET_U XSAVES component and hope that the guest can cope with this
	  by context switching the msrs instead.


	  Yet another solution is to enable the intercept of the XSAVES, and adjust
	  the saved/restored bits of CET_U msrs in the image after its emulation/execution.
	  (This can't be done on AMD, but at least this can be done on Intel, and AMD
	  so far doesn't support the indirect branch tracking at all).


Another, much simpler option is to fail the guest creation if the shadow stack + indirect branch tracking
state differs between host and the guest, unless both are disabled in the guest.
(in essence don't let the guest be created if (2) or (3) happen)

Best regards,
	Maxim Levitsky


> 
> Note, this MSR design introduced an architectual limitation of SHSTK and
> IBT control for guest, i.e., when SHSTK is exposed, IBT is also available
> to guest from architectual perspective since IBT relies on subset of SHSTK
> relevant MSRs.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/vmx.c | 42 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 42 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9f4b56337251..30373258573d 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -699,6 +699,10 @@ static bool is_valid_passthrough_msr(u32 msr)
>  	case MSR_LBR_CORE_TO ... MSR_LBR_CORE_TO + 8:
>  		/* LBR MSRs. These are handled in vmx_update_intercept_for_lbr_msrs() */
>  		return true;
> +	case MSR_IA32_U_CET:
> +	case MSR_IA32_S_CET:
> +	case MSR_IA32_PL0_SSP ... MSR_IA32_INT_SSP_TAB:
> +		return true;
>  	}
>  
>  	r = possible_passthrough_msr_slot(msr) != -ENOENT;
> @@ -7769,6 +7773,42 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>  		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>  }
>  
> +static void vmx_update_intercept_for_cet_msr(struct kvm_vcpu *vcpu)
> +{
> +	bool incpt;
> +
> +	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK)) {
> +		incpt = !guest_cpuid_has(vcpu, X86_FEATURE_SHSTK);
> +
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL0_SSP,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL1_SSP,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL2_SSP,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_PL3_SSP,
> +					  MSR_TYPE_RW, incpt);
> +		if (guest_cpuid_has(vcpu, X86_FEATURE_LM))
> +			vmx_set_intercept_for_msr(vcpu, MSR_IA32_INT_SSP_TAB,
> +						  MSR_TYPE_RW, incpt);
> +		if (!incpt)
> +			return;
> +	}
> +
> +	if (kvm_cpu_cap_has(X86_FEATURE_IBT)) {
> +		incpt = !guest_cpuid_has(vcpu, X86_FEATURE_IBT);
> +
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_U_CET,
> +					  MSR_TYPE_RW, incpt);
> +		vmx_set_intercept_for_msr(vcpu, MSR_IA32_S_CET,
> +					  MSR_TYPE_RW, incpt);
> +	}
> +}
> +
>  static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  {
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -7846,6 +7886,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  
>  	/* Refresh #PF interception to account for MAXPHYADDR changes. */
>  	vmx_update_exception_bitmap(vcpu);
> +
> +	vmx_update_intercept_for_cet_msr(vcpu);
>  }
>  
>  static u64 vmx_get_perf_capabilities(void)






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields
  2023-09-14  6:33 ` [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields Yang Weijiang
@ 2023-10-31 17:56   ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:56 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Save constant values to HOST_{S_CET,SSP,INTR_SSP_TABLE} field explicitly.
> Kernel IBT is supported and the setting in MSR_IA32_S_CET is static after
> post-boot(The exception is BIOS call case but vCPU thread never across it)
> and KVM doesn't need to refresh HOST_S_CET field before every VM-Enter/
> VM-Exit sequence.
> 
> Host supervisor shadow stack is not enabled now and SSP is not accessible
> to kernel mode, thus it's safe to set host IA32_INT_SSP_TAB/SSP VMCS field
> to 0s. When shadow stack is enabled for CPL3, SSP is reloaded from PL3_SSP
> before it exits to userspace. Check SDM Vol 2A/B Chapter 3/4 for SYSCALL/
> SYSRET/SYSENTER SYSEXIT/RDSSP/CALL etc.
> 
> Prevent KVM module loading if host supervisor shadow stack SHSTK_EN is set
> in MSR_IA32_S_CET as KVM cannot co-exit with it correctly.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Suggested-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/capabilities.h |  4 ++++
>  arch/x86/kvm/vmx/vmx.c          | 15 +++++++++++++++
>  arch/x86/kvm/x86.c              | 14 ++++++++++++++
>  arch/x86/kvm/x86.h              |  1 +
>  4 files changed, 34 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index 41a4533f9989..ee8938818c8a 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -106,6 +106,10 @@ static inline bool cpu_has_load_perf_global_ctrl(void)
>  	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL;
>  }
>  
> +static inline bool cpu_has_load_cet_ctrl(void)
> +{
> +	return (vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_CET_STATE);
> +}
>  static inline bool cpu_has_vmx_mpx(void)
>  {
>  	return vmcs_config.vmentry_ctrl & VM_ENTRY_LOAD_BNDCFGS;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 30373258573d..9ccc2c552f55 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -4375,6 +4375,21 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
>  
>  	if (cpu_has_load_ia32_efer())
>  		vmcs_write64(HOST_IA32_EFER, host_efer);
> +
> +	/*
> +	 * Supervisor shadow stack is not enabled on host side, i.e.,
> +	 * host IA32_S_CET.SHSTK_EN bit is guaranteed to 0 now, per SDM
> +	 * description(RDSSP instruction), SSP is not readable in CPL0,
> +	 * so resetting the two registers to 0s at VM-Exit does no harm
> +	 * to kernel execution. When execution flow exits to userspace,
> +	 * SSP is reloaded from IA32_PL3_SSP. Check SDM Vol.2A/B Chapter
> +	 * 3 and 4 for details.

> +	 */
> +	if (cpu_has_load_cet_ctrl()) {
> +		vmcs_writel(HOST_S_CET, host_s_cet);
> +		vmcs_writel(HOST_SSP, 0);
> +		vmcs_writel(HOST_INTR_SSP_TABLE, 0);
> +	}
>  }
>  
>  void set_cr4_guest_host_mask(struct vcpu_vmx *vmx)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c85ee42ab4f1..231d4a7b6f3d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -114,6 +114,8 @@ static u64 __read_mostly efer_reserved_bits = ~((u64)EFER_SCE);
>  #endif
>  
>  static u64 __read_mostly cr4_reserved_bits = CR4_RESERVED_BITS;
> +u64 __read_mostly host_s_cet;
> +EXPORT_SYMBOL_GPL(host_s_cet);
>  
>  #define KVM_EXIT_HYPERCALL_VALID_MASK (1 << KVM_HC_MAP_GPA_RANGE)
>  
> @@ -9618,6 +9620,18 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>  		return -EIO;
>  	}
>  
> +	if (boot_cpu_has(X86_FEATURE_SHSTK)) {
> +		rdmsrl(MSR_IA32_S_CET, host_s_cet);
> +		/*
> +		 * Linux doesn't yet support supervisor shadow stacks (SSS), so
> +		 * KVM doesn't save/restore the associated MSRs, i.e. KVM may
> +		 * clobber the host values.  Yell and refuse to load if SSS is
> +		 * unexpectedly enabled, e.g. to avoid crashing the host.
> +		 */
> +		if (WARN_ON_ONCE(host_s_cet & CET_SHSTK_EN))
> +			return -EIO;
This is a good idea.

> +	}
> +
>  	x86_emulator_cache = kvm_alloc_emulator_cache();
>  	if (!x86_emulator_cache) {
>  		pr_err("failed to allocate cache for x86 emulator\n");
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 9a8e3a84eaf4..0d5f673338dd 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -324,6 +324,7 @@ fastpath_t handle_fastpath_set_msr_irqoff(struct kvm_vcpu *vcpu);
>  extern u64 host_xcr0;
>  extern u64 host_xss;
>  extern u64 host_arch_capabilities;
> +extern u64 host_s_cet;
>  
>  extern struct kvm_caps kvm_caps;
>  


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky






^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  2023-09-14  6:33 ` [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace Yang Weijiang
  2023-09-24 13:38   ` kernel test robot
@ 2023-10-31 17:56   ` Maxim Levitsky
  2023-11-01 22:14     ` Sean Christopherson
  1 sibling, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:56 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Expose CET features to guest if KVM/host can support them, clear CPUID
> feature bits if KVM/host cannot support.
> 
> Set CPUID feature bits so that CET features are available in guest CPUID.
> Add CR4.CET bit support in order to allow guest set CET master control
> bit.
> 
> Disable KVM CET feature if unrestricted_guest is unsupported/disabled as
> KVM does not support emulating CET.
> Don't expose CET feature if either of {U,S}_CET xstate bits is cleared
> in host XSS or if XSAVES isn't supported.
> 
> The CET load-bits in VM_ENTRY/VM_EXIT control fields should be set to make
> guest CET xstates isolated from host's. And all platforms that support CET
> enumerate VMX_BASIC[bit56] as 1, clear CET feature bits if the bit doesn't
> read 1.
> 
> Regarding the CET MSR contents after Reset/INIT, SDM doesn't mention the
> default values, neither can I get the answer internally so far, will fill
> the gap once it's clear.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h  |  3 ++-
>  arch/x86/include/asm/msr-index.h |  1 +
>  arch/x86/kvm/cpuid.c             | 12 ++++++++++--
>  arch/x86/kvm/vmx/capabilities.h  |  6 ++++++
>  arch/x86/kvm/vmx/vmx.c           | 23 ++++++++++++++++++++++-
>  arch/x86/kvm/vmx/vmx.h           |  6 ++++--
>  arch/x86/kvm/x86.c               | 12 +++++++++++-
>  arch/x86/kvm/x86.h               |  3 +++
>  8 files changed, 59 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d77b030e996c..db0010fa3363 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -125,7 +125,8 @@
>  			  | X86_CR4_PGE | X86_CR4_PCE | X86_CR4_OSFXSR | X86_CR4_PCIDE \
>  			  | X86_CR4_OSXSAVE | X86_CR4_SMEP | X86_CR4_FSGSBASE \
>  			  | X86_CR4_OSXMMEXCPT | X86_CR4_LA57 | X86_CR4_VMXE \
> -			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP))
> +			  | X86_CR4_SMAP | X86_CR4_PKE | X86_CR4_UMIP \
> +			  | X86_CR4_CET))
>  
>  #define CR8_RESERVED_BITS (~(unsigned long)X86_CR8_TPR)
>  
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d111350197f..1f8dc04da468 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -1091,6 +1091,7 @@
>  #define VMX_BASIC_MEM_TYPE_MASK	0x003c000000000000LLU
>  #define VMX_BASIC_MEM_TYPE_WB	6LLU
>  #define VMX_BASIC_INOUT		0x0040000000000000LLU
> +#define VMX_BASIC_NO_HW_ERROR_CODE_CC	0x0100000000000000LLU
>  
>  /* Resctrl MSRs: */
>  /* - Intel: */
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 4e7a820cba62..d787a506746a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -654,7 +654,7 @@ void kvm_set_cpu_caps(void)
>  		F(AVX512_VPOPCNTDQ) | F(UMIP) | F(AVX512_VBMI2) | F(GFNI) |
>  		F(VAES) | F(VPCLMULQDQ) | F(AVX512_VNNI) | F(AVX512_BITALG) |
>  		F(CLDEMOTE) | F(MOVDIRI) | F(MOVDIR64B) | 0 /*WAITPKG*/ |
> -		F(SGX_LC) | F(BUS_LOCK_DETECT)
> +		F(SGX_LC) | F(BUS_LOCK_DETECT) | F(SHSTK)
>  	);
>  	/* Set LA57 based on hardware capability. */
>  	if (cpuid_ecx(7) & F(LA57))
> @@ -672,7 +672,8 @@ void kvm_set_cpu_caps(void)
>  		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
>  		F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
>  		F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
> -		F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D)
> +		F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16) | F(FLUSH_L1D) |
> +		F(IBT)
>  	);
>  
>  	/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
> @@ -685,6 +686,13 @@ void kvm_set_cpu_caps(void)
>  		kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
>  	if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
>  		kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> +	/*
> +	 * The feature bit in boot_cpu_data.x86_capability could have been
> +	 * cleared due to ibt=off cmdline option, then add it back if CPU
> +	 * supports IBT.
> +	 */
> +	if (cpuid_edx(7) & F(IBT))
> +		kvm_cpu_cap_set(X86_FEATURE_IBT);

The usual policy is that when the host doesn't support a feature, then the guest
should not support it either. On the other hand, for this particular feature,
it is probably safe to use it. Just a point for a discussion.

>  
>  	kvm_cpu_cap_mask(CPUID_7_1_EAX,
>  		F(AVX_VNNI) | F(AVX512_BF16) | F(CMPCCXADD) |
> diff --git a/arch/x86/kvm/vmx/capabilities.h b/arch/x86/kvm/vmx/capabilities.h
> index ee8938818c8a..e12bc233d88b 100644
> --- a/arch/x86/kvm/vmx/capabilities.h
> +++ b/arch/x86/kvm/vmx/capabilities.h
> @@ -79,6 +79,12 @@ static inline bool cpu_has_vmx_basic_inout(void)
>  	return	(((u64)vmcs_config.basic_cap << 32) & VMX_BASIC_INOUT);
>  }
>  
> +static inline bool cpu_has_vmx_basic_no_hw_errcode(void)
> +{
> +	return	((u64)vmcs_config.basic_cap << 32) &
> +		 VMX_BASIC_NO_HW_ERROR_CODE_CC;
> +}

I see, this is because #CP does have an error code but when bit 56 of IA32_VMX_BASIC,
is clear then error code must be present iff exception is within a hardcoded list of exceptions,
and #CP is not on this list because back then the #CP didn't exist, and all new CPUs do
have this bit 56 set.

But I am not 100% sure that this check is worth it, I don't mind having it though,
but please add a comment explaining why the bit 56 is needed for CET.

> +
>  static inline bool cpu_has_virtual_nmis(void)
>  {
>  	return vmcs_config.pin_based_exec_ctrl & PIN_BASED_VIRTUAL_NMIS &&
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 9ccc2c552f55..f0dea8ecd0c6 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -2614,6 +2614,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
>  		{ VM_ENTRY_LOAD_IA32_EFER,		VM_EXIT_LOAD_IA32_EFER },
>  		{ VM_ENTRY_LOAD_BNDCFGS,		VM_EXIT_CLEAR_BNDCFGS },
>  		{ VM_ENTRY_LOAD_IA32_RTIT_CTL,		VM_EXIT_CLEAR_IA32_RTIT_CTL },
> +		{ VM_ENTRY_LOAD_CET_STATE,		VM_EXIT_LOAD_CET_STATE },
>  	};
>  
>  	memset(vmcs_conf, 0, sizeof(*vmcs_conf));
> @@ -4934,6 +4935,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  		vmcs_write64(GUEST_BNDCFGS, 0);
>  
>  	vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, 0);  /* 22.2.1 */


I guess that the below 3 writes should only be done if CET is supported,
this is what the kernel test robot is complaining about.


> +	vmcs_writel(GUEST_SSP, 0);
> +	vmcs_writel(GUEST_S_CET, 0);
> +	vmcs_writel(GUEST_INTR_SSP_TABLE, 0);
>  
>  	kvm_make_request(KVM_REQ_APIC_PAGE_RELOAD, vcpu);
>  
> @@ -6354,6 +6358,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	if (vmcs_read32(VM_EXIT_MSR_STORE_COUNT) > 0)
>  		vmx_dump_msrs("guest autostore", &vmx->msr_autostore.guest);
>  
> +	if (vmentry_ctl & VM_ENTRY_LOAD_CET_STATE) {
> +		pr_err("S_CET = 0x%016lx\n", vmcs_readl(GUEST_S_CET));
> +		pr_err("SSP = 0x%016lx\n", vmcs_readl(GUEST_SSP));
> +		pr_err("INTR SSP TABLE = 0x%016lx\n",
> +		       vmcs_readl(GUEST_INTR_SSP_TABLE));
> +	}
>  	pr_err("*** Host State ***\n");
>  	pr_err("RIP = 0x%016lx  RSP = 0x%016lx\n",
>  	       vmcs_readl(HOST_RIP), vmcs_readl(HOST_RSP));
> @@ -6431,6 +6441,12 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
>  	if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
>  		pr_err("Virtual processor ID = 0x%04x\n",
>  		       vmcs_read16(VIRTUAL_PROCESSOR_ID));
> +	if (vmexit_ctl & VM_EXIT_LOAD_CET_STATE) {
> +		pr_err("S_CET = 0x%016lx\n", vmcs_readl(HOST_S_CET));
> +		pr_err("SSP = 0x%016lx\n", vmcs_readl(HOST_SSP));
> +		pr_err("INTR SSP TABLE = 0x%016lx\n",
> +		       vmcs_readl(HOST_INTR_SSP_TABLE));
> +	}
>  }
>  
>  /*
> @@ -7967,7 +7983,6 @@ static __init void vmx_set_cpu_caps(void)
>  		kvm_cpu_cap_set(X86_FEATURE_UMIP);
>  
>  	/* CPUID 0xD.1 */
> -	kvm_caps.supported_xss = 0;
>  	if (!cpu_has_vmx_xsaves())
>  		kvm_cpu_cap_clear(X86_FEATURE_XSAVES);
>  
> @@ -7979,6 +7994,12 @@ static __init void vmx_set_cpu_caps(void)
>  
>  	if (cpu_has_vmx_waitpkg())
>  		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
> +
> +	if (!cpu_has_load_cet_ctrl() || !enable_unrestricted_guest ||
> +	    !cpu_has_vmx_basic_no_hw_errcode()) {
> +		kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> +		kvm_cpu_cap_clear(X86_FEATURE_IBT);

I think that here we also need to clear kvm_caps.supported_xss,
or even better, lets set the CET bits in kvm_caps.supported_xss only
once CET is fully enabled (both this check and check in __kvm_x86_vendor_init pass).

> +	}
>  }
>  
>  static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
> index c2130d2c8e24..fb72819fbb41 100644
> --- a/arch/x86/kvm/vmx/vmx.h
> +++ b/arch/x86/kvm/vmx/vmx.h
> @@ -480,7 +480,8 @@ static inline u8 vmx_get_rvi(void)
>  	 VM_ENTRY_LOAD_IA32_EFER |					\
>  	 VM_ENTRY_LOAD_BNDCFGS |					\
>  	 VM_ENTRY_PT_CONCEAL_PIP |					\
> -	 VM_ENTRY_LOAD_IA32_RTIT_CTL)
> +	 VM_ENTRY_LOAD_IA32_RTIT_CTL |					\
> +	 VM_ENTRY_LOAD_CET_STATE)
>  
>  #define __KVM_REQUIRED_VMX_VM_EXIT_CONTROLS				\
>  	(VM_EXIT_SAVE_DEBUG_CONTROLS |					\
> @@ -502,7 +503,8 @@ static inline u8 vmx_get_rvi(void)
>  	       VM_EXIT_LOAD_IA32_EFER |					\
>  	       VM_EXIT_CLEAR_BNDCFGS |					\
>  	       VM_EXIT_PT_CONCEAL_PIP |					\
> -	       VM_EXIT_CLEAR_IA32_RTIT_CTL)
> +	       VM_EXIT_CLEAR_IA32_RTIT_CTL |				\
> +	       VM_EXIT_LOAD_CET_STATE)
>  
>  #define KVM_REQUIRED_VMX_PIN_BASED_VM_EXEC_CONTROL			\
>  	(PIN_BASED_EXT_INTR_MASK |					\
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 231d4a7b6f3d..b7d1ac6b8d75 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -231,7 +231,8 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
>  				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
>  				| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
>  
> -#define KVM_SUPPORTED_XSS     0
> +#define KVM_SUPPORTED_XSS	(XFEATURE_MASK_CET_USER | \
> +				 XFEATURE_MASK_CET_KERNEL)
>  
>  u64 __read_mostly host_efer;
>  EXPORT_SYMBOL_GPL(host_efer);
> @@ -9699,6 +9700,15 @@ static int __kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
>  	if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES))
>  		kvm_caps.supported_xss = 0;
>  
> +	if ((kvm_caps.supported_xss & (XFEATURE_MASK_CET_USER |
> +	     XFEATURE_MASK_CET_KERNEL)) !=
> +	    (XFEATURE_MASK_CET_USER | XFEATURE_MASK_CET_KERNEL)) {
> +		kvm_cpu_cap_clear(X86_FEATURE_SHSTK);
> +		kvm_cpu_cap_clear(X86_FEATURE_IBT);
> +		kvm_caps.supported_xss &= ~XFEATURE_CET_USER;
> +		kvm_caps.supported_xss &= ~XFEATURE_CET_KERNEL;
> +	}
> +
>  #define __kvm_cpu_cap_has(UNUSED_, f) kvm_cpu_cap_has(f)
>  	cr4_reserved_bits = __cr4_reserved_bits(__kvm_cpu_cap_has, UNUSED_);
>  #undef __kvm_cpu_cap_has
> diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
> index 0d5f673338dd..665a7f91d04f 100644
> --- a/arch/x86/kvm/x86.h
> +++ b/arch/x86/kvm/x86.h
> @@ -530,6 +530,9 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
>  		__reserved_bits |= X86_CR4_VMXE;        \
>  	if (!__cpu_has(__c, X86_FEATURE_PCID))          \
>  		__reserved_bits |= X86_CR4_PCIDE;       \
> +	if (!__cpu_has(__c, X86_FEATURE_SHSTK) &&       \
> +	    !__cpu_has(__c, X86_FEATURE_IBT))           \
> +		__reserved_bits |= X86_CR4_CET;         \
>  	__reserved_bits;                                \
>  })
>  


Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  2023-09-14  6:33 ` [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 Yang Weijiang
@ 2023-10-31 17:57   ` Maxim Levitsky
  2023-11-01  4:21   ` Chao Gao
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:57 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Per SDM description(Vol.3D, Appendix A.1):
> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
> exception with or without an error code, regardless of vector"
> 
> Modify has_error_code check before inject events to nested guest. Only
> enforce the check when guest is in real mode, the exception is not hard
> exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
> other case ignore the check to make the logic consistent with SDM.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
>  arch/x86/kvm/vmx/nested.h |  5 +++++
>  2 files changed, 19 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index c5ec0ef51ff7..78a3be394d00 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
>  {
>  	const u64 feature_and_reserved =
>  		/* feature (except bit 48; see below) */
> -		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
> +		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
>  		/* reserved */
> -		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
> +		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
>  	u64 vmx_basic = vmcs_config.nested.basic;
>  
>  	if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
> @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
>  		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
>  			return -EINVAL;
>  
> -		/* VM-entry interruption-info field: deliver error code */
> -		should_have_error_code =
> -			intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> -			x86_exception_has_error_code(vector);
> -		if (CC(has_error_code != should_have_error_code))
> -			return -EINVAL;
> +		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> +		    !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> +			/* VM-entry interruption-info field: deliver error code */
> +			should_have_error_code =
> +				intr_type == INTR_TYPE_HARD_EXCEPTION &&
> +				prot_mode &&
> +				x86_exception_has_error_code(vector);
> +			if (CC(has_error_code != should_have_error_code))
> +				return -EINVAL;
> +		}
>  
>  		/* VM-entry exception error code */
>  		if (CC(has_error_code &&
> @@ -6968,6 +6972,8 @@ static void nested_vmx_setup_basic(struct nested_vmx_msrs *msrs)
>  
>  	if (cpu_has_vmx_basic_inout())
>  		msrs->basic |= VMX_BASIC_INOUT;
> +	if (cpu_has_vmx_basic_no_hw_errcode())
> +		msrs->basic |= VMX_BASIC_NO_HW_ERROR_CODE_CC;
>  }
>  
>  static void nested_vmx_setup_cr_fixed(struct nested_vmx_msrs *msrs)
> diff --git a/arch/x86/kvm/vmx/nested.h b/arch/x86/kvm/vmx/nested.h
> index b4b9d51438c6..26842da6857d 100644
> --- a/arch/x86/kvm/vmx/nested.h
> +++ b/arch/x86/kvm/vmx/nested.h
> @@ -284,6 +284,11 @@ static inline bool nested_cr4_valid(struct kvm_vcpu *vcpu, unsigned long val)
>  	       __kvm_is_valid_cr4(vcpu, val);
>  }
>  
> +static inline bool nested_cpu_has_no_hw_errcode_cc(struct kvm_vcpu *vcpu)
> +{
> +	return to_vmx(vcpu)->nested.msrs.basic & VMX_BASIC_NO_HW_ERROR_CODE_CC;
> +}
> +
>  /* No difference in the restrictions on guest and host CR4 in VMX operation. */
>  #define nested_guest_cr4_valid	nested_cr4_valid
>  #define nested_host_cr4_valid	nested_cr4_valid

Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky





^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-09-14  6:33 ` [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest Yang Weijiang
@ 2023-10-31 17:57   ` Maxim Levitsky
  2023-11-01  2:09   ` Chao Gao
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-10-31 17:57 UTC (permalink / raw)
  To: Yang Weijiang, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> to enable CET for nested VM.
> 
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>  arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
>  arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>  arch/x86/kvm/vmx/vmx.c    |  2 ++
>  4 files changed, 46 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 78a3be394d00..2c4ff13fddb0 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>  	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>  					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>  
> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> +
> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> +
>  	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>  
>  	vmx->nested.force_msr_bitmap_recalc = false;
> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>  		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>  #endif
>  		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> -		VM_EXIT_CLEAR_BNDCFGS;
> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>  	msrs->exit_ctls_high |=
>  		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>  		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>  #ifdef CONFIG_X86_64
>  		VM_ENTRY_IA32E_MODE |
>  #endif
> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> +		VM_ENTRY_LOAD_CET_STATE;
>  	msrs->entry_ctls_high |=
>  		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>  		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> index 106a72c923ca..4233b5ca9461 100644
> --- a/arch/x86/kvm/vmx/vmcs12.c
> +++ b/arch/x86/kvm/vmx/vmcs12.c
> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>  	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>  	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>  	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> +	FIELD(GUEST_S_CET, guest_s_cet),
> +	FIELD(GUEST_SSP, guest_ssp),
> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>  	FIELD(HOST_CR0, host_cr0),
>  	FIELD(HOST_CR3, host_cr3),
>  	FIELD(HOST_CR4, host_cr4),
> @@ -151,5 +154,8 @@ const unsigned short vmcs12_field_offsets[] = {
>  	FIELD(HOST_IA32_SYSENTER_EIP, host_ia32_sysenter_eip),
>  	FIELD(HOST_RSP, host_rsp),
>  	FIELD(HOST_RIP, host_rip),
> +	FIELD(HOST_S_CET, host_s_cet),
> +	FIELD(HOST_SSP, host_ssp),
> +	FIELD(HOST_INTR_SSP_TABLE, host_ssp_tbl),
>  };
>  const unsigned int nr_vmcs12_fields = ARRAY_SIZE(vmcs12_field_offsets);
> diff --git a/arch/x86/kvm/vmx/vmcs12.h b/arch/x86/kvm/vmx/vmcs12.h
> index 01936013428b..3884489e7f7e 100644
> --- a/arch/x86/kvm/vmx/vmcs12.h
> +++ b/arch/x86/kvm/vmx/vmcs12.h
> @@ -117,7 +117,13 @@ struct __packed vmcs12 {
>  	natural_width host_ia32_sysenter_eip;
>  	natural_width host_rsp;
>  	natural_width host_rip;
> -	natural_width paddingl[8]; /* room for future expansion */
> +	natural_width host_s_cet;
> +	natural_width host_ssp;
> +	natural_width host_ssp_tbl;
> +	natural_width guest_s_cet;
> +	natural_width guest_ssp;
> +	natural_width guest_ssp_tbl;
> +	natural_width paddingl[2]; /* room for future expansion */
>  	u32 pin_based_vm_exec_control;
>  	u32 cpu_based_vm_exec_control;
>  	u32 exception_bitmap;
> @@ -292,6 +298,12 @@ static inline void vmx_check_vmcs12_offsets(void)
>  	CHECK_OFFSET(host_ia32_sysenter_eip, 656);
>  	CHECK_OFFSET(host_rsp, 664);
>  	CHECK_OFFSET(host_rip, 672);
> +	CHECK_OFFSET(host_s_cet, 680);
> +	CHECK_OFFSET(host_ssp, 688);
> +	CHECK_OFFSET(host_ssp_tbl, 696);
> +	CHECK_OFFSET(guest_s_cet, 704);
> +	CHECK_OFFSET(guest_ssp, 712);
> +	CHECK_OFFSET(guest_ssp_tbl, 720);
>  	CHECK_OFFSET(pin_based_vm_exec_control, 744);
>  	CHECK_OFFSET(cpu_based_vm_exec_control, 748);
>  	CHECK_OFFSET(exception_bitmap, 752);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f0dea8ecd0c6..2c43f1088d77 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7731,6 +7731,8 @@ static void nested_vmx_cr_fixed1_bits_update(struct kvm_vcpu *vcpu)
>  	cr4_fixed1_update(X86_CR4_PKE,        ecx, feature_bit(PKU));
>  	cr4_fixed1_update(X86_CR4_UMIP,       ecx, feature_bit(UMIP));
>  	cr4_fixed1_update(X86_CR4_LA57,       ecx, feature_bit(LA57));
> +	cr4_fixed1_update(X86_CR4_CET,	      ecx, feature_bit(SHSTK));
> +	cr4_fixed1_update(X86_CR4_CET,	      edx, feature_bit(IBT));
>  
>  #undef cr4_fixed1_update
>  }


It is surprising how little needs to be done to support the nested mode, but it does look correct.
I might have missed something though, can't be 100% sure in this case.


Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>

Best regards,
	Maxim Levitsky







^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-09-14  6:33 ` [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest Yang Weijiang
  2023-10-31 17:57   ` Maxim Levitsky
@ 2023-11-01  2:09   ` Chao Gao
  2023-11-01  9:22     ` Yang, Weijiang
                       ` (2 more replies)
  1 sibling, 3 replies; 120+ messages in thread
From: Chao Gao @ 2023-11-01  2:09 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>to enable CET for nested VM.
>
>Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>---
> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
> arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
> arch/x86/kvm/vmx/vmx.c    |  2 ++
> 4 files changed, 46 insertions(+), 3 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index 78a3be394d00..2c4ff13fddb0 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
> 
>+	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_U_CET, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_S_CET, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>+
>+	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>+					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>+
> 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
> 
> 	vmx->nested.force_msr_bitmap_recalc = false;
>@@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
> #endif
> 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>-		VM_EXIT_CLEAR_BNDCFGS;
>+		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> 	msrs->exit_ctls_high |=
> 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>@@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> #ifdef CONFIG_X86_64
> 		VM_ENTRY_IA32E_MODE |
> #endif
>-		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>+		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>+		VM_ENTRY_LOAD_CET_STATE;
> 	msrs->entry_ctls_high |=
> 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>index 106a72c923ca..4233b5ca9461 100644
>--- a/arch/x86/kvm/vmx/vmcs12.c
>+++ b/arch/x86/kvm/vmx/vmcs12.c
>@@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>+	FIELD(GUEST_S_CET, guest_s_cet),
>+	FIELD(GUEST_SSP, guest_ssp),
>+	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),

I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
between vmcs02 and vmcs12 on nested VM entry/exit, probably in
sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  2023-09-14  6:33 ` [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 Yang Weijiang
  2023-10-31 17:57   ` Maxim Levitsky
@ 2023-11-01  4:21   ` Chao Gao
  2023-11-15  8:31     ` Yang, Weijiang
  1 sibling, 1 reply; 120+ messages in thread
From: Chao Gao @ 2023-11-01  4:21 UTC (permalink / raw)
  To: Yang Weijiang
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
>Per SDM description(Vol.3D, Appendix A.1):
>"If bit 56 is read as 1, software can use VM entry to deliver a hardware
>exception with or without an error code, regardless of vector"
>
>Modify has_error_code check before inject events to nested guest. Only
>enforce the check when guest is in real mode, the exception is not hard
>exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
>other case ignore the check to make the logic consistent with SDM.
>
>Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>---
> arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
> arch/x86/kvm/vmx/nested.h |  5 +++++
> 2 files changed, 19 insertions(+), 8 deletions(-)
>
>diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>index c5ec0ef51ff7..78a3be394d00 100644
>--- a/arch/x86/kvm/vmx/nested.c
>+++ b/arch/x86/kvm/vmx/nested.c
>@@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
> {
> 	const u64 feature_and_reserved =
> 		/* feature (except bit 48; see below) */
>-		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>+		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
> 		/* reserved */
>-		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>+		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
> 	u64 vmx_basic = vmcs_config.nested.basic;
> 
> 	if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>@@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
> 		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
> 			return -EINVAL;
> 
>-		/* VM-entry interruption-info field: deliver error code */
>-		should_have_error_code =
>-			intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>-			x86_exception_has_error_code(vector);
>-		if (CC(has_error_code != should_have_error_code))
>-			return -EINVAL;
>+		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>+		    !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
>+			/* VM-entry interruption-info field: deliver error code */
>+			should_have_error_code =
>+				intr_type == INTR_TYPE_HARD_EXCEPTION &&
>+				prot_mode &&
>+				x86_exception_has_error_code(vector);
>+			if (CC(has_error_code != should_have_error_code))
>+				return -EINVAL;
>+		}

prot_mode and intr_type are used twice, making the code a little hard to read.

how about:
		/*
		 * Cannot deliver error code in real mode or if the
		 * interruption type is not hardware exception. For other
		 * cases, do the consistency check only if the vCPU doesn't
		 * enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
		 */
		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
			if (CC(has_error_code))
				return -EINVAL;
		} else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
			if (CC(has_error_code != x86_exception_has_error_code(vector)))
				return -EINVAL;
		}

and drop should_have_error_code.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit
  2023-10-31 17:43   ` Maxim Levitsky
@ 2023-11-01  9:19     ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-01  9:19 UTC (permalink / raw)
  To: Maxim Levitsky, seanjc, pbonzini, kvm, linux-kernel
  Cc: dave.hansen, peterz, chao.gao, rick.p.edgecombe, john.allen

On 11/1/2023 1:43 AM, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>> Remove XFEATURE_CET_USER entry from dependency array as the entry doesn't
>> reflect true dependency between CET features and the xstate bit, instead
>> manually check and add the bit back if either SHSTK or IBT is supported.
>>
>> Both user mode shadow stack and indirect branch tracking features depend
>> on XFEATURE_CET_USER bit in XSS to automatically save/restore user mode
>> xstate registers, i.e., IA32_U_CET and IA32_PL3_SSP whenever necessary.
>>
>> Although in real world a platform with IBT but no SHSTK is rare, but in
>> virtualization world it's common, guest SHSTK and IBT can be controlled
>> independently via userspace app.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>>   arch/x86/kernel/fpu/xstate.c | 9 ++++++++-
>>   1 file changed, 8 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>> index cadf68737e6b..12c8cb278346 100644
>> --- a/arch/x86/kernel/fpu/xstate.c
>> +++ b/arch/x86/kernel/fpu/xstate.c
>> @@ -73,7 +73,6 @@ static unsigned short xsave_cpuid_features[] __initdata = {
>>   	[XFEATURE_PT_UNIMPLEMENTED_SO_FAR]	= X86_FEATURE_INTEL_PT,
>>   	[XFEATURE_PKRU]				= X86_FEATURE_OSPKE,
>>   	[XFEATURE_PASID]			= X86_FEATURE_ENQCMD,
>> -	[XFEATURE_CET_USER]			= X86_FEATURE_SHSTK,
>>   	[XFEATURE_XTILE_CFG]			= X86_FEATURE_AMX_TILE,
>>   	[XFEATURE_XTILE_DATA]			= X86_FEATURE_AMX_TILE,
>>   };
>> @@ -798,6 +797,14 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
>>   			fpu_kernel_cfg.max_features &= ~BIT_ULL(i);
>>   	}
>>   
>> +	/*
>> +	 * Manually add CET user mode xstate bit if either SHSTK or IBT is
>> +	 * available. Both features depend on the xstate bit to save/restore
>> +	 * CET user mode state.
>> +	 */
>> +	if (boot_cpu_has(X86_FEATURE_SHSTK) || boot_cpu_has(X86_FEATURE_IBT))
>> +		fpu_kernel_cfg.max_features |= BIT_ULL(XFEATURE_CET_USER);
>> +
>>   	if (!cpu_feature_enabled(X86_FEATURE_XFD))
>>   		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
>>   
>
> The goal of the xsave_cpuid_features is to disable xfeature state bits which are enabled
> in CPUID, but their parent feature bit (e.g X86_FEATURE_AVX512) is disabled in CPUID,
> something that should not happen on real CPU, but can happen if the user explicitly
> disables the feature on the kernel command line and/or due to virtualization.
>
> However the above code does the opposite, it will enable XFEATURE_CET_USER xsaves component,
> when in fact, it might be disabled in the CPUID (and one can say that in theory such
> configuration is even useful, since the kernel can still context switch CET msrs manually).
>
>
> So I think that the code should do this instead:
>
> if (!boot_cpu_has(X86_FEATURE_SHSTK) && !boot_cpu_has(X86_FEATURE_IBT))
>   	fpu_kernel_cfg.max_features &= ~BIT_ULL(XFEATURE_CET_USER);

Hi, Maxim,
Thanks a lot for the comments on the series!
I'll will check and reply them after finish an urgent task at hand.

Yeah, it looks good to me and makes the handling logic more consistent!

> Best regards,
> 	Maxim Levitsky
>
>
>
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-11-01  2:09   ` Chao Gao
@ 2023-11-01  9:22     ` Yang, Weijiang
  2023-11-01  9:54     ` Maxim Levitsky
  2023-11-15  8:23     ` Yang, Weijiang
  2 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-01  9:22 UTC (permalink / raw)
  To: Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On 11/1/2023 10:09 AM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>> arch/x86/kvm/vmx/vmx.c    |  2 ++
>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78a3be394d00..2c4ff13fddb0 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> 	vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> -		VM_EXIT_CLEAR_BNDCFGS;
>> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> 	msrs->exit_ctls_high |=
>> 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> 		VM_ENTRY_IA32E_MODE |
>> #endif
>> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> +		VM_ENTRY_LOAD_CET_STATE;
>> 	msrs->entry_ctls_high |=
>> 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> +	FIELD(GUEST_S_CET, guest_s_cet),
>> +	FIELD(GUEST_SSP, guest_ssp),
>> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

Thanks Chao!
Let me double check the nested code part and reply.


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-11-01  2:09   ` Chao Gao
  2023-11-01  9:22     ` Yang, Weijiang
@ 2023-11-01  9:54     ` Maxim Levitsky
  2023-11-15  8:56       ` Yang, Weijiang
  2023-11-15  8:23     ` Yang, Weijiang
  2 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-01  9:54 UTC (permalink / raw)
  To: Chao Gao, Yang Weijiang
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 10:09 +0800, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
> > Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
> > to enable CET for nested VM.
> > 
> > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > ---
> > arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
> > arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
> > arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
> > arch/x86/kvm/vmx/vmx.c    |  2 ++
> > 4 files changed, 46 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 78a3be394d00..2c4ff13fddb0 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
> > 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
> > 
> > +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_U_CET, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_S_CET, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
> > +
> > +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
> > +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
> > +
> > 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
> > 
> > 	vmx->nested.force_msr_bitmap_recalc = false;
> > @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
> > 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
> > #endif
> > 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
> > -		VM_EXIT_CLEAR_BNDCFGS;
> > +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
> > 	msrs->exit_ctls_high |=
> > 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
> > 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
> > @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
> > #ifdef CONFIG_X86_64
> > 		VM_ENTRY_IA32E_MODE |
> > #endif
> > -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
> > +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
> > +		VM_ENTRY_LOAD_CET_STATE;
> > 	msrs->entry_ctls_high |=
> > 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
> > 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
> > diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
> > index 106a72c923ca..4233b5ca9461 100644
> > --- a/arch/x86/kvm/vmx/vmcs12.c
> > +++ b/arch/x86/kvm/vmx/vmcs12.c
> > @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
> > 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
> > 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
> > 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
> > +	FIELD(GUEST_S_CET, guest_s_cet),
> > +	FIELD(GUEST_SSP, guest_ssp),
> > +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> 
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.
> 

Aha, this is why I suspected that nested support is incomplete, 
100% agree.

In particular, looking at Intel's SDM I see that:

HOST_S_CET, HOST_SSP, HOST_INTR_SSP_TABLE needs to be copied from vmcb12 to vmcb02 but not vise versa
because the CPU doesn't touch them.

GUEST_S_CET, GUEST_SSP, GUEST_INTR_SSP_TABLE should be copied bi-directionally.

This of course depends on the corresponding vm entry and vm exit controls being set.
That means that it is legal in theory to do VM entry/exit with CET enabled but not use 
VM_ENTRY_LOAD_CET_STATE and/or VM_EXIT_LOAD_CET_STATE,
because for example nested hypervisor in theory can opt to save/load these itself.

I think that this is all, but I also can't be 100% sure. This thing has to be tested well before
we can be sure that it works.

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-10-31 17:45             ` Maxim Levitsky
@ 2023-11-01 14:16               ` Sean Christopherson
  2023-11-02 18:20                 ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 14:16 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Weijiang Yang, Dave Hansen, pbonzini, kvm, linux-kernel, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > On Wed, Oct 25, 2023, Weijiang Yang wrote:
> On top of that I think that applying the same permission approach to guest's
> FPU state is not a good fit, because of two reasons:
> 
> 1. The guest FPU state will never be pushed on the signal stack - KVM swaps
>    back the host FPU state before it returns from the KVM_RUN ioctl.
> 
>    Also I think (not sure) that ptrace can only access (FPU) state of a
>    stopped process, and a stopped vCPU process will also first return to
>    userspace. Again I might be mistaken here, I never researched this in
>    depth.
> 
>    Assuming that I am correct on these assumptions, the guest FPU state can
>    only be accessed via KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
>    which also returns the userspace portion of the state including optionally
>    the AMX state, but this ioctl doesn't really need FPU permission
>    framework, because it is a KVM ABI, and in fact KVM_GET_XSAVE2 was added
>    exactly because of that: to make sure that userspace is aware that larger
>    than 4K buffer can be returned.
> 
> 2. Guest FPU state is not even on demand resized (but I can imagine that in
>    the future we will do this).

Just because guest FPU state isn't resized doesn't mean there's no value in
requiring userspace to opt-in to allocating 8KiB of data per-vCPU.

> And of course, adding permissions for kernel features, that is even worse
> idea, which we really shouldn't do.
> 
> >  
> > If there are no objections, I'll test the below and write a proper changelog.
> >  
> > --
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> >  __state_perm
> > 
> > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> >  arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> >  1 file changed, 11 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > index ef6906107c54..73f6bc00d178 100644
> > --- a/arch/x86/kernel/fpu/xstate.c
> > +++ b/arch/x86/kernel/fpu/xstate.c
> > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> >  	if ((permitted & requested) == requested)
> >  		return 0;
> >  
> > -	/* Calculate the resulting kernel state size */
> > +	/*
> > +	 * Calculate the resulting kernel state size.  Note, @permitted also
> > +	 * contains supervisor xfeatures even though supervisor are always
> > +	 * permitted for kernel and guest FPUs, and never permitted for user
> > +	 * FPUs.
> > +	 */
> >  	mask = permitted | requested;
> > -	/* Take supervisor states into account on the host */
> > -	if (!guest)
> > -		mask |= xfeatures_mask_supervisor();
> >  	ksize = xstate_calculate_size(mask, compacted);
> 
> This might not work with kernel dynamic features, because
> xfeatures_mask_supervisor() will return all supported supervisor features.

I don't understand what you mean by "This".

Somewhat of a side topic, I feel very strongly that we should use "guest only"
terminology instead of "dynamic".  There is nothing dynamic about whether or not
XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
wheter or not CET is supported.

> Therefore at least until we have an actual kernel dynamic feature (a feature
> used by the host kernel and not KVM, and which has to be dynamic like AMX),
> I suggest that KVM stops using the permission API completely for the guest
> FPU state, and just gives all the features it wants to enable right to

By "it", I assume you mean userspace?

> __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> deprecated and ignored)

KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
or would require a VM-scoped KVM ioctl() to let userspace opt-in to

Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy, as KVM allows
multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN.  E.g.
KVM would need to support actually resizing guest FPU state, which would be extra
complexity without any meaningful benefit.

The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
single process to host multiple VMs with different dynamic xfeature requirements.
But such a setup is mostly theoretical.  Maybe it'll affect the SEV migration
helper at some point?  But even that isn't guaranteed.

So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
it's sufficient for all current use cases.  Unless a concrete use case comes along,
deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
both the kernel and userspace without any meaningful benefit, or really even any
true change in behavior.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  2023-10-31 17:46   ` Maxim Levitsky
@ 2023-11-01 14:41     ` Sean Christopherson
  2023-11-02 18:25       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 14:41 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
> > state, i.e. on a vCPU's CPUID state.  Prior to commit 275a87244ec8 ("KVM:
> > x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)"), KVM
> > incorrectly fudged guest CPUID at runtime,
> Can you explain how commit 275a87244ec8 relates to this patch?
>
> > which in turn necessitated massaging the incoming CPUID state for
> > KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
> 
> Can you link the commit that added this 'massaging' and explain on how this
> relates to this patch?

It's commit 275a87244ec8, which is right above.  I think the missing part is an
explicit call out that the massaging used cpuid_get_supported_xcr0() with the
incoming "struct kvm_cpuid_entry2", i.e. without a "struct kvm_vcpu".

> Can you explain what is the problem that this patch is trying to solve?

Is this better?

--
Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
state, i.e. on a vCPU's CPUID state, now that the only usage of the helper
is to retrieve a vCPU's already-set CPUID.

Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
(allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at
runtime, which in turn necessitated massaging the incoming CPUID state for
KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().  I.e.
KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID state,
and thus without an explicit vCPU object.
--

> Is it really allowed in x86 spec to have different supported mask of XCR0 bits
> on different CPUs (assuming all CPUs of the same type)?

Yes, nothing in the SDM explicitly states that all cores in have identical feature
sets.  And "assuming all CPUs of the same type" isn't really a valid constraint
because it's very doable to put different SKUs into a multi-socket system.

Intel even (somewhat inadvertantly) kinda sorta shipped such CPUs, as Alder Lake
P-cores support AVX512 but E-cores do not, and IIRC early (pre-production?) BIOS
didn't disable AVX512 on the P-Cores, i.e. software could observe cores with and
without AVX512.  That quickly got fixed because it confused software, but until
Intel squashed AVX512 entirely with a microcode update, disabling E-Cores in BIOS
would effectively enable AVX512 on the remaining P-Cores.

And it's not XCR0-related, but PMUs on Alder Lake (and all Intel hybrid CPUs) are
truly heterogenous.  It's a mess for virtualization, but concrete proof that there
are no architectural guarantees regarding homogeneity of feature sets.

> If true, does KVM supports it?

Yes.  Whether or not that's a good thing is definitely debatle, bug KVM's ABI for
a very long time has allowed userspace to expose whatever it wants via KVM_SET_CPUID.

Getting (guest) software to play nice is an entirely different matter, but exposing
heterogenous vCPUs isn't an architectural violation.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-10-31 17:54   ` Maxim Levitsky
@ 2023-11-01 15:46     ` Sean Christopherson
  2023-11-02 18:35       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 15:46 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > the features can be used iff both KVM and guest CPUID can support them.
> > 
> > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > ---
> >  arch/x86/kvm/governed_features.h | 2 ++
> >  arch/x86/kvm/vmx/vmx.c           | 2 ++
> >  2 files changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> > index 423a73395c10..db7e21c5ecc2 100644
> > --- a/arch/x86/kvm/governed_features.h
> > +++ b/arch/x86/kvm/governed_features.h
> > @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> >  KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> >  KVM_GOVERNED_X86_FEATURE(VGIF)
> >  KVM_GOVERNED_X86_FEATURE(VNMI)
> > +KVM_GOVERNED_X86_FEATURE(SHSTK)
> > +KVM_GOVERNED_X86_FEATURE(IBT)
> >  
> >  #undef KVM_GOVERNED_X86_FEATURE
> >  #undef KVM_GOVERNED_FEATURE
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 9409753f45b0..fd5893b3a2c8 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  		kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
> >  
> >  	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> > +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> > +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
> >  
> >  	vmx_setup_uret_msrs(vmx);
> >  
> 
> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> 
> 
> PS: IMHO The whole 'governed feature framework' is very confusing and
> somewhat poorly documented.
>
> Currently the only partial explanation of it, is at 'governed_features',
> which doesn't explain how to use it.

To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
would be fairly self-explanatory, at least relative to all the other CPUID handling
in KVM.

> For the reference this is how KVM expects governed features to be used in the
> common case (there are some exceptions to this but they are rare)
> 
> 1. If a feature is not enabled in host CPUID or KVM doesn't support it, 
>    KVM is expected to not enable it in KVM cpu caps.
> 
> 2. Userspace uploads guest CPUID.
> 
> 3. After the guest CPUID upload, the vendor code calls
>    kvm_governed_feature_check_and_set() which sets governed features = True iff
>    feature is supported in both kvm cpu caps and in guest CPUID.
>
> 4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed
>    feature instead of reading guest CPUID.
> 
> It might make sense to document the above somewhere at least.
>
> Now about another thing I am thinking:
> 
> I do know that the mess of boolean flags that svm had is worse than these
> governed features and functionality wise these are equivalent.
> 
> However thinking again about the whole thing: 
> 
> IMHO the 'governed features' is another quite confusing term that a KVM
> developer will need to learn and keep in memory.

I 100% agree, but I explicitly called out the terrible name in the v1 and v2
cover letters[1][2], and the patches were on the list for 6 months before I
applied them.  I'm definitely still open to a better name, but I'm also not
exactly chomping at the bit to get behind the bikehsed.

v1:
 : Note, I don't like the name "governed", but it was the least awful thing I
 : could come up with.  Suggestions most definitely welcome.

v2:
 : Note, I still don't like the name "governed", but no one has suggested
 : anything else, let alone anything better :-)


[1] https://lore.kernel.org/all/20230217231022.816138-1-seanjc@google.com
[2] https://lore.kernel.org/all/20230729011608.1065019-1-seanjc@google.com

> Because of that, can't we just use guest CPUID as a single source of truth
> and drop all the governed features code?

No, not without a rather massive ABI break.  To make guest CPUID the single source
of true, KVM would need to modify guest CPUID to squash features that userspace
has set, but that are not supported by hardware.  And that is most definitely a
can of worms I don't want to reopen, e.g. see the mess that got created when KVM
tried to "help" userspace by mucking with VMX capability MSRs in response to
CPUID changes.

There aren't many real use cases for advertising "unsupported" features via guest
CPUID, but there are some, and I have definitely abused KVM_SET_CPUID2 for testing
purposes.

And as above, that doesn't work for X86_FEATURE_XSAVES or X86_FEATURE_GBPAGES.

We'd also have to overhaul guest CPUID lookups to be significantly faster (which
is doable), as one of the motiviations for the framework was to avoid the overhead
of looking through guest CPUID without needing one-off boolean fields.

> In most cases, when the governed feature value will differ from the guest
> CPUID is when a feature is enabled in the guest CPUID, but not enabled in the
> KVM caps.
> 
> I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in
> which the opposite happens, governed feature is enabled, even when the
> feature is hidden from the guest CPUID, but it might be better from
> readability wise point, to deal with these cases manually and we unlikely to
> have many new such cases in the future.
> 
> So for the common case of CPUID mismatch, when the governed feature is
> disabled but guest CPUID is enabled, does it make sense to allow this? 

Yes and no.  For "governed features", probably not.  But for CPUID as a whole, there
are legimiate cases where userspace needs to enumerate things that aren't officially
"supported" by KVM.  E.g. topology, core crystal frequency (CPUID 0x15), defeatures
that KVM hasn't yet learned about, features that don't have virtualization controls
and KVM hasn't yet learned about, etc.  And for things like Xen and Hyper-V paravirt
features, it's very doable to implement features that are enumerate by CPUID fully
in userspace, e.g. using MSR filters.

But again, it's a moot point because KVM has (mostly) allowed userspace to fully
control guest CPUID for a very long time.

> Such a feature which is advertised as supported but not really working is a
> recipe of hard to find guest bugs IMHO.
> 
> IMHO it would be much better to just check this condition and do
> kvm_vm_bugged() or something in case when a feature is enabled in the guest
> CPUID but KVM can't support it, and then just use guest CPUID in
> 'guest_can_use()'.

Maybe, if we were creating KVM from scratch, e.g. didn't have to worry about
existing uesrspace behavior and could implement a more forward-looking API than
KVM_GET_SUPPORTED_CPUID.  But even then the enforcement would need to be limited
to "pure" hardware-defined feature bits, and I suspect that there would still be
exceptions.  And there would likely be complexitly in dealing with CPUID leafs
that are completely unknown to KVM, e.g. unless KVM completely disallowed non-zero
values for unknown CPUID leafs, adding restrictions when a feature is defined by
Intel or AMD would be at constant risk of breaking userspace.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-10-31 17:55   ` Maxim Levitsky
@ 2023-11-01 16:31     ` Sean Christopherson
  2023-11-02 18:38       ` Maxim Levitsky
  2023-11-03  8:18       ` Yang, Weijiang
  0 siblings, 2 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 16:31 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Add emulation interface for CET MSR access. The emulation code is split
> > into common part and vendor specific part. The former does common check
> > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > helpers while the latter accesses the MSRs linked to VMCS fields.
> > 
> > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > ---

...

> > +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > +	case MSR_KVM_SSP:
> > +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > +			break;
> > +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > +			return 1;
> > +		if (index == MSR_KVM_SSP && !host_initiated)
> > +			return 1;
> > +		if (is_noncanonical_address(data, vcpu))
> > +			return 1;
> > +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > +			return 1;
> > +		break;
> Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> make the above code simpler (e.g there will be no need to check that write
> comes from the host/etc).

I don't think an ioctl() would be simpler overall, especially when factoring in
userspace.  With a synthetic MSR, we get the following quite cheaply:

 1. Enumerating support to userspace.
 2. Save/restore of the value, e.g. for live migration.
 3. Vendor hooks for propagating values to/from the VMCS/VMCB.

For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
require new userspace flows, and #3 would require new kvm_x86_ops hooks.

The synthetic MSR adds a small amount of messiness, as does bundling 
MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
from the need to allow userspace to write '0' when KVM enumerated supported to
userspace.

If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
MSR_IA32_INT_SSP_TAB.  For the unfortunate "host reset" behavior, the best idea I
came up with is to add a helper.  It's still a bit ugly, but the ugliness is
contained in a helper and IMO makes it much easier to follow the case statements.

get:

	case MSR_IA32_INT_SSP_TAB:
		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
		    !guest_cpuid_has(vcpu, X86_FEATURE_LM))
			return 1;
		break;
	case MSR_KVM_SSP:
		if (!host_initiated)
			return 1;
		fallthrough;
	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
			return 1;
		break;

static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
				   bool host_initiated)
{
	bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;

	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
		return true;

	if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
		return true;

	/* 
	 * If KVM supports the MSR, i.e. has enumerated the MSR existence to
	 * userspace, then userspace is allowed to write '0' irrespective of
	 * whether or not the MSR is exposed to the guest.
	 */
	if (!host_initiated || data)
		return false;

	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
		return true;

	return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
}

set:
	case MSR_IA32_U_CET:
	case MSR_IA32_S_CET:
		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
			return 1;
		if (data & CET_US_RESERVED_BITS)
			return 1;
		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
		    (data & CET_US_SHSTK_MASK_BITS))
			return 1;
		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
		    (data & CET_US_IBT_MASK_BITS))
			return 1;
		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
			return 1;

		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
			return 1;
		break;
	case MSR_IA32_INT_SSP_TAB:
		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
			return 1;

		if (is_noncanonical_address(data, vcpu))
			return 1;
		break;
	case MSR_KVM_SSP:
		if (!host_initiated)
			return 1;
		fallthrough;
	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
			return 1;
		if (is_noncanonical_address(data, vcpu))
			return 1;
		if (!IS_ALIGNED(data, 4))
			return 1;
		break;
	}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-10-31 17:51   ` Maxim Levitsky
@ 2023-11-01 17:20     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 17:20 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen, Zhang Yi Z

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
> >  	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
> >  }
> >  
> > +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_cpuid_entry2 *best;
> > +
> > +	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> > +	if (!best)
> > +		return 0;
> > +
> > +	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> > +}
> 
> Same question as one for patch that added vcpu_get_supported_xcr0()
> Why to have per vCPU supported XSS if we assume that all CPUs have the same
> CPUID?
> 
> I mean I am not against supporting hybrid CPU models, but KVM currently doesn't
> support this and this creates illusion that it does.

KVM does "support" hybrid vCPU models in the sense that KVM has allow hybrid models
since forever.  There are definite things that won't work, e.g. not all relevant
CPUID bits are captured in kvm_mmu_page_role, and so KVM will incorrectly share
page tables across vCPUs that are technically incompatible.

But for many features, heterogenous vCPU models do Just Work as far as KVM is
concerned.  There likely isn't a real world kernel that supports heterogenous
feature sets for things like XSS and XCR0, but that's a guest software limitation,
not a limitation of KVM's CPU virtualization.

As with many things, KVM's ABI is to let userspace shoot themselves in the foot.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-10-31 17:51   ` Maxim Levitsky
@ 2023-11-01 18:05     ` Sean Christopherson
  2023-11-02 18:31       ` Maxim Levitsky
  2023-11-03  8:46       ` Yang, Weijiang
  0 siblings, 2 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 18:05 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 66edbed25db8..a091764bf1d2 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> >  static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> >  
> >  static DEFINE_MUTEX(vendor_module_lock);
> > +static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
> > +static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> > +
> >  struct kvm_x86_ops kvm_x86_ops __read_mostly;
> >  
> >  #define KVM_X86_OP(func)					     \
> > @@ -4372,6 +4375,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_get_msr_common);
> >  
> > +static const u32 xstate_msrs[] = {
> > +	MSR_IA32_U_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP,
> > +	MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP,
> > +};
> > +
> > +static bool is_xstate_msr(u32 index)
> > +{
> > +	int i;
> > +
> > +	for (i = 0; i < ARRAY_SIZE(xstate_msrs); i++) {
> > +		if (index == xstate_msrs[i])
> > +			return true;
> > +	}
> > +	return false;
> > +}
> 
> The name 'xstate_msr' IMHO is not clear.
> 
> How about naming it 'guest_fpu_state_msrs', together with adding a comment like that:

Maybe xstate_managed_msrs?  I'd prefer not to include "guest" because the behavior
is more a property of the architecture and/or the host kernel.  I understand where
you're coming from, but it's the MSR *values* are part of guest state, whereas the
check is a query on how KVM manages the MSR value, if that makes sense.

And I really don't like "FPU".  I get why the the kernel uses the "FPU" terminology,
but for this check in particular I want to tie the behavior back to the architecture,
i.e. provide the hint that the reason why these MSRs are special is because Intel
defined them to be context switched via XSTATE.

Actually, this is unnecesary bikeshedding to some extent, using an array is silly.
It's easier and likely far more performant (not that that matters in this case)
to use a switch statement.

Is this better?

/*
 * Returns true if the MSR in question is managed via XSTATE, i.e. is context
 * switched with the rest of guest FPU state.
 */
static bool is_xstate_managed_msr(u32 index)
{
	switch (index) {
	case MSR_IA32_U_CET:
	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
		return true;
	default:
		return false;
	}
}

/*
 * Read or write a bunch of msrs. All parameters are kernel addresses.
 *
 * @return number of msrs set successfully.
 */
static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
		    struct kvm_msr_entry *entries,
		    int (*do_msr)(struct kvm_vcpu *vcpu,
				  unsigned index, u64 *data))
{
	bool fpu_loaded = false;
	int i;

	for (i = 0; i < msrs->nmsrs; ++i) {
		/*
	 	 * If userspace is accessing one or more XSTATE-managed MSRs,
		 * temporarily load the guest's FPU state so that the guest's
		 * MSR value(s) is resident in hardware, i.e. so that KVM can
		 * get/set the MSR via RDMSR/WRMSR.
	 	 */
		if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
		    is_xstate_managed_msr(entries[i].index)) {
			kvm_load_guest_fpu(vcpu);
			fpu_loaded = true;
		}
		if (do_msr(vcpu, entries[i].index, &entries[i].data))
			break;
	}
	if (fpu_loaded)
		kvm_put_guest_fpu(vcpu);

	return i;
}

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features
  2023-10-31 17:47   ` Maxim Levitsky
@ 2023-11-01 19:18     ` Sean Christopherson
  2023-11-02 18:31       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 19:18 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > 
> > Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
> > is non-zero, i.e. KVM supports at least one XSS based feature.
> 
> 
> I can't believe that CET is the first supervisor feature that KVM supports...
> 
> Ah, now I understand why:
> 
> 1. XSAVES on AMD can't really be intercepted (other than clearing CR4.OSXSAVE
>    bit, which isn't an option if you want to support AVX for example) On VMX
>    however you can intercept XSAVES and even intercept it only when it touches
>    specific bits of state that you don't want the guest to read/write freely.
> 
> 2. Even if it was possible to intercept it, guests use XSAVES on every
>    context switch if available and emulating it might be costly.
>
> 3. Emulating XSAVES is also not that easy to do correctly.
>
> However XSAVES touches various MSRs, thus letting the guest use it
> unintercepted means giving access to host MSRs, which might be wrong security
> wise in some cases.
>
> Thus I see that KVM hardcodes the IA32_XSS to 0, and that makes the XSAVES
> work exactly like XSAVE.
> 
> And for some features which would benefit from XSAVES state components,
> KVM likely won't even be able to do so due to this limitation.
> (this is allowed thankfully by the CPUID), forcing the guests to use
> rdmsr/wrmsr instead.

Sort of?  KVM doesn't (yet) virtualize PASID, HDC, HWP, or arch LBRs (wow,
there's a lot of stuff getting thrown into XSTATE), so naturally those aren't
supported in XSS.

KVM does virtualize Processor Trace (PT), but PT is a bit of a special snowflake.
E.g. the host kernel elects NOT to manage PT MSRs via XSTATE, but it would be
possible for KVM to the guest to manage PT MSRs via XSTATE.

I suspect the answer to PT is threefold:

 1. Exposing a feature that isn't "supported" by the host kernel is scary.
 2. No one has pushed for the support, e.g. Linux guests obviously don't complain
    about lack of XSS support for PT.
 3. Toggling PT MSR passthrough on XSAVES/XRSTORS accesses would be more complex
    and less performant than KVM's current approach.

Re: #3, KVM does passthrough PT MSRs, but only when the guest is actively using
PT.  PT is basically a super fancy PMU feature, and so KVM "needs" to load guest
state as late as possible before VM-Entry, and load host state as early as possible
after VM-Exit.  I.e. the context switch happens on *every* entry/exit pair.

By passing through PT MSRs only when needed, KVM avoids a rather large pile of
RDMSRs and WRMSRs on every entry/exit, as the host values can be kept resident in
hardware so long as the main enable bit is cleared in the guest's control MSR
(which is context switch via a dedicated VMCS field).

XSAVES isn't subject to MSR intercepts, but KVM could utilize VMX's XSS-exiting
bitmap to effectively intercept reads and writes to PT MSRs.  Except that as you
note, KVM would either need to emulate XSAVES (oof) or save/load PT MSRs much more
frequently.

So it's kind of an emulation thing, but I honestly doubt that emulating XSAVES
was ever seriously considered when KVM support for PT was added.

CET is different than PT because the MSRs that need to be context switched at
every entry/exit have dedicated VMCS fields.  The IA32_PLx_SSP MSRs don't have
VMCS fields, but they are consumed only in privelege level changes, i.e. can be
safely deferred until guest "FPU" state is put.

> However it is possible to enable IA32_XSS bits in case the msrs XSAVES
> reads/writes can't do harm to the host, and then KVM can context switch these
> MSRs when the guest exits and that is what is done here with CET.

This isn't really true.  It's not a safety or correctness issue so much as it's
a performance issue.  E.g. KVM could let the guest use XSS for any virtualized
feature, but it would effectively require context switching related state that
the host needs loaded "immediately" after VM-Exit.  And for MSRs, that gets
very expensive without dedicated VMCS fields.

I mean, yeah, it's a correctness thing to not consume guest state in the host
and vice versa, but that's not unique to XSS in any way.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers
  2023-10-31 17:47   ` Maxim Levitsky
@ 2023-11-01 19:32     ` Sean Christopherson
  2023-11-02 18:26       ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 19:32 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> > helpers to replace existing usage of the raw functions.
> > kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> > to get/set a MSR value for emulating CPU behavior.
> 
> I am not sure if I like this patch or not. On one hand the code is cleaner
> this way, but on the other hand now it is easier to call kvm_msr_write() on
> behalf of the guest.
> 
> For example we also have the 'kvm_set_msr()' which does actually set the msr
> on behalf of the guest.
> 
> How about we call the new function kvm_msr_set_host() and rename
> kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
> what they do?

LOL, just call me Nostradamus[*] ;-)

 : > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
 : > where other fields of SMRAM are handled.
 : 
 : +1.  The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
 : %true for @host_initiated.  Though I would add a prep patch to provide wrappers
 : for __kvm_get_msr() and __kvm_set_msr().  Naming will be hard, but I think we
                                             ^^^^^^^^^^^^^^^^^^^
 : can use kvm_{read,write}_msr() to go along with the KVM-initiated register
 : accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.

[*] https://lore.kernel.org/all/ZM0YZgFsYWuBFOze@google.com

> Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
> IMHO have names that are not very user friendly.

I don't like the host/guest split because KVM always operates on guest values,
e.g. kvm_msr_set_host() in particular could get confusing.

IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
are the real problem.

What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
more obvious that those are the "guest" helpers?  And do that as a prep patch in
this series (there aren't _that_ many users).

I'm also in favor of renaming the "inner" helpers, but I think we should tackle
those separately.separately

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace
  2023-10-31 17:56   ` Maxim Levitsky
@ 2023-11-01 22:14     ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-01 22:14 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > @@ -685,6 +686,13 @@ void kvm_set_cpu_caps(void)
> >  		kvm_cpu_cap_set(X86_FEATURE_INTEL_STIBP);
> >  	if (boot_cpu_has(X86_FEATURE_AMD_SSBD))
> >  		kvm_cpu_cap_set(X86_FEATURE_SPEC_CTRL_SSBD);
> > +	/*
> > +	 * The feature bit in boot_cpu_data.x86_capability could have been
> > +	 * cleared due to ibt=off cmdline option, then add it back if CPU
> > +	 * supports IBT.
> > +	 */
> > +	if (cpuid_edx(7) & F(IBT))
> > +		kvm_cpu_cap_set(X86_FEATURE_IBT);
> 
> The usual policy is that when the host doesn't support a feature, then the guest
> should not support it either. On the other hand, for this particular feature,
> it is probably safe to use it. Just a point for a discussion.

Agreed, this needs extra justification.  It's "safe" in theory, but if the admin
disabled IBT because of a ucode bug, then all bets are off.

I'm guessing this was added because of the virtualization hole?  I.e. if KVM
allows CR4.CET=1 for shadow stacks, then KVM can't (easily?) prevent the guest
from also using IBT.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-11-01 14:16               ` Sean Christopherson
@ 2023-11-02 18:20                 ` Maxim Levitsky
  2023-11-03 14:33                   ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:20 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Weijiang Yang, Dave Hansen, pbonzini, kvm, linux-kernel, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > On Wed, Oct 25, 2023, Weijiang Yang wrote:
> > On top of that I think that applying the same permission approach to guest's
> > FPU state is not a good fit, because of two reasons:
> > 
> > 1. The guest FPU state will never be pushed on the signal stack - KVM swaps
> >    back the host FPU state before it returns from the KVM_RUN ioctl.
> > 
> >    Also I think (not sure) that ptrace can only access (FPU) state of a
> >    stopped process, and a stopped vCPU process will also first return to
> >    userspace. Again I might be mistaken here, I never researched this in
> >    depth.
> > 
> >    Assuming that I am correct on these assumptions, the guest FPU state can
> >    only be accessed via KVM_GET_XSAVE/KVM_SET_XSAVE/KVM_GET_XSAVE2 ioctls,
> >    which also returns the userspace portion of the state including optionally
> >    the AMX state, but this ioctl doesn't really need FPU permission
> >    framework, because it is a KVM ABI, and in fact KVM_GET_XSAVE2 was added
> >    exactly because of that: to make sure that userspace is aware that larger
> >    than 4K buffer can be returned.
> > 
> > 2. Guest FPU state is not even on demand resized (but I can imagine that in
> >    the future we will do this).
> 
> Just because guest FPU state isn't resized doesn't mean there's no value in
> requiring userspace to opt-in to allocating 8KiB of data per-vCPU.
See my response below:
> 
> > And of course, adding permissions for kernel features, that is even worse
> > idea, which we really shouldn't do.
> > 
> > >  
> > > If there are no objections, I'll test the below and write a proper changelog.
> > >  
> > > --
> > > From: Sean Christopherson <seanjc@google.com>
> > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > >  __state_perm
> > > 
> > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > ---
> > >  arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > >  1 file changed, 11 insertions(+), 7 deletions(-)
> > > 
> > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > index ef6906107c54..73f6bc00d178 100644
> > > --- a/arch/x86/kernel/fpu/xstate.c
> > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > >  	if ((permitted & requested) == requested)
> > >  		return 0;
> > >  
> > > -	/* Calculate the resulting kernel state size */
> > > +	/*
> > > +	 * Calculate the resulting kernel state size.  Note, @permitted also
> > > +	 * contains supervisor xfeatures even though supervisor are always
> > > +	 * permitted for kernel and guest FPUs, and never permitted for user
> > > +	 * FPUs.
> > > +	 */
> > >  	mask = permitted | requested;
> > > -	/* Take supervisor states into account on the host */
> > > -	if (!guest)
> > > -		mask |= xfeatures_mask_supervisor();
> > >  	ksize = xstate_calculate_size(mask, compacted);
> > 
> > This might not work with kernel dynamic features, because
> > xfeatures_mask_supervisor() will return all supported supervisor features.
> 
> I don't understand what you mean by "This".

> 
> Somewhat of a side topic, I feel very strongly that we should use "guest only"
> terminology instead of "dynamic".  There is nothing dynamic about whether or not
> XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> wheter or not CET is supported.

> > Therefore at least until we have an actual kernel dynamic feature (a feature
> > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > I suggest that KVM stops using the permission API completely for the guest
> > FPU state, and just gives all the features it wants to enable right to
> 
> By "it", I assume you mean userspace?
> 
> > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > deprecated and ignored)
> 
> KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> or would require a VM-scoped KVM ioctl() to let userspace opt-in to
> 
> Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy, 

> as KVM allows
> multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN.  E.g.
> KVM would need to support actually resizing guest FPU state, which would be extra
> complexity without any meaningful benefit.


OK, I understand you now. What you claim is that it is legal to do this:

- KVM_SET_XSAVE
- KVM_SET_CPUID (with AMX enabled)

KVM_SET_CPUID will have to resize the xstate which is already valid.

Your patch to fix the __xstate_request_perm() does seem to be correct in a sense that it will
preserve the kernel fpu components in the fpu permissions.

However note that kernel fpu permissions come from 'fpu_kernel_cfg.default_features' 
which don't include the dynamic kernel xfeatures (added a few patches before this one).

Therefore an attempt to resize the xstate to include a kernel dynamic feature by
__xfd_enable_feature will fail.

If kvm on the other hand includes all the kernel dynamic features in the initial
allocation of FPU state (not optimal but possible), then later call to __xstate_request_perm
for a userspace dynamic feature (which can still happen) will mess the the xstate,
because again the permission code assumes that only default kernel features were granted the permissions.


This has to be solved this way or another.

> 
> The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> single process to host multiple VMs with different dynamic xfeature requirements.
> But such a setup is mostly theoretical.  Maybe it'll affect the SEV migration
> helper at some point?  But even that isn't guaranteed.
> 
> So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> it's sufficient for all current use cases.  Unless a concrete use case comes along,
> deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> both the kernel and userspace without any meaningful benefit, or really even any
> true change in behavior.


ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from usability POV, because it is redundant.

KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace instructs the KVM, how much space to allocate,
to support a VM with *this* CPUID.


For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand state for it (also at least 8K/vCPU btw).
The same should apply for AMX - Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra state when needed.

I don't see why we need an extra and non KVM API for that.


Best regards,
	Maxim Levitsky



> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data
  2023-11-01 14:41     ` Sean Christopherson
@ 2023-11-02 18:25       ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:25 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 07:41 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > 
> > > Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
> > > state, i.e. on a vCPU's CPUID state.  Prior to commit 275a87244ec8 ("KVM:
> > > x86: Don't adjust guest's CPUID.0x12.1 (allowed SGX enclave XFRM)"), KVM
> > > incorrectly fudged guest CPUID at runtime,
> > Can you explain how commit 275a87244ec8 relates to this patch?
> > 
> > > which in turn necessitated massaging the incoming CPUID state for
> > > KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().
> > 
> > Can you link the commit that added this 'massaging' and explain on how this
> > relates to this patch?
> 
> It's commit 275a87244ec8, which is right above.  I think the missing part is an
> explicit call out that the massaging used cpuid_get_supported_xcr0() with the
> incoming "struct kvm_cpuid_entry2", i.e. without a "struct kvm_vcpu".
> 
> > Can you explain what is the problem that this patch is trying to solve?
> 
> Is this better?
> 
> --
> Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU
> state, i.e. on a vCPU's CPUID state, now that the only usage of the helper
> is to retrieve a vCPU's already-set CPUID.
> 
> Prior to commit 275a87244ec8 ("KVM: x86: Don't adjust guest's CPUID.0x12.1
> (allowed SGX enclave XFRM)"), KVM incorrectly fudged guest CPUID at
> runtime, which in turn necessitated massaging the incoming CPUID state for
> KVM_SET_CPUID{2} so as not to run afoul of kvm_cpuid_check_equal().  I.e.
> KVM also invoked cpuid_get_supported_xcr0() with the incoming CPUID state,
> and thus without an explicit vCPU object.

Ah, I understand you. I incorrectly assumed that KVM doesn't allow different CPUID
on different vCPUs, while the actual restriction that was recently placed was
to not allow changing a vCPU's CPUID, once a vCPU was in a guest mode once.

I also understand what you mean in regard to the commit 275a87244ec8 but IMHO
this part of the commit message is only adding to the confusion.

I think that this will be a better commit message:


"Rework and rename cpuid_get_supported_xcr0() to explicitly operate on vCPU state
i.e. on a vCPU's CPUID state.

This is needed because KVM permits different vCPUs to have different CPUIDs, 
and thus it is valid for each vCPU to have different set of supported bits in the XCR0"

> --
> 
> > Is it really allowed in x86 spec to have different supported mask of XCR0 bits
> > on different CPUs (assuming all CPUs of the same type)?
> 
> Yes, nothing in the SDM explicitly states that all cores in have identical feature
> sets.  And "assuming all CPUs of the same type" isn't really a valid constraint
> because it's very doable to put different SKUs into a multi-socket system.
> 
> Intel even (somewhat inadvertantly) kinda sorta shipped such CPUs, as Alder Lake
> P-cores support AVX512 but E-cores do not, and IIRC early (pre-production?) BIOS
> didn't disable AVX512 on the P-Cores, i.e. software could observe cores with and
> without AVX512.  That quickly got fixed because it confused software, but until
> Intel squashed AVX512 entirely with a microcode update, disabling E-Cores in BIOS
> would effectively enable AVX512 on the remaining P-Cores.

Yea, sure I know about this, that's why I said "same type". I just was under the impression
that KVM doesn't 'officially' support heterogeneous vCPUs and recently added a check
to ensure that all vCPUs have the same CPUID. Now I understand.

> 
> And it's not XCR0-related, but PMUs on Alder Lake (and all Intel hybrid CPUs) are
> truly heterogenous.  It's a mess for virtualization, but concrete proof that there
> are no architectural guarantees regarding homogeneity of feature sets.
> 
> > If true, does KVM supports it?
> 
> Yes.  Whether or not that's a good thing is definitely debatle, bug KVM's ABI for
> a very long time has allowed userspace to expose whatever it wants via KVM_SET_CPUID.

> 
> Getting (guest) software to play nice is an entirely different matter, but exposing
> heterogenous vCPUs isn't an architectural violation.
> 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers
  2023-11-01 19:32     ` Sean Christopherson
@ 2023-11-02 18:26       ` Maxim Levitsky
  2023-11-15  9:00         ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:26 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 12:32 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
> > > helpers to replace existing usage of the raw functions.
> > > kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
> > > to get/set a MSR value for emulating CPU behavior.
> > 
> > I am not sure if I like this patch or not. On one hand the code is cleaner
> > this way, but on the other hand now it is easier to call kvm_msr_write() on
> > behalf of the guest.
> > 
> > For example we also have the 'kvm_set_msr()' which does actually set the msr
> > on behalf of the guest.
> > 
> > How about we call the new function kvm_msr_set_host() and rename
> > kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
> > what they do?
> 
> LOL, just call me Nostradamus[*] ;-)
> 
>  : > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
>  : > where other fields of SMRAM are handled.
>  : 
>  : +1.  The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
>  : %true for @host_initiated.  Though I would add a prep patch to provide wrappers
>  : for __kvm_get_msr() and __kvm_set_msr().  Naming will be hard, but I think we
>                                              ^^^^^^^^^^^^^^^^^^^
>  : can use kvm_{read,write}_msr() to go along with the KVM-initiated register
>  : accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.
> 
> [*] https://lore.kernel.org/all/ZM0YZgFsYWuBFOze@google.com
> 
> > Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
> > IMHO have names that are not very user friendly.
> 
> I don't like the host/guest split because KVM always operates on guest values,
> e.g. kvm_msr_set_host() in particular could get confusing.
That makes sense.

> 
> IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
> are the real problem.
> 
> What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
> more obvious that those are the "guest" helpers?  And do that as a prep patch in
> this series (there aren't _that_ many users).
Makes sense.

> 
> I'm also in favor of renaming the "inner" helpers, but I think we should tackle
> those separately.separately

OK.

> 

Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features
  2023-11-01 19:18     ` Sean Christopherson
@ 2023-11-02 18:31       ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 12:18 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > 
> > > Add MSR_IA32_XSS to list of MSRs reported to userspace if supported_xss
> > > is non-zero, i.e. KVM supports at least one XSS based feature.
> > 
> > I can't believe that CET is the first supervisor feature that KVM supports...
> > 
> > Ah, now I understand why:
> > 
> > 1. XSAVES on AMD can't really be intercepted (other than clearing CR4.OSXSAVE
> >    bit, which isn't an option if you want to support AVX for example) On VMX
> >    however you can intercept XSAVES and even intercept it only when it touches
> >    specific bits of state that you don't want the guest to read/write freely.
> > 
> > 2. Even if it was possible to intercept it, guests use XSAVES on every
> >    context switch if available and emulating it might be costly.
> > 
> > 3. Emulating XSAVES is also not that easy to do correctly.
> > 
> > However XSAVES touches various MSRs, thus letting the guest use it
> > unintercepted means giving access to host MSRs, which might be wrong security
> > wise in some cases.
> > 
> > Thus I see that KVM hardcodes the IA32_XSS to 0, and that makes the XSAVES
> > work exactly like XSAVE.
> > 
> > And for some features which would benefit from XSAVES state components,
> > KVM likely won't even be able to do so due to this limitation.
> > (this is allowed thankfully by the CPUID), forcing the guests to use
> > rdmsr/wrmsr instead.
> 
> Sort of?  KVM doesn't (yet) virtualize PASID, HDC, HWP, or arch LBRs (wow,
> there's a lot of stuff getting thrown into XSTATE), so naturally those aren't
> supported in XSS.
> 
> KVM does virtualize Processor Trace (PT), but PT is a bit of a special snowflake.
> E.g. the host kernel elects NOT to manage PT MSRs via XSTATE, but it would be
> possible for KVM to the guest to manage PT MSRs via XSTATE.

I must also note that PT doesn't always uses guest physical addresses to write
its trace output, because there is a secondary execution control 
'Intel PT uses guest physical addresses', however I see that KVM requires it, so yes,
we could likely have supported PT xsaves component.

> 
> I suspect the answer to PT is threefold:
> 
>  1. Exposing a feature that isn't "supported" by the host kernel is scary.
>  2. No one has pushed for the support, e.g. Linux guests obviously don't complain
>     about lack of XSS support for PT.
>  3. Toggling PT MSR passthrough on XSAVES/XRSTORS accesses would be more complex
>     and less performant than KVM's current approach.
> 
> Re: #3, KVM does passthrough PT MSRs, but only when the guest is actively using
> PT.  PT is basically a super fancy PMU feature, and so KVM "needs" to load guest
> state as late as possible before VM-Entry, and load host state as early as possible
> after VM-Exit.  I.e. the context switch happens on *every* entry/exit pair.
> 
Makes sense.

> By passing through PT MSRs only when needed, KVM avoids a rather large pile of
> RDMSRs and WRMSRs on every entry/exit, as the host values can be kept resident in
> hardware so long as the main enable bit is cleared in the guest's control MSR
> (which is context switch via a dedicated VMCS field).
> 
> XSAVES isn't subject to MSR intercepts, but KVM could utilize VMX's XSS-exiting
> bitmap to effectively intercept reads and writes to PT MSRs.  Except that as you
> note, KVM would either need to emulate XSAVES (oof) or save/load PT MSRs much more
> frequently.
> 
> So it's kind of an emulation thing, but I honestly doubt that emulating XSAVES
> was ever seriously considered when KVM support for PT was added.
> 
> CET is different than PT because the MSRs that need to be context switched at
> every entry/exit have dedicated VMCS fields.  The IA32_PLx_SSP MSRs don't have
> VMCS fields, but they are consumed only in privelege level changes, i.e. can be
> safely deferred until guest "FPU" state is put.
> 
> > However it is possible to enable IA32_XSS bits in case the msrs XSAVES
> > reads/writes can't do harm to the host, and then KVM can context switch these
> > MSRs when the guest exits and that is what is done here with CET.
> 
> This isn't really true.  It's not a safety or correctness issue so much as it's
> a performance issue. 
True as well, I haven't thought about it from this POV.


>  E.g. KVM could let the guest use XSS for any virtualized
> feature, but it would effectively require context switching related state that
> the host needs loaded "immediately" after VM-Exit.  And for MSRs, that gets
> very expensive without dedicated VMCS fields.

Yes, unless allowing setting a MSR via xrstors causes harm to the host,
(for example msr that has a physical address in it). 

Such MSRs cannot be allowed to be set by the guest even for the duration of the guest run,
and that means that we cannot pass through the corresponding XSS state component.

> 
> I mean, yeah, it's a correctness thing to not consume guest state in the host
> and vice versa, but that's not unique to XSS in any way.
> 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-11-01 18:05     ` Sean Christopherson
@ 2023-11-02 18:31       ` Maxim Levitsky
  2023-11-03  8:46       ` Yang, Weijiang
  1 sibling, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:31 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 11:05 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 66edbed25db8..a091764bf1d2 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> > >  static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
> > >  
> > >  static DEFINE_MUTEX(vendor_module_lock);
> > > +static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
> > > +static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
> > > +
> > >  struct kvm_x86_ops kvm_x86_ops __read_mostly;
> > >  
> > >  #define KVM_X86_OP(func)					     \
> > > @@ -4372,6 +4375,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_get_msr_common);
> > >  
> > > +static const u32 xstate_msrs[] = {
> > > +	MSR_IA32_U_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP,
> > > +	MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP,
> > > +};
> > > +
> > > +static bool is_xstate_msr(u32 index)
> > > +{
> > > +	int i;
> > > +
> > > +	for (i = 0; i < ARRAY_SIZE(xstate_msrs); i++) {
> > > +		if (index == xstate_msrs[i])
> > > +			return true;
> > > +	}
> > > +	return false;
> > > +}
> > 
> > The name 'xstate_msr' IMHO is not clear.
> > 
> > How about naming it 'guest_fpu_state_msrs', together with adding a comment like that:
> 
> Maybe xstate_managed_msrs?  I'd prefer not to include "guest" because the behavior
> is more a property of the architecture and/or the host kernel.  I understand where
> you're coming from, but it's the MSR *values* are part of guest state, whereas the
> check is a query on how KVM manages the MSR value, if that makes sense.
Makes sense.
> 
> And I really don't like "FPU".  I get why the the kernel uses the "FPU" terminology,
> but for this check in particular I want to tie the behavior back to the architecture,
> i.e. provide the hint that the reason why these MSRs are special is because Intel
> defined them to be context switched via XSTATE.
> 
> Actually, this is unnecesary bikeshedding to some extent, using an array is silly.
> It's easier and likely far more performant (not that that matters in this case)
> to use a switch statement.
> 
> Is this better?
> 
> /*
>  * Returns true if the MSR in question is managed via XSTATE, i.e. is context
>  * switched with the rest of guest FPU state.
>  */
> static bool is_xstate_managed_msr(u32 index)
> {
> 	switch (index) {
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		return true;
> 	default:
> 		return false;
> 	}
> }

Reasonable.

> 
> /*
>  * Read or write a bunch of msrs. All parameters are kernel addresses.
>  *
>  * @return number of msrs set successfully.
>  */
> static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
> 		    struct kvm_msr_entry *entries,
> 		    int (*do_msr)(struct kvm_vcpu *vcpu,
> 				  unsigned index, u64 *data))
> {
> 	bool fpu_loaded = false;
> 	int i;
> 
> 	for (i = 0; i < msrs->nmsrs; ++i) {
> 		/*
> 	 	 * If userspace is accessing one or more XSTATE-managed MSRs,
> 		 * temporarily load the guest's FPU state so that the guest's
> 		 * MSR value(s) is resident in hardware, i.e. so that KVM can
> 		 * get/set the MSR via RDMSR/WRMSR.
> 	 	 */
Reasonable as well.
> 		if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
> 		    is_xstate_managed_msr(entries[i].index)) {
> 			kvm_load_guest_fpu(vcpu);
> 			fpu_loaded = true;
> 		}
> 		if (do_msr(vcpu, entries[i].index, &entries[i].data))
> 			break;
> 	}
> 	if (fpu_loaded)
> 		kvm_put_guest_fpu(vcpu);
> 
> 	return i;
> }
> 

Best regards,
	Maxim Levitsky



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-11-01 15:46     ` Sean Christopherson
@ 2023-11-02 18:35       ` Maxim Levitsky
  2023-11-04  0:07         ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:35 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 08:46 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > > the features can be used iff both KVM and guest CPUID can support them.
> > > 
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > ---
> > >  arch/x86/kvm/governed_features.h | 2 ++
> > >  arch/x86/kvm/vmx/vmx.c           | 2 ++
> > >  2 files changed, 4 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/governed_features.h b/arch/x86/kvm/governed_features.h
> > > index 423a73395c10..db7e21c5ecc2 100644
> > > --- a/arch/x86/kvm/governed_features.h
> > > +++ b/arch/x86/kvm/governed_features.h
> > > @@ -16,6 +16,8 @@ KVM_GOVERNED_X86_FEATURE(PAUSEFILTER)
> > >  KVM_GOVERNED_X86_FEATURE(PFTHRESHOLD)
> > >  KVM_GOVERNED_X86_FEATURE(VGIF)
> > >  KVM_GOVERNED_X86_FEATURE(VNMI)
> > > +KVM_GOVERNED_X86_FEATURE(SHSTK)
> > > +KVM_GOVERNED_X86_FEATURE(IBT)
> > >  
> > >  #undef KVM_GOVERNED_X86_FEATURE
> > >  #undef KVM_GOVERNED_FEATURE
> > > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > > index 9409753f45b0..fd5893b3a2c8 100644
> > > --- a/arch/x86/kvm/vmx/vmx.c
> > > +++ b/arch/x86/kvm/vmx/vmx.c
> > > @@ -7765,6 +7765,8 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> > >  		kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_XSAVES);
> > >  
> > >  	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_VMX);
> > > +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_SHSTK);
> > > +	kvm_governed_feature_check_and_set(vcpu, X86_FEATURE_IBT);
> > >  
> > >  	vmx_setup_uret_msrs(vmx);
> > >  
> > 
> > Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com>
> > 
> > 
> > PS: IMHO The whole 'governed feature framework' is very confusing and
> > somewhat poorly documented.
> > 
> > Currently the only partial explanation of it, is at 'governed_features',
> > which doesn't explain how to use it.
> 
> To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
> would be fairly self-explanatory, at least relative to all the other CPUID handling
> in KVM.

What is not self-explanatory is what are the governed_feature and how to query them.

> 
> > For the reference this is how KVM expects governed features to be used in the
> > common case (there are some exceptions to this but they are rare)
> > 
> > 1. If a feature is not enabled in host CPUID or KVM doesn't support it, 
> >    KVM is expected to not enable it in KVM cpu caps.
> > 
> > 2. Userspace uploads guest CPUID.
> > 
> > 3. After the guest CPUID upload, the vendor code calls
> >    kvm_governed_feature_check_and_set() which sets governed features = True iff
> >    feature is supported in both kvm cpu caps and in guest CPUID.
> > 
> > 4. kvm/vendor code uses 'guest_can_use()' to query the value of the governed
> >    feature instead of reading guest CPUID.
> > 
> > It might make sense to document the above somewhere at least.
> > 
> > Now about another thing I am thinking:
> > 
> > I do know that the mess of boolean flags that svm had is worse than these
> > governed features and functionality wise these are equivalent.
> > 
> > However thinking again about the whole thing: 
> > 
> > IMHO the 'governed features' is another quite confusing term that a KVM
> > developer will need to learn and keep in memory.
> 
> I 100% agree, but I explicitly called out the terrible name in the v1 and v2
> cover letters[1][2], and the patches were on the list for 6 months before I
> applied them.  I'm definitely still open to a better name, but I'm also not
> exactly chomping at the bit to get behind the bikehsed.

Honestly I don't know if I can come up with a better name either.
Name is IMHO not the underlying problem, its the feature itself that is confusing.

> 
> v1:
>  : Note, I don't like the name "governed", but it was the least awful thing I
>  : could come up with.  Suggestions most definitely welcome.
> 
> v2:
>  : Note, I still don't like the name "governed", but no one has suggested
>  : anything else, let alone anything better :-)
> 
> 
> [1] https://lore.kernel.org/all/20230217231022.816138-1-seanjc@google.com
> [2] https://lore.kernel.org/all/20230729011608.1065019-1-seanjc@google.com
> 
> > Because of that, can't we just use guest CPUID as a single source of truth
> > and drop all the governed features code?
> 
> No, not without a rather massive ABI break.  To make guest CPUID the single source
> of true, KVM would need to modify guest CPUID to squash features that userspace
> has set, but that are not supported by hardware.  And that is most definitely a
> can of worms I don't want to reopen, e.g. see the mess that got created when KVM
> tried to "help" userspace by mucking with VMX capability MSRs in response to
> CPUID changes.


> 
> There aren't many real use cases for advertising "unsupported" features via guest
> CPUID, but there are some, and I have definitely abused KVM_SET_CPUID2 for testing
> purposes.
> 
> And as above, that doesn't work for X86_FEATURE_XSAVES or X86_FEATURE_GBPAGES.
> 
> We'd also have to overhaul guest CPUID lookups to be significantly faster (which
> is doable), as one of the motiviations for the framework was to avoid the overhead
> of looking through guest CPUID without needing one-off boolean fields.
> 
> > In most cases, when the governed feature value will differ from the guest
> > CPUID is when a feature is enabled in the guest CPUID, but not enabled in the
> > KVM caps.
> > 
> > I do see two exceptions to this: XSAVES on AMD and X86_FEATURE_GBPAGES, in
> > which the opposite happens, governed feature is enabled, even when the
> > feature is hidden from the guest CPUID, but it might be better from
> > readability wise point, to deal with these cases manually and we unlikely to
> > have many new such cases in the future.
> > 
> > So for the common case of CPUID mismatch, when the governed feature is
> > disabled but guest CPUID is enabled, does it make sense to allow this? 
> 
> Yes and no.  For "governed features", probably not.  But for CPUID as a whole, there
> are legimiate cases where userspace needs to enumerate things that aren't officially
> "supported" by KVM.  E.g. topology, core crystal frequency (CPUID 0x15), defeatures
> that KVM hasn't yet learned about, features that don't have virtualization controls
> and KVM hasn't yet learned about, etc.  And for things like Xen and Hyper-V paravirt
> features, it's very doable to implement features that are enumerate by CPUID fully
> in userspace, e.g. using MSR filters.
> 
> But again, it's a moot point because KVM has (mostly) allowed userspace to fully
> control guest CPUID for a very long time.
> 
> > Such a feature which is advertised as supported but not really working is a
> > recipe of hard to find guest bugs IMHO.
> > 
> > IMHO it would be much better to just check this condition and do
> > kvm_vm_bugged() or something in case when a feature is enabled in the guest
> > CPUID but KVM can't support it, and then just use guest CPUID in
> > 'guest_can_use()'.

OK, I won't argue that much over this, however I still think that there are
better ways to deal with it.

If we put optimizations aside (all of this can surely be optimized such as to
have very little overhead)

How about we have 2 cpuids: Guest visible CPUID which KVM will never use directly
other than during initialization and effective cpuid which is roughly
what governed features are, but will include all features and will be initialized
roughly like governed features are initialized:

effective_cpuid = guest_cpuid & kvm_supported_cpuid 

Except for some forced overrides like for XSAVES and such.

Then we won't need to maintain a list of governed features, and guest_can_use()
for all features will just return the effective cpuid leafs.

In other words, I want KVM to turn all known CPUID features to governed features,
and then remove all the mentions of governed features except 'guest_can_use'
which is a good API.

Such proposal will use a bit more memory but will make it easier for future
KVM developers to understand the code and have less chance of introducing bugs.

Best regards,
	Maxim Levitsky



> 
> Maybe, if we were creating KVM from scratch, e.g. didn't have to worry about
> existing uesrspace behavior and could implement a more forward-looking API than
> KVM_GET_SUPPORTED_CPUID.  But even then the enforcement would need to be limited
> to "pure" hardware-defined feature bits, and I suspect that there would still be
> exceptions.  And there would likely be complexitly in dealing with CPUID leafs
> that are completely unknown to KVM, e.g. unless KVM completely disallowed non-zero
> values for unknown CPUID leafs, adding restrictions when a feature is defined by
> Intel or AMD would be at constant risk of breaking userspace.
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-01 16:31     ` Sean Christopherson
@ 2023-11-02 18:38       ` Maxim Levitsky
  2023-11-02 23:58         ` Sean Christopherson
  2023-11-03  8:18       ` Yang, Weijiang
  1 sibling, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-02 18:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Wed, 2023-11-01 at 09:31 -0700, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > Add emulation interface for CET MSR access. The emulation code is split
> > > into common part and vendor specific part. The former does common check
> > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > 
> > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > ---
> 
> ...
> 
> > > +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > +	case MSR_KVM_SSP:
> > > +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > +			break;
> > > +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > +			return 1;
> > > +		if (index == MSR_KVM_SSP && !host_initiated)
> > > +			return 1;
> > > +		if (is_noncanonical_address(data, vcpu))
> > > +			return 1;
> > > +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > +			return 1;
> > > +		break;
> > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > make the above code simpler (e.g there will be no need to check that write
> > comes from the host/etc).
> 
> I don't think an ioctl() would be simpler overall, especially when factoring in
> userspace.  With a synthetic MSR, we get the following quite cheaply:
> 
>  1. Enumerating support to userspace.
>  2. Save/restore of the value, e.g. for live migration.
>  3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> 
> For an ioctl(), 
> #1 would require a capability, #2 (and #1 to some extent) would
> require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> 
> The synthetic MSR adds a small amount of messiness, as does bundling 
> MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
> from the need to allow userspace to write '0' when KVM enumerated supported to
> userspace.


Let me put it this way - all hacks start like that, and in this case this is API/ABI hack
so we will have to live with it forever.

Once there is a precedent, trust me there will be 10s of new 'fake' msrs added, and the
interface will become one big mess.

As I suggested, if you don't want to add new capability/ioctl and vendor callback per new
x86 arch register, then let's implement KVM_GET_ONE_REG/KVM_SET_ONE_REG and then it will
be really easy to add new regs without confusing users, and without polluting msr
namespace with msrs that don't exist.


Best regards,
	Maxim Levitsky


> 
> If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> MSR_IA32_INT_SSP_TAB.  For the unfortunate "host reset" behavior, the best idea I
> came up with is to add a helper.  It's still a bit ugly, but the ugliness is
> contained in a helper and IMO makes it much easier to follow the case statements.
> 
> get:
> 
> 	case MSR_IA32_INT_SSP_TAB:
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> 		    !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> 			return 1;
> 		break;
> 	case MSR_KVM_SSP:
> 		if (!host_initiated)
> 			return 1;
> 		fallthrough;
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> 			return 1;
> 		break;
> 
> static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> 				   bool host_initiated)
> {
> 	bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
> 
> 	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> 		return true;
> 
> 	if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> 		return true;
> 
> 	/* 
> 	 * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> 	 * userspace, then userspace is allowed to write '0' irrespective of
> 	 * whether or not the MSR is exposed to the guest.
> 	 */
> 	if (!host_initiated || data)
> 		return false;
> 
> 	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> 		return true;
> 
> 	return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
> 
> set:
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_S_CET:
> 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> 			return 1;
> 		if (data & CET_US_RESERVED_BITS)
> 			return 1;
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> 		    (data & CET_US_SHSTK_MASK_BITS))
> 			return 1;
> 		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> 		    (data & CET_US_IBT_MASK_BITS))
> 			return 1;
> 		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> 			return 1;
> 
> 		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> 		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> 			return 1;
> 		break;
> 	case MSR_IA32_INT_SSP_TAB:
> 		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> 			return 1;
> 
> 		if (is_noncanonical_address(data, vcpu))
> 			return 1;
> 		break;
> 	case MSR_KVM_SSP:
> 		if (!host_initiated)
> 			return 1;
> 		fallthrough;
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> 			return 1;
> 		if (is_noncanonical_address(data, vcpu))
> 			return 1;
> 		if (!IS_ALIGNED(data, 4))
> 			return 1;
> 		break;
> 	}
> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-02 18:38       ` Maxim Levitsky
@ 2023-11-02 23:58         ` Sean Christopherson
  2023-11-07 18:12           ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-02 23:58 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 09:31 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Add emulation interface for CET MSR access. The emulation code is split
> > > > into common part and vendor specific part. The former does common check
> > > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > > 
> > > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > > ---
> > 
> > ...
> > 
> > > > +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > > +	case MSR_KVM_SSP:
> > > > +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > +			break;
> > > > +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > +			return 1;
> > > > +		if (index == MSR_KVM_SSP && !host_initiated)
> > > > +			return 1;
> > > > +		if (is_noncanonical_address(data, vcpu))
> > > > +			return 1;
> > > > +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > > +			return 1;
> > > > +		break;
> > > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > > make the above code simpler (e.g there will be no need to check that write
> > > comes from the host/etc).
> > 
> > I don't think an ioctl() would be simpler overall, especially when factoring in
> > userspace.  With a synthetic MSR, we get the following quite cheaply:
> > 
> >  1. Enumerating support to userspace.
> >  2. Save/restore of the value, e.g. for live migration.
> >  3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> > 
> > For an ioctl(), 
> > #1 would require a capability, #2 (and #1 to some extent) would
> > require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> > 
> > The synthetic MSR adds a small amount of messiness, as does bundling 
> > MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
> > from the need to allow userspace to write '0' when KVM enumerated supported to
> > userspace.
> 
> Let me put it this way - all hacks start like that, and in this case this is API/ABI hack
> so we will have to live with it forever.

Eh, I don't view it as a hack, at least the kind of hack that has a negative
connotation.  KVM effectively has ~240 MSR indices reserved for whatever KVM
wants.  The only weird thing about this one is that it's not accessible from the
guest.  Which I agree is quite weird, but from a code perspective I think it
works quite well.

> Once there is a precedent, trust me there will be 10s of new 'fake' msrs added, and the
> interface will become one big mess.

That suggests MSRs aren't already one big mess. :-)  I'm somewhat joking, but also
somewhat serious.  I really don't think that adding one oddball synthetic MSR is
going to meaningfully move the needle on the messiness of MSRs.

Hmm, there probably is a valid slippery slope argument though.  As you say, at
some point, enough state will get shoved into hardware that KVM will need an ever
growing number of synthetic MSRs to keep pace.

> As I suggested, if you don't want to add new capability/ioctl and vendor
> callback per new x86 arch register, then let's implement
> KVM_GET_ONE_REG/KVM_SET_ONE_REG and then it will be really easy to add new
> regs without confusing users, and without polluting msr namespace with msrs
> that don't exist.

I definitely don't hate the idea of KVM_{G,S}ET_ONE_REG, what I don't want is to
have an entirely separate path in KVM for handling the actual get/set.

What if we combine the approaches?  Add KVM_{G,S}ET_ONE_REG support so that the
uAPI can use completely arbitrary register indices without having to worry about
polluting the MSR space and making MSR_KVM_SSP ABI.

Ooh, if we're clever, I bet we can extend KVM_{G,S}ET_ONE_REG to also work with
existing MSRs, GPRs, and other stuff, i.e. not force userspace through the funky
KVM_SET_MSRS just to set one reg, and not force a RMW of all GPRs just to set
RIP or something.  E.g. use bits 39:32 of the id to encode the register class,
bits 31:0 to hold the index within a class, and reserve bits 63:40 for future
usage.

Then for KVM-defined registers, we can route them internally as needed, e.g. we
can still define MSR_KVM_SSP so that internal it's treated like an MSR, but its
index isn't ABI and so can be changed at will.  And future KVM-defined registers
wouldn't _need_ to be treated like MSRs, i.e. we could route registers through
the MSR APIs if and only if it makes sense to do so.

Side topic, why on earth is the data value of kvm_one_reg "addr"?

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-01 16:31     ` Sean Christopherson
  2023-11-02 18:38       ` Maxim Levitsky
@ 2023-11-03  8:18       ` Yang, Weijiang
  2023-11-03 22:26         ` Sean Christopherson
  1 sibling, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-03  8:18 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen, Maxim Levitsky

On 11/2/2023 12:31 AM, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>>> Add emulation interface for CET MSR access. The emulation code is split
>>> into common part and vendor specific part. The former does common check
>>> for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
>>> helpers while the latter accesses the MSRs linked to VMCS fields.
>>>
>>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>>> ---
> ...
>
>>> +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
>>> +	case MSR_KVM_SSP:
>>> +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
>>> +			break;
>>> +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
>>> +			return 1;
>>> +		if (index == MSR_KVM_SSP && !host_initiated)
>>> +			return 1;
>>> +		if (is_noncanonical_address(data, vcpu))
>>> +			return 1;
>>> +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
>>> +			return 1;
>>> +		break;
>> Once again I'll prefer to have an ioctl for setting/getting SSP, this will
>> make the above code simpler (e.g there will be no need to check that write
>> comes from the host/etc).
> I don't think an ioctl() would be simpler overall, especially when factoring in
> userspace.  With a synthetic MSR, we get the following quite cheaply:
>
>   1. Enumerating support to userspace.
>   2. Save/restore of the value, e.g. for live migration.
>   3. Vendor hooks for propagating values to/from the VMCS/VMCB.
>
> For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
> require new userspace flows, and #3 would require new kvm_x86_ops hooks.
>
> The synthetic MSR adds a small amount of messiness, as does bundling
> MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
> from the need to allow userspace to write '0' when KVM enumerated supported to
> userspace.
>
> If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> MSR_IA32_INT_SSP_TAB.  For the unfortunate "host reset" behavior, the best idea I
> came up with is to add a helper.  It's still a bit ugly, but the ugliness is
> contained in a helper and IMO makes it much easier to follow the case statements.

Frankly speaking, existing code is not hard to understand to me :-), the handling for MSR_KVM_SSP
and MSR_IA32_INT_SSP_TAB is straightforward if audiences read the related spec.
But I'll take your advice and enclose below changes. Thanks!
> get:
>
> 	case MSR_IA32_INT_SSP_TAB:
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> 		    !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> 			return 1;
> 		break;
> 	case MSR_KVM_SSP:
> 		if (!host_initiated)
> 			return 1;
> 		fallthrough;
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> 			return 1;
> 		break;
>
> static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> 				   bool host_initiated)
> {
> 	bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
>
> 	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> 		return true;
>
> 	if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> 		return true;
>
> 	/*
> 	 * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> 	 * userspace, then userspace is allowed to write '0' irrespective of
> 	 * whether or not the MSR is exposed to the guest.
> 	 */
> 	if (!host_initiated || data)
> 		return false;
> 	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> 		return true;
>
> 	return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> }
>
> set:
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_S_CET:
> 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> 			return 1;
> 		if (data & CET_US_RESERVED_BITS)
> 			return 1;
> 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> 		    (data & CET_US_SHSTK_MASK_BITS))
> 			return 1;
> 		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> 		    (data & CET_US_IBT_MASK_BITS))
> 			return 1;
> 		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> 			return 1;
>
> 		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> 		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> 			return 1;
> 		break;
> 	case MSR_IA32_INT_SSP_TAB:
> 		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> 			return 1;
>
> 		if (is_noncanonical_address(data, vcpu))
> 			return 1;
> 		break;
> 	case MSR_KVM_SSP:
> 		if (!host_initiated)
> 			return 1;
> 		fallthrough;
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> 			return 1;
> 		if (is_noncanonical_address(data, vcpu))
> 			return 1;
> 		if (!IS_ALIGNED(data, 4))
> 			return 1;
> 		break;
> 	}


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-11-01 18:05     ` Sean Christopherson
  2023-11-02 18:31       ` Maxim Levitsky
@ 2023-11-03  8:46       ` Yang, Weijiang
  2023-11-03 14:02         ` Sean Christopherson
  1 sibling, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-03  8:46 UTC (permalink / raw)
  To: Sean Christopherson, Maxim Levitsky
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 11/2/2023 2:05 AM, Sean Christopherson wrote:
> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 66edbed25db8..a091764bf1d2 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -133,6 +133,9 @@ static int __set_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>>>   static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2);
>>>   
>>>   static DEFINE_MUTEX(vendor_module_lock);
>>> +static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu);
>>> +static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu);
>>> +
>>>   struct kvm_x86_ops kvm_x86_ops __read_mostly;
>>>   
>>>   #define KVM_X86_OP(func)					     \
>>> @@ -4372,6 +4375,22 @@ int kvm_get_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>>>   }
>>>   EXPORT_SYMBOL_GPL(kvm_get_msr_common);
>>>   
>>> +static const u32 xstate_msrs[] = {
>>> +	MSR_IA32_U_CET, MSR_IA32_PL0_SSP, MSR_IA32_PL1_SSP,
>>> +	MSR_IA32_PL2_SSP, MSR_IA32_PL3_SSP,
>>> +};
>>> +
>>> +static bool is_xstate_msr(u32 index)
>>> +{
>>> +	int i;
>>> +
>>> +	for (i = 0; i < ARRAY_SIZE(xstate_msrs); i++) {
>>> +		if (index == xstate_msrs[i])
>>> +			return true;
>>> +	}
>>> +	return false;
>>> +}
>> The name 'xstate_msr' IMHO is not clear.
>>
>> How about naming it 'guest_fpu_state_msrs', together with adding a comment like that:
> Maybe xstate_managed_msrs?  I'd prefer not to include "guest" because the behavior
> is more a property of the architecture and/or the host kernel.  I understand where
> you're coming from, but it's the MSR *values* are part of guest state, whereas the
> check is a query on how KVM manages the MSR value, if that makes sense.
>
> And I really don't like "FPU".  I get why the the kernel uses the "FPU" terminology,
> but for this check in particular I want to tie the behavior back to the architecture,
> i.e. provide the hint that the reason why these MSRs are special is because Intel
> defined them to be context switched via XSTATE.
>
> Actually, this is unnecesary bikeshedding to some extent, using an array is silly.
> It's easier and likely far more performant (not that that matters in this case)
> to use a switch statement.
>
> Is this better?

The change looks good to me! Thanks!

> /*
>   * Returns true if the MSR in question is managed via XSTATE, i.e. is context
>   * switched with the rest of guest FPU state.
>   */
> static bool is_xstate_managed_msr(u32 index)

How about is_xfeature_msr()? xfeature is XSAVE-Supported-Feature, just to align with SDM
convention.

> {
> 	switch (index) {
> 	case MSR_IA32_U_CET:
> 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> 		return true;
> 	default:
> 		return false;
> 	}
> }
>
> /*
>   * Read or write a bunch of msrs. All parameters are kernel addresses.
>   *
>   * @return number of msrs set successfully.
>   */
> static int __msr_io(struct kvm_vcpu *vcpu, struct kvm_msrs *msrs,
> 		    struct kvm_msr_entry *entries,
> 		    int (*do_msr)(struct kvm_vcpu *vcpu,
> 				  unsigned index, u64 *data))
> {
> 	bool fpu_loaded = false;
> 	int i;
>
> 	for (i = 0; i < msrs->nmsrs; ++i) {
> 		/*
> 	 	 * If userspace is accessing one or more XSTATE-managed MSRs,
> 		 * temporarily load the guest's FPU state so that the guest's
> 		 * MSR value(s) is resident in hardware, i.e. so that KVM can
> 		 * get/set the MSR via RDMSR/WRMSR.
> 	 	 */
> 		if (vcpu && !fpu_loaded && kvm_caps.supported_xss &&
> 		    is_xstate_managed_msr(entries[i].index)) {
> 			kvm_load_guest_fpu(vcpu);
> 			fpu_loaded = true;
> 		}
> 		if (do_msr(vcpu, entries[i].index, &entries[i].data))
> 			break;
> 	}
> 	if (fpu_loaded)
> 		kvm_put_guest_fpu(vcpu);
>
> 	return i;
> }


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs
  2023-11-03  8:46       ` Yang, Weijiang
@ 2023-11-03 14:02         ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-03 14:02 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: Maxim Levitsky, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Fri, Nov 03, 2023, Weijiang Yang wrote:
> On 11/2/2023 2:05 AM, Sean Christopherson wrote:
> > /*
> >   * Returns true if the MSR in question is managed via XSTATE, i.e. is context
> >   * switched with the rest of guest FPU state.
> >   */
> > static bool is_xstate_managed_msr(u32 index)
> 
> How about is_xfeature_msr()? xfeature is XSAVE-Supported-Feature, just to align with SDM
> convention.

My vote remains for is_xstate_managed_msr().  is_xfeature_msr() could also refer
to MSRs that control XSTATE features, e.g. XSS.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-11-02 18:20                 ` Maxim Levitsky
@ 2023-11-03 14:33                   ` Sean Christopherson
  2023-11-07 18:04                     ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-03 14:33 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Weijiang Yang, Dave Hansen, pbonzini, kvm, linux-kernel, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > > --
> > > > From: Sean Christopherson <seanjc@google.com>
> > > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > > >  __state_perm
> > > > 
> > > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > ---
> > > >  arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > > >  1 file changed, 11 insertions(+), 7 deletions(-)
> > > > 
> > > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > > index ef6906107c54..73f6bc00d178 100644
> > > > --- a/arch/x86/kernel/fpu/xstate.c
> > > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > >  	if ((permitted & requested) == requested)
> > > >  		return 0;
> > > >  
> > > > -	/* Calculate the resulting kernel state size */
> > > > +	/*
> > > > +	 * Calculate the resulting kernel state size.  Note, @permitted also
> > > > +	 * contains supervisor xfeatures even though supervisor are always
> > > > +	 * permitted for kernel and guest FPUs, and never permitted for user
> > > > +	 * FPUs.
> > > > +	 */
> > > >  	mask = permitted | requested;
> > > > -	/* Take supervisor states into account on the host */
> > > > -	if (!guest)
> > > > -		mask |= xfeatures_mask_supervisor();
> > > >  	ksize = xstate_calculate_size(mask, compacted);
> > > 
> > > This might not work with kernel dynamic features, because
> > > xfeatures_mask_supervisor() will return all supported supervisor features.
> > 
> > I don't understand what you mean by "This".
> 
> > 
> > Somewhat of a side topic, I feel very strongly that we should use "guest only"
> > terminology instead of "dynamic".  There is nothing dynamic about whether or not
> > XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> > wheter or not CET is supported.
> 
> > > Therefore at least until we have an actual kernel dynamic feature (a feature
> > > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > > I suggest that KVM stops using the permission API completely for the guest
> > > FPU state, and just gives all the features it wants to enable right to
> > 
> > By "it", I assume you mean userspace?
> > 
> > > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > > deprecated and ignored)
> > 
> > KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> > either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> > or would require a VM-scoped KVM ioctl() to let userspace opt-in to
> > 
> > Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy, 
> 
> > as KVM allows
> > multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN.  E.g.
> > KVM would need to support actually resizing guest FPU state, which would be extra
> > complexity without any meaningful benefit.
> 
> 
> OK, I understand you now. What you claim is that it is legal to do this:
> 
> - KVM_SET_XSAVE
> - KVM_SET_CPUID (with AMX enabled)
> 
> KVM_SET_CPUID will have to resize the xstate which is already valid.

I was actually talking about

  KVM_SET_CPUID2 (with dynamic user feature #1)
  KVM_SET_CPUID2 (with dynamic user feature #2)

The second call through __xstate_request_perm() will be done with only user
xfeatures in @permitted and so the kernel will compute the wrong ksize.

> Your patch to fix the __xstate_request_perm() does seem to be correct in a
> sense that it will preserve the kernel fpu components in the fpu permissions.
> 
> However note that kernel fpu permissions come from
> 'fpu_kernel_cfg.default_features' which don't include the dynamic kernel
> xfeatures (added a few patches before this one).

CET_KERNEL isn't dynamic!  It's guest-only.  There are no runtime decisions as to
whether or not CET_KERNEL is allowed.  All guest FPU get CET_KERNEL, no kernel FPUs
get CET_KERNEL.

That matters because I am also proposing that we add a dedicated, defined-at-boot
fpu_guest_cfg instead of bolting on a "dynamic", which is what I meant by this:

 : Or even better if it doesn't cause weirdness elsewhere, a dedicated
 : fpu_guest_cfg.  For me at least, a fpu_guest_cfg would make it easier to
 : understand what all is going on.

That way, initialization of permissions is simply

	fpu->guest_perm = fpu_guest_cfg.default_features;

and there's no need to differentiate between guest and kernel FPUs when reallocating
for dynamic user xfeatures because guest_perm.__state_perm already holds the correct
data.

> Therefore an attempt to resize the xstate to include a kernel dynamic feature by
> __xfd_enable_feature will fail.
> 
> If kvm on the other hand includes all the kernel dynamic features in the
> initial allocation of FPU state (not optimal but possible),

This is what I am suggesting.

 : There are definitely scenarios where CET will not be exposed to KVM guests, but
 : I don't see any reason to make the guest FPU space dynamically sized for CET.
 : It's what, 40 bytes?

> then later call to __xstate_request_perm for a userspace dynamic feature
> (which can still happen) will mess the the xstate, because again the
> permission code assumes that only default kernel features were granted the
> permissions.
> 
> 
> This has to be solved this way or another.
> 
> > 
> > The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> > single process to host multiple VMs with different dynamic xfeature requirements.
> > But such a setup is mostly theoretical.  Maybe it'll affect the SEV migration
> > helper at some point?  But even that isn't guaranteed.
> > 
> > So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> > it's sufficient for all current use cases.  Unless a concrete use case comes along,
> > deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> > both the kernel and userspace without any meaningful benefit, or really even any
> > true change in behavior.
> 
> 
> ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from
> usability POV, because it is redundant.
> 
> KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace
> instructs the KVM, how much space to allocate, to support a VM with *this*
> CPUID.
> 
> For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand
> state for it (also at least 8K/vCPU btw).  The same should apply for AMX -
> Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra
> state when needed.
> 
> I don't see why we need an extra and non KVM API for that.

I don't necessarily disagree, but what's done is done.  We missed our chance to
propose a different mechanism, and at this point undoing all of that without good
cause is unlikely to benefit anyone.  If a use comes along that needs something
"better" than the prctl() API, then I agree it'd be worth revisiting.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-03  8:18       ` Yang, Weijiang
@ 2023-11-03 22:26         ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-03 22:26 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen, Maxim Levitsky

On Fri, Nov 03, 2023, Weijiang Yang wrote:
> On 11/2/2023 12:31 AM, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Add emulation interface for CET MSR access. The emulation code is split
> > > > into common part and vendor specific part. The former does common check
> > > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > > 
> > > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > > ---
> > ...
> > 
> > > > +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > > +	case MSR_KVM_SSP:
> > > > +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > +			break;
> > > > +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > +			return 1;
> > > > +		if (index == MSR_KVM_SSP && !host_initiated)
> > > > +			return 1;
> > > > +		if (is_noncanonical_address(data, vcpu))
> > > > +			return 1;
> > > > +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > > +			return 1;
> > > > +		break;
> > > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > > make the above code simpler (e.g there will be no need to check that write
> > > comes from the host/etc).
> > I don't think an ioctl() would be simpler overall, especially when factoring in
> > userspace.  With a synthetic MSR, we get the following quite cheaply:
> > 
> >   1. Enumerating support to userspace.
> >   2. Save/restore of the value, e.g. for live migration.
> >   3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> > 
> > For an ioctl(), #1 would require a capability, #2 (and #1 to some extent) would
> > require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> > 
> > The synthetic MSR adds a small amount of messiness, as does bundling
> > MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
> > from the need to allow userspace to write '0' when KVM enumerated supported to
> > userspace.
> > 
> > If we isolate MSR_IA32_INT_SSP_TAB, that'll help with the synthetic MSR and with
> > MSR_IA32_INT_SSP_TAB.  For the unfortunate "host reset" behavior, the best idea I
> > came up with is to add a helper.  It's still a bit ugly, but the ugliness is
> > contained in a helper and IMO makes it much easier to follow the case statements.
> 
> Frankly speaking, existing code is not hard to understand to me :-), the
> handling for MSR_KVM_SSP and MSR_IA32_INT_SSP_TAB is straightforward if
> audiences read the related spec.

I don't necessarily disagree, but I 100% agree with Maxim that host_msr_reset is
a confusing name.  As Maxim pointed out, '0' isn't necessarily the RESET value.
And host_msr_reset implies that userspace is emulating a RESET, which may not
actually be true, e.g. a naive userspace could be restoring '0' as part of live
migration.

> But I'll take your advice and enclose below changes. Thanks!

Definitely feel free to propose an alternative.  My goal with the suggested change
is eliminate host_msr_reset without creating creating unwieldy case statements.
Isolating MSR_IA32_INT_SSP_TAB was (obviously) the best solution I came up with.

> > get:
> > 
> > 	case MSR_IA32_INT_SSP_TAB:
> > 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) ||
> > 		    !guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > 			return 1;
> > 		break;
> > 	case MSR_KVM_SSP:
> > 		if (!host_initiated)
> > 			return 1;
> > 		fallthrough;
> > 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > 			return 1;
> > 		break;
> > 
> > static bool is_set_cet_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u64 data,
> > 				   bool host_initiated)
> > {
> > 	bool any_cet = index == MSR_IA32_S_CET || index == MSR_IA32_U_CET;
> > 
> > 	if (guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > 		return true;
> > 
> > 	if (any_cet && guest_can_use(vcpu, X86_FEATURE_IBT))
> > 		return true;
> > 
> > 	/*
> > 	 * If KVM supports the MSR, i.e. has enumerated the MSR existence to
> > 	 * userspace, then userspace is allowed to write '0' irrespective of
> > 	 * whether or not the MSR is exposed to the guest.
> > 	 */
> > 	if (!host_initiated || data)
> > 		return false;
> > 	if (kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > 		return true;
> > 
> > 	return any_cet && kvm_cpu_cap_has(X86_FEATURE_IBT);
> > }
> > 
> > set:
> > 	case MSR_IA32_U_CET:
> > 	case MSR_IA32_S_CET:
> > 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> > 			return 1;
> > 		if (data & CET_US_RESERVED_BITS)
> > 			return 1;
> > 		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK) &&
> > 		    (data & CET_US_SHSTK_MASK_BITS))
> > 			return 1;
> > 		if (!guest_can_use(vcpu, X86_FEATURE_IBT) &&
> > 		    (data & CET_US_IBT_MASK_BITS))
> > 			return 1;
> > 		if (!IS_ALIGNED(CET_US_LEGACY_BITMAP_BASE(data), 4))
> > 			return 1;
> > 
> > 		/* IBT can be suppressed iff the TRACKER isn't WAIT_ENDBR. */
> > 		if ((data & CET_SUPPRESS) && (data & CET_WAIT_ENDBR))
> > 			return 1;
> > 		break;
> > 	case MSR_IA32_INT_SSP_TAB:
> > 		if (!guest_cpuid_has(vcpu, X86_FEATURE_LM))
> > 			return 1;

Doh, I think this should be:

		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated) ||
		    !guest_cpuid_has(vcpu, X86_FEATURE_LM))
			return 1;
> > 
> > 		if (is_noncanonical_address(data, vcpu))
> > 			return 1;
> > 		break;
> > 	case MSR_KVM_SSP:
> > 		if (!host_initiated)
> > 			return 1;
> > 		fallthrough;
> > 	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > 		if (!is_set_cet_msr_allowed(vcpu, index, data, host_initiated))
> > 			return 1;
> > 		if (is_noncanonical_address(data, vcpu))
> > 			return 1;
> > 		if (!IS_ALIGNED(data, 4))
> > 			return 1;
> > 		break;
> > 	}
> 

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-11-02 18:35       ` Maxim Levitsky
@ 2023-11-04  0:07         ` Sean Christopherson
  2023-11-07 18:05           ` Maxim Levitsky
  0 siblings, 1 reply; 120+ messages in thread
From: Sean Christopherson @ 2023-11-04  0:07 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 08:46 -0700, Sean Christopherson wrote:
> > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > > > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > > > the features can be used iff both KVM and guest CPUID can support them.
> > > PS: IMHO The whole 'governed feature framework' is very confusing and
> > > somewhat poorly documented.
> > > 
> > > Currently the only partial explanation of it, is at 'governed_features',
> > > which doesn't explain how to use it.
> > 
> > To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
> > would be fairly self-explanatory, at least relative to all the other CPUID handling
> > in KVM.
> 
> What is not self-explanatory is what are the governed_feature and how to query them.

...

> > > However thinking again about the whole thing: 
> > > 
> > > IMHO the 'governed features' is another quite confusing term that a KVM
> > > developer will need to learn and keep in memory.
> > 
> > I 100% agree, but I explicitly called out the terrible name in the v1 and v2
> > cover letters[1][2], and the patches were on the list for 6 months before I
> > applied them.  I'm definitely still open to a better name, but I'm also not
> > exactly chomping at the bit to get behind the bikehsed.
> 
> Honestly I don't know if I can come up with a better name either.  Name is
> IMHO not the underlying problem, its the feature itself that is confusing.

...

> > Yes and no.  For "governed features", probably not.  But for CPUID as a whole, there
> > are legimiate cases where userspace needs to enumerate things that aren't officially
> > "supported" by KVM.  E.g. topology, core crystal frequency (CPUID 0x15), defeatures
> > that KVM hasn't yet learned about, features that don't have virtualization controls
> > and KVM hasn't yet learned about, etc.  And for things like Xen and Hyper-V paravirt
> > features, it's very doable to implement features that are enumerate by CPUID fully
> > in userspace, e.g. using MSR filters.
> > 
> > But again, it's a moot point because KVM has (mostly) allowed userspace to fully
> > control guest CPUID for a very long time.
> > 
> > > Such a feature which is advertised as supported but not really working is a
> > > recipe of hard to find guest bugs IMHO.
> > > 
> > > IMHO it would be much better to just check this condition and do
> > > kvm_vm_bugged() or something in case when a feature is enabled in the guest
> > > CPUID but KVM can't support it, and then just use guest CPUID in
> > > 'guest_can_use()'.
> 
> OK, I won't argue that much over this, however I still think that there are
> better ways to deal with it.
> 
> If we put optimizations aside (all of this can surely be optimized such as to
> have very little overhead)
> 
> How about we have 2 cpuids: Guest visible CPUID which KVM will never use directly
> other than during initialization and effective cpuid which is roughly
> what governed features are, but will include all features and will be initialized
> roughly like governed features are initialized:
> 
> effective_cpuid = guest_cpuid & kvm_supported_cpuid 
> 
> Except for some forced overrides like for XSAVES and such.
> 
> Then we won't need to maintain a list of governed features, and guest_can_use()
> for all features will just return the effective cpuid leafs.
> 
> In other words, I want KVM to turn all known CPUID features to governed features,
> and then remove all the mentions of governed features except 'guest_can_use'
> which is a good API.
> 
> Such proposal will use a bit more memory but will make it easier for future
> KVM developers to understand the code and have less chance of introducing bugs.

Hmm, two _full_ CPUID arrays would be a mess and completely unnecessary.  E.g.
we'd have to sort out Hyper-V and KVM PV, which both have their own caches.  And
a duplicate entry for things like F/M/S would be ridiculous.

But maintaining a per-vCPU version of the CPU caps is definitely doable.  I.e. a
vCPU equivalent to kvm_cpu_caps and the per-CPU capabilities.  There are currently
25 leafs that are tracked by kvm_cpu_caps, so relative to "governed" features,
the cost will be 96 bytes per vCPU.  I agree that 96 bytes is worth eating, we've
certainly taken on more for a lot, lot less.

It's a lot of churn, and there are some subtle nasties, e.g. MWAIT and other
CPUID bits that changed based on MSRs or CR4, but most of the churn is superficial
and the result is waaaaay less ugly than governed features and for the majority of
features will Just Work.

I'll get a series posted next week (need to write changelogs and do a _lot_ more
testing).  If you want to take a peek at where I'm headed before then:

  https://github.com/sean-jc/linux x86/guest_cpufeatures

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-11-03 14:33                   ` Sean Christopherson
@ 2023-11-07 18:04                     ` Maxim Levitsky
  2023-11-14  9:13                       ` Yang, Weijiang
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-07 18:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Weijiang Yang, Dave Hansen, pbonzini, kvm, linux-kernel, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Fri, 2023-11-03 at 07:33 -0700, Sean Christopherson wrote:
> On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> > On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > > On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
> > > > > --
> > > > > From: Sean Christopherson <seanjc@google.com>
> > > > > Date: Thu, 26 Oct 2023 10:17:33 -0700
> > > > > Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
> > > > >  __state_perm
> > > > > 
> > > > > Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
> > > > > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > > > > ---
> > > > >  arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
> > > > >  1 file changed, 11 insertions(+), 7 deletions(-)
> > > > > 
> > > > > diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> > > > > index ef6906107c54..73f6bc00d178 100644
> > > > > --- a/arch/x86/kernel/fpu/xstate.c
> > > > > +++ b/arch/x86/kernel/fpu/xstate.c
> > > > > @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
> > > > >  	if ((permitted & requested) == requested)
> > > > >  		return 0;
> > > > >  
> > > > > -	/* Calculate the resulting kernel state size */
> > > > > +	/*
> > > > > +	 * Calculate the resulting kernel state size.  Note, @permitted also
> > > > > +	 * contains supervisor xfeatures even though supervisor are always
> > > > > +	 * permitted for kernel and guest FPUs, and never permitted for user
> > > > > +	 * FPUs.
> > > > > +	 */
> > > > >  	mask = permitted | requested;
> > > > > -	/* Take supervisor states into account on the host */
> > > > > -	if (!guest)
> > > > > -		mask |= xfeatures_mask_supervisor();
> > > > >  	ksize = xstate_calculate_size(mask, compacted);
> > > > 
> > > > This might not work with kernel dynamic features, because
> > > > xfeatures_mask_supervisor() will return all supported supervisor features.
> > > 
> > > I don't understand what you mean by "This".
> > > Somewhat of a side topic, I feel very strongly that we should use "guest only"
> > > terminology instead of "dynamic".  There is nothing dynamic about whether or not
> > > XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
> > > wheter or not CET is supported.
> > > > Therefore at least until we have an actual kernel dynamic feature (a feature
> > > > used by the host kernel and not KVM, and which has to be dynamic like AMX),
> > > > I suggest that KVM stops using the permission API completely for the guest
> > > > FPU state, and just gives all the features it wants to enable right to
> > > 
> > > By "it", I assume you mean userspace?
> > > 
> > > > __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
> > > > deprecated and ignored)
> > > 
> > > KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
> > > either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
> > > or would require a VM-scoped KVM ioctl() to let userspace opt-in to
> > > 
> > > Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy, 
> > > as KVM allows
> > > multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN.  E.g.
> > > KVM would need to support actually resizing guest FPU state, which would be extra
> > > complexity without any meaningful benefit.
> > 
> > OK, I understand you now. What you claim is that it is legal to do this:
> > 
> > - KVM_SET_XSAVE
> > - KVM_SET_CPUID (with AMX enabled)
> > 
> > KVM_SET_CPUID will have to resize the xstate which is already valid.
> 
> I was actually talking about
> 
>   KVM_SET_CPUID2 (with dynamic user feature #1)
>   KVM_SET_CPUID2 (with dynamic user feature #2)
> 
> The second call through __xstate_request_perm() will be done with only user
> xfeatures in @permitted and so the kernel will compute the wrong ksize.
> 
> > Your patch to fix the __xstate_request_perm() does seem to be correct in a
> > sense that it will preserve the kernel fpu components in the fpu permissions.
> > 
> > However note that kernel fpu permissions come from
> > 'fpu_kernel_cfg.default_features' which don't include the dynamic kernel
> > xfeatures (added a few patches before this one).
> 
> CET_KERNEL isn't dynamic!  It's guest-only.  There are no runtime decisions as to
> whether or not CET_KERNEL is allowed.  All guest FPU get CET_KERNEL, no kernel FPUs
> get CET_KERNEL.
> 
> That matters because I am also proposing that we add a dedicated, defined-at-boot
> fpu_guest_cfg instead of bolting on a "dynamic", which is what I meant by this:

Seems fair.

> 
>  : Or even better if it doesn't cause weirdness elsewhere, a dedicated
>  : fpu_guest_cfg.  For me at least, a fpu_guest_cfg would make it easier to
>  : understand what all is going on.
This is a very good idea.

> 
> That way, initialization of permissions is simply
> 
> 	fpu->guest_perm = fpu_guest_cfg.default_features;
> 
> and there's no need to differentiate between guest and kernel FPUs when reallocating
> for dynamic user xfeatures because guest_perm.__state_perm already holds the correct
> data.
> 
> > Therefore an attempt to resize the xstate to include a kernel dynamic feature by
> > __xfd_enable_feature will fail.
> > 
> > If kvm on the other hand includes all the kernel dynamic features in the
> > initial allocation of FPU state (not optimal but possible),
> 
> This is what I am suggesting.

This is a valid solution.

> 
>  : There are definitely scenarios where CET will not be exposed to KVM guests, but
>  : I don't see any reason to make the guest FPU space dynamically sized for CET.
>  : It's what, 40 bytes?

I don't disagree with this. Allocating all guest kernel features is a valid solution
for now although this can change in the future if a 'heavy' kernel feature comes.

Also IMHO its not a question of space but more question of run-time overhead.
I don't know how well the INIT/MODIFIED ucode state tracking works (on Intel and AMD)
and what are the costs of saving/restoring an unused feature.

But again this is a valid solution and as long as the code works, I don't have
anything against it.

> 
> > then later call to __xstate_request_perm for a userspace dynamic feature
> > (which can still happen) will mess the the xstate, because again the
> > permission code assumes that only default kernel features were granted the
> > permissions.
> > 
> > 
> > This has to be solved this way or another.
> > 
> > > The only benefit I can think of for a VM-scoped ioctl() is that it would allow a
> > > single process to host multiple VMs with different dynamic xfeature requirements.
> > > But such a setup is mostly theoretical.  Maybe it'll affect the SEV migration
> > > helper at some point?  But even that isn't guaranteed.
> > > 
> > > So while I agree that ARCH_GET_XCOMP_GUEST_PERM isn't ideal, practically speaking
> > > it's sufficient for all current use cases.  Unless a concrete use case comes along,
> > > deprecating ARCH_GET_XCOMP_GUEST_PERM in favor of a KVM ioctl() would be churn for
> > > both the kernel and userspace without any meaningful benefit, or really even any
> > > true change in behavior.
> > 
> > ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM is not a good API from
> > usability POV, because it is redundant.
> > 
> > KVM already has API called KVM_SET_CPUID2, by which the qemu/userspace
> > instructs the KVM, how much space to allocate, to support a VM with *this*
> > CPUID.
> > 
> > For example if qemu asks for nested SVM/VMX, then kvm will allocate on demand
> > state for it (also at least 8K/vCPU btw).  The same should apply for AMX -
> > Qemu sets AMX xsave bit in CPUID - that permits KVM to allocate the extra
> > state when needed.
> > 
> > I don't see why we need an extra and non KVM API for that.
> 
> I don't necessarily disagree, but what's done is done.  We missed our chance to
> propose a different mechanism, and at this point undoing all of that without good
> cause is unlikely to benefit anyone.  If a use comes along that needs something
> "better" than the prctl() API, then I agree it'd be worth revisiting.

I do think that it is not too late to deprecate the ARCH_GET_XCOMP_GUEST_PERM/ARCH_SET_XCOMP_GUEST_PERM,
and just ignore it, instead taking the guest CPUID as the source of truth.

That API was out only for a few releases and only has to be used for AMX which is a very new feature.

Also if we let the guest call the deprecated API but ignore it (allow everything regardless if the userspace
called the permission API) that will not break the existing code IMHO.

Best regards,
	Maxim Levitsky


> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled"
  2023-11-04  0:07         ` Sean Christopherson
@ 2023-11-07 18:05           ` Maxim Levitsky
  0 siblings, 0 replies; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-07 18:05 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Fri, 2023-11-03 at 17:07 -0700, Sean Christopherson wrote:
> On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> > On Wed, 2023-11-01 at 08:46 -0700, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > > Use the governed feature framework to track whether X86_FEATURE_SHSTK
> > > > > and X86_FEATURE_IBT features can be used by userspace and guest, i.e.,
> > > > > the features can be used iff both KVM and guest CPUID can support them.
> > > > PS: IMHO The whole 'governed feature framework' is very confusing and
> > > > somewhat poorly documented.
> > > > 
> > > > Currently the only partial explanation of it, is at 'governed_features',
> > > > which doesn't explain how to use it.
> > > 
> > > To be honest, terrible name aside, I thought kvm_governed_feature_check_and_set()
> > > would be fairly self-explanatory, at least relative to all the other CPUID handling
> > > in KVM.
> > 
> > What is not self-explanatory is what are the governed_feature and how to query them.
> 
> ...
> 
> > > > However thinking again about the whole thing: 
> > > > 
> > > > IMHO the 'governed features' is another quite confusing term that a KVM
> > > > developer will need to learn and keep in memory.
> > > 
> > > I 100% agree, but I explicitly called out the terrible name in the v1 and v2
> > > cover letters[1][2], and the patches were on the list for 6 months before I
> > > applied them.  I'm definitely still open to a better name, but I'm also not
> > > exactly chomping at the bit to get behind the bikehsed.
> > 
> > Honestly I don't know if I can come up with a better name either.  Name is
> > IMHO not the underlying problem, its the feature itself that is confusing.
> 
> ...
> 
> > > Yes and no.  For "governed features", probably not.  But for CPUID as a whole, there
> > > are legimiate cases where userspace needs to enumerate things that aren't officially
> > > "supported" by KVM.  E.g. topology, core crystal frequency (CPUID 0x15), defeatures
> > > that KVM hasn't yet learned about, features that don't have virtualization controls
> > > and KVM hasn't yet learned about, etc.  And for things like Xen and Hyper-V paravirt
> > > features, it's very doable to implement features that are enumerate by CPUID fully
> > > in userspace, e.g. using MSR filters.
> > > 
> > > But again, it's a moot point because KVM has (mostly) allowed userspace to fully
> > > control guest CPUID for a very long time.
> > > 
> > > > Such a feature which is advertised as supported but not really working is a
> > > > recipe of hard to find guest bugs IMHO.
> > > > 
> > > > IMHO it would be much better to just check this condition and do
> > > > kvm_vm_bugged() or something in case when a feature is enabled in the guest
> > > > CPUID but KVM can't support it, and then just use guest CPUID in
> > > > 'guest_can_use()'.
> > 
> > OK, I won't argue that much over this, however I still think that there are
> > better ways to deal with it.
> > 
> > If we put optimizations aside (all of this can surely be optimized such as to
> > have very little overhead)
> > 
> > How about we have 2 cpuids: Guest visible CPUID which KVM will never use directly
> > other than during initialization and effective cpuid which is roughly
> > what governed features are, but will include all features and will be initialized
> > roughly like governed features are initialized:
> > 
> > effective_cpuid = guest_cpuid & kvm_supported_cpuid 
> > 
> > Except for some forced overrides like for XSAVES and such.
> > 
> > Then we won't need to maintain a list of governed features, and guest_can_use()
> > for all features will just return the effective cpuid leafs.
> > 
> > In other words, I want KVM to turn all known CPUID features to governed features,
> > and then remove all the mentions of governed features except 'guest_can_use'
> > which is a good API.
> > 
> > Such proposal will use a bit more memory but will make it easier for future
> > KVM developers to understand the code and have less chance of introducing bugs.
> 
> Hmm, two _full_ CPUID arrays would be a mess and completely unnecessary.  E.g.
> we'd have to sort out Hyper-V and KVM PV, which both have their own caches.  And
> a duplicate entry for things like F/M/S would be ridiculous.
> 
> But maintaining a per-vCPU version of the CPU caps is definitely doable.  I.e. a
> vCPU equivalent to kvm_cpu_caps and the per-CPU capabilities.  There are currently
> 25 leafs that are tracked by kvm_cpu_caps, so relative to "governed" features,
> the cost will be 96 bytes per vCPU.  I agree that 96 bytes is worth eating, we've
> certainly taken on more for a lot, lot less.
> 
> It's a lot of churn, and there are some subtle nasties, e.g. MWAIT and other
> CPUID bits that changed based on MSRs or CR4, but most of the churn is superficial
> and the result is waaaaay less ugly than governed features and for the majority of
> features will Just Work.
> 
> I'll get a series posted next week (need to write changelogs and do a _lot_ more
> testing).  If you want to take a peek at where I'm headed before then:
> 
>   https://github.com/sean-jc/linux x86/guest_cpufeatures

This looks very good, looking forward to see the patches on the mailing list.

Best regards,
	Maxim Levitsky

> 



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-02 23:58         ` Sean Christopherson
@ 2023-11-07 18:12           ` Maxim Levitsky
  2023-11-07 18:39             ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Maxim Levitsky @ 2023-11-07 18:12 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Thu, 2023-11-02 at 16:58 -0700, Sean Christopherson wrote:
> On Thu, Nov 02, 2023, Maxim Levitsky wrote:
> > On Wed, 2023-11-01 at 09:31 -0700, Sean Christopherson wrote:
> > > On Tue, Oct 31, 2023, Maxim Levitsky wrote:
> > > > On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
> > > > > Add emulation interface for CET MSR access. The emulation code is split
> > > > > into common part and vendor specific part. The former does common check
> > > > > for MSRs and reads/writes directly from/to XSAVE-managed MSRs via the
> > > > > helpers while the latter accesses the MSRs linked to VMCS fields.
> > > > > 
> > > > > Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> > > > > ---
> > > 
> > > ...
> > > 
> > > > > +	case MSR_IA32_PL0_SSP ... MSR_IA32_PL3_SSP:
> > > > > +	case MSR_KVM_SSP:
> > > > > +		if (host_msr_reset && kvm_cpu_cap_has(X86_FEATURE_SHSTK))
> > > > > +			break;
> > > > > +		if (!guest_can_use(vcpu, X86_FEATURE_SHSTK))
> > > > > +			return 1;
> > > > > +		if (index == MSR_KVM_SSP && !host_initiated)
> > > > > +			return 1;
> > > > > +		if (is_noncanonical_address(data, vcpu))
> > > > > +			return 1;
> > > > > +		if (index != MSR_IA32_INT_SSP_TAB && !IS_ALIGNED(data, 4))
> > > > > +			return 1;
> > > > > +		break;
> > > > Once again I'll prefer to have an ioctl for setting/getting SSP, this will
> > > > make the above code simpler (e.g there will be no need to check that write
> > > > comes from the host/etc).
> > > 
> > > I don't think an ioctl() would be simpler overall, especially when factoring in
> > > userspace.  With a synthetic MSR, we get the following quite cheaply:
> > > 
> > >  1. Enumerating support to userspace.
> > >  2. Save/restore of the value, e.g. for live migration.
> > >  3. Vendor hooks for propagating values to/from the VMCS/VMCB.
> > > 
> > > For an ioctl(), 
> > > #1 would require a capability, #2 (and #1 to some extent) would
> > > require new userspace flows, and #3 would require new kvm_x86_ops hooks.
> > > 
> > > The synthetic MSR adds a small amount of messiness, as does bundling 
> > > MSR_IA32_INT_SSP_TAB with the other shadow stack MSRs.  The bulk of the mess comes
> > > from the need to allow userspace to write '0' when KVM enumerated supported to
> > > userspace.
> > 
> > Let me put it this way - all hacks start like that, and in this case this is API/ABI hack
> > so we will have to live with it forever.
> 
> Eh, I don't view it as a hack, at least the kind of hack that has a negative
> connotation.  KVM effectively has ~240 MSR indices reserved for whatever KVM
> wants. 
This is exactly the problem. These indices are reserved for PV features, not
for fake msrs, and my fear is that once we mix it up, it will be a mess.

If that was not API/ABI, I wouldn't complain, but since this is API/ABI, I'm afraid
to make a mistake and then be sorry.


>  The only weird thing about this one is that it's not accessible from the
> guest.  Which I agree is quite weird, but from a code perspective I think it
> works quite well.
> 
> > Once there is a precedent, trust me there will be 10s of new 'fake' msrs added, and the
> > interface will become one big mess.
> 
> That suggests MSRs aren't already one big mess. :-)  I'm somewhat joking, but also
> somewhat serious.  I really don't think that adding one oddball synthetic MSR is
> going to meaningfully move the needle on the messiness of MSRs.
> 
> Hmm, there probably is a valid slippery slope argument though.  As you say, at
> some point, enough state will get shoved into hardware that KVM will need an ever
> growing number of synthetic MSRs to keep pace.

Yes, exactly what I mean - Honestly though I don't expect many new x86 registers/states
that are not msrs, but we don't know what x86 designers will do next,
and APIs are something that can't be fixed later.

> 
> > As I suggested, if you don't want to add new capability/ioctl and vendor
> > callback per new x86 arch register, then let's implement
> > KVM_GET_ONE_REG/KVM_SET_ONE_REG and then it will be really easy to add new
> > regs without confusing users, and without polluting msr namespace with msrs
> > that don't exist.
> 
> I definitely don't hate the idea of KVM_{G,S}ET_ONE_REG, what I don't want is to
> have an entirely separate path in KVM for handling the actual get/set.
> 
> What if we combine the approaches?  Add KVM_{G,S}ET_ONE_REG support so that the
> uAPI can use completely arbitrary register indices without having to worry about
> polluting the MSR space and making MSR_KVM_SSP ABI.
Sounds like a reasonable idea but might be overkill.

> 
> Ooh, if we're clever, I bet we can extend KVM_{G,S}ET_ONE_REG to also work with
> existing MSRs, GPRs, and other stuff,

Not sure if we want to make it work with MSRs. MSRs are a very well defined thing
on x86, and we already have an API to read/write them. Other registers maybe,
don't know.

>  i.e. not force userspace through the funky
> KVM_SET_MSRS just to set one reg, and not force a RMW of all GPRs just to set
> RIP or something.
Setting one GPR like RIP does sound like a valid use case of KVM_SET_ONE_REG.

>   E.g. use bits 39:32 of the id to encode the register class,
> bits 31:0 to hold the index within a class, and reserve bits 63:40 for future
> usage.
> 
> Then for KVM-defined registers, we can route them internally as needed, e.g. we
> can still define MSR_KVM_SSP so that internal it's treated like an MSR, but its
> index isn't ABI and so can be changed at will.  And future KVM-defined registers
> wouldn't _need_ to be treated like MSRs, i.e. we could route registers through
> the MSR APIs if and only if it makes sense to do so.

I am not sure that even internally I'll treat MSR_KVM_SSP as MSR. 
An MSR IMHO is a msr, a register is a register, mixing this up will
just add to the confusion.


Honestly if I were to add support for the SSP register, I'll just add a new
ioctl/capability and vendor callback. All of this code is just harmless
boilerplate code.
Even using KVM_GET_ONE_REG/KVM_SET_ONE_REG is probably overkill, although using
it for new registers is reasonable.

At the end I am not going to argue much about this - I just voiced my option that currently
MSR read/write interface is pure in regard that it only works on either real msrs or
at least PV msrs that the guest can read/write. 

All other guest's state is set via separate ioctls/callbacks/etc, and thus it's more consistent
from API POV to add SSP here.


> 
> Side topic, why on earth is the data value of kvm_one_reg "addr"?

I don't know, probably something ARM related.

> 


Best regards,
	Maxim Levitsky


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs
  2023-11-07 18:12           ` Maxim Levitsky
@ 2023-11-07 18:39             ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-07 18:39 UTC (permalink / raw)
  To: Maxim Levitsky
  Cc: Yang Weijiang, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	chao.gao, rick.p.edgecombe, john.allen

On Tue, Nov 07, 2023, Maxim Levitsky wrote:
> On Thu, 2023-11-02 at 16:58 -0700, Sean Christopherson wrote:
> > Ooh, if we're clever, I bet we can extend KVM_{G,S}ET_ONE_REG to also work with
> > existing MSRs, GPRs, and other stuff,
> 
> Not sure if we want to make it work with MSRs. MSRs are a very well defined thing
> on x86, and we already have an API to read/write them.

Yeah, the API is weird though :-)

> Other registers maybe, don't know.
> 
> >  i.e. not force userspace through the funky
> > KVM_SET_MSRS just to set one reg, and not force a RMW of all GPRs just to set
> > RIP or something.
> Setting one GPR like RIP does sound like a valid use case of KVM_SET_ONE_REG.
> 
> >   E.g. use bits 39:32 of the id to encode the register class,
> > bits 31:0 to hold the index within a class, and reserve bits 63:40 for future
> > usage.
> > 
> > Then for KVM-defined registers, we can route them internally as needed, e.g. we
> > can still define MSR_KVM_SSP so that internal it's treated like an MSR, but its
> > index isn't ABI and so can be changed at will.  And future KVM-defined registers
> > wouldn't _need_ to be treated like MSRs, i.e. we could route registers through
> > the MSR APIs if and only if it makes sense to do so.
> 
> I am not sure that even internally I'll treat MSR_KVM_SSP as MSR. 
> An MSR IMHO is a msr, a register is a register, mixing this up will
> just add to the confusion.

I disagree, things like MSR_{FS,GS}_BASE already set the precedent that MSRs and
registers can be separate viewpoints to the same internal CPU state.  AIUI, these
days, whether a register is exposed via an MSR or dedicated ISA largely comes
down to CPL restrictions and performance.

> Honestly if I were to add support for the SSP register, I'll just add a new
> ioctl/capability and vendor callback. All of this code is just harmless
> boilerplate code.

We've had far too many bugs and confusion over error handling for things like
checking "is this value legal" to be considered harmless boilerplate code.

> Even using KVM_GET_ONE_REG/KVM_SET_ONE_REG is probably overkill, although using
> it for new registers is reasonable.

Maybe, but if we're going to bother adding new ioctls() for x86, I don't see any
benefit to reinventing a wheel that's only good for one thing.

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size
  2023-11-07 18:04                     ` Maxim Levitsky
@ 2023-11-14  9:13                       ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-14  9:13 UTC (permalink / raw)
  To: Maxim Levitsky, Sean Christopherson
  Cc: Dave Hansen, pbonzini, kvm, linux-kernel, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 11/8/2023 2:04 AM, Maxim Levitsky wrote:
> On Fri, 2023-11-03 at 07:33 -0700, Sean Christopherson wrote:
>> On Thu, Nov 02, 2023, Maxim Levitsky wrote:
>>> On Wed, 2023-11-01 at 07:16 -0700, Sean Christopherson wrote:
>>>> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>>>>> On Thu, 2023-10-26 at 10:24 -0700, Sean Christopherson wrote:
>>>>>> --
>>>>>> From: Sean Christopherson <seanjc@google.com>
>>>>>> Date: Thu, 26 Oct 2023 10:17:33 -0700
>>>>>> Subject: [PATCH] x86/fpu/xstate: Always preserve non-user xfeatures/flags in
>>>>>>   __state_perm
>>>>>>
>>>>>> Fixes: 781c64bfcb73 ("x86/fpu/xstate: Handle supervisor states in XSTATE permissions")
>>>>>> Signed-off-by: Sean Christopherson <seanjc@google.com>
>>>>>> ---
>>>>>>   arch/x86/kernel/fpu/xstate.c | 18 +++++++++++-------
>>>>>>   1 file changed, 11 insertions(+), 7 deletions(-)
>>>>>>
>>>>>> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
>>>>>> index ef6906107c54..73f6bc00d178 100644
>>>>>> --- a/arch/x86/kernel/fpu/xstate.c
>>>>>> +++ b/arch/x86/kernel/fpu/xstate.c
>>>>>> @@ -1601,16 +1601,20 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
>>>>>>   	if ((permitted & requested) == requested)
>>>>>>   		return 0;
>>>>>>   
>>>>>> -	/* Calculate the resulting kernel state size */
>>>>>> +	/*
>>>>>> +	 * Calculate the resulting kernel state size.  Note, @permitted also
>>>>>> +	 * contains supervisor xfeatures even though supervisor are always
>>>>>> +	 * permitted for kernel and guest FPUs, and never permitted for user
>>>>>> +	 * FPUs.
>>>>>> +	 */
>>>>>>   	mask = permitted | requested;
>>>>>> -	/* Take supervisor states into account on the host */
>>>>>> -	if (!guest)
>>>>>> -		mask |= xfeatures_mask_supervisor();
>>>>>>   	ksize = xstate_calculate_size(mask, compacted);
>>>>> This might not work with kernel dynamic features, because
>>>>> xfeatures_mask_supervisor() will return all supported supervisor features.
>>>> I don't understand what you mean by "This".
>>>> Somewhat of a side topic, I feel very strongly that we should use "guest only"
>>>> terminology instead of "dynamic".  There is nothing dynamic about whether or not
>>>> XFEATURE_CET_KERNEL is allowed; there's not even a real "decision" beyond checking
>>>> wheter or not CET is supported.
>>>>> Therefore at least until we have an actual kernel dynamic feature (a feature
>>>>> used by the host kernel and not KVM, and which has to be dynamic like AMX),
>>>>> I suggest that KVM stops using the permission API completely for the guest
>>>>> FPU state, and just gives all the features it wants to enable right to
>>>> By "it", I assume you mean userspace?
>>>>
>>>>> __fpu_alloc_init_guest_fpstate() (Guest FPU permission API IMHO should be
>>>>> deprecated and ignored)
>>>> KVM allocates guest FPU state during KVM_CREATE_VCPU, so not using prctl() would
>>>> either require KVM to defer allocating guest FPU state until KVM_SET_CPUID{,2},
>>>> or would require a VM-scoped KVM ioctl() to let userspace opt-in to
>>>>
>>>> Allocating guest FPU state during KVM_SET_CPUID{,2} would get messy,
>>>> as KVM allows
>>>> multiple calls to KVM_SET_CPUID{,2} so long as the vCPU hasn't done KVM_RUN.  E.g.
>>>> KVM would need to support actually resizing guest FPU state, which would be extra
>>>> complexity without any meaningful benefit.
>>> OK, I understand you now. What you claim is that it is legal to do this:
>>>
>>> - KVM_SET_XSAVE
>>> - KVM_SET_CPUID (with AMX enabled)
>>>
>>> KVM_SET_CPUID will have to resize the xstate which is already valid.
>> I was actually talking about
>>
>>    KVM_SET_CPUID2 (with dynamic user feature #1)
>>    KVM_SET_CPUID2 (with dynamic user feature #2)
>>
>> The second call through __xstate_request_perm() will be done with only user
>> xfeatures in @permitted and so the kernel will compute the wrong ksize.
>>
>>> Your patch to fix the __xstate_request_perm() does seem to be correct in a
>>> sense that it will preserve the kernel fpu components in the fpu permissions.
>>>
>>> However note that kernel fpu permissions come from
>>> 'fpu_kernel_cfg.default_features' which don't include the dynamic kernel
>>> xfeatures (added a few patches before this one).
>> CET_KERNEL isn't dynamic!  It's guest-only.  There are no runtime decisions as to
>> whether or not CET_KERNEL is allowed.  All guest FPU get CET_KERNEL, no kernel FPUs
>> get CET_KERNEL.
>>
>> That matters because I am also proposing that we add a dedicated, defined-at-boot
>> fpu_guest_cfg instead of bolting on a "dynamic", which is what I meant by this:
> Seems fair.
>
>>   : Or even better if it doesn't cause weirdness elsewhere, a dedicated
>>   : fpu_guest_cfg.  For me at least, a fpu_guest_cfg would make it easier to
>>   : understand what all is going on.
> This is a very good idea.
>
>> That way, initialization of permissions is simply
>>
>> 	fpu->guest_perm = fpu_guest_cfg.default_features;
>>
>> and there's no need to differentiate between guest and kernel FPUs when reallocating
>> for dynamic user xfeatures because guest_perm.__state_perm already holds the correct
>> data.
>>
>>> Therefore an attempt to resize the xstate to include a kernel dynamic feature by
>>> __xfd_enable_feature will fail.
>>>
>>> If kvm on the other hand includes all the kernel dynamic features in the
>>> initial allocation of FPU state (not optimal but possible),
>> This is what I am suggesting.
> This is a valid solution.

Sorry for delayed response!!

I favor adding new fpu_guest_cfg to make things clearer.
Maybe you're talking about some patch like below: (not tested)

 From 19c77aad196efe7eab4a10c5882166453de287b9 Mon Sep 17 00:00:00 2001
From: Yang Weijiang <weijiang.yang@intel.com>
Date: Fri, 22 Sep 2023 00:37:20 -0400
Subject: [PATCH] x86/fpu/xstate: Introduce fpu_guest_cfg for guest FPU
  configuration

Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
---
  arch/x86/include/asm/fpu/types.h |  2 +-
  arch/x86/kernel/fpu/core.c       | 14 +++++++++++---
  arch/x86/kernel/fpu/xstate.c     |  9 +++++++++
  3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c6fd13a17205..306825ad6bc0 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -602,6 +602,6 @@ struct fpu_state_config {
  };

  /* FPU state configuration information */
-extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg;
+extern struct fpu_state_config fpu_kernel_cfg, fpu_user_cfg, fpu_guest_cfg;

  #endif /* _ASM_X86_FPU_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a86d37052a64..c70dad9894f0 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -33,9 +33,10 @@ DEFINE_STATIC_KEY_FALSE(__fpu_state_size_dynamic);
  DEFINE_PER_CPU(u64, xfd_state);
  #endif
-/* The FPU state configuration data for kernel and user space */
+/* The FPU state configuration data for kernel, user space and guest. */
  struct fpu_state_config        fpu_kernel_cfg __ro_after_init;
  struct fpu_state_config fpu_user_cfg __ro_after_init;
+struct fpu_state_config fpu_guest_cfg __ro_after_init;

  /*
   * Represents the initial FPU state. It's mostly (but not completely) zeroes,
@@ -535,8 +536,15 @@ void fpstate_reset(struct fpu *fpu)
         fpu->perm.__state_perm          = fpu_kernel_cfg.default_features;
         fpu->perm.__state_size          = fpu_kernel_cfg.default_size;
         fpu->perm.__user_state_size     = fpu_user_cfg.default_size;
-       /* Same defaults for guests */
-       fpu->guest_perm = fpu->perm;
+
+       /* Guest permission settings */
+       fpu->guest_perm.__state_perm    = fpu_guest_cfg.default_features;
+       fpu->guest_perm.__state_size    = fpu_guest_cfg.default_size;
+       /*
+        * Set guest's __user_state_size to fpu_user_cfg.default_size so that
+        * existing uAPIs can still work.
+        */
+       fpu->guest_perm.__user_state_size = fpu_user_cfg.default_size;
  }

  static inline void fpu_inherit_perms(struct fpu *dst_fpu)
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 1b7bc03968c5..bebabace628b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -681,6 +681,7 @@ static int __init init_xstate_size(void)
  {
         /* Recompute the context size for enabled features: */
         unsigned int user_size, kernel_size, kernel_default_size;
+       unsigned int guest_default_size;
         bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);

         /* Uncompacted user space size */
@@ -702,13 +703,18 @@ static int __init init_xstate_size(void)
         kernel_default_size =
xstate_calculate_size(fpu_kernel_cfg.default_features, compacted);

+       guest_default_size =
+ xstate_calculate_size(fpu_guest_cfg.default_features, compacted);
+
         if (!paranoid_xstate_size_valid(kernel_size))
                 return -EINVAL;

         fpu_kernel_cfg.max_size = kernel_size;
         fpu_user_cfg.max_size = user_size;
+       fpu_guest_cfg.max_size = kernel_size;

         fpu_kernel_cfg.default_size = kernel_default_size;
+       fpu_guest_cfg.default_size = guest_default_size;
         fpu_user_cfg.default_size =
                 xstate_calculate_size(fpu_user_cfg.default_features, false);

@@ -829,6 +835,9 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
         fpu_user_cfg.default_features = fpu_user_cfg.max_features;
         fpu_user_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;

+       fpu_guest_cfg.max_features = fpu_kernel_cfg.max_features;
+       fpu_guest_cfg.default_features &= ~XFEATURE_MASK_USER_DYNAMIC;
+
         /* Store it for paranoia check at the end */
         xfeatures = fpu_kernel_cfg.max_features;

--
2.27.0

[...]

^ permalink raw reply related	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS
  2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
  2023-10-08  5:54   ` Chao Gao
  2023-10-31 17:51   ` Maxim Levitsky
@ 2023-11-15  7:18   ` Binbin Wu
  2 siblings, 0 replies; 120+ messages in thread
From: Binbin Wu @ 2023-11-15  7:18 UTC (permalink / raw)
  To: seanjc, pbonzini
  Cc: Yang Weijiang, linux-kernel, kvm, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen, Zhang Yi Z



On 9/14/2023 2:33 PM, Yang Weijiang wrote:
> Update CPUID.(EAX=0DH,ECX=1).EBX to reflect current required xstate size
> due to XSS MSR modification.
> CPUID(EAX=0DH,ECX=1).EBX reports the required storage size of all enabled
> xstate features in (XCR0 | IA32_XSS). The CPUID value can be used by guest
> before allocate sufficient xsave buffer.
>
> Note, KVM does not yet support any XSS based features, i.e. supported_xss
> is guaranteed to be zero at this time.
>
> Opportunistically modify XSS write access logic as: if !guest_cpuid_has(),
> write initiated from host is allowed iff the write is reset operaiton,
> i.e., data == 0, reject host_initiated non-reset write and any guest write.
Hi Sean & Polo,
During code review of Enable CET Virtualization v5 patchset, there were
discussions about "do a wholesale cleanup of all the cases that essentially
allow userspace to do KVM_SET_MSR before KVM_SET_CPUID2", i.e. force the 
order
between  KVM_SET_CPUID2 and KVM_SET_MSR, but allow the host_initiated 
path with
default (generally 0) value.
https://lore.kernel.org/kvm/ZM1C+ILRMCfzJxx7@google.com/
https://lore.kernel.org/kvm/CABgObfbvr8F8g5hJN6jn95m7u7m2+8ACkqO25KAZwRmJ9AncZg@mail.gmail.com/

I can take the task to do the code cleanup.
Before going any further, I want to confirm it is still the direction 
intended,
right?


>
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Co-developed-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Zhang Yi Z <yi.z.zhang@linux.intel.com>
> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/cpuid.c            | 15 ++++++++++++++-
>   arch/x86/kvm/x86.c              | 13 +++++++++----
>   3 files changed, 24 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0fc5e6312e93..d77b030e996c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -803,6 +803,7 @@ struct kvm_vcpu_arch {
>   
>   	u64 xcr0;
>   	u64 guest_supported_xcr0;
> +	u64 guest_supported_xss;
>   
>   	struct kvm_pio_request pio;
>   	void *pio_data;
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 1f206caec559..4e7a820cba62 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -275,7 +275,8 @@ static void __kvm_update_cpuid_runtime(struct kvm_vcpu *vcpu, struct kvm_cpuid_e
>   	best = cpuid_entry2_find(entries, nent, 0xD, 1);
>   	if (best && (cpuid_entry_has(best, X86_FEATURE_XSAVES) ||
>   		     cpuid_entry_has(best, X86_FEATURE_XSAVEC)))
> -		best->ebx = xstate_required_size(vcpu->arch.xcr0, true);
> +		best->ebx = xstate_required_size(vcpu->arch.xcr0 |
> +						 vcpu->arch.ia32_xss, true);
>   
>   	best = __kvm_find_kvm_cpuid_features(vcpu, entries, nent);
>   	if (kvm_hlt_in_guest(vcpu->kvm) && best &&
> @@ -312,6 +313,17 @@ static u64 vcpu_get_supported_xcr0(struct kvm_vcpu *vcpu)
>   	return (best->eax | ((u64)best->edx << 32)) & kvm_caps.supported_xcr0;
>   }
>   
> +static u64 vcpu_get_supported_xss(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *best;
> +
> +	best = kvm_find_cpuid_entry_index(vcpu, 0xd, 1);
> +	if (!best)
> +		return 0;
> +
> +	return (best->ecx | ((u64)best->edx << 32)) & kvm_caps.supported_xss;
> +}
> +
>   static bool kvm_cpuid_has_hyperv(struct kvm_cpuid_entry2 *entries, int nent)
>   {
>   	struct kvm_cpuid_entry2 *entry;
> @@ -358,6 +370,7 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   	}
>   
>   	vcpu->arch.guest_supported_xcr0 = vcpu_get_supported_xcr0(vcpu);
> +	vcpu->arch.guest_supported_xss = vcpu_get_supported_xss(vcpu);
>   
>   	/*
>   	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1258d1d6dd52..9a616d84bd39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -3795,20 +3795,25 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   			vcpu->arch.ia32_tsc_adjust_msr += adj;
>   		}
>   		break;
> -	case MSR_IA32_XSS:
> -		if (!msr_info->host_initiated &&
> -		    !guest_cpuid_has(vcpu, X86_FEATURE_XSAVES))
> +	case MSR_IA32_XSS: {
> +		bool host_msr_reset = msr_info->host_initiated && data == 0;
> +
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XSAVES) &&
> +		    (!host_msr_reset || !msr_info->host_initiated))
>   			return 1;
>   		/*
>   		 * KVM supports exposing PT to the guest, but does not support
>   		 * IA32_XSS[bit 8]. Guests have to use RDMSR/WRMSR rather than
>   		 * XSAVES/XRSTORS to save/restore PT MSRs.
>   		 */
> -		if (data & ~kvm_caps.supported_xss)
> +		if (data & ~vcpu->arch.guest_supported_xss)
>   			return 1;
> +		if (vcpu->arch.ia32_xss == data)
> +			break;
>   		vcpu->arch.ia32_xss = data;
>   		kvm_update_cpuid_runtime(vcpu);
>   		break;
> +	}
>   	case MSR_SMI_COUNT:
>   		if (!msr_info->host_initiated)
>   			return 1;


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-11-01  2:09   ` Chao Gao
  2023-11-01  9:22     ` Yang, Weijiang
  2023-11-01  9:54     ` Maxim Levitsky
@ 2023-11-15  8:23     ` Yang, Weijiang
  2 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-15  8:23 UTC (permalink / raw)
  To: Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On 11/1/2023 10:09 AM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>> to enable CET for nested VM.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>> arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>> arch/x86/kvm/vmx/vmx.c    |  2 ++
>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index 78a3be394d00..2c4ff13fddb0 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>> 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>
>> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>> +
>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>> +
>> 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>
>> 	vmx->nested.force_msr_bitmap_recalc = false;
>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>> 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>> #endif
>> 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>> -		VM_EXIT_CLEAR_BNDCFGS;
>> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>> 	msrs->exit_ctls_high |=
>> 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>> 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>> #ifdef CONFIG_X86_64
>> 		VM_ENTRY_IA32E_MODE |
>> #endif
>> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>> +		VM_ENTRY_LOAD_CET_STATE;
>> 	msrs->entry_ctls_high |=
>> 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>> 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>> index 106a72c923ca..4233b5ca9461 100644
>> --- a/arch/x86/kvm/vmx/vmcs12.c
>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>> 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>> 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>> 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>> +	FIELD(GUEST_S_CET, guest_s_cet),
>> +	FIELD(GUEST_SSP, guest_ssp),
>> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.

After checked around the code, it's necessary to sync related fields from vmcs02 to vmcs12
at nested VM exit so that L1 or userspace can access correct values.
I'll add this part, thanks!

^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  2023-11-01  4:21   ` Chao Gao
@ 2023-11-15  8:31     ` Yang, Weijiang
  2023-11-15 13:27       ` Sean Christopherson
  0 siblings, 1 reply; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-15  8:31 UTC (permalink / raw)
  To: Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On 11/1/2023 12:21 PM, Chao Gao wrote:
> On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
>> Per SDM description(Vol.3D, Appendix A.1):
>> "If bit 56 is read as 1, software can use VM entry to deliver a hardware
>> exception with or without an error code, regardless of vector"
>>
>> Modify has_error_code check before inject events to nested guest. Only
>> enforce the check when guest is in real mode, the exception is not hard
>> exception and the platform doesn't enumerate bit56 in VMX_BASIC, in all
>> other case ignore the check to make the logic consistent with SDM.
>>
>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>> ---
>> arch/x86/kvm/vmx/nested.c | 22 ++++++++++++++--------
>> arch/x86/kvm/vmx/nested.h |  5 +++++
>> 2 files changed, 19 insertions(+), 8 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>> index c5ec0ef51ff7..78a3be394d00 100644
>> --- a/arch/x86/kvm/vmx/nested.c
>> +++ b/arch/x86/kvm/vmx/nested.c
>> @@ -1205,9 +1205,9 @@ static int vmx_restore_vmx_basic(struct vcpu_vmx *vmx, u64 data)
>> {
>> 	const u64 feature_and_reserved =
>> 		/* feature (except bit 48; see below) */
>> -		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) |
>> +		BIT_ULL(49) | BIT_ULL(54) | BIT_ULL(55) | BIT_ULL(56) |
>> 		/* reserved */
>> -		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 56);
>> +		BIT_ULL(31) | GENMASK_ULL(47, 45) | GENMASK_ULL(63, 57);
>> 	u64 vmx_basic = vmcs_config.nested.basic;
>>
>> 	if (!is_bitwise_subset(vmx_basic, data, feature_and_reserved))
>> @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
>> 		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
>> 			return -EINVAL;
>>
>> -		/* VM-entry interruption-info field: deliver error code */
>> -		should_have_error_code =
>> -			intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
>> -			x86_exception_has_error_code(vector);
>> -		if (CC(has_error_code != should_have_error_code))
>> -			return -EINVAL;
>> +		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
>> +		    !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
>> +			/* VM-entry interruption-info field: deliver error code */
>> +			should_have_error_code =
>> +				intr_type == INTR_TYPE_HARD_EXCEPTION &&
>> +				prot_mode &&
>> +				x86_exception_has_error_code(vector);
>> +			if (CC(has_error_code != should_have_error_code))
>> +				return -EINVAL;
>> +		}
> prot_mode and intr_type are used twice, making the code a little hard to read.
>
> how about:
> 		/*
> 		 * Cannot deliver error code in real mode or if the
> 		 * interruption type is not hardware exception. For other
> 		 * cases, do the consistency check only if the vCPU doesn't
> 		 * enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
> 		 */
> 		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
> 			if (CC(has_error_code))
> 				return -EINVAL;
> 		} else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> 			if (CC(has_error_code != x86_exception_has_error_code(vector)))
> 				return -EINVAL;
> 		}
>
> and drop should_have_error_code.

The change looks clearer, I'll take it, thanks!



^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest
  2023-11-01  9:54     ` Maxim Levitsky
@ 2023-11-15  8:56       ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-15  8:56 UTC (permalink / raw)
  To: Maxim Levitsky, Chao Gao
  Cc: seanjc, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On 11/1/2023 5:54 PM, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 10:09 +0800, Chao Gao wrote:
>> On Thu, Sep 14, 2023 at 02:33:25AM -0400, Yang Weijiang wrote:
>>> Set up CET MSRs, related VM_ENTRY/EXIT control bits and fixed CR4 setting
>>> to enable CET for nested VM.
>>>
>>> Signed-off-by: Yang Weijiang <weijiang.yang@intel.com>
>>> ---
>>> arch/x86/kvm/vmx/nested.c | 27 +++++++++++++++++++++++++--
>>> arch/x86/kvm/vmx/vmcs12.c |  6 ++++++
>>> arch/x86/kvm/vmx/vmcs12.h | 14 +++++++++++++-
>>> arch/x86/kvm/vmx/vmx.c    |  2 ++
>>> 4 files changed, 46 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
>>> index 78a3be394d00..2c4ff13fddb0 100644
>>> --- a/arch/x86/kvm/vmx/nested.c
>>> +++ b/arch/x86/kvm/vmx/nested.c
>>> @@ -660,6 +660,28 @@ static inline bool nested_vmx_prepare_msr_bitmap(struct kvm_vcpu *vcpu,
>>> 	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> 					 MSR_IA32_FLUSH_CMD, MSR_TYPE_W);
>>>
>>> +	/* Pass CET MSRs to nested VM if L0 and L1 are set to pass-through. */
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_U_CET, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_S_CET, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_PL0_SSP, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_PL1_SSP, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_PL2_SSP, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_PL3_SSP, MSR_TYPE_RW);
>>> +
>>> +	nested_vmx_set_intercept_for_msr(vmx, msr_bitmap_l1, msr_bitmap_l0,
>>> +					 MSR_IA32_INT_SSP_TAB, MSR_TYPE_RW);
>>> +
>>> 	kvm_vcpu_unmap(vcpu, &vmx->nested.msr_bitmap_map, false);
>>>
>>> 	vmx->nested.force_msr_bitmap_recalc = false;
>>> @@ -6794,7 +6816,7 @@ static void nested_vmx_setup_exit_ctls(struct vmcs_config *vmcs_conf,
>>> 		VM_EXIT_HOST_ADDR_SPACE_SIZE |
>>> #endif
>>> 		VM_EXIT_LOAD_IA32_PAT | VM_EXIT_SAVE_IA32_PAT |
>>> -		VM_EXIT_CLEAR_BNDCFGS;
>>> +		VM_EXIT_CLEAR_BNDCFGS | VM_EXIT_LOAD_CET_STATE;
>>> 	msrs->exit_ctls_high |=
>>> 		VM_EXIT_ALWAYSON_WITHOUT_TRUE_MSR |
>>> 		VM_EXIT_LOAD_IA32_EFER | VM_EXIT_SAVE_IA32_EFER |
>>> @@ -6816,7 +6838,8 @@ static void nested_vmx_setup_entry_ctls(struct vmcs_config *vmcs_conf,
>>> #ifdef CONFIG_X86_64
>>> 		VM_ENTRY_IA32E_MODE |
>>> #endif
>>> -		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS;
>>> +		VM_ENTRY_LOAD_IA32_PAT | VM_ENTRY_LOAD_BNDCFGS |
>>> +		VM_ENTRY_LOAD_CET_STATE;
>>> 	msrs->entry_ctls_high |=
>>> 		(VM_ENTRY_ALWAYSON_WITHOUT_TRUE_MSR | VM_ENTRY_LOAD_IA32_EFER |
>>> 		 VM_ENTRY_LOAD_IA32_PERF_GLOBAL_CTRL);
>>> diff --git a/arch/x86/kvm/vmx/vmcs12.c b/arch/x86/kvm/vmx/vmcs12.c
>>> index 106a72c923ca..4233b5ca9461 100644
>>> --- a/arch/x86/kvm/vmx/vmcs12.c
>>> +++ b/arch/x86/kvm/vmx/vmcs12.c
>>> @@ -139,6 +139,9 @@ const unsigned short vmcs12_field_offsets[] = {
>>> 	FIELD(GUEST_PENDING_DBG_EXCEPTIONS, guest_pending_dbg_exceptions),
>>> 	FIELD(GUEST_SYSENTER_ESP, guest_sysenter_esp),
>>> 	FIELD(GUEST_SYSENTER_EIP, guest_sysenter_eip),
>>> +	FIELD(GUEST_S_CET, guest_s_cet),
>>> +	FIELD(GUEST_SSP, guest_ssp),
>>> +	FIELD(GUEST_INTR_SSP_TABLE, guest_ssp_tbl),
>> I think we need to sync guest states, e.g., guest_s_cet/guest_ssp/guest_ssp_tbl,
>> between vmcs02 and vmcs12 on nested VM entry/exit, probably in
>> sync_vmcs02_to_vmcs12() and prepare_vmcs12() or "_rare" variants of them.
>>
> Aha, this is why I suspected that nested support is incomplete,
> 100% agree.
>
> In particular, looking at Intel's SDM I see that:
>
> HOST_S_CET, HOST_SSP, HOST_INTR_SSP_TABLE needs to be copied from vmcb12 to vmcb02 but not vise versa
> because the CPU doesn't touch them.
>
> GUEST_S_CET, GUEST_SSP, GUEST_INTR_SSP_TABLE should be copied bi-directionally.

Yes, I'll make this part of code complete in next version, thanks!

> This of course depends on the corresponding vm entry and vm exit controls being set.
> That means that it is legal in theory to do VM entry/exit with CET enabled but not use
> VM_ENTRY_LOAD_CET_STATE and/or VM_EXIT_LOAD_CET_STATE,
> because for example nested hypervisor in theory can opt to save/load these itself.
>
> I think that this is all, but I also can't be 100% sure. This thing has to be tested well before
> we can be sure that it works.
>
> Best regards,
> 	Maxim Levitsky
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers
  2023-11-02 18:26       ` Maxim Levitsky
@ 2023-11-15  9:00         ` Yang, Weijiang
  0 siblings, 0 replies; 120+ messages in thread
From: Yang, Weijiang @ 2023-11-15  9:00 UTC (permalink / raw)
  To: Maxim Levitsky, Sean Christopherson
  Cc: pbonzini, kvm, linux-kernel, dave.hansen, peterz, chao.gao,
	rick.p.edgecombe, john.allen

On 11/3/2023 2:26 AM, Maxim Levitsky wrote:
> On Wed, 2023-11-01 at 12:32 -0700, Sean Christopherson wrote:
>> On Tue, Oct 31, 2023, Maxim Levitsky wrote:
>>> On Thu, 2023-09-14 at 02:33 -0400, Yang Weijiang wrote:
>>>> Wrap __kvm_{get,set}_msr() into two new helpers for KVM usage and use the
>>>> helpers to replace existing usage of the raw functions.
>>>> kvm_msr_{read,write}() are KVM-internal helpers, i.e. used when KVM needs
>>>> to get/set a MSR value for emulating CPU behavior.
>>> I am not sure if I like this patch or not. On one hand the code is cleaner
>>> this way, but on the other hand now it is easier to call kvm_msr_write() on
>>> behalf of the guest.
>>>
>>> For example we also have the 'kvm_set_msr()' which does actually set the msr
>>> on behalf of the guest.
>>>
>>> How about we call the new function kvm_msr_set_host() and rename
>>> kvm_set_msr() to kvm_msr_set_guest(), together with good comments explaning
>>> what they do?
>> LOL, just call me Nostradamus[*] ;-)
>>
>>   : > SSP save/load should go to enter_smm_save_state_64() and rsm_load_state_64(),
>>   : > where other fields of SMRAM are handled.
>>   :
>>   : +1.  The right way to get/set MSRs like this is to use __kvm_get_msr() and pass
>>   : %true for @host_initiated.  Though I would add a prep patch to provide wrappers
>>   : for __kvm_get_msr() and __kvm_set_msr().  Naming will be hard, but I think we
>>                                               ^^^^^^^^^^^^^^^^^^^
>>   : can use kvm_{read,write}_msr() to go along with the KVM-initiated register
>>   : accessors/mutators, e.g. kvm_register_read(), kvm_pdptr_write(), etc.
>>
>> [*] https://lore.kernel.org/all/ZM0YZgFsYWuBFOze@google.com
>>
>>> Also functions like kvm_set_msr_ignored_check(), kvm_set_msr_with_filter() and such,
>>> IMHO have names that are not very user friendly.
>> I don't like the host/guest split because KVM always operates on guest values,
>> e.g. kvm_msr_set_host() in particular could get confusing.
> That makes sense.
>
>> IMO kvm_get_msr() and kvm_set_msr(), and to some extent the helpers you note below,
>> are the real problem.
>>
>> What if we rename kvm_{g,s}et_msr() to kvm_emulate_msr_{read,write}() to make it
>> more obvious that those are the "guest" helpers?  And do that as a prep patch in
>> this series (there aren't _that_ many users).
> Makes sense.

Then I'll modify related code and add the pre-patch in next version, thanks!

>> I'm also in favor of renaming the "inner" helpers, but I think we should tackle
>> those separately.separately
> OK.
>
> Best regards,
> 	Maxim Levitsky
>
>


^ permalink raw reply	[flat|nested] 120+ messages in thread

* Re: [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1
  2023-11-15  8:31     ` Yang, Weijiang
@ 2023-11-15 13:27       ` Sean Christopherson
  0 siblings, 0 replies; 120+ messages in thread
From: Sean Christopherson @ 2023-11-15 13:27 UTC (permalink / raw)
  To: Weijiang Yang
  Cc: Chao Gao, pbonzini, kvm, linux-kernel, dave.hansen, peterz,
	rick.p.edgecombe, john.allen

On Wed, Nov 15, 2023, Weijiang Yang wrote:
> On 11/1/2023 12:21 PM, Chao Gao wrote:
> > On Thu, Sep 14, 2023 at 02:33:24AM -0400, Yang Weijiang wrote:
> > > @@ -2846,12 +2846,16 @@ static int nested_check_vm_entry_controls(struct kvm_vcpu *vcpu,
> > > 		    CC(intr_type == INTR_TYPE_OTHER_EVENT && vector != 0))
> > > 			return -EINVAL;
> > > 
> > > -		/* VM-entry interruption-info field: deliver error code */
> > > -		should_have_error_code =
> > > -			intr_type == INTR_TYPE_HARD_EXCEPTION && prot_mode &&
> > > -			x86_exception_has_error_code(vector);
> > > -		if (CC(has_error_code != should_have_error_code))
> > > -			return -EINVAL;
> > > +		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION ||
> > > +		    !nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> > > +			/* VM-entry interruption-info field: deliver error code */
> > > +			should_have_error_code =
> > > +				intr_type == INTR_TYPE_HARD_EXCEPTION &&
> > > +				prot_mode &&
> > > +				x86_exception_has_error_code(vector);
> > > +			if (CC(has_error_code != should_have_error_code))
> > > +				return -EINVAL;
> > > +		}
> > prot_mode and intr_type are used twice, making the code a little hard to read.
> > 
> > how about:
> > 		/*
> > 		 * Cannot deliver error code in real mode or if the
> > 		 * interruption type is not hardware exception. For other
> > 		 * cases, do the consistency check only if the vCPU doesn't
> > 		 * enumerate VMX_BASIC_NO_HW_ERROR_CODE_CC.
> > 		 */
> > 		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
> > 			if (CC(has_error_code))
> > 				return -EINVAL;
> > 		} else if (!nested_cpu_has_no_hw_errcode_cc(vcpu)) {
> > 			if (CC(has_error_code != x86_exception_has_error_code(vector)))
> > 				return -EINVAL;
> > 		}

Or maybe go one step further and put the nested_cpu_has...() check inside the CC()
macro so that it too will be captured on error.  It's a little uglier though, and
I doubt providing that extra information will matter in practice, so definitely
feel free to stick with Chao's version.

		if (!prot_mode || intr_type != INTR_TYPE_HARD_EXCEPTION) {
			if (CC(has_error_code))
				return -EINVAL;
		} else if (CC(!nested_cpu_has_no_hw_errcode_cc(vcpu) &&
			      has_error_code != x86_exception_has_error_code(vector))) {
			return -EINVAL;
		}

^ permalink raw reply	[flat|nested] 120+ messages in thread

end of thread, other threads:[~2023-11-15 13:27 UTC | newest]

Thread overview: 120+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-09-14  6:33 [PATCH v6 00/25] Enable CET Virtualization Yang Weijiang
2023-09-14  6:33 ` [PATCH v6 01/25] x86/fpu/xstate: Manually check and add XFEATURE_CET_USER xstate bit Yang Weijiang
2023-09-14 22:39   ` Edgecombe, Rick P
2023-09-15  2:32     ` Yang, Weijiang
2023-09-15 16:35       ` Edgecombe, Rick P
2023-09-18  7:16         ` Yang, Weijiang
2023-10-31 17:43   ` Maxim Levitsky
2023-11-01  9:19     ` Yang, Weijiang
2023-09-14  6:33 ` [PATCH v6 02/25] x86/fpu/xstate: Fix guest fpstate allocation size calculation Yang Weijiang
2023-09-14 22:45   ` Edgecombe, Rick P
2023-09-15  2:45     ` Yang, Weijiang
2023-09-15 16:35       ` Edgecombe, Rick P
2023-10-21  0:39   ` Sean Christopherson
2023-10-24  8:50     ` Yang, Weijiang
2023-10-24 16:32       ` Sean Christopherson
2023-10-25 13:49         ` Yang, Weijiang
2023-10-31 17:43         ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 03/25] x86/fpu/xstate: Add CET supervisor mode state support Yang Weijiang
2023-09-15  0:06   ` Edgecombe, Rick P
2023-09-15  6:30     ` Yang, Weijiang
2023-10-31 17:44       ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 04/25] x86/fpu/xstate: Introduce kernel dynamic xfeature set Yang Weijiang
2023-09-15  0:24   ` Edgecombe, Rick P
2023-09-15  6:42     ` Yang, Weijiang
2023-10-31 17:44       ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 05/25] x86/fpu/xstate: Remove kernel dynamic xfeatures from kernel default_features Yang Weijiang
2023-09-14 16:22   ` Dave Hansen
2023-09-15  1:52     ` Yang, Weijiang
2023-10-31 17:44     ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 06/25] x86/fpu/xstate: Opt-in kernel dynamic bits when calculate guest xstate size Yang Weijiang
2023-09-14 17:40   ` Dave Hansen
2023-09-15  2:22     ` Yang, Weijiang
2023-10-24 17:07       ` Sean Christopherson
2023-10-25 14:49         ` Yang, Weijiang
2023-10-26 17:24           ` Sean Christopherson
2023-10-26 22:06             ` Edgecombe, Rick P
2023-10-31 17:45             ` Maxim Levitsky
2023-11-01 14:16               ` Sean Christopherson
2023-11-02 18:20                 ` Maxim Levitsky
2023-11-03 14:33                   ` Sean Christopherson
2023-11-07 18:04                     ` Maxim Levitsky
2023-11-14  9:13                       ` Yang, Weijiang
2023-09-14  6:33 ` [PATCH v6 07/25] x86/fpu/xstate: Tweak guest fpstate to support kernel dynamic xfeatures Yang Weijiang
2023-10-31 17:45   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 08/25] x86/fpu/xstate: WARN if normal fpstate contains " Yang Weijiang
2023-10-31 17:45   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 09/25] KVM: x86: Rework cpuid_get_supported_xcr0() to operate on vCPU data Yang Weijiang
2023-10-31 17:46   ` Maxim Levitsky
2023-11-01 14:41     ` Sean Christopherson
2023-11-02 18:25       ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 10/25] KVM: x86: Add kvm_msr_{read,write}() helpers Yang Weijiang
2023-10-31 17:47   ` Maxim Levitsky
2023-11-01 19:32     ` Sean Christopherson
2023-11-02 18:26       ` Maxim Levitsky
2023-11-15  9:00         ` Yang, Weijiang
2023-09-14  6:33 ` [PATCH v6 11/25] KVM: x86: Report XSS as to-be-saved if there are supported features Yang Weijiang
2023-10-31 17:47   ` Maxim Levitsky
2023-11-01 19:18     ` Sean Christopherson
2023-11-02 18:31       ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 12/25] KVM: x86: Refresh CPUID on write to guest MSR_IA32_XSS Yang Weijiang
2023-10-08  5:54   ` Chao Gao
2023-10-10  0:49     ` Yang, Weijiang
2023-10-31 17:51   ` Maxim Levitsky
2023-11-01 17:20     ` Sean Christopherson
2023-11-15  7:18   ` Binbin Wu
2023-09-14  6:33 ` [PATCH v6 13/25] KVM: x86: Initialize kvm_caps.supported_xss Yang Weijiang
2023-10-31 17:51   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 14/25] KVM: x86: Load guest FPU state when access XSAVE-managed MSRs Yang Weijiang
2023-10-31 17:51   ` Maxim Levitsky
2023-11-01 18:05     ` Sean Christopherson
2023-11-02 18:31       ` Maxim Levitsky
2023-11-03  8:46       ` Yang, Weijiang
2023-11-03 14:02         ` Sean Christopherson
2023-09-14  6:33 ` [PATCH v6 15/25] KVM: x86: Add fault checks for guest CR4.CET setting Yang Weijiang
2023-10-31 17:51   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 16/25] KVM: x86: Report KVM supported CET MSRs as to-be-saved Yang Weijiang
2023-10-08  6:19   ` Chao Gao
2023-10-10  0:54     ` Yang, Weijiang
2023-10-31 17:52   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 17/25] KVM: VMX: Introduce CET VMCS fields and control bits Yang Weijiang
2023-10-31 17:52   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 18/25] KVM: x86: Use KVM-governed feature framework to track "SHSTK/IBT enabled" Yang Weijiang
2023-10-31 17:54   ` Maxim Levitsky
2023-11-01 15:46     ` Sean Christopherson
2023-11-02 18:35       ` Maxim Levitsky
2023-11-04  0:07         ` Sean Christopherson
2023-11-07 18:05           ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 19/25] KVM: VMX: Emulate read and write to CET MSRs Yang Weijiang
2023-10-31 17:55   ` Maxim Levitsky
2023-11-01 16:31     ` Sean Christopherson
2023-11-02 18:38       ` Maxim Levitsky
2023-11-02 23:58         ` Sean Christopherson
2023-11-07 18:12           ` Maxim Levitsky
2023-11-07 18:39             ` Sean Christopherson
2023-11-03  8:18       ` Yang, Weijiang
2023-11-03 22:26         ` Sean Christopherson
2023-09-14  6:33 ` [PATCH v6 20/25] KVM: x86: Save and reload SSP to/from SMRAM Yang Weijiang
2023-10-31 17:55   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 21/25] KVM: VMX: Set up interception for CET MSRs Yang Weijiang
2023-10-31 17:56   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 22/25] KVM: VMX: Set host constant supervisor states to VMCS fields Yang Weijiang
2023-10-31 17:56   ` Maxim Levitsky
2023-09-14  6:33 ` [PATCH v6 23/25] KVM: x86: Enable CET virtualization for VMX and advertise to userspace Yang Weijiang
2023-09-24 13:38   ` kernel test robot
2023-09-25  0:26     ` Yang, Weijiang
2023-10-31 17:56   ` Maxim Levitsky
2023-11-01 22:14     ` Sean Christopherson
2023-09-14  6:33 ` [PATCH v6 24/25] KVM: nVMX: Introduce new VMX_BASIC bit for event error_code delivery to L1 Yang Weijiang
2023-10-31 17:57   ` Maxim Levitsky
2023-11-01  4:21   ` Chao Gao
2023-11-15  8:31     ` Yang, Weijiang
2023-11-15 13:27       ` Sean Christopherson
2023-09-14  6:33 ` [PATCH v6 25/25] KVM: nVMX: Enable CET support for nested guest Yang Weijiang
2023-10-31 17:57   ` Maxim Levitsky
2023-11-01  2:09   ` Chao Gao
2023-11-01  9:22     ` Yang, Weijiang
2023-11-01  9:54     ` Maxim Levitsky
2023-11-15  8:56       ` Yang, Weijiang
2023-11-15  8:23     ` Yang, Weijiang
2023-09-25  0:31 ` [PATCH v6 00/25] Enable CET Virtualization Yang, Weijiang

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.