linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/19] AMX Support in KVM
@ 2021-12-08  0:03 Yang Zhong
  2021-12-08  0:03 ` [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions Yang Zhong
                   ` (19 more replies)
  0 siblings, 20 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

(send on behalf of Jing who is currently on leave)

This series brings AMX (Advanced Matrix eXtensions) virtualization
support to KVM. The three preparation patches in fpu core from 
Thomas [1] are also included. 

A large portion of the changes in this series is to deal with eXtended
Feature Disable (XFD) which allows resizing of the fpstate buffer to 
support dynamically-enabled XSTATE features with large state component
(e.g. 8K for AMX).

The support is based on several key changes (design discussions can be
found in [2]):

  - Guest permissions for dynamically-enabled XSAVE features

    Native tasks have to request permission via prctl() before touching
    a dynamic-resized XSTATE compoenent. Introduce guest permissions 
    for the similar purpose. Userspace VMM is expected to request guest
    permission only once when the first vCPU is created.

    KVM checks guest permission in KVM_SET_CPUID2. Setting XFD in guest
    cpuid w/o proper permissions fails this operation.

  - Extend fpstate reallocation mechanism to cover guest fpu

    Unlike native tasks which have reallocation triggered from #NM 
    handler, guest fpstate reallocation is requested by KVM when it
    detects the guest intention to use dynamically-enabled XSAVE
    features.

    The reallocation request is handled when exiting to userspace
    VMM. This implies that KVM must break vcpu_run() loop and exit
    to userspace VMM instead of immediately resuming back to the guest
    when reallocation is required.

  - Detect fpstate reallocation in the emulation code

    Because guest #NM is not trapped in KVM (costly), the guest 
    intention of using a dynamically-enabled XSAVE feature[i] can be
    indirectly represented by guest XCR0[i]=1 and XFD[i]=0. This 
    requires the emulation logic of both WRMSR(IA32_XFD) and XSETBV 
    to check reallocation requirement when one of the two conditions
    is changed.

  - Disable WRMSR interception for IA32_XFD

    IA32_XFD can be frequently updated by the guest, as it is part of
    the task state and swapped in context switch when prev and next have
    different XFD setting. Always intercepting WRMSR can easily cause
    non-negligible overhead.

    Disable WRMSR interception for IA32_XFD after fpstate reallocation
    succeeds. After that point the guest direct writes IA32_XFD without
    causing VM-exits.

    However MSR passthrough implies that guest_fpstate::xfd and per-cpu
    xfd cache might be out of sync with the current IA32_XFD value set by
    the guest. This suggests KVM needs to re-sync the software state
    with IA32_XFD before the vCPU thread might be preempted or interrupted.

  - Save/restore guest XFD_ERR

    When XFD causes an instruction to generate #NM, XFD_ERR contains
    information about which disabled state components are being accessed.
    The #NM handler is expected to check this information and then enable
    the state components by clearing IA32_XFD for the faulting task (if 
    having permission).

    #NM can be triggered in both host and guest. It'd be problematic if
    the XFD_ERR value generated in guest is consumed/clobbered by the 
    host before the guest itself doing so. This may lead to non-XFD-
    related #NM treated as XFD #NM in host (due to guest XFD_ERR value),
    or XFD-related #NM treated as non-XFD #NM in guest (XFD_ERR cleared 
    by the host #NM handler).

    KVM needs to save the guest XFD_ERR value before this register
    might be accessed by the host and restore it before entering the 
    guest.

    One open remains in this area about when to start saving/restoring
    guest XFD_ERR. Several options are discussed in patch 15.

  - Expose related cpuid bits to guest

    The last step is to allow exposing XFD, AMX_TILE, AMX_INT8 and
    AMX_BF16 in guest cpuid. Adding those bits into kvm_cpu_caps finally
    activates all previous logics in this series

To verify AMX virtualization overhead on non-AMX usages, we run the
Phoronix kernel build test in the guest w/ and w/o AMX in cpuid. The 
result shows no observable difference between two configurations.

Live migration support is still being worked on. Userspace VMM needs
to use the new KVM_{G|S}SET_XSAVE2 ioctl in this series to migrate state
for dynamically-enabled XSAVE features.

Thanks Thomas for the thoughts and patches on the KVM FPU and AMX
support. Thanks Jun Nakajima for the design suggestions.

[1] git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git x86/fpu-kvm
[2] https://www.spinics.net/lists/kvm/msg259015.html

Thanks,
Yang

---
Jing Liu (13):
  kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
  kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2
  x86/fpu: Move xfd initialization out of __fpstate_reset() to the
    callers
  kvm: x86: Propagate fpstate reallocation error to userspace
  x86/fpu: Move xfd_update_state() to xstate.c and export symbol
  kvm: x86: Prepare reallocation check
  kvm: x86: Emulate WRMSR of guest IA32_XFD
  kvm: x86: Disable WRMSR interception for IA32_XFD on demand
  x86/fpu: Prepare for KVM XFD_ERR handling
  kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls
  kvm: x86: AMX XCR0 support for guest
  kvm: x86: Add AMX CPUIDs support

Thomas Gleixner (4):
  x86/fpu: Extend prctl() with guest permissions
  x86/fpu: Prepare KVM for dynamically enabled states
  x86/fpu: Add reallocation mechanims for KVM
  x86/fpu: Prepare KVM for bringing XFD state back in-sync

Yang Zhong (2):
  kvm: x86: Check fpstate reallocation in XSETBV emulation
  kvm: x86: Save and restore guest XFD_ERR properly

 Documentation/virt/kvm/api.rst     |  47 +++++++
 arch/x86/include/asm/cpufeatures.h |   2 +
 arch/x86/include/asm/fpu/api.h     |  12 ++
 arch/x86/include/asm/fpu/types.h   |  56 +++++++++
 arch/x86/include/asm/fpu/xstate.h  |   2 +
 arch/x86/include/asm/kvm-x86-ops.h |   1 +
 arch/x86/include/asm/kvm_host.h    |   2 +
 arch/x86/include/uapi/asm/kvm.h    |   6 +
 arch/x86/include/uapi/asm/prctl.h  |  26 ++--
 arch/x86/kernel/fpu/core.c         | 109 ++++++++++++++++-
 arch/x86/kernel/fpu/xstate.c       | 119 +++++++++++++++---
 arch/x86/kernel/fpu/xstate.h       |  29 +++--
 arch/x86/kernel/process.c          |   2 +
 arch/x86/kvm/cpuid.c               |  36 +++++-
 arch/x86/kvm/vmx/vmx.c             |  20 +++
 arch/x86/kvm/vmx/vmx.h             |   2 +-
 arch/x86/kvm/x86.c                 | 189 ++++++++++++++++++++++++++++-
 arch/x86/kvm/x86.h                 |   2 +
 include/uapi/linux/kvm.h           |   8 +-
 19 files changed, 607 insertions(+), 63 deletions(-)


^ permalink raw reply	[flat|nested] 80+ messages in thread

* [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-14  0:16   ` Thomas Gleixner
  2021-12-08  0:03 ` [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states Yang Zhong
                   ` (18 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Thomas Gleixner <tglx@linutronix.de>

Add guest permission control for dynamic XSTATE components, including
extension to prctl() with two new options (ARCH_GET_XCOMP_GUEST_PERM
and ARCH_REQ_XCOMP_GUEST_PERM) and to struct fpu with a new member
(guest_perm).

Userspace VMM has to request guest permissions before it exposes any
XSAVE feature using dynamic XSTATE components. The permission can be
set only once when the first vCPU is created. A new flag
FPU_GUEST_PERM_LOCKED is introduced to lock the change for this purpose

Similar to native permissions this doesn't actually enable the
permitted feature. KVM is expected to install a larger kernel buffer
and enable the feature when detecting the intention from the guest.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
(To Thomas) We change the definition of xstate_get_guest_group_perm()
from xstate.h to api.h since this will be called by KVM.

 arch/x86/include/asm/fpu/api.h    |  2 ++
 arch/x86/include/asm/fpu/types.h  |  9 ++++++
 arch/x86/include/uapi/asm/prctl.h | 26 ++++++++--------
 arch/x86/kernel/fpu/core.c        |  3 ++
 arch/x86/kernel/fpu/xstate.c      | 50 +++++++++++++++++++++++--------
 arch/x86/kernel/fpu/xstate.h      | 13 ++++++--
 arch/x86/kernel/process.c         |  2 ++
 7 files changed, 78 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 6053674f9132..7532f73c82a6 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -138,6 +138,8 @@ static inline void fpstate_free(struct fpu *fpu) { }
 /* fpstate-related functions which are exported to KVM */
 extern void fpstate_clear_xstate_component(struct fpstate *fps, unsigned int xfeature);
 
+extern inline u64 xstate_get_guest_group_perm(void);
+
 /* KVM specific functions */
 extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
 extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 3c06c82ab355..6ddf80637697 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -387,6 +387,8 @@ struct fpstate {
 	/* @regs is dynamically sized! Don't add anything after @regs! */
 } __aligned(64);
 
+#define FPU_GUEST_PERM_LOCKED		BIT_ULL(63)
+
 struct fpu_state_perm {
 	/*
 	 * @__state_perm:
@@ -476,6 +478,13 @@ struct fpu {
 	 */
 	struct fpu_state_perm		perm;
 
+	/*
+	 * @guest_perm:
+	 *
+	 * Permission related information for guest pseudo FPUs
+	 */
+	struct fpu_state_perm		guest_perm;
+
 	/*
 	 * @__fpstate:
 	 *
diff --git a/arch/x86/include/uapi/asm/prctl.h b/arch/x86/include/uapi/asm/prctl.h
index 754a07856817..500b96e71f18 100644
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -2,20 +2,22 @@
 #ifndef _ASM_X86_PRCTL_H
 #define _ASM_X86_PRCTL_H
 
-#define ARCH_SET_GS		0x1001
-#define ARCH_SET_FS		0x1002
-#define ARCH_GET_FS		0x1003
-#define ARCH_GET_GS		0x1004
+#define ARCH_SET_GS			0x1001
+#define ARCH_SET_FS			0x1002
+#define ARCH_GET_FS			0x1003
+#define ARCH_GET_GS			0x1004
 
-#define ARCH_GET_CPUID		0x1011
-#define ARCH_SET_CPUID		0x1012
+#define ARCH_GET_CPUID			0x1011
+#define ARCH_SET_CPUID			0x1012
 
-#define ARCH_GET_XCOMP_SUPP	0x1021
-#define ARCH_GET_XCOMP_PERM	0x1022
-#define ARCH_REQ_XCOMP_PERM	0x1023
+#define ARCH_GET_XCOMP_SUPP		0x1021
+#define ARCH_GET_XCOMP_PERM		0x1022
+#define ARCH_REQ_XCOMP_PERM		0x1023
+#define ARCH_GET_XCOMP_GUEST_PERM	0x1024
+#define ARCH_REQ_XCOMP_GUEST_PERM	0x1025
 
-#define ARCH_MAP_VDSO_X32	0x2001
-#define ARCH_MAP_VDSO_32	0x2002
-#define ARCH_MAP_VDSO_64	0x2003
+#define ARCH_MAP_VDSO_X32		0x2001
+#define ARCH_MAP_VDSO_32		0x2002
+#define ARCH_MAP_VDSO_64		0x2003
 
 #endif /* _ASM_X86_PRCTL_H */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 8ea306b1bf8e..ab19b3d8b2f7 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -450,6 +450,8 @@ void fpstate_reset(struct fpu *fpu)
 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;
 	fpu->perm.__state_size		= fpu_kernel_cfg.default_size;
 	fpu->perm.__user_state_size	= fpu_user_cfg.default_size;
+	/* Same defaults for guests */
+	fpu->guest_perm = fpu->perm;
 }
 
 static inline void fpu_inherit_perms(struct fpu *dst_fpu)
@@ -460,6 +462,7 @@ static inline void fpu_inherit_perms(struct fpu *dst_fpu)
 		spin_lock_irq(&current->sighand->siglock);
 		/* Fork also inherits the permissions of the parent */
 		dst_fpu->perm = src_fpu->perm;
+		dst_fpu->guest_perm = src_fpu->guest_perm;
 		spin_unlock_irq(&current->sighand->siglock);
 	}
 }
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed0..9856d579aa6e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1595,7 +1595,7 @@ static int validate_sigaltstack(unsigned int usize)
 	return 0;
 }
 
-static int __xstate_request_perm(u64 permitted, u64 requested)
+static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
 {
 	/*
 	 * This deliberately does not exclude !XSAVES as we still might
@@ -1605,6 +1605,7 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
 	 */
 	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
 	struct fpu *fpu = &current->group_leader->thread.fpu;
+	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
 	u64 mask;
 	int ret;
@@ -1621,15 +1622,18 @@ static int __xstate_request_perm(u64 permitted, u64 requested)
 	mask &= XFEATURE_MASK_USER_SUPPORTED;
 	usize = xstate_calculate_size(mask, false);
 
-	ret = validate_sigaltstack(usize);
-	if (ret)
-		return ret;
+	if (!guest) {
+		ret = validate_sigaltstack(usize);
+		if (ret)
+			return ret;
+	}
 
+	perm = guest ? &fpu->guest_perm : &fpu->perm;
 	/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
-	WRITE_ONCE(fpu->perm.__state_perm, requested);
+	WRITE_ONCE(perm->__state_perm, requested);
 	/* Protected by sighand lock */
-	fpu->perm.__state_size = ksize;
-	fpu->perm.__user_state_size = usize;
+	perm->__state_size = ksize;
+	perm->__user_state_size = usize;
 	return ret;
 }
 
@@ -1640,7 +1644,7 @@ static const u64 xstate_prctl_req[XFEATURE_MAX] = {
 	[XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
 };
 
-static int xstate_request_perm(unsigned long idx)
+static int xstate_request_perm(unsigned long idx, bool guest)
 {
 	u64 permitted, requested;
 	int ret;
@@ -1661,14 +1665,19 @@ static int xstate_request_perm(unsigned long idx)
 		return -EOPNOTSUPP;
 
 	/* Lockless quick check */
-	permitted = xstate_get_host_group_perm();
+	permitted = xstate_get_group_perm(guest);
 	if ((permitted & requested) == requested)
 		return 0;
 
 	/* Protect against concurrent modifications */
 	spin_lock_irq(&current->sighand->siglock);
-	permitted = xstate_get_host_group_perm();
-	ret = __xstate_request_perm(permitted, requested);
+	permitted = xstate_get_group_perm(guest);
+
+	/* First vCPU allocation locks the permissions. */
+	if (guest && (permitted & FPU_GUEST_PERM_LOCKED))
+		ret = -EBUSY;
+	else
+		ret = __xstate_request_perm(permitted, requested, guest);
 	spin_unlock_irq(&current->sighand->siglock);
 	return ret;
 }
@@ -1713,12 +1722,17 @@ int xfd_enable_feature(u64 xfd_err)
 	return 0;
 }
 #else /* CONFIG_X86_64 */
-static inline int xstate_request_perm(unsigned long idx)
+static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
 	return -EPERM;
 }
 #endif  /* !CONFIG_X86_64 */
 
+inline u64 xstate_get_guest_group_perm(void)
+{
+	return xstate_get_group_perm(true);
+}
+EXPORT_SYMBOL_GPL(xstate_get_guest_group_perm);
 /**
  * fpu_xstate_prctl - xstate permission operations
  * @tsk:	Redundant pointer to current
@@ -1742,6 +1756,7 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
 	u64 __user *uptr = (u64 __user *)arg2;
 	u64 permitted, supported;
 	unsigned long idx = arg2;
+	bool guest = false;
 
 	if (tsk != current)
 		return -EPERM;
@@ -1760,11 +1775,20 @@ long fpu_xstate_prctl(struct task_struct *tsk, int option, unsigned long arg2)
 		permitted &= XFEATURE_MASK_USER_SUPPORTED;
 		return put_user(permitted, uptr);
 
+	case ARCH_GET_XCOMP_GUEST_PERM:
+		permitted = xstate_get_guest_group_perm();
+		permitted &= XFEATURE_MASK_USER_SUPPORTED;
+		return put_user(permitted, uptr);
+
+	case ARCH_REQ_XCOMP_GUEST_PERM:
+		guest = true;
+		fallthrough;
+
 	case ARCH_REQ_XCOMP_PERM:
 		if (!IS_ENABLED(CONFIG_X86_64))
 			return -EOPNOTSUPP;
 
-		return xstate_request_perm(idx);
+		return xstate_request_perm(idx, guest);
 
 	default:
 		return -EINVAL;
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 86ea7c0fa2f6..98a472775c97 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -20,10 +20,19 @@ static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
 }
 
-static inline u64 xstate_get_host_group_perm(void)
+static inline u64 xstate_get_group_perm(bool guest)
 {
+	struct fpu *fpu = &current->group_leader->thread.fpu;
+	struct fpu_state_perm *perm;
+
 	/* Pairs with WRITE_ONCE() in xstate_request_perm() */
-	return READ_ONCE(current->group_leader->thread.fpu.perm.__state_perm);
+	perm = guest ? &fpu->guest_perm : &fpu->perm;
+	return READ_ONCE(perm->__state_perm);
+}
+
+static inline u64 xstate_get_host_group_perm(void)
+{
+	return xstate_get_group_perm(false);
 }
 
 enum xstate_copy_mode {
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 04143a653a8a..d7bc23589062 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -993,6 +993,8 @@ long do_arch_prctl_common(struct task_struct *task, int option,
 	case ARCH_GET_XCOMP_SUPP:
 	case ARCH_GET_XCOMP_PERM:
 	case ARCH_REQ_XCOMP_PERM:
+	case ARCH_GET_XCOMP_GUEST_PERM:
+	case ARCH_REQ_XCOMP_GUEST_PERM:
 		return fpu_xstate_prctl(task, option, arg2);
 	}
 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
  2021-12-08  0:03 ` [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-13  9:12   ` Paolo Bonzini
  2021-12-08  0:03 ` [PATCH 03/19] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule Yang Zhong
                   ` (17 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Thomas Gleixner <tglx@linutronix.de>

Add more fields for tracking per-vCPU permissions for dynamic XSTATE
components:

  - user_xfeatures

    Track which features are currently enabled for the vCPU

  - user_perm

    Copied from guest_perm of the group leader thread. The first
    vCPU which does the copy locks the guest_perm

  - realloc_request

    KVM sets this field to request dynamically-enabled features
    which require reallocation of @fpstate

Initialize those fields properly.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/asm/fpu/types.h | 23 +++++++++++++++++++++++
 arch/x86/kernel/fpu/core.c       | 26 +++++++++++++++++++++++++-
 2 files changed, 48 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 6ddf80637697..861cffca3209 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -504,6 +504,29 @@ struct fpu {
  * Guest pseudo FPU container
  */
 struct fpu_guest {
+	/*
+	 * @user_xfeatures:		xfeature bitmap of features which are
+	 *				currently enabled for the guest vCPU.
+	 */
+	u64				user_xfeatures;
+
+	/*
+	 * @user_perm:			xfeature bitmap of features which are
+	 *				permitted to be enabled for the guest
+	 *				vCPU.
+	 */
+	u64				user_perm;
+
+	/*
+	 * @realloc_request:		xfeature bitmap of features which are
+	 *				requested to be enabled dynamically
+	 *				which requires reallocation of @fpstate
+	 *
+	 *				Set by an intercept handler and
+	 *				evaluated in fpu_swap_kvm_fpstate()
+	 */
+	u64				realloc_request;
+
 	/*
 	 * @fpstate:			Pointer to the allocated guest fpstate
 	 */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ab19b3d8b2f7..fe592799508c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -201,6 +201,26 @@ void fpu_reset_from_exception_fixup(void)
 #if IS_ENABLED(CONFIG_KVM)
 static void __fpstate_reset(struct fpstate *fpstate);
 
+static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
+{
+	struct fpu_state_perm *fpuperm;
+	u64 perm;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	spin_lock_irq(&current->sighand->siglock);
+	fpuperm = &current->group_leader->thread.fpu.guest_perm;
+	perm = fpuperm->__state_perm;
+
+	/* First fpstate allocation locks down permissions. */
+	WRITE_ONCE(fpuperm->__state_perm, perm | FPU_GUEST_PERM_LOCKED);
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	gfpu->user_perm = perm & ~FPU_GUEST_PERM_LOCKED;
+}
+
 bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 {
 	struct fpstate *fpstate;
@@ -216,7 +236,11 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 	fpstate->is_valloc	= true;
 	fpstate->is_guest	= true;
 
-	gfpu->fpstate = fpstate;
+	gfpu->fpstate		= fpstate;
+	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
+	gfpu->user_perm		= fpu_user_cfg.default_features;
+	fpu_init_guest_permissions(gfpu);
+
 	return true;
 }
 EXPORT_SYMBOL_GPL(fpu_alloc_guest_fpstate);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 03/19] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
  2021-12-08  0:03 ` [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions Yang Zhong
  2021-12-08  0:03 ` [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  0:03 ` [PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2 Yang Zhong
                   ` (16 subsequent siblings)
  19 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

CPUID.0xD.1.EBX enumerates the size of the XSAVE area (in compacted
format) required by XSAVES. If CPUID.0xD.i.ECX[1] is set for a state
component (i), this state component should be located on the next
64-bytes boundary following the preceding state component in the
compacted layout.

Fix xstate_required_size() to follow the alignment rule. AMX is the
first state component with 64-bytes alignment to catch this bug.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/cpuid.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 07e9215e911d..148003e26cbb 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -42,7 +42,8 @@ static u32 xstate_required_size(u64 xstate_bv, bool compacted)
 		if (xstate_bv & 0x1) {
 		        u32 eax, ebx, ecx, edx, offset;
 		        cpuid_count(0xD, feature_bit, &eax, &ebx, &ecx, &edx);
-			offset = compacted ? ret : ebx;
+			/* ECX[1]: 64B alignment in compacted form */
+			offset = compacted ? ((ecx & 0x2) ? ALIGN(ret, 64) : ret) : ebx;
 			ret = max(ret, offset + eax);
 		}
 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (2 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 03/19] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  0:03 ` [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers Yang Zhong
                   ` (15 subsequent siblings)
  19 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Guest xstate permissions should be set by userspace VMM before vcpu
creation. This patch extends KVM to check the guest permissions in
KVM_SET_CPUID2 ioctl to avoid permission failure at guest run-time
(e.g. when reallocation path is triggered).

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/cpuid.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 148003e26cbb..f3c61205bbf4 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -18,6 +18,7 @@
 #include <asm/processor.h>
 #include <asm/user.h>
 #include <asm/fpu/xstate.h>
+#include <asm/fpu/api.h>
 #include <asm/sgx.h>
 #include "cpuid.h"
 #include "lapic.h"
@@ -97,6 +98,17 @@ static int kvm_check_cpuid(struct kvm_cpuid_entry2 *entries, int nent)
 			return -EINVAL;
 	}
 
+	/*
+	 * Check guest permissions for XSTATE features which must
+	 * be enabled dynamically.
+	 */
+	best = cpuid_entry2_find(entries, nent, 7, 0);
+	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE)) {
+		if (!(xstate_get_guest_group_perm() &
+			XFEATURE_MASK_XTILE_DATA))
+			return -EINVAL;
+	}
+
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (3 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2 Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 22:33   ` Thomas Gleixner
  2021-12-08  0:03 ` [PATCH 06/19] x86/fpu: Add reallocation mechanims for KVM Yang Zhong
                   ` (14 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

vCPU threads are different from native tasks regarding to the initial
xfd value. While all native tasks follow a fixed value (init_fpstate::xfd)
defined by fpu core, vCPU threads need to obey the reset value
(i.e. ZERO) defined by the spec, to meet the expectation of the guest.

Move xfd initialization out of __fpstate_reset() to the callers for
choosing a specific value.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kernel/fpu/core.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index fe592799508c..fae44fa27cdb 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -231,6 +231,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 	if (!fpstate)
 		return false;
 
+	/* Leave xfd to 0 (the reset value defined by spec) */
 	__fpstate_reset(fpstate);
 	fpstate_init_user(fpstate);
 	fpstate->is_valloc	= true;
@@ -461,7 +462,6 @@ static void __fpstate_reset(struct fpstate *fpstate)
 	fpstate->user_size	= fpu_user_cfg.default_size;
 	fpstate->xfeatures	= fpu_kernel_cfg.default_features;
 	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
-	fpstate->xfd		= init_fpstate.xfd;
 }
 
 void fpstate_reset(struct fpu *fpu)
@@ -469,6 +469,7 @@ void fpstate_reset(struct fpu *fpu)
 	/* Set the fpstate pointer to the default fpstate */
 	fpu->fpstate = &fpu->__fpstate;
 	__fpstate_reset(fpu->fpstate);
+	fpu->fpstate->xfd		= init_fpstate.xfd;
 
 	/* Initialize the permission related info in fpu */
 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 06/19] x86/fpu: Add reallocation mechanims for KVM
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (4 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  0:03 ` [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace Yang Zhong
                   ` (13 subsequent siblings)
  19 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Thomas Gleixner <tglx@linutronix.de>

Extend fpstate reallocation mechanism to cover guest fpu. Unlike native
tasks which have reallocation triggered from #NM handler, guest fpstate
reallocation is requested by KVM when detecting the guest intention
on using a dynamically-enabled XSAVE feature.

Since KVM currently swaps host/guest fpstate when exiting to userspace
VMM (see fpu_swap_kvm_fpstate()), deal with fpstate reallocation also
at this point.

The implication - KVM must break vcpu_run() loop to exit to userspace
VMM instead of immediately returning back to the guest when fpstate
requires reallocation. In this case KVM should set
guest_fpu::realloc_request to mark those features in related VM exit
handlers.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kernel/fpu/core.c   | 26 +++++++++++++++++++---
 arch/x86/kernel/fpu/xstate.c | 43 ++++++++++++++++++++++++++++++------
 arch/x86/kernel/fpu/xstate.h |  2 ++
 3 files changed, 61 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index fae44fa27cdb..7a0436a0cb2c 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,11 +261,31 @@ void fpu_free_guest_fpstate(struct fpu_guest *gfpu)
 }
 EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
 
+static int fpu_guest_realloc_fpstate(struct fpu_guest *guest_fpu,
+				     bool enter_guest)
+{
+	/*
+	 * Reallocation requests can only be handled when
+	 * switching from guest to host mode.
+	 */
+	if (WARN_ON_ONCE(enter_guest || !IS_ENABLED(CONFIG_X86_64))) {
+		guest_fpu->realloc_request = 0;
+		return -EUNATCH;
+	}
+	return xfd_enable_guest_features(guest_fpu);
+}
+
 int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 {
-	struct fpstate *guest_fps = guest_fpu->fpstate;
+	struct fpstate *guest_fps, *cur_fps;
 	struct fpu *fpu = &current->thread.fpu;
-	struct fpstate *cur_fps = fpu->fpstate;
+	int ret = 0;
+
+	if (unlikely(guest_fpu->realloc_request))
+		ret = fpu_guest_realloc_fpstate(guest_fpu, enter_guest);
+
+	guest_fps = guest_fpu->fpstate;
+	cur_fps = fpu->fpstate;
 
 	fpregs_lock();
 	if (!cur_fps->is_confidential && !test_thread_flag(TIF_NEED_FPU_LOAD))
@@ -298,7 +318,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 
 	fpregs_mark_activate();
 	fpregs_unlock();
-	return 0;
+	return ret;
 }
 EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
 
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 9856d579aa6e..fe3d8ed3db0e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1529,6 +1529,7 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
  *		of that task
  * @ksize:	The required size for the kernel buffer
  * @usize:	The required size for user space buffers
+ * @guest_fpu:	Pointer to a guest FPU container. NULL for host allocations
  *
  * Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer
  * terminates quickly, vfree()-induced IPIs may be a concern, but tasks
@@ -1537,7 +1538,7 @@ static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
  * Returns: 0 on success, -ENOMEM on allocation error.
  */
 static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
-			   unsigned int usize)
+			   unsigned int usize, struct fpu_guest *guest_fpu)
 {
 	struct fpu *fpu = &current->thread.fpu;
 	struct fpstate *curfps, *newfps = NULL;
@@ -1553,6 +1554,12 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
 	newfps->user_size = usize;
 	newfps->is_valloc = true;
 
+	if (guest_fpu) {
+		newfps->is_guest = true;
+		newfps->is_confidential = curfps->is_confidential;
+		guest_fpu->user_xfeatures |= xfeatures;
+	}
+
 	fpregs_lock();
 	/*
 	 * Ensure that the current state is in the registers before
@@ -1566,12 +1573,14 @@ static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
 	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
 	newfps->xfd = curfps->xfd & ~xfeatures;
 
+	if (guest_fpu)
+		guest_fpu->fpstate = newfps;
+
 	curfps = fpu_install_fpstate(fpu, newfps);
 
 	/* Do the final updates within the locked region */
 	xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures);
 	xfd_update_state(newfps);
-
 	fpregs_unlock();
 
 	vfree(curfps);
@@ -1682,9 +1691,10 @@ static int xstate_request_perm(unsigned long idx, bool guest)
 	return ret;
 }
 
-int xfd_enable_feature(u64 xfd_err)
+static int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu)
 {
 	u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC;
+	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
 	struct fpu *fpu;
 
@@ -1697,14 +1707,16 @@ int xfd_enable_feature(u64 xfd_err)
 	spin_lock_irq(&current->sighand->siglock);
 
 	/* If not permitted let it die */
-	if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
+	if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
 		spin_unlock_irq(&current->sighand->siglock);
 		return -EPERM;
 	}
 
 	fpu = &current->group_leader->thread.fpu;
-	ksize = fpu->perm.__state_size;
-	usize = fpu->perm.__user_state_size;
+	perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
+	ksize = perm->__state_size;
+	usize = perm->__user_state_size;
+
 	/*
 	 * The feature is permitted. State size is sufficient.  Dropping
 	 * the lock is safe here even if more features are added from
@@ -1717,10 +1729,27 @@ int xfd_enable_feature(u64 xfd_err)
 	 * Try to allocate a new fpstate. If that fails there is no way
 	 * out.
 	 */
-	if (fpstate_realloc(xfd_event, ksize, usize))
+	if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu))
 		return -EFAULT;
 	return 0;
 }
+
+int xfd_enable_feature(u64 xfd_err)
+{
+	return __xfd_enable_feature(xfd_err, NULL);
+}
+
+int xfd_enable_guest_features(struct fpu_guest *guest_fpu)
+{
+	u64 xfd_err = guest_fpu->realloc_request & XFEATURE_MASK_USER_SUPPORTED;
+
+	guest_fpu->realloc_request = 0;
+
+	if (!xfd_err)
+		return 0;
+	return __xfd_enable_feature(xfd_err, guest_fpu);
+}
+
 #else /* CONFIG_X86_64 */
 static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 98a472775c97..3254e2b5f17f 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -55,6 +55,8 @@ extern void fpu__init_system_xstate(unsigned int legacy_size);
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
 
+extern int xfd_enable_guest_features(struct fpu_guest *guest_fpu);
+
 static inline u64 xfeatures_mask_supervisor(void)
 {
 	return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED;

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (5 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 06/19] x86/fpu: Add reallocation mechanims for KVM Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 15:44   ` Paolo Bonzini
  2021-12-08  0:03 ` [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol Yang Zhong
                   ` (12 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

fpstate reallocation is handled when the vCPU thread returns
to userspace. As reallocation could fail (e.g. lack of memory),
this patch extends kvm_put_guest_fpu() to return an integer value
to carry error code to userspace VMM. The userspace VMM is expected
to handle any error caused by fpstate reallocation.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/x86.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0ee1a039b490..05f2cda73d69 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10171,17 +10171,21 @@ static void kvm_load_guest_fpu(struct kvm_vcpu *vcpu)
 }
 
 /* When vcpu_run ends, restore user space FPU context. */
-static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
+static int kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 {
-	fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, false);
+	int ret;
+
+	ret = fpu_swap_kvm_fpstate(&vcpu->arch.guest_fpu, false);
 	++vcpu->stat.fpu_reload;
 	trace_kvm_fpu(0);
+
+	return ret;
 }
 
 int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 {
 	struct kvm_run *kvm_run = vcpu->run;
-	int r;
+	int r, ret;
 
 	vcpu_load(vcpu);
 	kvm_sigset_activate(vcpu);
@@ -10243,7 +10247,10 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu)
 		r = vcpu_run(vcpu);
 
 out:
-	kvm_put_guest_fpu(vcpu);
+	ret = kvm_put_guest_fpu(vcpu);
+	if ((r >= 0) && (ret < 0))
+		r = ret;
+
 	if (kvm_run->kvm_valid_regs)
 		store_regs(vcpu);
 	post_kvm_run_save(vcpu);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (6 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 22:44   ` Thomas Gleixner
  2021-12-08  0:03 ` [PATCH 09/19] kvm: x86: Prepare reallocation check Yang Zhong
                   ` (11 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

xfd_update_state() is the interface to update IA32_XFD and its per-cpu
cache. All callers of this interface are currently in fpu core. KVM only
indirectly triggers IA32_XFD update via a helper function
(fpu_swap_kvm_fpstate()) when switching between user fpu and guest fpu.

Supporting AMX in guest now requires KVM to directly update IA32_XFD
with the guest value (when emulating WRMSR) so XSAVE/XRSTOR can manage
XSTATE components correctly inside guest.

This patch moves xfd_update_state() from fpu/xstate.h to fpu/xstate.c
and export it for reference outside of fpu core.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/asm/fpu/api.h |  2 ++
 arch/x86/kernel/fpu/xstate.c   | 12 ++++++++++++
 arch/x86/kernel/fpu/xstate.h   | 14 +-------------
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 7532f73c82a6..999d89026be9 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -131,8 +131,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
 /* Process cleanup */
 #ifdef CONFIG_X86_64
 extern void fpstate_free(struct fpu *fpu);
+extern void xfd_update_state(struct fpstate *fpstate);
 #else
 static inline void fpstate_free(struct fpu *fpu) { }
+static void xfd_update_state(struct fpstate *fpstate) { }
 #endif
 
 /* fpstate-related functions which are exported to KVM */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index fe3d8ed3db0e..3c39789deeb9 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1750,6 +1750,18 @@ int xfd_enable_guest_features(struct fpu_guest *guest_fpu)
 	return __xfd_enable_feature(xfd_err, guest_fpu);
 }
 
+void xfd_update_state(struct fpstate *fpstate)
+{
+	if (fpu_state_size_dynamic()) {
+		u64 xfd = fpstate->xfd;
+
+		if (__this_cpu_read(xfd_state) != xfd) {
+			wrmsrl(MSR_IA32_XFD, xfd);
+			__this_cpu_write(xfd_state, xfd);
+		}
+	}
+}
+EXPORT_SYMBOL_GPL(xfd_update_state);
 #else /* CONFIG_X86_64 */
 static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index 3254e2b5f17f..651bd29977b9 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -149,19 +149,7 @@ static inline void xfd_validate_state(struct fpstate *fpstate, u64 mask, bool rs
 #endif
 
 #ifdef CONFIG_X86_64
-static inline void xfd_update_state(struct fpstate *fpstate)
-{
-	if (fpu_state_size_dynamic()) {
-		u64 xfd = fpstate->xfd;
-
-		if (__this_cpu_read(xfd_state) != xfd) {
-			wrmsrl(MSR_IA32_XFD, xfd);
-			__this_cpu_write(xfd_state, xfd);
-		}
-	}
-}
-#else
-static inline void xfd_update_state(struct fpstate *fpstate) { }
+extern void xfd_update_state(struct fpstate *fpstate);
 #endif
 
 /*

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (7 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-13  9:16   ` Paolo Bonzini
  2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
                   ` (10 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

On native fpstate reallocation is triggered by #NM because IA32_XFD
is initialized to 1 for all native tasks.

However #NM in guest is not trapped by KVM. Instead, guest enabling
of a dynamic extended feature can be captured via emulation of
IA32_XFD and XSETBV. Basically having guest XCR0[i]=1 and XFD[i]=0
indicates that the feature[i] is activated by the guest.

This patch provides a helper function for such check, invoked when
either XCR0 or XFD is changed in the emulation path.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/x86.c | 24 ++++++++++++++++++++++++
 arch/x86/kvm/x86.h |  1 +
 2 files changed, 25 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 05f2cda73d69..91cc6f69a7ca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -956,6 +956,30 @@ void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu)
 }
 EXPORT_SYMBOL_GPL(kvm_load_host_xsave_state);
 
+bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 xfd)
+{
+	u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
+
+	/* For any state which is enabled dynamically */
+	if ((xfd & xcr0) != xcr0) {
+		u64 request = (xcr0 ^ xfd) & xcr0;
+		struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
+
+		/*
+		 * If requested features haven't been enabled, update
+		 * the request bitmap and tell the caller to request
+		 * dynamic buffer reallocation.
+		 */
+		if ((guest_fpu->user_xfeatures & request) != request) {
+			vcpu->arch.guest_fpu.realloc_request = request;
+			return true;
+		}
+	}
+
+	return false;
+}
+EXPORT_SYMBOL_GPL(kvm_check_guest_realloc_fpstate);
+
 static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 {
 	u64 xcr0 = xcr;
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 4abcd8d9836d..24a323980146 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -445,6 +445,7 @@ static inline void kvm_machine_check(void)
 
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
+bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 new_xfd);
 int kvm_spec_ctrl_test_value(u64 value);
 bool kvm_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
 int kvm_handle_memory_failure(struct kvm_vcpu *vcpu, int r,

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (8 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 09/19] kvm: x86: Prepare reallocation check Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 16:02   ` Paolo Bonzini
                     ` (2 more replies)
  2021-12-08  0:03 ` [PATCH 11/19] kvm: x86: Check fpstate reallocation in XSETBV emulation Yang Zhong
                   ` (9 subsequent siblings)
  19 siblings, 3 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Intel's eXtended Feature Disable (XFD) feature allows the software
to dynamically adjust fpstate buffer size for XSAVE features which
have large state.

WRMSR to IA32_XFD is intercepted so if the written value enables
a dynamic XSAVE feature the emulation code can exit to userspace
to trigger fpstate reallocation for the state.

Introduce a new KVM exit reason (KVM_EXIT_FPU_REALLOC) for this
purpose. If reallocation succeeds in fpu_swap_kvm_fpstate(), this
exit just bounces to userspace and then back. Otherwise the
userspace VMM should handle the error properly.

Use a new exit reason (instead of KVM_EXIT_X86_WRMSR) is clearer
and can be shared between WRMSR(IA32_XFD) and XSETBV. This also
avoids mixing with the userspace MSR machinery which is tied to
KVM_EXIT_X86_WRMSR today.

Also introduce a new MSR return type (KVM_MSR_RET_USERSPACE).
Currently MSR emulation returns to userspace only upon error or
per certain filtering rules via the userspace MSR mechinary.
This new return type indicates that emulation of certain MSR has
its own specific reason to bounce to userspace.

IA32_XFD is updated in two ways:

  - If reallocation is not required, the emulation code directly
    updates guest_fpu::xfd and then calls xfd_update_state() to
    update IA32_XFD and per-cpu cache;

  - If reallocation is triggered, above updates are completed as
    part of the fpstate reallocation process if succeeds;

RDMSR to IA32_XFD is not intercepted. fpu_swap_kvm_fpstate() ensures
the guest XFD value loaded into MSR before re-entering the guest.
Just save an unnecessary VM-exit here

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/vmx/vmx.c   |  8 +++++++
 arch/x86/kvm/x86.c       | 48 ++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.h       |  1 +
 include/uapi/linux/kvm.h |  1 +
 4 files changed, 58 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 70d86ffbccf7..971d60980d5b 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7141,6 +7141,11 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
+static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
+{
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+}
+
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7181,6 +7186,9 @@ static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		}
 	}
 
+	if (cpu_feature_enabled(X86_FEATURE_XFD) && guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+		vmx_update_intercept_xfd(vcpu);
+
 	set_cr4_guest_host_mask(vmx);
 
 	vmx_write_encls_bitmap(vcpu, NULL);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 91cc6f69a7ca..c83887cb55ee 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1873,6 +1873,16 @@ static int kvm_msr_user_space(struct kvm_vcpu *vcpu, u32 index,
 {
 	u64 msr_reason = kvm_msr_reason(r);
 
+	/*
+	 * MSR emulation may need certain effect triggered in the
+	 * path transitioning to userspace (e.g. fpstate realloction).
+	 * In this case the actual exit reason and completion
+	 * func should have been set by the emulation code before
+	 * this point.
+	 */
+	if (r == KVM_MSR_RET_USERSPACE)
+		return 1;
+
 	/* Check if the user wanted to know about this MSR fault */
 	if (!(vcpu->kvm->arch.user_space_msr_mask & msr_reason))
 		return 0;
@@ -3692,6 +3702,44 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			return 1;
 		vcpu->arch.msr_misc_features_enables = data;
 		break;
+#ifdef CONFIG_X86_64
+	case MSR_IA32_XFD:
+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
+			return 1;
+
+		/* Setting unsupported bits causes #GP */
+		if (~XFEATURE_MASK_USER_DYNAMIC & data) {
+			kvm_inject_gp(vcpu, 0);
+			break;
+		}
+
+		WARN_ON_ONCE(current->thread.fpu.fpstate !=
+			     vcpu->arch.guest_fpu.fpstate);
+
+		/*
+		 * Check if fpstate reallocate is required. If yes, then
+		 * let the fpu core do reallocation and update xfd;
+		 * otherwise, update xfd here.
+		 */
+		if (kvm_check_guest_realloc_fpstate(vcpu, data)) {
+			vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+			vcpu->arch.complete_userspace_io =
+				kvm_skip_emulated_instruction;
+			return KVM_MSR_RET_USERSPACE;
+		}
+
+		/*
+		 * Update IA32_XFD to the guest value so #NM can be
+		 * raised properly in the guest. Instead of directly
+		 * writing the MSR, call a helper to avoid breaking
+		 * per-cpu cached value in fpu core.
+		 */
+		fpregs_lock();
+		current->thread.fpu.fpstate->xfd = data;
+		xfd_update_state(current->thread.fpu.fpstate);
+		fpregs_unlock();
+		break;
+#endif
 	default:
 		if (kvm_pmu_is_valid_msr(vcpu, msr))
 			return kvm_pmu_set_msr(vcpu, msr_info);
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 24a323980146..446ffa8c7804 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -460,6 +460,7 @@ bool kvm_msr_allowed(struct kvm_vcpu *vcpu, u32 index, u32 type);
  */
 #define  KVM_MSR_RET_INVALID	2	/* in-kernel MSR emulation #GP condition */
 #define  KVM_MSR_RET_FILTERED	3	/* #GP due to userspace MSR filter */
+#define  KVM_MSR_RET_USERSPACE	4	/* Userspace handling */
 
 #define __cr4_reserved_bits(__cpu_has, __c)             \
 ({                                                      \
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 1daa45268de2..0c7b301c7254 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -270,6 +270,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_X86_BUS_LOCK     33
 #define KVM_EXIT_XEN              34
 #define KVM_EXIT_RISCV_SBI        35
+#define KVM_EXIT_FPU_REALLOC      36
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 11/19] kvm: x86: Check fpstate reallocation in XSETBV emulation
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (9 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  0:03 ` [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync Yang Zhong
                   ` (8 subsequent siblings)
  19 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

XSETBV allows the software to write the extended control register
XCR0, thus its emulation handler also needs to check fpstate
reallocation when the changed XCR0 value enables certain
dynamically-enabled features.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/x86.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c83887cb55ee..b195f4fa888f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -1028,6 +1028,15 @@ int kvm_emulate_xsetbv(struct kvm_vcpu *vcpu)
 		return 1;
 	}
 
+	if (guest_cpuid_has(vcpu, X86_FEATURE_XFD)) {
+		if (kvm_check_guest_realloc_fpstate(vcpu, vcpu->arch.guest_fpu.fpstate->xfd)) {
+			vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+			vcpu->arch.complete_userspace_io =
+				kvm_skip_emulated_instruction;
+			return 0;
+		}
+	}
+
 	return kvm_skip_emulated_instruction(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_emulate_xsetbv);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (10 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 11/19] kvm: x86: Check fpstate reallocation in XSETBV emulation Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 23:11   ` Thomas Gleixner
  2021-12-08  0:03 ` [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand Yang Zhong
                   ` (7 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Thomas Gleixner <tglx@linutronix.de>

Guest may toggle IA32_XFD in high frequency as it is part of the fpstate
information (features, sizes, xfd) and swapped in task context switch.

To minimize the trap overhead of writes to this MSR, one optimization
is to allow guest direct write thus eliminate traps. However MSR
passthrough implies that guest_fpstate::xfd and per-cpu xfd cache might
be out of sync with the current IA32_XFD value by the guest.

This suggests KVM needs to re-sync guest_fpstate::xfd and per-cpu cache
with IA32_XFD before the vCPU thread might be preempted or interrupted.

This patch provides a helper function for the re-sync purpose.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
(To Thomas): the original name kvm_update_guest_xfd_state() in
your sample code is renamed to xfd_sync_state() in this patch. In
concept it is a general helper to bring software values in-sync with
the MSR value after they become out-of-sync. KVM is just the
first out-of-sync usage on this helper, so a neutral name may make
more sense. But if you prefer to the original name we can also
change back.

 arch/x86/include/asm/fpu/xstate.h |  2 ++
 arch/x86/kernel/fpu/xstate.c      | 14 ++++++++++++++
 2 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index cd3dd170e23a..c8b51d34daab 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -129,4 +129,6 @@ static __always_inline __pure bool fpu_state_size_dynamic(void)
 }
 #endif
 
+extern void xfd_sync_state(void);
+
 #endif
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 3c39789deeb9..a5656237a763 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1762,11 +1762,25 @@ void xfd_update_state(struct fpstate *fpstate)
 	}
 }
 EXPORT_SYMBOL_GPL(xfd_update_state);
+
+/* Bring software state in sync with the current MSR value */
+void xfd_sync_state(void)
+{
+	if (fpu_state_size_dynamic()) {
+		u64 xfd;
+
+		rdmsrl(MSR_IA32_XFD, xfd);
+		current->thread.fpu.fpstate->xfd = xfd;
+		__this_cpu_write(xfd_state, xfd);
+	}
+}
+EXPORT_SYMBOL_GPL(xfd_sync_state);
 #else /* CONFIG_X86_64 */
 static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
 	return -EPERM;
 }
+void xfd_sync_state(void) {}
 #endif  /* !CONFIG_X86_64 */
 
 inline u64 xstate_get_guest_group_perm(void)

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (11 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  7:23   ` Liu, Jing2
  2021-12-08  0:03 ` [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling Yang Zhong
                   ` (6 subsequent siblings)
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Always intercepting IA32_XFD causes non-negligible overhead when
this register is updated frequently in the guest.

Disable WRMSR interception to IA32_XFD after fpstate reallocation
is completed. There are three options for when to disable the
interception:

  1) When emulating the 1st WRMSR which requires reallocation,
     disable interception before exiting to userapce with the
     assumption that the userspace VMM should not bounch back to
     the kernel if reallocation fails. However it's not good to
     design kernel based on application behavior. If due to bug
     the vCPU thread comes back to the kernel after reallocation
     fails, XFD passthrough may lead to host memory corruption
     when doing XSAVES for guest fpstate which has a smaller size
     than what guest XFD allows.

  2) Disable interception when coming back from the userspace VMM
     (for the 1st WRMSR which triggers reallocation). Re-check
     whether fpstate size can serve the new guest XFD value. Disable
     interception only when the check succeeds. This requires KVM
     to store guest XFD value in some place and then compare it
     to guest_fpu::user_xfeatures in the completion handler.

  3) Disable interception at the 2nd WRMSR which enables dynamic
     XSTATE features. If guest_fpu::user_xfeatures already includes
     bits for dynamic features set in guest XFD value, disable
     interception.

Currently 3) is implemented, with a flow like below:

    (G) WRMSR(IA32_XFD) which enables AMX for the FIRST time
    --trap to host--
    (HK) Emulate WRMSR and find fpstate size too small
    (HK) Reallocate fpstate
    --exit to userspace--
    (HU) do nothing
    --back to kernel via kvm_run--
    (HK) complete WRMSR emulation
    --enter guest--
    (G) do something
    (G) WRMSR(IA32_XFD) which disables AMX
    --trap to host--
    (HK) Emulate WRMSR and disable AMX in IA32_XFD
    --enter guest--
    (G) do something
    (G) WRMSR(IA32_XFD) which enables AMX for the SECOND time
    --trap to host--
    (HK) Emulate WRMSR and find fpstate size sufficient for AMX
    (HK) Disable WRMSR interception for IA32_XFD
    --enter guest--
    (G) WRMSR(IA32_XFD)
    (G) WRMSR(IA32_XFD)
    (G) WRMSR(IA32_XFD)
    ...

After disabling WRMSR interception, the guest directly updates
IA32_XFD which becomes out-of-sync with the host-side software
state (guest_fpstate::xfd and per-cpu xfd cache). This requires
KVM to call xfd_sync_state() to bring the software state in
sync with IA32_XFD register after VM-exit (before preemption
happens or exiting to userspace).

p.s. We have confirmed that SDM is being revised to say that
when setting IA32_XFD[18] the AMX register state is not
guaranteed to be preserved. This clarification avoids adding
mess for a creative guest which sets IA32_XFD[18]=1 before saving
active AMX state to its own storage.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/vmx/vmx.c             | 10 ++++++++++
 arch/x86/kvm/vmx/vmx.h             |  2 +-
 arch/x86/kvm/x86.c                 |  7 +++++++
 5 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index cefe1d81e2e8..60c27f9990e9 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -30,6 +30,7 @@ KVM_X86_OP(update_exception_bitmap)
 KVM_X86_OP(get_msr)
 KVM_X86_OP(set_msr)
 KVM_X86_OP(get_segment_base)
+KVM_X86_OP_NULL(set_xfd_passthrough)
 KVM_X86_OP(get_segment)
 KVM_X86_OP(get_cpl)
 KVM_X86_OP(set_segment)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ac61f85e07b..7c97cc1fea89 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -640,6 +640,7 @@ struct kvm_vcpu_arch {
 	u64 smi_count;
 	bool tpr_access_reporting;
 	bool xsaves_enabled;
+	bool xfd_out_of_sync;
 	u64 ia32_xss;
 	u64 microcode_version;
 	u64 arch_capabilities;
@@ -1328,6 +1329,7 @@ struct kvm_x86_ops {
 	void (*update_exception_bitmap)(struct kvm_vcpu *vcpu);
 	int (*get_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
 	int (*set_msr)(struct kvm_vcpu *vcpu, struct msr_data *msr);
+	void (*set_xfd_passthrough)(struct kvm_vcpu *vcpu);
 	u64 (*get_segment_base)(struct kvm_vcpu *vcpu, int seg);
 	void (*get_segment)(struct kvm_vcpu *vcpu,
 			    struct kvm_segment *var, int seg);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 971d60980d5b..6198b13c4846 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -160,6 +160,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_FS_BASE,
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
+	MSR_IA32_XFD,
 #endif
 	MSR_IA32_SYSENTER_CS,
 	MSR_IA32_SYSENTER_ESP,
@@ -1924,6 +1925,14 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
 	return debugctl;
 }
 
+#ifdef CONFIG_X86_64
+static void vmx_set_xfd_passthrough(struct kvm_vcpu *vcpu)
+{
+	vmx_disable_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_RW);
+	vcpu->arch.xfd_out_of_sync = true;
+}
+#endif
+
 /*
  * Writes msr value into the appropriate "register".
  * Returns 0 on success, non-0 otherwise.
@@ -7657,6 +7666,7 @@ static struct kvm_x86_ops vmx_x86_ops __initdata = {
 #ifdef CONFIG_X86_64
 	.set_hv_timer = vmx_set_hv_timer,
 	.cancel_hv_timer = vmx_cancel_hv_timer,
+	.set_xfd_passthrough = vmx_set_xfd_passthrough,
 #endif
 
 	.setup_mce = vmx_setup_mce,
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 4df2ac24ffc1..bf9d3051cd6c 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	13
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index b195f4fa888f..d127b229dd29 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -974,6 +974,10 @@ bool kvm_check_guest_realloc_fpstate(struct kvm_vcpu *vcpu, u64 xfd)
 			vcpu->arch.guest_fpu.realloc_request = request;
 			return true;
 		}
+
+		/* Disable WRMSR interception if possible */
+		if (kvm_x86_ops.set_xfd_passthrough)
+			static_call(kvm_x86_set_xfd_passthrough)(vcpu);
 	}
 
 	return false;
@@ -10002,6 +10006,9 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (hw_breakpoint_active())
 		hw_breakpoint_restore();
 
+	if (vcpu->arch.xfd_out_of_sync)
+		xfd_sync_state();
+
 	vcpu->arch.last_vmentry_cpu = vcpu->cpu;
 	vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
 

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (12 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 16:16   ` Paolo Bonzini
  2021-12-10 23:20   ` Thomas Gleixner
  2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
                   ` (5 subsequent siblings)
  19 siblings, 2 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

When XFD causes an instruction to generate #NM, IA32_XFD_ERR
contains information about which disabled state components
are being accessed. The #NM handler is expected to check this
information and then enable the state components by clearing
IA32_XFD for the faulting task (if having permission).

if the XFD_ERR value generated in guest is consumed/clobbered by
the host before the guest itself doing so. This may lead to
non-XFD-related #NM treated as XFD #NM in host (due to non-zero
value in XFD_ERR), or XFD-related #NM treated as non-XFD #NM in
guest (XFD_ERR cleared by the host #NM handler).

This patch provides two helpers to swap the guest XFD_ERR and host
XFD_ERR. Where to call them in KVM will be discussed thoroughly
in next patch.

The guest XFD_ERR value is saved in fpu_guest::xfd_err. There is
no need to save host XFD_ERR because it's always cleared to ZERO
by the host #NM handler (which cannot be preempted by a vCPU
thread to observe a non-zero value).

The lower two bits in fpu_guest::xfd_err is borrowed for special
purposes. The state components (FP and SSE) covered by the two
bits are not XSAVE-enabled feature, thus not XFD-enabled either.
It's impossible to see hardware setting them in XFD_ERR:

  - XFD_ERR_GUEST_DISABLED (bit 0)

    Indicate that XFD extension is not exposed to the guest thus
    no need to save/restore it.

  - XFD_ERR_GUEST_SAVED (bit 1)

    Indicate fpu_guest::xfd_err already contains a saved value
    thus no need for duplicated saving (e.g. when the vCPU thread
    is preempted multiple times before re-enter the guest).

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/asm/fpu/api.h   |  8 ++++++
 arch/x86/include/asm/fpu/types.h | 24 ++++++++++++++++
 arch/x86/kernel/fpu/core.c       | 49 ++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 999d89026be9..c2e8f2172994 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -147,6 +147,14 @@ extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
 extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
 extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
 
+#ifdef CONFIG_X86_64
+extern void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu);
+extern void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu);
+#else
+static inline void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu) { }
+static inline void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu) { }
+#endif
+
 extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
 extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
 
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 861cffca3209..5ee98222c103 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -500,6 +500,22 @@ struct fpu {
 	 */
 };
 
+/*
+ * Use @xfd_err:bit0 to indicate whether guest XFD_ERR should be
+ * saved/restored. The x87 state covered by bit 0 is not a
+ * XSAVE-enabled feature, thus is not XFD-enabled either (won't
+ * occur in XFD_ERR).
+ */
+#define XFD_ERR_GUEST_DISABLED		(1 << XFEATURE_FP)
+
+/*
+ * Use @xfd_err:bit1 to indicate the validity of @xfd_err. Used to
+ * avoid duplicated savings in case the vCPU is preempted multiple
+ * times before it re-enters the guest. The SSE state covered by
+ * bit 1 is neither XSAVE-enabled nor XFD-enabled.
+ */
+#define XFD_ERR_GUEST_SAVED		(1 << XFEATURE_SSE)
+
 /*
  * Guest pseudo FPU container
  */
@@ -527,6 +543,14 @@ struct fpu_guest {
 	 */
 	u64				realloc_request;
 
+	/*
+	 * @xfd_err:			save the guest value. bit 0 and bit1
+	 *				have special meaning to indicate the
+	 *				requirement of saving and the validity
+	 *				of the saved value.
+	 */
+	u64				xfd_err;
+
 	/*
 	 * @fpstate:			Pointer to the allocated guest fpstate
 	 */
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 7a0436a0cb2c..5089f2e7dc22 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -322,6 +322,55 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 }
 EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
 
+#ifdef CONFIG_X86_64
+void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu)
+{
+	if (guest_fpu->xfd_err & XFD_ERR_GUEST_DISABLED)
+		return;
+
+	/* A non-zero value indicates guest XFD_ERR already saved */
+	if (guest_fpu->xfd_err)
+		return;
+
+	/* Guest XFD_ERR must be saved before switching to host fpstate */
+	WARN_ON_ONCE(!current->thread.fpu.fpstate->is_guest);
+
+	rdmsrl(MSR_IA32_XFD_ERR, guest_fpu->xfd_err);
+
+	/*
+	 * Restore to the host value if guest xfd_err is non-zero.
+	 * Except in #NM handler, all other places in the kernel
+	 * should just see xfd_err=0. So just restore to 0.
+	 */
+	if (guest_fpu->xfd_err)
+		wrmsrl(MSR_IA32_XFD_ERR, 0);
+
+	guest_fpu->xfd_err |= XFD_ERR_GUEST_SAVED;
+}
+EXPORT_SYMBOL_GPL(fpu_save_guest_xfd_err);
+
+void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu)
+{
+	u64 xfd_err = guest_fpu->xfd_err;
+
+	if (xfd_err & XFD_ERR_GUEST_DISABLED)
+		return;
+
+	xfd_err &= ~XFD_ERR_GUEST_SAVED;
+
+	/*
+	 * No need to restore a zero value since XFD_ERR
+	 * is always zero outside of #NM handler in the host.
+	 */
+	if (!xfd_err)
+		return;
+
+	wrmsrl(MSR_IA32_XFD_ERR, xfd_err);
+	guest_fpu->xfd_err = 0;
+}
+EXPORT_SYMBOL_GPL(fpu_restore_guest_xfd_err);
+#endif
+
 void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
 				    unsigned int size, u32 pkru)
 {

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (13 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 16:23   ` Paolo Bonzini
                     ` (2 more replies)
  2021-12-08  0:03 ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Yang Zhong
                   ` (4 subsequent siblings)
  19 siblings, 3 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

KVM needs to save the guest XFD_ERR value before this register
might be accessed by the host and restore it before entering the
guest.

This implementation saves guest XFD_ERR in two transition points:

  - When the vCPU thread exits to the userspace VMM;
  - When the vCPU thread is preempted;

XFD_ERR is cleared to ZERO right after saving the previous guest
value. Otherwise a stale guest value may confuse the host #NM
handler to misinterpret a non-XFD-related #NM as XFD related.

There is no need to save the host XFD_ERR value because the only
place where XFD_ERR is consumed outside of KVM is in #NM handler
(which can not be preempted by a vCPU thread). XFD_ERR should
always be observed as ZER0 outside of #NM hanlder, thus clearing
XFD_ERR meets the host expectation here.

The saved guest value is restored to XFD_ERR right before entering
the guest (with preemption disabled).

Current implementation still has two opens which we would like
to hear suggestions:

  1) Will #NM be triggered in host kernel?

  Now the code is written assuming above is true, and it's the only
  reason for saving guest XFD_ERR at preemption time. Otherwise the
  save is only required when the CPU enters ring-3 (either from the
  vCPU itself or other threads), by leveraging the "user-return
  notifier" machinery as suggested by Paolo.

  2) When to enable XFD_ERR save/restore?

  There are four options on the table:

    a) As long as guest cpuid has xfd enabled

       XFD_ERR save/restore is enabled in every VM-exit (if preemption
       or ret-to-userspace happens)

    b) When the guest sets IA32_XFD to 1 for the first time

       Indicate that guest OS supports XFD features. Because guest OS
       usually initializes IA32_XFD at boot time, XFD_ERR save/restore
       is enabled for almost every VM-exit (if preemption or ret-to-
       userspace happens).

       No save/restore for legacy guest OS which doesn't support XFD
       features at all (thus won't touch IA32_XFD).

    c) When the guest sets IA32_XFD to 0 for the first time

       Lazily enabling XFD_ERR save/restore until XFD features are
       used inside guest. However, this option doesn't work because
       XFD_ERR is set when #NM is raised. An VM-exit could happen
       between CPU raising #NM and guest #NM handler reading XFD_ERR
       (before setting XFD to 0). The very first XFD_ERR might be
       already clobbered by the host due to no save/restore in that
       small window.

    d) When the 1st guest #NM with non-zero XFD_ERR occurs

       Lazily enabling XFD_ERR save/restore until XFD features are
       used inside guest. This requires intercepting guest #NM until
       non-zero XFD_ERR occurs. If a guest with XFD in cpuid never
       launches an AMX application, it implies that #NM is always
       trapped thus adding a constant overhead which may be even
       higher than doing RDMSR in preemption path in a) and b):

         #preempts < #VMEXITS (no #NM trap) < #VMEXITS (#NM trap)

       The number of preemptions and ret-to-userspaces should be a
       small portion of total #VMEXITs in a healthy virtualization
       environment. Our gut-feeling is that adding at most one MSR
       read and one MSR write to the preempt/user-ret paths is possibly
       more efficient than increasing #VMEXITs due to trapping #NM.

For above analysis we plan to go option b), although this version
currently implements a). But we would like to hear other suggestions
before making this change.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Kevin Tian <kevin.tian@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kernel/fpu/core.c | 2 ++
 arch/x86/kvm/cpuid.c       | 5 +++++
 arch/x86/kvm/vmx/vmx.c     | 2 ++
 arch/x86/kvm/vmx/vmx.h     | 2 +-
 arch/x86/kvm/x86.c         | 5 +++++
 5 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 5089f2e7dc22..9811dc98d550 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 	fpstate->is_guest	= true;
 
 	gfpu->fpstate		= fpstate;
+	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;
 	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
 	gfpu->user_perm		= fpu_user_cfg.default_features;
 	fpu_init_guest_permissions(gfpu);
@@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 		fpu->fpstate = guest_fps;
 		guest_fps->in_use = true;
 	} else {
+		fpu_save_guest_xfd_err(guest_fpu);
 		guest_fps->in_use = false;
 		fpu->fpstate = fpu->__task_fpstate;
 		fpu->__task_fpstate = NULL;
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index f3c61205bbf4..ea51b986ee67 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 		kvm_apic_set_version(vcpu);
 	}
 
+	/* Enable saving guest XFD_ERR */
+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
+	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
+		vcpu->arch.guest_fpu.xfd_err = 0;
+
 	best = kvm_find_cpuid_entry(vcpu, 0xD, 0);
 	if (!best)
 		vcpu->arch.guest_supported_xcr0 = 0;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6198b13c4846..0db8bdf273e2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -161,6 +161,7 @@ static u32 vmx_possible_passthrough_msrs[MAX_POSSIBLE_PASSTHROUGH_MSRS] = {
 	MSR_GS_BASE,
 	MSR_KERNEL_GS_BASE,
 	MSR_IA32_XFD,
+	MSR_IA32_XFD_ERR,
 #endif
 	MSR_IA32_SYSENTER_CS,
 	MSR_IA32_SYSENTER_ESP,
@@ -7153,6 +7154,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 static void vmx_update_intercept_xfd(struct kvm_vcpu *vcpu)
 {
 	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD, MSR_TYPE_R, false);
+	vmx_set_intercept_for_msr(vcpu, MSR_IA32_XFD_ERR, MSR_TYPE_RW, false);
 }
 
 static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index bf9d3051cd6c..0a00242a91e7 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -340,7 +340,7 @@ struct vcpu_vmx {
 	struct lbr_desc lbr_desc;
 
 	/* Save desired MSR intercept (read: pass-through) state */
-#define MAX_POSSIBLE_PASSTHROUGH_MSRS	14
+#define MAX_POSSIBLE_PASSTHROUGH_MSRS	15
 	struct {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d127b229dd29..8b033c9241d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
 		kvm_steal_time_set_preempted(vcpu);
 	srcu_read_unlock(&vcpu->kvm->srcu, idx);
 
+	if (vcpu->preempted)
+		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
+
 	static_call(kvm_x86_vcpu_put)(vcpu);
 	vcpu->arch.last_host_tsc = rdtsc();
 }
@@ -9951,6 +9954,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		switch_fpu_return();
 
+	fpu_restore_guest_xfd_err(&vcpu->arch.guest_fpu);
+
 	if (unlikely(vcpu->arch.switch_db_regs)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (14 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 16:25   ` Paolo Bonzini
  2021-12-10 16:30   ` Paolo Bonzini
  2021-12-08  0:03 ` [PATCH 17/19] docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls Yang Zhong
                   ` (3 subsequent siblings)
  19 siblings, 2 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

When dynamic XSTATE features are supported, the xsave states are
beyond 4KB. The current kvm_xsave structure and related
KVM_{G, S}ET_XSAVE only allows 4KB which is not enough for full
states.

Introduce a new kvm_xsave2 structure and the corresponding
KVM_GET_XSAVE2 and KVM_SET_XSAVE2 ioctls so that userspace VMM
can get and set the full xsave states.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h |  6 ++++
 arch/x86/kvm/x86.c              | 62 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h        |  7 +++-
 3 files changed, 74 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5a776a08f78c..de42a51e20c3 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -47,6 +47,7 @@
 #define __KVM_HAVE_VCPU_EVENTS
 #define __KVM_HAVE_DEBUGREGS
 #define __KVM_HAVE_XSAVE
+#define __KVM_HAVE_XSAVE2
 #define __KVM_HAVE_XCRS
 #define __KVM_HAVE_READONLY_MEM
 
@@ -378,6 +379,11 @@ struct kvm_xsave {
 	__u32 region[1024];
 };
 
+struct kvm_xsave2 {
+	__u32 size;
+	__u8 state[0];
+};
+
 #define KVM_MAX_XCRS	16
 
 struct kvm_xcr {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8b033c9241d6..d212f6d2d39a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4216,6 +4216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_DEBUGREGS:
 	case KVM_CAP_X86_ROBUST_SINGLESTEP:
 	case KVM_CAP_XSAVE:
+	case KVM_CAP_XSAVE2:
 	case KVM_CAP_ASYNC_PF:
 	case KVM_CAP_ASYNC_PF_INT:
 	case KVM_CAP_GET_TSC_KHZ:
@@ -4940,6 +4941,17 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
 				       vcpu->arch.pkru);
 }
 
+static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
+					  u8 *state, u32 size)
+{
+	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+		return;
+
+	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
+				       state, size,
+				       vcpu->arch.pkru);
+}
+
 static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 					struct kvm_xsave *guest_xsave)
 {
@@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 					      supported_xcr0, &vcpu->arch.pkru);
 }
 
+static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
+{
+	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
+		return 0;
+
+	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
+					      supported_xcr0, &vcpu->arch.pkru);
+}
+
 static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
 					struct kvm_xcrs *guest_xcrs)
 {
@@ -5416,6 +5437,47 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
 		break;
 	}
+	case KVM_GET_XSAVE2: {
+		struct kvm_xsave2 __user *xsave2_arg = argp;
+		struct kvm_xsave2 xsave2;
+
+		r = -EFAULT;
+		if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
+			break;
+
+		u.buffer = kzalloc(xsave2.size, GFP_KERNEL_ACCOUNT);
+
+		r = -ENOMEM;
+		if (!u.buffer)
+			break;
+
+		kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, xsave2.size);
+
+		r = -EFAULT;
+		if (copy_to_user(xsave2_arg->state, u.buffer, xsave2.size))
+			break;
+
+		r = 0;
+		break;
+	}
+	case KVM_SET_XSAVE2: {
+		struct kvm_xsave2 __user *xsave2_arg = argp;
+		struct kvm_xsave2 xsave2;
+
+		r = -EFAULT;
+		if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
+			break;
+
+		u.buffer = memdup_user(xsave2_arg->state, xsave2.size);
+
+		if (IS_ERR(u.buffer)) {
+			r = PTR_ERR(u.buffer);
+			goto out_nofree;
+		}
+
+		r = kvm_vcpu_ioctl_x86_set_xsave2(vcpu, u.buffer);
+		break;
+	}
 	case KVM_GET_XCRS: {
 		u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
 		r = -ENOMEM;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 0c7b301c7254..603e1ca9ba09 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1132,7 +1132,9 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
 #define KVM_CAP_ARM_MTE 205
 #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
-
+#ifdef __KVM_HAVE_XSAVE2
+#define KVM_CAP_XSAVE2 207
+#endif
 #ifdef KVM_CAP_IRQ_ROUTING
 
 struct kvm_irq_routing_irqchip {
@@ -1679,6 +1681,9 @@ struct kvm_xen_hvm_attr {
 #define KVM_GET_SREGS2             _IOR(KVMIO,  0xcc, struct kvm_sregs2)
 #define KVM_SET_SREGS2             _IOW(KVMIO,  0xcd, struct kvm_sregs2)
 
+#define KVM_GET_XSAVE2		   _IOR(KVMIO,  0xcf, struct kvm_xsave2)
+#define KVM_SET_XSAVE2		   _IOW(KVMIO,  0xd0, struct kvm_xsave2)
+
 struct kvm_xen_vcpu_attr {
 	__u16 type;
 	__u16 pad[3];

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 17/19] docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (15 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-08  0:03 ` [PATCH 18/19] kvm: x86: AMX XCR0 support for guest Yang Zhong
                   ` (2 subsequent siblings)
  19 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Document the detailed information of the new KVM_{G, S}ET_XSAVE2 ioctls.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 Documentation/virt/kvm/api.rst | 47 ++++++++++++++++++++++++++++++++++
 1 file changed, 47 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index aeeb071c7688..39dfd867e429 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1569,6 +1569,8 @@ otherwise it will return EBUSY error.
   };
 
 This ioctl would copy current vcpu's xsave struct to the userspace.
+Application should use KVM_GET_XSAVE2 if xsave states are larger than
+4KB.
 
 
 4.43 KVM_SET_XSAVE
@@ -1588,6 +1590,8 @@ This ioctl would copy current vcpu's xsave struct to the userspace.
   };
 
 This ioctl would copy userspace's xsave struct to the kernel.
+Application should use KVM_SET_XSAVE2 if xsave states are larger than
+4KB.
 
 
 4.44 KVM_GET_XCRS
@@ -7484,3 +7488,46 @@ The argument to KVM_ENABLE_CAP is also a bitmask, and must be a subset
 of the result of KVM_CHECK_EXTENSION.  KVM will forward to userspace
 the hypercalls whose corresponding bit is in the argument, and return
 ENOSYS for the others.
+
+8.35 KVM_GET_XSAVE2
+-------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave2 (in)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+  struct kvm_xsave2 {
+	__u32 size;
+	__u8 state[0];
+  };
+
+This ioctl is used for copying current vcpu's xsave struct to the
+userspace when xsave state size is larger than 4KB. Application code
+should set the 'size' member which indicates the size of xsave state
+and KVM copies the xsave state into the 'state' region.
+
+8.36 KVM_SET_XSAVE2
+-------------------
+
+:Capability: KVM_CAP_XSAVE2
+:Architectures: x86
+:Type: vcpu ioctl
+:Parameters: struct kvm_xsave2 (out)
+:Returns: 0 on success, -1 on error
+
+
+::
+
+  struct kvm_xsave2 {
+	__u32 size;
+	__u8 state[0];
+  };
+
+This ioctl is used for copying userspace's xsave struct to the kernel
+when xsave size is larger than 4KB. Application code should set the
+'size' member which indicates the size of xsave state.

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 18/19] kvm: x86: AMX XCR0 support for guest
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (16 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 17/19] docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 16:30   ` Paolo Bonzini
  2021-12-08  0:03 ` [PATCH 19/19] kvm: x86: Add AMX CPUIDs support Yang Zhong
  2021-12-11 21:20 ` [PATCH 00/19] AMX Support in KVM Thomas Gleixner
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Two XCR0 bits are defined for AMX to support XSAVE mechanism. Bit 17 is
for tilecfg and bit 18 is for tiledata.

The value of XCR0[17:18] is always either 00b or 11b. Also, SDM recommends
that only 64-bit operating systems enable Intel AMX by setting
XCR0[18:17].

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/kvm/x86.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d212f6d2d39a..a9a608c8fa50 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -210,7 +210,7 @@ static struct kvm_user_return_msrs __percpu *user_return_msrs;
 #define KVM_SUPPORTED_XCR0     (XFEATURE_MASK_FP | XFEATURE_MASK_SSE \
 				| XFEATURE_MASK_YMM | XFEATURE_MASK_BNDREGS \
 				| XFEATURE_MASK_BNDCSR | XFEATURE_MASK_AVX512 \
-				| XFEATURE_MASK_PKRU)
+				| XFEATURE_MASK_PKRU | XFEATURE_MASK_XTILE)
 
 u64 __read_mostly host_efer;
 EXPORT_SYMBOL_GPL(host_efer);
@@ -1017,6 +1017,23 @@ static int __kvm_set_xcr(struct kvm_vcpu *vcpu, u32 index, u64 xcr)
 		if ((xcr0 & XFEATURE_MASK_AVX512) != XFEATURE_MASK_AVX512)
 			return 1;
 	}
+
+#ifdef CONFIG_X86_64
+	if ((xcr0 & XFEATURE_MASK_XTILE) &&
+	    ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
+		return 1;
+#else
+	/*
+	 * Intel AMX instructions can be executed only in 64-bit mode but
+	 * XSAVE can operate on XTILECFG and XTILEDATA in any mode.
+	 * Since the FPU core follows SDM recommendation to set
+	 * XCR[18:17] only in 64-bit environment, here also prevent any
+	 * guest OS from setting the two bits when host is 32-bit.
+	 *
+	 * XFEATURE_MASK_XTILE cannot be used since it is 0 in this case.
+	 */
+	xcr0 &= ~(XFEATURE_MASK_XTILE_DATA | XFEATURE_MASK_XTILE_CFG);
+#endif
 	vcpu->arch.xcr0 = xcr0;
 
 	if ((xcr0 ^ old_xcr0) & XFEATURE_MASK_EXTEND)

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* [PATCH 19/19] kvm: x86: Add AMX CPUIDs support
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (17 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 18/19] kvm: x86: AMX XCR0 support for guest Yang Zhong
@ 2021-12-08  0:03 ` Yang Zhong
  2021-12-10 21:52   ` Paolo Bonzini
  2021-12-11 21:20 ` [PATCH 00/19] AMX Support in KVM Thomas Gleixner
  19 siblings, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-08  0:03 UTC (permalink / raw)
  To: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

From: Jing Liu <jing2.liu@intel.com>

Extend CPUID emulation to support XFD, AMX_TILE, AMX_INT8 and
AMX_BF16. Adding those bits into kvm_cpu_caps finally activates all
previous logics in this series.

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
---
 arch/x86/include/asm/cpufeatures.h |  2 ++
 arch/x86/kvm/cpuid.c               | 16 +++++++++++++---
 2 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d5b5f2ab87a0..da872b6f8d8b 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -299,7 +299,9 @@
 /* Intel-defined CPU features, CPUID level 0x00000007:1 (EAX), word 12 */
 #define X86_FEATURE_AVX_VNNI		(12*32+ 4) /* AVX VNNI instructions */
 #define X86_FEATURE_AVX512_BF16		(12*32+ 5) /* AVX512 BFLOAT16 instructions */
+#define X86_FEATURE_AMX_BF16		(18*32+22) /* AMX bf16 Support */
 #define X86_FEATURE_AMX_TILE		(18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8		(18*32+25) /* AMX int8 Support */
 
 /* AMD-defined CPU features, CPUID level 0x80000008 (EBX), word 13 */
 #define X86_FEATURE_CLZERO		(13*32+ 0) /* CLZERO instruction */
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index ea51b986ee67..7bb56cc89aa7 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -510,7 +510,8 @@ void kvm_set_cpu_caps(void)
 		F(AVX512_4VNNIW) | F(AVX512_4FMAPS) | F(SPEC_CTRL) |
 		F(SPEC_CTRL_SSBD) | F(ARCH_CAPABILITIES) | F(INTEL_STIBP) |
 		F(MD_CLEAR) | F(AVX512_VP2INTERSECT) | F(FSRM) |
-		F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16)
+		F(SERIALIZE) | F(TSXLDTRK) | F(AVX512_FP16) |
+		F(AMX_TILE) | F(AMX_INT8) | F(AMX_BF16)
 	);
 
 	/* TSC_ADJUST and ARCH_CAPABILITIES are emulated in software. */
@@ -529,7 +530,7 @@ void kvm_set_cpu_caps(void)
 	);
 
 	kvm_cpu_cap_mask(CPUID_D_1_EAX,
-		F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES)
+		F(XSAVEOPT) | F(XSAVEC) | F(XGETBV1) | F(XSAVES) | F(XFD)
 	);
 
 	kvm_cpu_cap_init_scattered(CPUID_12_EAX,
@@ -655,6 +656,8 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
 	case 0x14:
 	case 0x17:
 	case 0x18:
+	case 0x1d:
+	case 0x1e:
 	case 0x1f:
 	case 0x8000001d:
 		entry->flags |= KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
@@ -779,6 +782,7 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		}
 		break;
 	case 9:
+	case 0x1e: /* TMUL information */
 		break;
 	case 0xa: { /* Architectural Performance Monitoring */
 		struct x86_pmu_capability cap;
@@ -914,7 +918,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 		break;
 	/* Intel PT */
 	case 0x14:
-		if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) {
+		if ((function == 0x14 && !kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) ||
+		    (function == 0x1d && !kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))) {
 			entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
 			break;
 		}
@@ -924,6 +929,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
 				goto out;
 		}
 		break;
+	/* Intel AMX TILE */
+	case 0x1d:
+		if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))
+			entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
+		break;
 	case KVM_CPUID_SIGNATURE: {
 		const u32 *sigptr = (const u32 *)KVM_SIGNATURE;
 		entry->eax = KVM_CPUID_FEATURES;

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* RE: [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand
  2021-12-08  0:03 ` [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand Yang Zhong
@ 2021-12-08  7:23   ` Liu, Jing2
  0 siblings, 0 replies; 80+ messages in thread
From: Liu, Jing2 @ 2021-12-08  7:23 UTC (permalink / raw)
  To: Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen, pbonzini
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu


On 12/8/2021 8:03 AM, Yang Zhong wrote: 
> From: Jing Liu <jing2.liu@intel.com>
> 
> Always intercepting IA32_XFD causes non-negligible overhead when this
> register is updated frequently in the guest.
> 
> Disable WRMSR interception to IA32_XFD after fpstate reallocation is
> completed. There are three options for when to disable the
> interception:
> 
>   1) When emulating the 1st WRMSR which requires reallocation,
>      disable interception before exiting to userapce with the
>      assumption that the userspace VMM should not bounch back to
>      the kernel if reallocation fails. However it's not good to
>      design kernel based on application behavior. If due to bug
>      the vCPU thread comes back to the kernel after reallocation
>      fails, XFD passthrough may lead to host memory corruption
>      when doing XSAVES for guest fpstate which has a smaller size
>      than what guest XFD allows.
> 
>   2) Disable interception when coming back from the userspace VMM
>      (for the 1st WRMSR which triggers reallocation). Re-check
>      whether fpstate size can serve the new guest XFD value. Disable
>      interception only when the check succeeds. This requires KVM
>      to store guest XFD value in some place and then compare it
>      to guest_fpu::user_xfeatures in the completion handler.

For option 2), we are considering that fpstate->size can be used to indicate
if reallocation is successful. Because once one of the XFD features (today,
it's AMX) is enabled, kernel need reallocate full size, otherwise, KVM has no
chance to reallocate for other XFD features later since it's non-trapped (to
avoid WRMSR VM EXITs due to guest toggling XFD). 

Then KVM doesn't need to store guest XFD value in some place. And kernel
fpu core may need an API to tell guest permitted size for KVM.

Thanks,
Jing

> 
>   3) Disable interception at the 2nd WRMSR which enables dynamic
>      XSTATE features. If guest_fpu::user_xfeatures already includes
>      bits for dynamic features set in guest XFD value, disable
>      interception.
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace
  2021-12-08  0:03 ` [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace Yang Zhong
@ 2021-12-10 15:44   ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 15:44 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
>   out:
> -	kvm_put_guest_fpu(vcpu);
> +	ret = kvm_put_guest_fpu(vcpu);
> +	if ((r >= 0) && (ret < 0))
> +		r = ret;
> +

No extra parentheses.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
@ 2021-12-10 16:02   ` Paolo Bonzini
  2021-12-13  7:51     ` Liu, Jing2
  2021-12-14 10:26     ` Yang Zhong
  2021-12-10 23:09   ` Thomas Gleixner
  2021-12-13 15:06   ` Paolo Bonzini
  2 siblings, 2 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:02 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

First, the MSR should be added to msrs_to_save_all and 
kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in kvm_init_msr_list.

It seems that RDMSR support is missing, too.

More important, please include:

- documentation for the new KVM_EXIT_* value

- a selftest that explains how userspace should react to it.

This is a strong requirement for any new API (the first has been for 
years; but the latter is also almost always respected these days).  This 
series should not have been submitted without documentation.

Also:

On 12/8/21 01:03, Yang Zhong wrote:
> 
> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> +			return 1;

This should allow msr->host_initiated always (even if XFD is not part of 
CPUID).  However, if XFD is nonzero and kvm_check_guest_realloc_fpstate 
returns true, then it should return 1.

The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.

> +		/* Setting unsupported bits causes #GP */
> +		if (~XFEATURE_MASK_USER_DYNAMIC & data) {
> +			kvm_inject_gp(vcpu, 0);
> +			break;
> +		}

This should check

	if (data & ~(XFEATURE_MASK_USER_DYNAMIC &
		    vcpu->arch.guest_supported_xcr0))

instead.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling
  2021-12-08  0:03 ` [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling Yang Zhong
@ 2021-12-10 16:16   ` Paolo Bonzini
  2021-12-10 23:20   ` Thomas Gleixner
  1 sibling, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:16 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> 
> The guest XFD_ERR value is saved in fpu_guest::xfd_err. There is
> no need to save host XFD_ERR because it's always cleared to ZERO
> by the host #NM handler (which cannot be preempted by a vCPU
> thread to observe a non-zero value).
> 
> The lower two bits in fpu_guest::xfd_err is borrowed for special
> purposes. The state components (FP and SSE) covered by the two
> bits are not XSAVE-enabled feature, thus not XFD-enabled either.
> It's impossible to see hardware setting them in XFD_ERR:
> 
>    - XFD_ERR_GUEST_DISABLED (bit 0)
> 
>      Indicate that XFD extension is not exposed to the guest thus
>      no need to save/restore it.

Can this instead just check if xfeatures includes any dynamic-save features?

Paolo

>    - XFD_ERR_GUEST_SAVED (bit 1)
> 
>      Indicate fpu_guest::xfd_err already contains a saved value
>      thus no need for duplicated saving (e.g. when the vCPU thread
>      is preempted multiple times before re-enter the guest).


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
@ 2021-12-10 16:23   ` Paolo Bonzini
  2021-12-10 22:01   ` Paolo Bonzini
  2021-12-11  0:10   ` Thomas Gleixner
  2 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:23 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
>   		kvm_steal_time_set_preempted(vcpu);
>   	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>   
> +	if (vcpu->preempted)
> +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
> +

Instead of checking vcpu->preempted, can you instead check if the active 
FPU is the guest FPU?  That is, save if 
current->thread.fpu->fpstate->is_guest?

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-08  0:03 ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Yang Zhong
@ 2021-12-10 16:25   ` Paolo Bonzini
  2021-12-10 16:30   ` Paolo Bonzini
  1 sibling, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:25 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> From: Jing Liu <jing2.liu@intel.com>
> 
> When dynamic XSTATE features are supported, the xsave states are
> beyond 4KB. The current kvm_xsave structure and related
> KVM_{G, S}ET_XSAVE only allows 4KB which is not enough for full
> states.
> 
> Introduce a new kvm_xsave2 structure and the corresponding
> KVM_GET_XSAVE2 and KVM_SET_XSAVE2 ioctls so that userspace VMM
> can get and set the full xsave states.
> 
> Signed-off-by: Jing Liu <jing2.liu@intel.com>
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>   arch/x86/include/uapi/asm/kvm.h |  6 ++++
>   arch/x86/kvm/x86.c              | 62 +++++++++++++++++++++++++++++++++
>   include/uapi/linux/kvm.h        |  7 +++-
>   3 files changed, 74 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 5a776a08f78c..de42a51e20c3 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -47,6 +47,7 @@
>   #define __KVM_HAVE_VCPU_EVENTS
>   #define __KVM_HAVE_DEBUGREGS
>   #define __KVM_HAVE_XSAVE
> +#define __KVM_HAVE_XSAVE2
>   #define __KVM_HAVE_XCRS
>   #define __KVM_HAVE_READONLY_MEM
>   
> @@ -378,6 +379,11 @@ struct kvm_xsave {
>   	__u32 region[1024];
>   };
>   
> +struct kvm_xsave2 {
> +	__u32 size;
> +	__u8 state[0];
> +};
> +
>   #define KVM_MAX_XCRS	16
>   
>   struct kvm_xcr {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 8b033c9241d6..d212f6d2d39a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4216,6 +4216,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_DEBUGREGS:
>   	case KVM_CAP_X86_ROBUST_SINGLESTEP:
>   	case KVM_CAP_XSAVE:
> +	case KVM_CAP_XSAVE2:
>   	case KVM_CAP_ASYNC_PF:
>   	case KVM_CAP_ASYNC_PF_INT:
>   	case KVM_CAP_GET_TSC_KHZ:
> @@ -4940,6 +4941,17 @@ static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
>   				       vcpu->arch.pkru);
>   }
>   
> +static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
> +					  u8 *state, u32 size)
> +{
> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> +		return;
> +
> +	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> +				       state, size,
> +				       vcpu->arch.pkru);
> +}
> +
>   static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>   					struct kvm_xsave *guest_xsave)
>   {
> @@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>   					      supported_xcr0, &vcpu->arch.pkru);
>   }
>   
> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
> +{
> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> +		return 0;
> +
> +	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
> +					      supported_xcr0, &vcpu->arch.pkru);
> +}
> +
>   static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu,
>   					struct kvm_xcrs *guest_xcrs)
>   {
> @@ -5416,6 +5437,47 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>   		r = kvm_vcpu_ioctl_x86_set_xsave(vcpu, u.xsave);
>   		break;
>   	}
> +	case KVM_GET_XSAVE2: {
> +		struct kvm_xsave2 __user *xsave2_arg = argp;
> +		struct kvm_xsave2 xsave2;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
> +			break;
> +
> +		u.buffer = kzalloc(xsave2.size, GFP_KERNEL_ACCOUNT);
> +
> +		r = -ENOMEM;
> +		if (!u.buffer)
> +			break;
> +
> +		kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, xsave2.size);
> +
> +		r = -EFAULT;
> +		if (copy_to_user(xsave2_arg->state, u.buffer, xsave2.size))
> +			break;
> +
> +		r = 0;
> +		break;
> +	}
> +	case KVM_SET_XSAVE2: {
> +		struct kvm_xsave2 __user *xsave2_arg = argp;
> +		struct kvm_xsave2 xsave2;
> +
> +		r = -EFAULT;
> +		if (copy_from_user(&xsave2, xsave2_arg, sizeof(struct kvm_xsave2)))
> +			break;
> +
> +		u.buffer = memdup_user(xsave2_arg->state, xsave2.size);
> +
> +		if (IS_ERR(u.buffer)) {
> +			r = PTR_ERR(u.buffer);
> +			goto out_nofree;
> +		}
> +
> +		r = kvm_vcpu_ioctl_x86_set_xsave2(vcpu, u.buffer);
> +		break;
> +	}
>   	case KVM_GET_XCRS: {
>   		u.xcrs = kzalloc(sizeof(struct kvm_xcrs), GFP_KERNEL_ACCOUNT);
>   		r = -ENOMEM;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 0c7b301c7254..603e1ca9ba09 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1132,7 +1132,9 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_EXIT_ON_EMULATION_FAILURE 204
>   #define KVM_CAP_ARM_MTE 205
>   #define KVM_CAP_VM_MOVE_ENC_CONTEXT_FROM 206
> -
> +#ifdef __KVM_HAVE_XSAVE2
> +#define KVM_CAP_XSAVE2 207
> +#endif
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
>   struct kvm_irq_routing_irqchip {
> @@ -1679,6 +1681,9 @@ struct kvm_xen_hvm_attr {
>   #define KVM_GET_SREGS2             _IOR(KVMIO,  0xcc, struct kvm_sregs2)
>   #define KVM_SET_SREGS2             _IOW(KVMIO,  0xcd, struct kvm_sregs2)
>   
> +#define KVM_GET_XSAVE2		   _IOR(KVMIO,  0xcf, struct kvm_xsave2)
> +#define KVM_SET_XSAVE2		   _IOW(KVMIO,  0xd0, struct kvm_xsave2)
> +
>   struct kvm_xen_vcpu_attr {
>   	__u16 type;
>   	__u16 pad[3];
> 

Please also modify KVM_GET/SET_XSAVE to fail with ENOSPC if the 
requested size is bigger than 4096.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-08  0:03 ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Yang Zhong
  2021-12-10 16:25   ` Paolo Bonzini
@ 2021-12-10 16:30   ` Paolo Bonzini
  2021-12-10 22:13     ` Paolo Bonzini
  2021-12-13 10:10     ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Thomas Gleixner
  1 sibling, 2 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:30 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> +static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
> +					  u8 *state, u32 size)
> +{
> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> +		return;
> +
> +	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> +				       state, size,
> +				       vcpu->arch.pkru);
> +}
> +
>   static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>   					struct kvm_xsave *guest_xsave)
>   {
> @@ -4951,6 +4963,15 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>   					      supported_xcr0, &vcpu->arch.pkru);
>   }
>   
> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
> +{
> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> +		return 0;
> +
> +	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
> +					      supported_xcr0, &vcpu->arch.pkru);
> +}
> +

I think fpu_copy_uabi_to_guest_fpstate (and therefore 
copy_uabi_from_kernel_to_xstate) needs to check that the size is 
compatible with the components in the input.

Also, IIUC the size of the AMX state will vary in different processors. 
  Is this correct?  If so, this should be handled already by 
KVM_GET/SET_XSAVE2 and therefore should be part of the 
arch/x86/kernel/fpu APIs.  In the future we want to support migrating a 
"small AMX" host to a "large AMX" host; and also migrating from a "large 
AMX" host to a "small AMX" host if the guest CPUID is compatible with 
the destination of the migration.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 18/19] kvm: x86: AMX XCR0 support for guest
  2021-12-08  0:03 ` [PATCH 18/19] kvm: x86: AMX XCR0 support for guest Yang Zhong
@ 2021-12-10 16:30   ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 16:30 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> +
> +#ifdef CONFIG_X86_64
> +	if ((xcr0 & XFEATURE_MASK_XTILE) &&
> +	    ((xcr0 & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE))
> +		return 1;
> +#else
> +	/*
> +	 * Intel AMX instructions can be executed only in 64-bit mode but
> +	 * XSAVE can operate on XTILECFG and XTILEDATA in any mode.
> +	 * Since the FPU core follows SDM recommendation to set
> +	 * XCR[18:17] only in 64-bit environment, here also prevent any
> +	 * guest OS from setting the two bits when host is 32-bit.
> +	 *
> +	 * XFEATURE_MASK_XTILE cannot be used since it is 0 in this case.
> +	 */
> +	xcr0 &= ~(XFEATURE_MASK_XTILE_DATA | XFEATURE_MASK_XTILE_CFG);
> +#endif

This should not be necessary, because on a 32-bit system the bits won't 
be part of supported_xcr0.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 19/19] kvm: x86: Add AMX CPUIDs support
  2021-12-08  0:03 ` [PATCH 19/19] kvm: x86: Add AMX CPUIDs support Yang Zhong
@ 2021-12-10 21:52   ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 21:52 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> @@ -914,7 +918,8 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   		break;
>   	/* Intel PT */
>   	case 0x14:
> -		if (!kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) {
> +		if ((function == 0x14 && !kvm_cpu_cap_has(X86_FEATURE_INTEL_PT)) ||
> +		    (function == 0x1d && !kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))) {

This hunk is wrong.

>   			entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
>   			break;
>   		}
> @@ -924,6 +929,11 @@ static inline int __do_cpuid_func(struct kvm_cpuid_array *array, u32 function)
>   				goto out;
>   		}
>   		break;
> +	/* Intel AMX TILE */
> +	case 0x1d:
> +		if (!kvm_cpu_cap_has(X86_FEATURE_AMX_TILE))
> +			entry->eax = entry->ebx = entry->ecx = entry->edx = 0;
> +		break;

This also needs a loop similar to the one in case 0x14; so the "break" 
goes inside the "if" and then you have

                 for (i = 1, max_idx = entry->eax; i <= max_idx; ++i) {
                         if (!do_host_cpuid(array, function, i))
                                 goto out;
                 }


Same for 0x1e, which also needs to be marked conditional.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
  2021-12-10 16:23   ` Paolo Bonzini
@ 2021-12-10 22:01   ` Paolo Bonzini
  2021-12-12 13:10     ` Yang Zhong
  2021-12-11  0:10   ` Thomas Gleixner
  2 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 22:01 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   		kvm_apic_set_version(vcpu);
>   	}
>   
> +	/* Enable saving guest XFD_ERR */
> +	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> +	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> +		vcpu->arch.guest_fpu.xfd_err = 0;
> +

This is incorrect.  Instead it should check whether leaf 0xD includes 
any dynamic features.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-10 16:30   ` Paolo Bonzini
@ 2021-12-10 22:13     ` Paolo Bonzini
  2021-12-13  8:23       ` Wang, Wei W
  2021-12-20 17:54       ` State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl) Nakajima, Jun
  2021-12-13 10:10     ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Thomas Gleixner
  1 sibling, 2 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-10 22:13 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/10/21 17:30, Paolo Bonzini wrote:
>>
>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 
>> *state)
>> +{
>> +    if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>> +        return 0;
>> +
>> +    return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>> +                          supported_xcr0, &vcpu->arch.pkru);
>> +}
>> +
> 
> I think fpu_copy_uabi_to_guest_fpstate (and therefore 
> copy_uabi_from_kernel_to_xstate) needs to check that the size is 
> compatible with the components in the input.
> 
> Also, IIUC the size of the AMX state will vary in different processors. 
>   Is this correct?  If so, this should be handled already by 
> KVM_GET/SET_XSAVE2 and therefore should be part of the 
> arch/x86/kernel/fpu APIs.  In the future we want to support migrating a 
> "small AMX" host to a "large AMX" host; and also migrating from a "large 
> AMX" host to a "small AMX" host if the guest CPUID is compatible with 
> the destination of the migration.

So, the size of the AMX state will depend on the active "palette" in 
TILECONFIG, and on the CPUID information.  I have a few questions on how 
Intel intends to handle future extensions to AMX:

- can we assume that, in the future, palette 1 will always have the same 
value (bytes_per_row=64, max_names=8, max_rows=16), and basically that 
the only variable value is really the number of palettes?

- how does Intel plan to handle bigger TILEDATA?  Will it use more XCR0 
bits or will it rather enlarge save state 18?

If it will use more XCR0 bits, I suppose that XCR0 bits will control 
which palettes can be chosen by LDTILECFG.

If not, on the other hand, this will be a first case of one system's 
XSAVE data not being XRSTOR-able on another system even if the 
destination system can set XCR0 to the same value as the source system.

Likewise, if the size and offsets for save state 18 were to vary 
depending on the selected palette, then this would be novel, in that the 
save state size and offsets would not be in CPUID anymore.  It would be 
particularly interesting for non-compacted format, where all save states 
after 18 would also move forward.

So, I hope that save state 18 will be frozen to 8k.  In that case, and 
if palette 1 is frozen to the same values as today, implementing 
migration will not be a problem; it will be essentially the same as 
SSE->AVX (horizontal extension of existing registers) and/or AVX->AVX512 
(both horizontal and vertical extension).

By the way, I think KVM_SET_XSAVE2 is not needed.  Instead:

- KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the 
buffer that is passed to KVM_GET_XSAVE2

- KVM_GET_XSAVE2 should fill in the buffer expecting that its size is 
whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes

- KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the 
save states recorded in the header point to offsets larger than 4k.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers
  2021-12-08  0:03 ` [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers Yang Zhong
@ 2021-12-10 22:33   ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-10 22:33 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

Yang, Jing,

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index fe592799508c..fae44fa27cdb 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -231,6 +231,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>  	if (!fpstate)
>  		return false;
>  
> +	/* Leave xfd to 0 (the reset value defined by spec) */
>  	__fpstate_reset(fpstate);

That change makes me a bit wary simply because the comment here is above
__fpstate_reset() which makes no sense. It does make sense to you at the
time, but does it make sense to you when you look at it 6 month down the
road?

So I'd rather make this very obvious what's going. See below.

Thanks,

        tglx
---

--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -199,7 +199,7 @@ void fpu_reset_from_exception_fixup(void
 }
 
 #if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate);
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
 
 static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
 {
@@ -231,7 +231,8 @@ bool fpu_alloc_guest_fpstate(struct fpu_
 	if (!fpstate)
 		return false;
 
-	__fpstate_reset(fpstate);
+	/* Leave xfd to 0 (the reset value defined by spec) */
+	__fpstate_reset(fpstate, 0);
 	fpstate_init_user(fpstate);
 	fpstate->is_valloc	= true;
 	fpstate->is_guest	= true;
@@ -454,21 +455,21 @@ void fpstate_init_user(struct fpstate *f
 		fpstate_init_fstate(fpstate);
 }
 
-static void __fpstate_reset(struct fpstate *fpstate)
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
 {
 	/* Initialize sizes and feature masks */
 	fpstate->size		= fpu_kernel_cfg.default_size;
 	fpstate->user_size	= fpu_user_cfg.default_size;
 	fpstate->xfeatures	= fpu_kernel_cfg.default_features;
 	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
-	fpstate->xfd		= init_fpstate.xfd;
+	fpstate->xfd		= xfd;
 }
 
 void fpstate_reset(struct fpu *fpu)
 {
 	/* Set the fpstate pointer to the default fpstate */
 	fpu->fpstate = &fpu->__fpstate;
-	__fpstate_reset(fpu->fpstate);
+	__fpstate_reset(fpu->fpstate, init_fpstate.xfd);
 
 	/* Initialize the permission related info in fpu */
 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol
  2021-12-08  0:03 ` [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol Yang Zhong
@ 2021-12-10 22:44   ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-10 22:44 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> From: Jing Liu <jing2.liu@intel.com>
>
> xfd_update_state() is the interface to update IA32_XFD and its per-cpu
> cache. All callers of this interface are currently in fpu core. KVM only
> indirectly triggers IA32_XFD update via a helper function
> (fpu_swap_kvm_fpstate()) when switching between user fpu and guest fpu.
>
> Supporting AMX in guest now requires KVM to directly update IA32_XFD
> with the guest value (when emulating WRMSR) so XSAVE/XRSTOR can manage
> XSTATE components correctly inside guest.
>
> This patch moves xfd_update_state() from fpu/xstate.h to fpu/xstate.c

s/This patch moves/Move/

please. See Documentation/process/submitting-patches.rst and search for
'This patch'

> and export it for reference outside of fpu core.
>
> Signed-off-by: Jing Liu <jing2.liu@intel.com>
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
>  arch/x86/include/asm/fpu/api.h |  2 ++
>  arch/x86/kernel/fpu/xstate.c   | 12 ++++++++++++
>  arch/x86/kernel/fpu/xstate.h   | 14 +-------------
>  3 files changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 7532f73c82a6..999d89026be9 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -131,8 +131,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs_owner_ctx);
>  /* Process cleanup */
>  #ifdef CONFIG_X86_64
>  extern void fpstate_free(struct fpu *fpu);
> +extern void xfd_update_state(struct fpstate *fpstate);
>  #else
>  static inline void fpstate_free(struct fpu *fpu) { }
> +static void xfd_update_state(struct fpstate *fpstate) { }

Try a 32bit build to see the warnings this causes. That wants to be
'static inline void' obviously.

>  #ifdef CONFIG_X86_64
> -static inline void xfd_update_state(struct fpstate *fpstate)
> -{
> -	if (fpu_state_size_dynamic()) {
> -		u64 xfd = fpstate->xfd;
> -
> -		if (__this_cpu_read(xfd_state) != xfd) {
> -			wrmsrl(MSR_IA32_XFD, xfd);
> -			__this_cpu_write(xfd_state, xfd);
> -		}
> -	}
> -}
> -#else
> -static inline void xfd_update_state(struct fpstate *fpstate) { }
> +extern void xfd_update_state(struct fpstate *fpstate);

Why? It's already declared in the global header. So all of this has to
be simply removed, no?

Thanks,

        tglx



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
  2021-12-10 16:02   ` Paolo Bonzini
@ 2021-12-10 23:09   ` Thomas Gleixner
  2021-12-13 15:06   ` Paolo Bonzini
  2 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-10 23:09 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> +
> +		/*
> +		 * Update IA32_XFD to the guest value so #NM can be
> +		 * raised properly in the guest. Instead of directly
> +		 * writing the MSR, call a helper to avoid breaking
> +		 * per-cpu cached value in fpu core.
> +		 */
> +		fpregs_lock();
> +		current->thread.fpu.fpstate->xfd = data;
> +		xfd_update_state(current->thread.fpu.fpstate);
> +		fpregs_unlock();
> +		break;

Now looking at the actual callsite the previous patch really should be
something like the below. Why?

It preserves the inline which allows the compiler to generate better
code in the other hotpathes and it keeps the FPU internals to the core
code. Hmm?

Thanks,

        tglx

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -125,8 +125,10 @@ DECLARE_PER_CPU(struct fpu *, fpu_fpregs
 /* Process cleanup */
 #ifdef CONFIG_X86_64
 extern void fpstate_free(struct fpu *fpu);
+extern void fpu_update_xfd_state(u64 xfd);
 #else
 static inline void fpstate_free(struct fpu *fpu) { }
+static inline void fpu_update_xfd_state(u64 xfd) { }
 #endif
 
 /* fpstate-related functions which are exported to KVM */
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -322,6 +322,19 @@ int fpu_swap_kvm_fpstate(struct fpu_gues
 }
 EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
 
+#ifdef CONFIG_X86_64
+void fpu_update_xfd_state(u64 xfd)
+{
+	struct fpstate *fps = current->thread.fpu.fpstate;
+
+	fpregs_lock();
+	fps->xfd = xfd;
+	xfd_update_state(fps);
+	fpregs_unlock();
+}
+EXPORT_SYMBOL_GPL(fpu_update_xfd_state);
+#endif
+
 void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
 				    unsigned int size, u32 pkru)
 {



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync
  2021-12-08  0:03 ` [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync Yang Zhong
@ 2021-12-10 23:11   ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-10 23:11 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> From: Thomas Gleixner <tglx@linutronix.de>
>
> Guest may toggle IA32_XFD in high frequency as it is part of the fpstate
> information (features, sizes, xfd) and swapped in task context switch.
>
> To minimize the trap overhead of writes to this MSR, one optimization
> is to allow guest direct write thus eliminate traps. However MSR
> passthrough implies that guest_fpstate::xfd and per-cpu xfd cache might
> be out of sync with the current IA32_XFD value by the guest.
>
> This suggests KVM needs to re-sync guest_fpstate::xfd and per-cpu cache
> with IA32_XFD before the vCPU thread might be preempted or interrupted.
>
> This patch provides a helper function for the re-sync purpose.

Provide a ....

> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jing Liu <jing2.liu@intel.com>
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
> (To Thomas): the original name kvm_update_guest_xfd_state() in
> your sample code is renamed to xfd_sync_state() in this patch. In
> concept it is a general helper to bring software values in-sync with
> the MSR value after they become out-of-sync. KVM is just the
> first out-of-sync usage on this helper, so a neutral name may make
> more sense. But if you prefer to the original name we can also
> change back.

There is no need for a general helper, really.

It's KVM specific and should go into KVM section in core.c next to the
other thing vs. the XFD update.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling
  2021-12-08  0:03 ` [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling Yang Zhong
  2021-12-10 16:16   ` Paolo Bonzini
@ 2021-12-10 23:20   ` Thomas Gleixner
  1 sibling, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-10 23:20 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -322,6 +322,55 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  }
>  EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
>  
> +#ifdef CONFIG_X86_64
> +void fpu_save_guest_xfd_err(struct fpu_guest *guest_fpu)
> +{
> +	if (guest_fpu->xfd_err & XFD_ERR_GUEST_DISABLED)
> +		return;
> +
> +	/* A non-zero value indicates guest XFD_ERR already saved */
> +	if (guest_fpu->xfd_err)
> +		return;
> +
> +	/* Guest XFD_ERR must be saved before switching to host fpstate */
> +	WARN_ON_ONCE(!current->thread.fpu.fpstate->is_guest);

Warn and proceed?

> +	rdmsrl(MSR_IA32_XFD_ERR, guest_fpu->xfd_err);
> +
> +	/*
> +	 * Restore to the host value if guest xfd_err is non-zero.
> +	 * Except in #NM handler, all other places in the kernel
> +	 * should just see xfd_err=0. So just restore to 0.
> +	 */
> +	if (guest_fpu->xfd_err)
> +		wrmsrl(MSR_IA32_XFD_ERR, 0);
> +
> +	guest_fpu->xfd_err |= XFD_ERR_GUEST_SAVED;
> +}
> +EXPORT_SYMBOL_GPL(fpu_save_guest_xfd_err);
> +
> +void fpu_restore_guest_xfd_err(struct fpu_guest *guest_fpu)
> +{
> +	u64 xfd_err = guest_fpu->xfd_err;
> +
> +	if (xfd_err & XFD_ERR_GUEST_DISABLED)
> +		return;
> +
> +	xfd_err &= ~XFD_ERR_GUEST_SAVED;
> +
> +	/*
> +	 * No need to restore a zero value since XFD_ERR
> +	 * is always zero outside of #NM handler in the host.
> +	 */
> +	if (!xfd_err)
> +		return;
> +
> +	wrmsrl(MSR_IA32_XFD_ERR, xfd_err);
> +	guest_fpu->xfd_err = 0;
> +}

Why should any pf this be in the FPU core?

It's a pure guest issue as all of this is related to struct fpu_guest
and not struct fpu or any other core FPU state.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
  2021-12-10 16:23   ` Paolo Bonzini
  2021-12-10 22:01   ` Paolo Bonzini
@ 2021-12-11  0:10   ` Thomas Gleixner
  2021-12-11  1:31     ` Paolo Bonzini
  2021-12-11  3:07     ` Tian, Kevin
  2 siblings, 2 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-11  0:10 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 5089f2e7dc22..9811dc98d550 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
>  	fpstate->is_guest	= true;
>  
>  	gfpu->fpstate		= fpstate;
> +	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;

This wants to be part of the previous patch, which introduces the field.

>  	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
>  	gfpu->user_perm		= fpu_user_cfg.default_features;
>  	fpu_init_guest_permissions(gfpu);
> @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  		fpu->fpstate = guest_fps;
>  		guest_fps->in_use = true;
>  	} else {
> +		fpu_save_guest_xfd_err(guest_fpu);

Hmm. See below.

>  		guest_fps->in_use = false;
>  		fpu->fpstate = fpu->__task_fpstate;
>  		fpu->__task_fpstate = NULL;
> @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>  		kvm_steal_time_set_preempted(vcpu);
>  	srcu_read_unlock(&vcpu->kvm->srcu, idx);
>  
> +	if (vcpu->preempted)
> +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);

I'm not really exited about the thought of an exception cause register
in guest clobbered state.

Aside of that I really have to ask the question why all this is needed?

#NM in the guest is slow path, right? So why are you trying to optimize
for it?

The straight forward solution to this is:

    1) Trap #NM and MSR_XFD_ERR write

    2) When the guest triggers #NM is takes an VMEXIT and the host
       does:

                rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

       injects the #NM and goes on.

    3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
       the host does:

           vcpu->arch.guest_fpu.xfd_err = msrval;
           wrmsrl(MSR_XFD_ERR, msrval);

      and goes back.

    4) Before entering the preemption disabled section of the VCPU loop
       do:
       
           if (vcpu->arch.guest_fpu.xfd_err)
                      wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);

    5) Before leaving the preemption disabled section of the VCPU loop
       do:
       
           if (vcpu->arch.guest_fpu.xfd_err)
                      wrmsrl(MSR_XFD_ERR, 0);

It's really that simple and pretty much 0 overhead for the regular case.

If the guest triggers #NM with a high frequency then taking the VMEXITs
is the least of the problems. That's not a realistic use case, really.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11  0:10   ` Thomas Gleixner
@ 2021-12-11  1:31     ` Paolo Bonzini
  2021-12-11  3:23       ` Tian, Kevin
  2021-12-11 13:10       ` Thomas Gleixner
  2021-12-11  3:07     ` Tian, Kevin
  1 sibling, 2 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-11  1:31 UTC (permalink / raw)
  To: Thomas Gleixner, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/11/21 01:10, Thomas Gleixner wrote:
>      2) When the guest triggers #NM is takes an VMEXIT and the host
>         does:
> 
>                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>         injects the #NM and goes on.
> 
>      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>         the host does:
> 
>             vcpu->arch.guest_fpu.xfd_err = msrval;
>             wrmsrl(MSR_XFD_ERR, msrval);

No wrmsrl here I think, the host value is 0 and should stay so.  Instead 
the wrmsrl will happen the next time the VCPU loop is entred.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11  0:10   ` Thomas Gleixner
  2021-12-11  1:31     ` Paolo Bonzini
@ 2021-12-11  3:07     ` Tian, Kevin
  2021-12-11 13:29       ` Thomas Gleixner
  1 sibling, 1 reply; 80+ messages in thread
From: Tian, Kevin @ 2021-12-11  3:07 UTC (permalink / raw)
  To: Thomas Gleixner, Zhong, Yang, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen, pbonzini
  Cc: seanjc, Nakajima, Jun, jing2.liu, Liu, Jing2, Zhong, Yang

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 8:11 AM
> 
> On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > index 5089f2e7dc22..9811dc98d550 100644
> > --- a/arch/x86/kernel/fpu/core.c
> > +++ b/arch/x86/kernel/fpu/core.c
> > @@ -238,6 +238,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_guest
> *gfpu)
> >  	fpstate->is_guest	= true;
> >
> >  	gfpu->fpstate		= fpstate;
> > +	gfpu->xfd_err           = XFD_ERR_GUEST_DISABLED;
> 
> This wants to be part of the previous patch, which introduces the field.
> 
> >  	gfpu->user_xfeatures	= fpu_user_cfg.default_features;
> >  	gfpu->user_perm		= fpu_user_cfg.default_features;
> >  	fpu_init_guest_permissions(gfpu);
> > @@ -297,6 +298,7 @@ int fpu_swap_kvm_fpstate(struct fpu_guest
> *guest_fpu, bool enter_guest)
> >  		fpu->fpstate = guest_fps;
> >  		guest_fps->in_use = true;
> >  	} else {
> > +		fpu_save_guest_xfd_err(guest_fpu);
> 
> Hmm. See below.
> 
> >  		guest_fps->in_use = false;
> >  		fpu->fpstate = fpu->__task_fpstate;
> >  		fpu->__task_fpstate = NULL;
> > @@ -4550,6 +4550,9 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> >  		kvm_steal_time_set_preempted(vcpu);
> >  	srcu_read_unlock(&vcpu->kvm->srcu, idx);
> >
> > +	if (vcpu->preempted)
> > +		fpu_save_guest_xfd_err(&vcpu->arch.guest_fpu);
> 
> I'm not really exited about the thought of an exception cause register
> in guest clobbered state.
> 
> Aside of that I really have to ask the question why all this is needed?
> 
> #NM in the guest is slow path, right? So why are you trying to optimize
> for it?

This is really good information. The current logic is obviously
based on the assumption that #NM is frequently triggered.

> 
> The straight forward solution to this is:
> 
>     1) Trap #NM and MSR_XFD_ERR write

and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
before preemption is enabled, otherwise there is still a small window
where MSR_XFD_ERR might be clobbered after preemption enable and
before #NM handler is actually called.

> 
>     2) When the guest triggers #NM is takes an VMEXIT and the host
>        does:
> 
>                 rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>        injects the #NM and goes on.
> 
>     3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>        the host does:
> 
>            vcpu->arch.guest_fpu.xfd_err = msrval;
>            wrmsrl(MSR_XFD_ERR, msrval);
> 
>       and goes back.
> 
>     4) Before entering the preemption disabled section of the VCPU loop
>        do:
> 
>            if (vcpu->arch.guest_fpu.xfd_err)
>                       wrmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> 
>     5) Before leaving the preemption disabled section of the VCPU loop
>        do:
> 
>            if (vcpu->arch.guest_fpu.xfd_err)
>                       wrmsrl(MSR_XFD_ERR, 0);
> 
> It's really that simple and pretty much 0 overhead for the regular case.

Much cleaner.

> 
> If the guest triggers #NM with a high frequency then taking the VMEXITs
> is the least of the problems. That's not a realistic use case, really.
> 
> Hmm?
> 
> Thanks,
> 
>         tglx

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11  1:31     ` Paolo Bonzini
@ 2021-12-11  3:23       ` Tian, Kevin
  2021-12-11 13:10       ` Thomas Gleixner
  1 sibling, 0 replies; 80+ messages in thread
From: Tian, Kevin @ 2021-12-11  3:23 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Zhong, Yang, x86, kvm,
	linux-kernel, mingo, bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, jing2.liu, Liu, Jing2

> From: Paolo Bonzini
> Sent: Saturday, December 11, 2021 9:32 AM
> 
> On 12/11/21 01:10, Thomas Gleixner wrote:
> >      2) When the guest triggers #NM is takes an VMEXIT and the host
> >         does:
> >
> >                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> >
> >         injects the #NM and goes on.
> >
> >      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
> >         the host does:
> >
> >             vcpu->arch.guest_fpu.xfd_err = msrval;
> >             wrmsrl(MSR_XFD_ERR, msrval);
> 
> No wrmsrl here I think, the host value is 0 and should stay so.  Instead
> the wrmsrl will happen the next time the VCPU loop is entred.
> 

To elaborate I guess the reason is because MSR_XFD_ERR should 
always contain host value 0 after preemption is enabled, while 
WRMSR emulation is called with preemption enabled. Then we 
just need wait for the next time the vcpu loop is entered to 
restore the guest value after preemption is disabled. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11  1:31     ` Paolo Bonzini
  2021-12-11  3:23       ` Tian, Kevin
@ 2021-12-11 13:10       ` Thomas Gleixner
  1 sibling, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-11 13:10 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On Sat, Dec 11 2021 at 02:31, Paolo Bonzini wrote:
> On 12/11/21 01:10, Thomas Gleixner wrote:
>>      2) When the guest triggers #NM is takes an VMEXIT and the host
>>         does:
>> 
>>                  rdmsrl(MSR_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>> 
>>         injects the #NM and goes on.
>> 
>>      3) When the guest writes to MSR_XFD_ERR it takes an VMEXIT and
>>         the host does:
>> 
>>             vcpu->arch.guest_fpu.xfd_err = msrval;
>>             wrmsrl(MSR_XFD_ERR, msrval);
>
> No wrmsrl here I think, the host value is 0 and should stay so.  Instead 
> the wrmsrl will happen the next time the VCPU loop is entred.

I assumed this can be handled in the fast path, but either way.

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11  3:07     ` Tian, Kevin
@ 2021-12-11 13:29       ` Thomas Gleixner
  2021-12-12  1:50         ` Tian, Kevin
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-11 13:29 UTC (permalink / raw)
  To: Tian, Kevin, Zhong, Yang, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen, pbonzini
  Cc: seanjc, Nakajima, Jun, jing2.liu, Liu, Jing2, Zhong, Yang

Kevin,

On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> #NM in the guest is slow path, right? So why are you trying to optimize
>> for it?
>
> This is really good information. The current logic is obviously
> based on the assumption that #NM is frequently triggered.

More context.

When an application want's to use AMX, it invokes the prctl() which
grants permission. If permission is granted then still the kernel FPU
state buffers are default size and XFD is armed.

When a thread of that process issues the first AMX (tile) instruction,
then #NM is raised.

The #NM handler does:

    1) Read MSR_XFD_ERR. If 0, goto regular #NM

    2) Write MSR_XFD_ERR to 0

    3) Check whether the process has permission granted. If not,
       raise SIGILL and return.

    4) Allocate and install a larger FPU state buffer for the task.
       If allocation fails, raise SIGSEGV and return.

    5) Disarm XFD for that task

That means one thread takes at max. one AMX/XFD related #NM during its
lifetime, which means two VMEXITs.

If there are other XFD controlled facilities in the future, then it will
be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which uses
them. Not the end of the world either.

Looking at the targeted application space it's pretty unlikely that
tasks which utilize AMX are going to be so short lived that the overhead
of these VMEXITs really matters.

This of course can be revisited when there is a sane use case, but
optimizing for it prematurely does not buy us anything else than
pointless complexity.

>> The straight forward solution to this is:
>> 
>>     1) Trap #NM and MSR_XFD_ERR write
>
> and #NM vmexit handler should be called in kvm_x86_handle_exit_irqoff()
> before preemption is enabled, otherwise there is still a small window
> where MSR_XFD_ERR might be clobbered after preemption enable and
> before #NM handler is actually called.

Yes.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 00/19] AMX Support in KVM
  2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
                   ` (18 preceding siblings ...)
  2021-12-08  0:03 ` [PATCH 19/19] kvm: x86: Add AMX CPUIDs support Yang Zhong
@ 2021-12-11 21:20 ` Thomas Gleixner
  19 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-11 21:20 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

Jing, Yang,

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
>
> Thanks Thomas for the thoughts and patches on the KVM FPU and AMX
> support.

welcome. The overall impression of that series is that it is heading
into the right direction. Keep up the good work!

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-11 13:29       ` Thomas Gleixner
@ 2021-12-12  1:50         ` Tian, Kevin
  2021-12-12  9:10           ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Tian, Kevin @ 2021-12-12  1:50 UTC (permalink / raw)
  To: Thomas Gleixner, Zhong, Yang, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen, pbonzini
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu, Liu, Jing2, Zhong, Yang

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Saturday, December 11, 2021 9:29 PM
> 
> Kevin,
> 
> On Sat, Dec 11 2021 at 03:07, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> #NM in the guest is slow path, right? So why are you trying to optimize
> >> for it?
> >
> > This is really good information. The current logic is obviously
> > based on the assumption that #NM is frequently triggered.
> 
> More context.
> 
> When an application want's to use AMX, it invokes the prctl() which
> grants permission. If permission is granted then still the kernel FPU
> state buffers are default size and XFD is armed.
> 
> When a thread of that process issues the first AMX (tile) instruction,
> then #NM is raised.
> 
> The #NM handler does:
> 
>     1) Read MSR_XFD_ERR. If 0, goto regular #NM
> 
>     2) Write MSR_XFD_ERR to 0
> 
>     3) Check whether the process has permission granted. If not,
>        raise SIGILL and return.
> 
>     4) Allocate and install a larger FPU state buffer for the task.
>        If allocation fails, raise SIGSEGV and return.
> 
>     5) Disarm XFD for that task
> 
> That means one thread takes at max. one AMX/XFD related #NM during its
> lifetime, which means two VMEXITs.
> 
> If there are other XFD controlled facilities in the future, then it will
> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
> uses
> them. Not the end of the world either.
> 
> Looking at the targeted application space it's pretty unlikely that
> tasks which utilize AMX are going to be so short lived that the overhead
> of these VMEXITs really matters.
> 
> This of course can be revisited when there is a sane use case, but
> optimizing for it prematurely does not buy us anything else than
> pointless complexity.

I get all above.

I guess the original open is also about the frequency of #NM not due 
to XFD. For Linux guest looks it's not a problem since CR0.TS is not set 
now when math emulation is not required:

DEFINE_IDTENTRY(exc_device_not_available)
{
	...
	/* This should not happen. */
	if (WARN(cr0 & X86_CR0_TS, "CR0.TS was set")) {
		/* Try to fix it up and carry on. */
		write_cr0(cr0 & ~X86_CR0_TS);
	} else {
		/*
		 * Something terrible happened, and we're better off trying
		 * to kill the task than getting stuck in a never-ending
		 * loop of #NM faults.
		 */
		die("unexpected #NM exception", regs, 0);
	}
}

It may affect guest which still uses CR0.TS to do lazy save. But likely
modern OSes all move to eager save approach so always trapping #NM
should be fine.

Is this understanding correct?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-12  1:50         ` Tian, Kevin
@ 2021-12-12  9:10           ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-12  9:10 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, Zhong, Yang, x86, kvm,
	linux-kernel, mingo, bp, dave.hansen
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu, Liu, Jing2

On 12/12/21 02:50, Tian, Kevin wrote:
>>
>> If there are other XFD controlled facilities in the future, then it will
>> be NR_USED_XFD_CONTROLLED_FACILITIES * 2 VMEXITs per thread which
>> uses
>> them. Not the end of the world either.
>>
>> Looking at the targeted application space it's pretty unlikely that
>> tasks which utilize AMX are going to be so short lived that the overhead
>> of these VMEXITs really matters.
>>
>> This of course can be revisited when there is a sane use case, but
>> optimizing for it prematurely does not buy us anything else than
>> pointless complexity.
> It may affect guest which still uses CR0.TS to do lazy save. But likely
> modern OSes all move to eager save approach so always trapping #NM
> should be fine.

You also don't need to trap #NM if CPUID includes no dynamic bits, 
because then XFD will never be nonzero.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly
  2021-12-10 22:01   ` Paolo Bonzini
@ 2021-12-12 13:10     ` Yang Zhong
  0 siblings, 0 replies; 80+ messages in thread
From: Yang Zhong @ 2021-12-12 13:10 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, seanjc,
	jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Fri, Dec 10, 2021 at 11:01:15PM +0100, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
> >--- a/arch/x86/kvm/cpuid.c
> >+++ b/arch/x86/kvm/cpuid.c
> >@@ -219,6 +219,11 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> >  		kvm_apic_set_version(vcpu);
> >  	}
> >+	/* Enable saving guest XFD_ERR */
> >+	best = kvm_find_cpuid_entry(vcpu, 7, 0);
> >+	if (best && cpuid_entry_has(best, X86_FEATURE_AMX_TILE))
> >+		vcpu->arch.guest_fpu.xfd_err = 0;
> >+
> 
> This is incorrect.  Instead it should check whether leaf 0xD
> includes any dynamic features.
> 

  Thanks Paolo, So ditto for "[PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2".

  Yang

> Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-10 16:02   ` Paolo Bonzini
@ 2021-12-13  7:51     ` Liu, Jing2
  2021-12-13  9:01       ` Paolo Bonzini
  2021-12-14 10:26     ` Yang Zhong
  1 sibling, 1 reply; 80+ messages in thread
From: Liu, Jing2 @ 2021-12-13  7:51 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: Christopherson,,
	Sean, Nakajima, Jun, Tian, Kevin, jing2.liu, Wang, Wei W, Zeng,
	Guang

On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
> 
> Also:
> 
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> > +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> > +			return 1;
> 
> This should allow msr->host_initiated always (even if XFD is not part of
> CPUID). 
Thanks Paolo.

msr->host_initiated handling would be added in next version.

I'd like to ask why always allow msr->host_initiated even if XFD is not part of
CPUID, although guest doesn't care that MSR?  We found some MSRs
 (e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH ) 
are specially handled so would like to know the consideration of allowing
msr->host_initiated.

if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
        return 1;


 However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
> returns true, then it should return 1.
> 
If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
may not need this check here?

Thanks,
Jing

> 
> Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-10 22:13     ` Paolo Bonzini
@ 2021-12-13  8:23       ` Wang, Wei W
  2021-12-13  9:24         ` Paolo Bonzini
  2021-12-20 17:54       ` State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl) Nakajima, Jun
  1 sibling, 1 reply; 80+ messages in thread
From: Wang, Wei W @ 2021-12-13  8:23 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On Saturday, December 11, 2021 6:13 AM, Paolo Bonzini wrote:
> 
> By the way, I think KVM_SET_XSAVE2 is not needed.  Instead:
> 
> - KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the
> buffer that is passed to KVM_GET_XSAVE2
> 
> - KVM_GET_XSAVE2 should fill in the buffer expecting that its size is
> whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes
> 
> - KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the
> save states recorded in the header point to offsets larger than 4k.

I think one issue is that KVM_SET_XSAVE works with "struct kvm_xsave" (hardcoded 4KB buffer),
including kvm_vcpu_ioctl_x86_set_xsave. The states obtained via KVM_GET_XSAVE2 will be made
using "struct kvm_xsave2".

Did you mean that we could add a new code path under KVM_SET_XSAVE to make it work with
the new "struct kvm_xsave2"?
e.g.:

(xsave2_enabled below is set when userspace calls to get KVM_CAP_XSAVE2)
if (kvm->xsave2_enabled) {
	new implementation using "struct kvm_xsave2"
} else {
	current implementation using "struct kvm_xsave"
}
(this seems like a new implementation which might deserve a new ioctl)

Thanks,
Wei


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-13  7:51     ` Liu, Jing2
@ 2021-12-13  9:01       ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13  9:01 UTC (permalink / raw)
  To: Liu, Jing2, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen
  Cc: Christopherson,,
	Sean, Nakajima, Jun, Tian, Kevin, jing2.liu, Wang, Wei W, Zeng,
	Guang

On 12/13/21 08:51, Liu, Jing2 wrote:
> On 12/11/2021 12:02 AM, Paolo Bonzini wrote:
>>
>> Also:
>>
>> On 12/8/21 01:03, Yang Zhong wrote:
>>>
>>> +		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
>>> +			return 1;
>>
>> This should allow msr->host_initiated always (even if XFD is not part of
>> CPUID).
> Thanks Paolo.
> 
> msr->host_initiated handling would be added in next version.
> 
> I'd like to ask why always allow msr->host_initiated even if XFD is not part of
> CPUID, although guest doesn't care that MSR?  We found some MSRs
>   (e.g. MSR_AMD64_OSVW_STATUS and MSR_AMD64_OSVW_ID_LENGTH )
> are specially handled so would like to know the consideration of allowing
> msr->host_initiated.
> 
> if (!msr_info->host_initiated && !guest_cpuid_has(vcpu, X86_FEATURE_XFD))
>          return 1;

Because it's simpler if userspace can just take the entire list from 
KVM_GET_MSR_INDEX_LIST and pass it to KVM_GET/SET_MSR.  See for example 
vcpu_save_state and vcpu_load_state in 
tools/testing/selftests/kvm/lib/x86_64/processor.c.

>>  However, if XFD is nonzero and kvm_check_guest_realloc_fpstate
>> returns true, then it should return 1.
>
> If XFD is nonzero, kvm_check_guest_realloc_fpstate() won't return true. So
> may not need this check here?

It can't for now, because there's a single dynamic feature, but here:

+	if ((xfd & xcr0) != xcr0) {
+		u64 request = (xcr0 ^ xfd) & xcr0;
+		struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
+
+		/*
+		 * If requested features haven't been enabled, update
+		 * the request bitmap and tell the caller to request
+		 * dynamic buffer reallocation.
+		 */
+		if ((guest_fpu->user_xfeatures & request) != request) {
+			vcpu->arch.guest_fpu.realloc_request = request;
+			return true;
+		}
+	}

it is certainly possible to return true with nonzero XFD.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states
  2021-12-08  0:03 ` [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states Yang Zhong
@ 2021-12-13  9:12   ` Paolo Bonzini
  2021-12-13 12:00     ` Thomas Gleixner
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13  9:12 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
>    - user_xfeatures
> 
>      Track which features are currently enabled for the vCPU

Please rename to alloc_xfeatures

>    - user_perm
> 
>      Copied from guest_perm of the group leader thread. The first
>      vCPU which does the copy locks the guest_perm

Please rename to perm_xfeatures.

>    - realloc_request
> 
>      KVM sets this field to request dynamically-enabled features
>      which require reallocation of @fpstate

This field should be in vcpu->arch, and there is no need for 
fpu_guest_realloc_fpstate.  Rename __xfd_enable_feature to 
fpu_enable_xfd_feature and add it to the public API, then just do

	if (unlikely(vcpu->arch.xfd_realloc_request)) {
		u64 request = vcpu->arch.xfd_realloc_request;
		ret = fpu_enable_xfd(request, enter_guest);
	}

to kvm_put_guest_fpu.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-08  0:03 ` [PATCH 09/19] kvm: x86: Prepare reallocation check Yang Zhong
@ 2021-12-13  9:16   ` Paolo Bonzini
  2021-12-14  7:06     ` Tian, Kevin
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13  9:16 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> +	u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
> +
> +	/* For any state which is enabled dynamically */
> +	if ((xfd & xcr0) != xcr0) {
> +		u64 request = (xcr0 ^ xfd) & xcr0;
> +		struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
> +
> +		/*
> +		 * If requested features haven't been enabled, update
> +		 * the request bitmap and tell the caller to request
> +		 * dynamic buffer reallocation.
> +		 */
> +		if ((guest_fpu->user_xfeatures & request) != request) {
> +			vcpu->arch.guest_fpu.realloc_request = request;

This should be "|=".  If you have

	wrmsr(XFD, dynamic-feature-1);
	...
	wrmsr(XFD, dynamic-feature-2);

then the space for both features has to be allocated.

> +			return true;
> +		}
> +	}
> +


This is just:

	struct fpu_guest *guest_fpu = &vcpu->arch.guest_fpu;
	u64 xcr0 = vcpu->arch.xcr0 & XFEATURE_MASK_USER_DYNAMIC;
	u64 dynamic_enabled = xcr0 & ~xfd;
	if (!(dynamic_enabled & ~guest_fpu->user_xfeatures))
		return false;

	/*
	 * This should actually not be in guest_fpu, see review of
	 * patch 2.  Also see above regarding "=" vs "|=".
	 */
	vcpu->arch.guest_fpu.realloc_request |= dynamic_enabled;
	return true;

But without documentation I'm not sure why this exit-to-userspace step 
is needed:

- if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a 
userspace error and you can #GP the guest without any issue.  Userspace 
is buggy

- if (dynamic_enabled & ~guest_fpu->user_xfeatures) != 0, but the 
feature *is* permitted, then it is okay to just go on and reallocate the 
guest FPU.


Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-13  8:23       ` Wang, Wei W
@ 2021-12-13  9:24         ` Paolo Bonzini
  2021-12-14  6:06           ` Wang, Wei W
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13  9:24 UTC (permalink / raw)
  To: Wang, Wei W, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On 12/13/21 09:23, Wang, Wei W wrote:
> On Saturday, December 11, 2021 6:13 AM, Paolo Bonzini wrote:
>>
>> By the way, I think KVM_SET_XSAVE2 is not needed.  Instead:
>>
>> - KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) should return the size of the
>> buffer that is passed to KVM_GET_XSAVE2
>>
>> - KVM_GET_XSAVE2 should fill in the buffer expecting that its size is
>> whatever KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) passes
>>
>> - KVM_SET_XSAVE can just expect a buffer that is bigger than 4k if the
>> save states recorded in the header point to offsets larger than 4k.
> 
> I think one issue is that KVM_SET_XSAVE works with "struct kvm_xsave" (hardcoded 4KB buffer),
> including kvm_vcpu_ioctl_x86_set_xsave. The states obtained via KVM_GET_XSAVE2 will be made
> using "struct kvm_xsave2".
> 
> Did you mean that we could add a new code path under KVM_SET_XSAVE to make it work with
> the new "struct kvm_xsave2"?

There is no need for struct kvm_xsave2, because there is no need for a 
"size" argument.

- KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the 
return value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)

- but KVM_SET_XSAVE2 is not needed, because KVM_SET_XSAVE can use 
copy_from_user to read the XSTATE_BV, use it to deduce the size of the 
buffer, and use copy_from_user to read the full size of the buffer.

For this to work you can redefine struct kvm_xsave to

	struct kvm_xsave {
		__u32 region[1024];

		/*
		 * KVM_GET_XSAVE only uses 4096 bytes and only returns
		 * user save states up to save state 17 (TILECFG).
		 *
		 * For KVM_GET_XSAVE2, the total size of region + extra
		 * must be the size that is communicated by
		 * KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).
		 *
		 * KVM_SET_XSAVE uses the extra field if the struct was
		 * returned by KVM_GET_XSAVE2.
		 */
		__u32 extra[];
	}

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-10 16:30   ` Paolo Bonzini
  2021-12-10 22:13     ` Paolo Bonzini
@ 2021-12-13 10:10     ` Thomas Gleixner
  2021-12-13 10:43       ` Paolo Bonzini
  1 sibling, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 10:10 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>> +{
>> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>> +		return 0;
>> +
>> +	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>> +					      supported_xcr0, &vcpu->arch.pkru);
>> +}
>> +
>
> I think fpu_copy_uabi_to_guest_fpstate (and therefore 
> copy_uabi_from_kernel_to_xstate) needs to check that the size is 
> compatible with the components in the input.

fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
correctly sized. We surely can add a size check there.

> Also, IIUC the size of the AMX state will vary in different processors. 
>   Is this correct?  If so, this should be handled already by 
> KVM_GET/SET_XSAVE2 and therefore should be part of the 
> arch/x86/kernel/fpu APIs.  In the future we want to support migrating a 
> "small AMX" host to a "large AMX" host; and also migrating from a "large 
> AMX" host to a "small AMX" host if the guest CPUID is compatible with 
> the destination of the migration.

How is that supposed to work? If the AMX state size differs then the
hosts are not compatible.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-13 10:10     ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Thomas Gleixner
@ 2021-12-13 10:43       ` Paolo Bonzini
  2021-12-13 12:40         ` Thomas Gleixner
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13 10:43 UTC (permalink / raw)
  To: Thomas Gleixner, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/13/21 11:10, Thomas Gleixner wrote:
> On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
>> On 12/8/21 01:03, Yang Zhong wrote:
>>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>>> +{
>>> +	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>>> +		return 0;
>>> +
>>> +	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>>> +					      supported_xcr0, &vcpu->arch.pkru);
>>> +}
>>> +
>>
>> I think fpu_copy_uabi_to_guest_fpstate (and therefore
>> copy_uabi_from_kernel_to_xstate) needs to check that the size is
>> compatible with the components in the input.
> 
> fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
> correctly sized. We surely can add a size check there.

fpu_copy_guest_fpstate_to_uabi is more problematic because that one
writes memory.  For fpu_copy_uabi_to_guest_fpstate, we know the input
buffer size from the components and we can use it to do a properly-sized
memdup_user.

For fpu_copy_guest_fpstate_to_uabi we can just decide that KVM_GET_XSAVE
will only save up to the first 4K.  Something like the following might
actually be good for 5.16-rc; right now, header.xfeatures might lead
userspace into reading uninitialized or unmapped memory:

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d28829403ed0..69609b8c3887 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1138,8 +1138,10 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
  	struct xstate_header header;
  	unsigned int zerofrom;
  	u64 mask;
+	u64 size;
  	int i;
  
+	size = to->left;
  	memset(&header, 0, sizeof(header));
  	header.xfeatures = xsave->header.xfeatures;
  
@@ -1186,7 +1188,20 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
  	/* Copy xsave->i387.sw_reserved */
  	membuf_write(&to, xstate_fx_sw_bytes, sizeof(xsave->i387.sw_reserved));
  
-	/* Copy the user space relevant state of @xsave->header */
+	/*
+	 * Copy the user space relevant state of @xsave->header.
+	 * If not all features fit in the buffer, drop them from the
+	 * saved state so that userspace does not read uninitialized or
+	 * unmapped memory.
+	 */
+	mask = fpstate->user_xfeatures;
+	for_each_extended_xfeature(i, mask) {
+		if (xstate_offsets[i] + xstate_size[i] > size) {
+			header.xfeatures &= BIT(i) - 1;
+			mask &= BIT(i) - 1;
+			break;
+		}
+	}
  	membuf_write(&to, &header, sizeof(header));
  
  	zerofrom = offsetof(struct xregs_state, extended_state_area);
@@ -1197,7 +1212,6 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
  	 * but there is no state to copy from in the compacted
  	 * init_fpstate. The gap tracking will zero these states.
  	 */
-	mask = fpstate->user_xfeatures;
  
  	for_each_extended_xfeature(i, mask) {
  		/*



>> Also, IIUC the size of the AMX state will vary in different processors.
>>    Is this correct?  If so, this should be handled already by
>> KVM_GET/SET_XSAVE2 and therefore should be part of the
>> arch/x86/kernel/fpu APIs.  In the future we want to support migrating a
>> "small AMX" host to a "large AMX" host; and also migrating from a "large
>> AMX" host to a "small AMX" host if the guest CPUID is compatible with
>> the destination of the migration.
> 
> How is that supposed to work? If the AMX state size differs then the
> hosts are not compatible.

I replied with some more questions later.  Basically it depends on how
Intel will define palettes that aren't part of the first implementation
of AMX.

Paolo

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states
  2021-12-13  9:12   ` Paolo Bonzini
@ 2021-12-13 12:00     ` Thomas Gleixner
  2021-12-13 12:45       ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 12:00 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>>    - user_xfeatures
>> 
>>      Track which features are currently enabled for the vCPU
>
> Please rename to alloc_xfeatures

That name makes no sense at all. This has nothing to do with alloc.

>>    - user_perm
>> 
>>      Copied from guest_perm of the group leader thread. The first
>>      vCPU which does the copy locks the guest_perm
>
> Please rename to perm_xfeatures.

All of that is following the naming conventions in the FPU code related
to permissions etc.

>>    - realloc_request
>> 
>>      KVM sets this field to request dynamically-enabled features
>>      which require reallocation of @fpstate
>
> This field should be in vcpu->arch, and there is no need for 
> fpu_guest_realloc_fpstate.  Rename __xfd_enable_feature to 
> fpu_enable_xfd_feature and add it to the public API, then just do
>
> 	if (unlikely(vcpu->arch.xfd_realloc_request)) {
> 		u64 request = vcpu->arch.xfd_realloc_request;
> 		ret = fpu_enable_xfd(request, enter_guest);
> 	}
>
> to kvm_put_guest_fpu.

Why? Yet another export of FPU internals just because?

Also what clears the reallocation request and what is the @enter_guest
argument supposed to help with?

I have no idea what you are trying to achieve.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-13 10:43       ` Paolo Bonzini
@ 2021-12-13 12:40         ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 12:40 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On Mon, Dec 13 2021 at 11:43, Paolo Bonzini wrote:
> On 12/13/21 11:10, Thomas Gleixner wrote:
>> On Fri, Dec 10 2021 at 17:30, Paolo Bonzini wrote:
>>> I think fpu_copy_uabi_to_guest_fpstate (and therefore
>>> copy_uabi_from_kernel_to_xstate) needs to check that the size is
>>> compatible with the components in the input.
>> 
>> fpu_copy_uabi_to_guest_fpstate() expects that the input buffer is
>> correctly sized. We surely can add a size check there.
>
> fpu_copy_guest_fpstate_to_uabi is more problematic because that one
> writes memory.  For fpu_copy_uabi_to_guest_fpstate, we know the input
> buffer size from the components and we can use it to do a properly-sized
> memdup_user.
>
> For fpu_copy_guest_fpstate_to_uabi we can just decide that KVM_GET_XSAVE
> will only save up to the first 4K.  Something like the following might
> actually be good for 5.16-rc; right now, header.xfeatures might lead
> userspace into reading uninitialized or unmapped memory:

If user space supplies a 4k buffer and reads beyond the end of the
buffer then it's hardly a kernel problem.

That function allows to provide a short buffer and fill it up to the
point where the buffer ends with the real information.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states
  2021-12-13 12:00     ` Thomas Gleixner
@ 2021-12-13 12:45       ` Paolo Bonzini
  2021-12-13 19:50         ` Thomas Gleixner
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13 12:45 UTC (permalink / raw)
  To: Thomas Gleixner, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/13/21 13:00, Thomas Gleixner wrote:
> On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
>> On 12/8/21 01:03, Yang Zhong wrote:
>>>     - user_xfeatures
>>>
>>>       Track which features are currently enabled for the vCPU
>>
>> Please rename to alloc_xfeatures
> 
> That name makes no sense at all. This has nothing to do with alloc.

Isn't that the features for which space is currently allocated?

fpstate_realloc does

+	if (guest_fpu) {
+		newfps->is_guest = true;
+		newfps->is_confidential = curfps->is_confidential;
+		guest_fpu->user_xfeatures |= xfeatures;
+	}
+

and kvm_check_guest_realloc_fpstate does


+		if ((guest_fpu->user_xfeatures & request) != request) {
+			vcpu->arch.guest_fpu.realloc_request |= request;
+			return true;
+		}

Reading "user_xfeatures" in there is cryptic, it seems like it's 
something related to the userspace thread or group that has invoked the 
KVM ioctl.  If it's renamed to alloc_xfeatures, then this:

+		missing = request & ~guest_fpu->alloc_xfeatures;
+		if (missing) {
+			vcpu->arch.guest_fpu.realloc_request |= missing;
+			return true;
+		}

makes it obvious that the allocation is for features that are requested 
but haven't been allocated in the xstate yet.

>>>     - user_perm
>>>
>>>       Copied from guest_perm of the group leader thread. The first
>>>       vCPU which does the copy locks the guest_perm
>>
>> Please rename to perm_xfeatures.
> 
> All of that is following the naming conventions in the FPU code related
> to permissions etc.

perm or guest_perm is just fine as well; but user_perm is not (there's 
no preexisting reference to user_perm in the code, in fact, as far as I 
can see).

>>>     - realloc_request
>>>
>>>       KVM sets this field to request dynamically-enabled features
>>>       which require reallocation of @fpstate
>>
>> This field should be in vcpu->arch, and there is no need for
>> fpu_guest_realloc_fpstate.  Rename __xfd_enable_feature to
>> fpu_enable_xfd_feature and add it to the public API, then just do
>>
>> 	if (unlikely(vcpu->arch.xfd_realloc_request)) {
>> 		u64 request = vcpu->arch.xfd_realloc_request;
>> 		ret = fpu_enable_xfd(request, enter_guest);
>> 	}
>>
>> to kvm_put_guest_fpu.
> 
> Why? Yet another export of FPU internals just because?

It's one function more and one field less.  I prefer another export of 
FPU internals, to a write to a random field with undocumented invariants.

For example, why WARN_ON_ONCE if enter_guest == true?  If you enter the 
guest after the host has restored MSR_IA32_XFD with KVM_SET_MSR, the 
WARN_ON_ONCE can actually fire.  It would be just fine to skip the 
reallocation until enter_guest == false, which is you actually XSAVE 
into the guest_fpu.

As an aside, realloc_request (if it survives, see below) shouldn't be 
added until patch 6.

> Also what clears the reallocation request and what is the @enter_guest
> argument supposed to help with?

Nothing---make it

  	if (unlikely(vcpu->arch.xfd_realloc_request)) {
  		u64 request = vcpu->arch.xfd_realloc_request;
  		ret = fpu_enable_xfd(request, &vcpu->arch.guest_fpu);
		if (!ret)
			vcpu->arch.xfd_realloc_request = 0;
  	}

but in fact, I'm not sure why the request has to be delayed at all.  The 
obvious implementation of a write to XFD, after all the validity checks, 
is just

	/* This function just calls xfd_enable_feature.  */
	r = fpu_guest_alloc_xfeatures(&vcpu->arch.guest_fpu,
				      vcpu->arch.xcr0 & ~xfd);
	/*
	 * An error means that userspace has screwed up by not doing
	 * arch_prctl(ARCH_REQ_XCOMP_GUEST_PERM).  If we are here coming
	 * from a ioctl fail it, if the guest has done WRMSR or XSETBV
	 * it will inject a #GP.
	 */
	if (r < 0)
		return 1;

	vcpu->arch.xfd = xfd;

with none of the kvm_check_guest_realloc_fpstate or KVM_EXIT_FPU_REALLOC 
business.  No field and a simple, self-contained external API.  The same 
code works for __kvm_set_xcr as well, just with "xcr0 & ~vcpu->arch.xfd" 
as the second argument instead.

(Maybe I'm missing why KVM_EXIT_FPU_REALLOC is needed, but I'm not 
ashamed to say that, given that new userspace API was added with 
documentation or tests.  The only comment in the code is:

+		/*
+		 * Check if fpstate reallocate is required. If yes, then
+		 * let the fpu core do reallocation and update xfd;
+		 * otherwise, update xfd here.
+		 */
+		if (kvm_check_guest_realloc_fpstate(vcpu, data)) {
+			vcpu->run->exit_reason = KVM_EXIT_FPU_REALLOC;
+			vcpu->arch.complete_userspace_io =
+				kvm_skip_emulated_instruction;
+			return KVM_MSR_RET_USERSPACE;
+		}
+

which has nothing to do with the actual content of either 
kvm_check_guest_realloc_fpstate or the "then" branch.  Userspace can 
just do ARCH_REQ_XCOMP_GUEST_PERM based on the guest CPUID, before 
KVM_SET_CPUID2, KVM_SET_MSR or KVM_SET_XCR).

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
  2021-12-10 16:02   ` Paolo Bonzini
  2021-12-10 23:09   ` Thomas Gleixner
@ 2021-12-13 15:06   ` Paolo Bonzini
  2021-12-13 19:45     ` Thomas Gleixner
  2 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-13 15:06 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/8/21 01:03, Yang Zhong wrote:
> +		/*
> +		 * Update IA32_XFD to the guest value so #NM can be
> +		 * raised properly in the guest. Instead of directly
> +		 * writing the MSR, call a helper to avoid breaking
> +		 * per-cpu cached value in fpu core.
> +		 */
> +		fpregs_lock();
> +		current->thread.fpu.fpstate->xfd = data;

This is wrong, it should be written in vcpu->arch.guest_fpu.

> +		xfd_update_state(current->thread.fpu.fpstate);

This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR will.

That said, I think xfd_update_state should not have an argument. 
current->thread.fpu.fpstate->xfd is the only fpstate that should be 
synced with the xfd_state per-CPU variable.

Paolo

> +		fpregs_unlock();


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-13 15:06   ` Paolo Bonzini
@ 2021-12-13 19:45     ` Thomas Gleixner
  2021-12-13 21:23       ` Thomas Gleixner
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 19:45 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> On 12/8/21 01:03, Yang Zhong wrote:
>> +		/*
>> +		 * Update IA32_XFD to the guest value so #NM can be
>> +		 * raised properly in the guest. Instead of directly
>> +		 * writing the MSR, call a helper to avoid breaking
>> +		 * per-cpu cached value in fpu core.
>> +		 */
>> +		fpregs_lock();
>> +		current->thread.fpu.fpstate->xfd = data;
>
> This is wrong, it should be written in vcpu->arch.guest_fpu.
>
>> +		xfd_update_state(current->thread.fpu.fpstate);
>
> This is okay though, so that KVM_SET_MSR will not write XFD and WRMSR
> will.
>
> That said, I think xfd_update_state should not have an argument. 
> current->thread.fpu.fpstate->xfd is the only fpstate that should be 
> synced with the xfd_state per-CPU variable.

I'm looking into this right now. The whole restore versus runtime thing
needs to be handled differently.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states
  2021-12-13 12:45       ` Paolo Bonzini
@ 2021-12-13 19:50         ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 19:50 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu

Paolo,

On Mon, Dec 13 2021 at 13:45, Paolo Bonzini wrote:
> On 12/13/21 13:00, Thomas Gleixner wrote:
>> On Mon, Dec 13 2021 at 10:12, Paolo Bonzini wrote:
>>> Please rename to alloc_xfeatures
>> 
>> That name makes no sense at all. This has nothing to do with alloc.
>
> Isn't that the features for which space is currently allocated?

It is, but from the kernel POV this is user. :)

> Reading "user_xfeatures" in there is cryptic, it seems like it's 
> something related to the userspace thread or group that has invoked the 
> KVM ioctl.  If it's renamed to alloc_xfeatures, then this:
>
> +		missing = request & ~guest_fpu->alloc_xfeatures;
> +		if (missing) {
> +			vcpu->arch.guest_fpu.realloc_request |= missing;
> +			return true;
> +		}
>
> makes it obvious that the allocation is for features that are requested 
> but haven't been allocated in the xstate yet.

Let's rename it to xfeatures and perm and be done with it.

>> Why? Yet another export of FPU internals just because?
>
> It's one function more and one field less.  I prefer another export of 
> FPU internals, to a write to a random field with undocumented
> invariants.

We want less not more exports. :)

> For example, why WARN_ON_ONCE if enter_guest == true?  If you enter the 
> guest after the host has restored MSR_IA32_XFD with KVM_SET_MSR, the

Indeed restoring a guest might require buffer reallocation, I missed
that, duh!

On restore the following components are involved:

   XCR0, XFD, XSTATE

XCR0 and XFD have to be restored _before_ XSTATE and that needs to
be enforced.

But independent of the ordering of XCR0 and XFD restore the following
check applies to both the restore and the runtime logic:

int kvm_fpu_realloc(struct kvm_vcpu *vcpu, u64 xcr0, u64 xfd)
{
   	u64 expand, enabled = xcr0 & ~xfd;

        expand = enabled & ~vcpu->arch.guest_fpu.xfeatures;
        if (!expand)
        	return 0;
        
        return fpu_enable_guest_features(&vcpu->arch.guest_fpu, expand);
}

int fpu_enable_guest_features(struct guest_fpu *gfpu, u64 which)
{
        permission_checks();
        ...
        return fpstate_realloc(.....)
}

fpstate_realloc() needs to be careful about flipping the pointers
depending on the question whether guest_fpu->fpstate is actually active,
i.e.:

        current->thread.fpu.fpstate == gfpu->fpstate

I'm halfways done with that. Will send something soonish.

Thanks,

        tglx

       


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-13 19:45     ` Thomas Gleixner
@ 2021-12-13 21:23       ` Thomas Gleixner
  2021-12-14  7:16         ` Tian, Kevin
  0 siblings, 1 reply; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-13 21:23 UTC (permalink / raw)
  To: Paolo Bonzini, Yang Zhong, x86, kvm, linux-kernel, mingo, bp,
	dave.hansen

Paolo,

On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
>> That said, I think xfd_update_state should not have an argument. 
>> current->thread.fpu.fpstate->xfd is the only fpstate that should be 
>> synced with the xfd_state per-CPU variable.
>
> I'm looking into this right now. The whole restore versus runtime thing
> needs to be handled differently.

We need to look at different things here:

   1) XFD MSR write emulation

   2) XFD MSR synchronization when write emulation is disabled

   3) Guest restore

#1 and #2 are in the context of vcpu_run() and

   vcpu->arch.guest_fpu.fpstate == current->thread.fpu.fpstate

while #3 has:

   vcpu->arch.guest_fpu.fpstate != current->thread.fpu.fpstate


#2 is only updating fpstate->xfd and the per CPU shadow.

So the state synchronization wants to be something like this:

void fpu_sync_guest_xfd_state(void)
{
	struct fpstate *fps = current->thread.fpu.fpstate;

	lockdep_assert_irqs_disabled();
	if (fpu_state_size_dynamic()) {
		rdmsrl(MSR_IA32_XFD, fps->xfd);
		__this_cpu_write(xfd_state, fps->xfd);
	}
}
EXPORT_SYMBOL_GPL(fpu_sync_guest_xfd_state);

No wrmsrl() because the MSR is already up do date. The important part is
that fpstate->xfd and the shadow state are updated so that after
reenabling preemption the context switch FPU logic works correctly.


#1 and #3 can trigger a reallocation of guest_fpu.fpstate and
can fail. But this is also true for XSETBV emulation and XCR0 restore.

For #1 modifying fps->xfd in the KVM code before calling into the FPU
code is just _wrong_ because if the guest removes the XFD restriction
then it must be ensured that the buffer is sized correctly _before_ this
is updated.

For #3 it's not really important, but I still try to wrap my head around
the whole picture vs. XCR0.

There are two options:

  1) Require strict ordering of XFD and XCR0 update to avoid pointless
     buffer expansion, i.e. XFD before XCR0.

     Because if XCR0 is updated while guest_fpu->fpstate.xfd is still in
     init state (0) and XCR0 contains extended features, then the buffer
     would be expanded because XFD does not mask the extended features
     out. When XFD is restored with a non-zero value, it's too late
     already.

  2) Ignore buffer expansion up to the point where XSTATE restore happens
     and evaluate guest XCR0 and guest_fpu->fpstate.xfd there.

I'm leaning towards #1 because that means we have exactly _ONE_ place
where we need to deal with buffer expansion. If Qemu gets the ordering
wrong it wastes memory per vCPU, *shrug*.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions
  2021-12-08  0:03 ` [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions Yang Zhong
@ 2021-12-14  0:16   ` Thomas Gleixner
  0 siblings, 0 replies; 80+ messages in thread
From: Thomas Gleixner @ 2021-12-14  0:16 UTC (permalink / raw)
  To: Yang Zhong, x86, kvm, linux-kernel, mingo, bp, dave.hansen, pbonzini
  Cc: seanjc, jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Tue, Dec 07 2021 at 19:03, Yang Zhong wrote:
> Similar to native permissions this doesn't actually enable the
> permitted feature. KVM is expected to install a larger kernel buffer
> and enable the feature when detecting the intention from the guest.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Jing Liu <jing2.liu@intel.com>
> Signed-off-by: Yang Zhong <yang.zhong@intel.com>
> ---
> (To Thomas) We change the definition of xstate_get_guest_group_perm()
> from xstate.h to api.h since this will be called by KVM.

No.

There is absolutely no need for that. After creating a vCPU the
permissions are frozen and readily available via
vcpu->arch.guest_fpu.perm.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-13  9:24         ` Paolo Bonzini
@ 2021-12-14  6:06           ` Wang, Wei W
  2021-12-14  6:18             ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Wang, Wei W @ 2021-12-14  6:06 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
> There is no need for struct kvm_xsave2, because there is no need for a "size"
> argument.
> 
> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the return
> value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)

Why would KVM_GET_XSAVE2 still be needed in this case?

I'm thinking it would also be possible to reuse KVM_GET_XSAVE:

- If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
 then KVM knows that the userspace is a new version and it works with larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
 So we can add a flag "kvm->xsave2_enabled", which gets set upon userspace checks KVM_CAP_XSAVE2.

- On KVM_GET_XSAVE, if "kvm->xsave2_enabled" is set,
 then KVM allocates buffer to load xstates and copies the loaded xstates data to the userspace buffer
 using the "size" that was returned to userspace on KVM_CAP_XSAVE2.
 If "kvm->xsave2_enabled" isn't set, using the legacy "4KB" size.

Thanks,
Wei

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-14  6:06           ` Wang, Wei W
@ 2021-12-14  6:18             ` Paolo Bonzini
  2021-12-15  2:39               ` Wang, Wei W
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-14  6:18 UTC (permalink / raw)
  To: Wang, Wei W, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On 12/14/21 07:06, Wang, Wei W wrote:
> On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
>> There is no need for struct kvm_xsave2, because there is no need for a "size"
>> argument.
>>
>> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as the return
>> value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
> 
> Why would KVM_GET_XSAVE2 still be needed in this case?
> 
> I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
> 
> - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
>   then KVM knows that the userspace is a new version and it works with larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
>   So we can add a flag "kvm->xsave2_enabled", which gets set upon userspace checks KVM_CAP_XSAVE2.

You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes.  In that case 
you don't need KVM_GET_XSAVE2.

Paolo

> - On KVM_GET_XSAVE, if "kvm->xsave2_enabled" is set,
>   then KVM allocates buffer to load xstates and copies the loaded xstates data to the userspace buffer
>   using the "size" that was returned to userspace on KVM_CAP_XSAVE2.
>   If "kvm->xsave2_enabled" isn't set, using the legacy "4KB" size.
> 
> Thanks,
> Wei
> 


^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-13  9:16   ` Paolo Bonzini
@ 2021-12-14  7:06     ` Tian, Kevin
  2021-12-14 10:16       ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Tian, Kevin @ 2021-12-14  7:06 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu, Liu, Jing2

> From: Paolo Bonzini
> Sent: Monday, December 13, 2021 5:16 PM
> 
> 
> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> userspace error and you can #GP the guest without any issue.  Userspace
> is buggy
> 

Is it a general guideline that an error caused by emulation itself (e.g.
due to no memory) can be reflected into the guest as #GP, even
when from guest p.o.v there is nothing wrong with its setting?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-13 21:23       ` Thomas Gleixner
@ 2021-12-14  7:16         ` Tian, Kevin
  0 siblings, 0 replies; 80+ messages in thread
From: Tian, Kevin @ 2021-12-14  7:16 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini, Zhong, Yang, x86, kvm,
	linux-kernel, mingo, bp, dave.hansen

Hi, Thomas,

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 5:23 AM
> 
> Paolo,
> 
> On Mon, Dec 13 2021 at 20:45, Thomas Gleixner wrote:
> > On Mon, Dec 13 2021 at 16:06, Paolo Bonzini wrote:
> >> That said, I think xfd_update_state should not have an argument.
> >> current->thread.fpu.fpstate->xfd is the only fpstate that should be
> >> synced with the xfd_state per-CPU variable.
> >
> > I'm looking into this right now. The whole restore versus runtime thing
> > needs to be handled differently.
> 

After looking at your series, I think it missed Paolo's comment
about changing xfd_update_state() to accept no argument.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-14  7:06     ` Tian, Kevin
@ 2021-12-14 10:16       ` Paolo Bonzini
  2021-12-14 14:41         ` Liu, Jing2
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-14 10:16 UTC (permalink / raw)
  To: Tian, Kevin, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu, Liu, Jing2

On 12/14/21 08:06, Tian, Kevin wrote:
>> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
>> userspace error and you can #GP the guest without any issue.  Userspace
>> is buggy
>
> Is it a general guideline that an error caused by emulation itself (e.g.
> due to no memory) can be reflected into the guest as #GP, even
> when from guest p.o.v there is nothing wrong with its setting?

No memory is a tricky one, if possible it should propagate -ENOMEM up to 
KVM_RUN or KVM_SET_MSR.  But it's basically an impossible case anyway, 
because even with 8K TILEDATA we're within the limit of 
PAGE_ALLOC_COSTLY_ORDER.

So, since it's not easy to do it right now, we can look at it later.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-10 16:02   ` Paolo Bonzini
  2021-12-13  7:51     ` Liu, Jing2
@ 2021-12-14 10:26     ` Yang Zhong
  2021-12-14 11:24       ` Paolo Bonzini
  1 sibling, 1 reply; 80+ messages in thread
From: Yang Zhong @ 2021-12-14 10:26 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, seanjc,
	jun.nakajima, kevin.tian, jing2.liu, jing2.liu, yang.zhong

On Fri, Dec 10, 2021 at 05:02:49PM +0100, Paolo Bonzini wrote:
> First, the MSR should be added to msrs_to_save_all and
> kvm_cpu_cap_has(X86_FEATURE_XFD) should be checked in
> kvm_init_msr_list.
> 
> It seems that RDMSR support is missing, too.
> 
> More important, please include:
> 
> - documentation for the new KVM_EXIT_* value
> 
> - a selftest that explains how userspace should react to it.
> 
> This is a strong requirement for any new API (the first has been for
> years; but the latter is also almost always respected these days).
> This series should not have been submitted without documentation.
> 
> Also:
> 
> On 12/8/21 01:03, Yang Zhong wrote:
> >
> >+		if (!guest_cpuid_has(vcpu, X86_FEATURE_XFD))
> >+			return 1;
> 
> This should allow msr->host_initiated always (even if XFD is not
> part of CPUID).  However, if XFD is nonzero and
> kvm_check_guest_realloc_fpstate returns true, then it should return
> 1.
> 
> The selftest should also cover using KVM_GET_MSR/KVM_SET_MSR.
> 

  Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
  git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm

  So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
  and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
  vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
  from selftest side? 

  I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
  reference. If you have better suggestion, please share it to me. thanks!

  Yang 



^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD
  2021-12-14 10:26     ` Yang Zhong
@ 2021-12-14 11:24       ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-14 11:24 UTC (permalink / raw)
  To: Yang Zhong
  Cc: x86, kvm, linux-kernel, tglx, mingo, bp, dave.hansen, seanjc,
	jun.nakajima, kevin.tian, jing2.liu, jing2.liu

On 12/14/21 11:26, Yang Zhong wrote:
>    Paolo, Seems we do not need new KVM_EXIT_* again from below thomas' new patchset:
>    git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm
> 
>    So the selftest stll need support KVM_GET_MSR/KVM_SET_MSR for MSR_IA32_XFD
>    and MSR_IA32_XFD_ERR? If yes, we only do some read/write test with vcpu_set_msr()/
>    vcpu_get_msr() from new selftest tool? or do wrmsr from guest side and check this value
>    from selftest side?

You can write a test similar to state_test.c to cover XCR0, XFD and the
new XSAVE extensions.  The test can:

- initialize AMX and write a nonzero value to XFD

- load a matrix into TMM0

- check that #NM is delivered (search for vm_install_exception_handler) and
that XFD_ERR is correct

- write 0 to XFD

- load again the matrix, and check that #NM is not delivered

- store it back into memory

- compare it with the original data

All of this can be done with a full save&restore after every step
(though I suggest that you first get it working without save&restore,
the relevant code in state_test.c is easy to identify and comment out).

You will have to modify vcpu_load_state, so that it does
first KVM_SET_MSRS, then KVM_SET_XCRS, then KVM_SET_XSAVE.
See patch below.

Paolo

>    I checked some msr selftest reference code, tsc_msrs_test.c, which maybe better for this
>    reference. If you have better suggestion, please share it to me. thanks!


------------------ 8< -----------------
From: Paolo Bonzini <pbonzini@redhat.com>
Subject: [PATCH] selftest: kvm: Reorder vcpu_load_state steps for AMX

For AMX support it is recommended to load XCR0 after XFD, so that
KVM does not see XFD=0, XCR=1 for a save state that will eventually
be disabled (which would lead to premature allocation of the space
required for that save state).

It is also required to load XSAVE data after XCR0 and XFD, so that
KVM can trigger allocation of the extra space required to store AMX
state.

Adjust vcpu_load_state to obey these new requirements.

Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>

diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c
index 82c39db91369..d805f63f7203 100644
--- a/tools/testing/selftests/kvm/lib/x86_64/processor.c
+++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c
@@ -1157,16 +1157,6 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
  	struct vcpu *vcpu = vcpu_find(vm, vcpuid);
  	int r;
  
-	r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
-        TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
-                r);
-
-	if (kvm_check_cap(KVM_CAP_XCRS)) {
-		r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
-		TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
-			    r);
-	}
-
  	r = ioctl(vcpu->fd, KVM_SET_SREGS, &state->sregs);
          TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_SREGS, r: %i",
                  r);
@@ -1175,6 +1165,16 @@ void vcpu_load_state(struct kvm_vm *vm, uint32_t vcpuid, struct kvm_x86_state *s
          TEST_ASSERT(r == state->msrs.nmsrs, "Unexpected result from KVM_SET_MSRS, r: %i (failed at %x)",
                  r, r == state->msrs.nmsrs ? -1 : state->msrs.entries[r].index);
  
+	if (kvm_check_cap(KVM_CAP_XCRS)) {
+		r = ioctl(vcpu->fd, KVM_SET_XCRS, &state->xcrs);
+		TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XCRS, r: %i",
+			    r);
+	}
+
+	r = ioctl(vcpu->fd, KVM_SET_XSAVE, &state->xsave);
+        TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_XSAVE, r: %i",
+                r);
+
  	r = ioctl(vcpu->fd, KVM_SET_VCPU_EVENTS, &state->events);
          TEST_ASSERT(r == 0, "Unexpected result from KVM_SET_VCPU_EVENTS, r: %i",
                  r);

^ permalink raw reply related	[flat|nested] 80+ messages in thread

* RE: [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-14 10:16       ` Paolo Bonzini
@ 2021-12-14 14:41         ` Liu, Jing2
  2021-12-15  7:09           ` Tian, Kevin
  0 siblings, 1 reply; 80+ messages in thread
From: Liu, Jing2 @ 2021-12-14 14:41 UTC (permalink / raw)
  To: Paolo Bonzini, Tian, Kevin, Zhong, Yang, x86, kvm, linux-kernel,
	tglx, mingo, bp, dave.hansen
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu


On 12/14/2021 6:16 PM, Paolo Bonzini wrote:
> 
> On 12/14/21 08:06, Tian, Kevin wrote:
> >> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> >> userspace error and you can #GP the guest without any issue.
> >> Userspace is buggy
> >
> > Is it a general guideline that an error caused by emulation itself (e.g.
> > due to no memory) can be reflected into the guest as #GP, even when
> > from guest p.o.v there is nothing wrong with its setting?
> 
> No memory is a tricky one, if possible it should propagate -ENOMEM up to
> KVM_RUN or KVM_SET_MSR.  But it's basically an impossible case anyway,
> because even with 8K TILEDATA we're within the limit of
> PAGE_ALLOC_COSTLY_ORDER.
> 
> So, since it's not easy to do it right now, we can look at it later.

For the way handling xcr0 and xfd ioctl failure, xcr0 and xfd have 
different handlings. Current KVM_SET_XCRS returns -EINVAL to 
userspace. KVM_SET_MSR is always allowed as the discussion in 
another thread.

So I'm thinking if reallocation failure in KVM_SET_XCRS and 
KVM_SET_MSR (may due to NOMEM or EPERM or ENOTSUPP), 
what is the way we would like to choose?

Thanks,
Jing
 
> Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-14  6:18             ` Paolo Bonzini
@ 2021-12-15  2:39               ` Wang, Wei W
  2021-12-15 13:42                 ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Wang, Wei W @ 2021-12-15  2:39 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On Tuesday, December 14, 2021 2:19 PM, Paolo Bonzini wrote:
> To: Wang, Wei W <wei.w.wang@intel.com>; Zhong, Yang
> <yang.zhong@intel.com>; x86@kernel.org; kvm@vger.kernel.org;
> linux-kernel@vger.kernel.org; tglx@linutronix.de; mingo@redhat.com;
> bp@alien8.de; dave.hansen@linux.intel.com
> Cc: seanjc@google.com; Nakajima, Jun <jun.nakajima@intel.com>; Tian, Kevin
> <kevin.tian@intel.com>; jing2.liu@linux.intel.com; Liu, Jing2
> <jing2.liu@intel.com>; Zeng, Guang <guang.zeng@intel.com>
> Subject: Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
> 
> On 12/14/21 07:06, Wang, Wei W wrote:
> > On Monday, December 13, 2021 5:24 PM, Paolo Bonzini wrote:
> >> There is no need for struct kvm_xsave2, because there is no need for a
> "size"
> >> argument.
> >>
> >> - KVM_GET_XSAVE2 *is* needed, and it can expect a buffer as big as
> >> the return value of KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2)
> >
> > Why would KVM_GET_XSAVE2 still be needed in this case?
> >
> > I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
> >
> > - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
> >   then KVM knows that the userspace is a new version and it works with
> larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
> >   So we can add a flag "kvm->xsave2_enabled", which gets set upon
> userspace checks KVM_CAP_XSAVE2.
> 
> You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes.  In that case
> you don't need KVM_GET_XSAVE2.

On more thing here, what size should KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return?
If the size still comes from the guest CPUID(0xd, 0)::RCX, would it be better to just return 1?
This requires that the QEMU CPUID info has been set to KVM before checking the cap.
QEMU already has this CPUID info to get the size (seems no need to inquire KVM for it).

Thanks,
Wei

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 09/19] kvm: x86: Prepare reallocation check
  2021-12-14 14:41         ` Liu, Jing2
@ 2021-12-15  7:09           ` Tian, Kevin
  0 siblings, 0 replies; 80+ messages in thread
From: Tian, Kevin @ 2021-12-15  7:09 UTC (permalink / raw)
  To: Liu, Jing2, Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel,
	tglx, mingo, bp, dave.hansen
  Cc: Christopherson,, Sean, Nakajima, Jun, jing2.liu

> From: Liu, Jing2 <jing2.liu@intel.com>
> Sent: Tuesday, December 14, 2021 10:42 PM
> 
> On 12/14/2021 6:16 PM, Paolo Bonzini wrote:
> >
> > On 12/14/21 08:06, Tian, Kevin wrote:
> > >> - if (dynamic_enabled & ~guest_fpu->user_perm) != 0, then this is a
> > >> userspace error and you can #GP the guest without any issue.
> > >> Userspace is buggy
> > >
> > > Is it a general guideline that an error caused by emulation itself (e.g.
> > > due to no memory) can be reflected into the guest as #GP, even when
> > > from guest p.o.v there is nothing wrong with its setting?
> >
> > No memory is a tricky one, if possible it should propagate -ENOMEM up to
> > KVM_RUN or KVM_SET_MSR.  But it's basically an impossible case anyway,
> > because even with 8K TILEDATA we're within the limit of
> > PAGE_ALLOC_COSTLY_ORDER.
> >
> > So, since it's not easy to do it right now, we can look at it later.
> 
> For the way handling xcr0 and xfd ioctl failure, xcr0 and xfd have
> different handlings. Current KVM_SET_XCRS returns -EINVAL to
> userspace. KVM_SET_MSR is always allowed as the discussion in
> another thread.
> 
> So I'm thinking if reallocation failure in KVM_SET_XCRS and
> KVM_SET_MSR (may due to NOMEM or EPERM or ENOTSUPP),
> what is the way we would like to choose?
> 

KVM_SET_MSRS can definitely accept failure according to msr_io().
I think Paolo's point is more about that the restore path should not
inherit any check related to vCPU capability. It's a different matter
if the error is caused by other host kernel errors.

Given that we don't need any special handling between the two
scenarios (set by guest vs. set by host) in those emulation paths. 
Just return '1' to indicate error and whatever error policy exists 
in those scenarios is just applied.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-15  2:39               ` Wang, Wei W
@ 2021-12-15 13:42                 ` Paolo Bonzini
  2021-12-16  8:25                   ` Wang, Wei W
  0 siblings, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-15 13:42 UTC (permalink / raw)
  To: Wang, Wei W, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On 12/15/21 03:39, Wang, Wei W wrote:
>>> Why would KVM_GET_XSAVE2 still be needed in this case?
>>>
>>> I'm thinking it would also be possible to reuse KVM_GET_XSAVE:
>>>
>>> - If userspace calls to KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2),
>>>    then KVM knows that the userspace is a new version and it works with
>> larger xsave buffer using the "size" that it returns via KVM_CAP_XSAVE2.
>>>    So we can add a flag "kvm->xsave2_enabled", which gets set upon
>> userspace checks KVM_CAP_XSAVE2.
>>
>> You can use KVM_ENABLE_CAP(KVM_CAP_XSAVE2) for that, yes.  In that case
>> you don't need KVM_GET_XSAVE2.
>
> On more thing here, what size should KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2) return?
> If the size still comes from the guest CPUID(0xd, 0)::RCX, would it be better to just return 1?
> This requires that the QEMU CPUID info has been set to KVM before checking the cap.
> QEMU already has this CPUID info to get the size (seems no need to inquire KVM for it).

It's still easier to return the full size of the buffer from 
KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).  It makes the userspace code a bit 
easier.

I'm also thinking that I prefer KVM_GET_XSAVE2 to 
KVM_ENABLE_CAP(KVM_CAP_XSAVE2), after all.  Since it would be a 
backwards-incompatible change to an _old_ ioctl (KVM_GET_XSAVE), I 
prefer to limit the ways that userspace can shoot itself in the foot.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* RE: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-15 13:42                 ` Paolo Bonzini
@ 2021-12-16  8:25                   ` Wang, Wei W
  2021-12-16 10:28                     ` Paolo Bonzini
  0 siblings, 1 reply; 80+ messages in thread
From: Wang, Wei W @ 2021-12-16  8:25 UTC (permalink / raw)
  To: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On Wednesday, December 15, 2021 9:43 PM, Paolo Bonzini wrote:
> It's still easier to return the full size of the buffer from
> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).  It makes the userspace code a
> bit easier.

OK. For the "full size" returned to userspace, would you prefer to directly use the value retrieved from guest CPUID(0xd),
or get it from guest_fpu (i.e. fpstate->user_size)?
(retrieved from CPUID will be the max size and should work fine as well)

> 
> I'm also thinking that I prefer KVM_GET_XSAVE2 to
> KVM_ENABLE_CAP(KVM_CAP_XSAVE2), after all.  Since it would be a
> backwards-incompatible change to an _old_ ioctl (KVM_GET_XSAVE), I prefer
> to limit the ways that userspace can shoot itself in the foot.

OK, we will use KVM_GET_XSAVE2.

Thanks,
Wei

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl
  2021-12-16  8:25                   ` Wang, Wei W
@ 2021-12-16 10:28                     ` Paolo Bonzini
  0 siblings, 0 replies; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-16 10:28 UTC (permalink / raw)
  To: Wang, Wei W, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen
  Cc: seanjc, Nakajima, Jun, Tian, Kevin, jing2.liu, Liu, Jing2, Zeng, Guang

On 12/16/21 09:25, Wang, Wei W wrote:
>> It's still easier to return the full size of the buffer from 
>> KVM_CHECK_EXTENSION(KVM_CAP_XSAVE2).  It makes the userspace code
>> a bit easier.
> 
> OK. For the "full size" returned to userspace, would you prefer to
> directly use the value retrieved from guest CPUID(0xd), or get it
> from guest_fpu (i.e. fpstate->user_size)? (retrieved from CPUID will
> be the max size and should work fine as well)

It is okay to reflect only the bits that were enabled in prctl, but 
please document it in api.rst as well.

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)
  2021-12-10 22:13     ` Paolo Bonzini
  2021-12-13  8:23       ` Wang, Wei W
@ 2021-12-20 17:54       ` Nakajima, Jun
  2021-12-22 14:44         ` Paolo Bonzini
  2021-12-22 14:52         ` Dave Hansen
  1 sibling, 2 replies; 80+ messages in thread
From: Nakajima, Jun @ 2021-12-20 17:54 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, Tian, Kevin, jing2.liu, Liu, Jing2


> On Dec 10, 2021, at 2:13 PM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 12/10/21 17:30, Paolo Bonzini wrote:
>>> 
>>> +static int kvm_vcpu_ioctl_x86_set_xsave2(struct kvm_vcpu *vcpu, u8 *state)
>>> +{
>>> +    if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>>> +        return 0;
>>> +
>>> +    return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, state,
>>> +                          supported_xcr0, &vcpu->arch.pkru);
>>> +}
>>> +
>> I think fpu_copy_uabi_to_guest_fpstate (and therefore copy_uabi_from_kernel_to_xstate) needs to check that the size is compatible with the components in the input.
>> Also, IIUC the size of the AMX state will vary in different processors.   Is this correct?  If so, this should be handled already by KVM_GET/SET_XSAVE2 and therefore should be part of the arch/x86/kernel/fpu APIs.  In the future we want to support migrating a "small AMX" host to a "large AMX" host; and also migrating from a "large AMX" host to a "small AMX" host if the guest CPUID is compatible with the destination of the migration.
> 
> So, the size of the AMX state will depend on the active "palette" in TILECONFIG, and on the CPUID information.  I have a few questions on how Intel intends to handle future extensions to AMX:
> 
> - can we assume that, in the future, palette 1 will always have the same value (bytes_per_row=64, max_names=8, max_rows=16), and basically that the only variable value is really the number of palettes?
> 
> - how does Intel plan to handle bigger TILEDATA?  Will it use more XCR0 bits or will it rather enlarge save state 18?
> 
> If it will use more XCR0 bits, I suppose that XCR0 bits will control which palettes can be chosen by LDTILECFG.
> 
> If not, on the other hand, this will be a first case of one system's XSAVE data not being XRSTOR-able on another system even if the destination system can set XCR0 to the same value as the source system.
> 
> Likewise, if the size and offsets for save state 18 were to vary depending on the selected palette, then this would be novel, in that the save state size and offsets would not be in CPUID anymore.  It would be particularly interesting for non-compacted format, where all save states after 18 would also move forward.
> 
> So, I hope that save state 18 will be frozen to 8k.  In that case, and if palette 1 is frozen to the same values as today, implementing migration will not be a problem; it will be essentially the same as SSE->AVX (horizontal extension of existing registers) and/or AVX->AVX512 (both horizontal and vertical extension).

Hi Paolo,

I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same. 

Thanks,
--- 
Jun






^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)
  2021-12-20 17:54       ` State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl) Nakajima, Jun
@ 2021-12-22 14:44         ` Paolo Bonzini
  2021-12-22 23:47           ` Nakajima, Jun
  2021-12-22 14:52         ` Dave Hansen
  1 sibling, 1 reply; 80+ messages in thread
From: Paolo Bonzini @ 2021-12-22 14:44 UTC (permalink / raw)
  To: Nakajima, Jun
  Cc: Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, Tian, Kevin, jing2.liu, Liu, Jing2

On 12/20/21 18:54, Nakajima, Jun wrote:
> Hi Paolo,
> 
> I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same.

Great!  Can you also confirm that XCR0 bits will control which palettes 
can be chosen by LDTILECFG?

Thanks,

Paolo

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)
  2021-12-20 17:54       ` State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl) Nakajima, Jun
  2021-12-22 14:44         ` Paolo Bonzini
@ 2021-12-22 14:52         ` Dave Hansen
  2021-12-22 23:51           ` Nakajima, Jun
  1 sibling, 1 reply; 80+ messages in thread
From: Dave Hansen @ 2021-12-22 14:52 UTC (permalink / raw)
  To: Nakajima, Jun, Paolo Bonzini
  Cc: Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, Tian, Kevin, jing2.liu, Liu, Jing2

On 12/20/21 9:54 AM, Nakajima, Jun wrote:
>> So, I hope that save state 18 will be frozen to 8k.  In that case,
>> and if palette 1 is frozen to the same values as today,
>> implementing migration will not be a problem; it will be
>> essentially the same as SSE->AVX (horizontal extension of existing
>> registers) and/or AVX->AVX512 (both horizontal and vertical
>> extension).
> 
> I would like to confirm that the state component 18 will remain 8KB
> and palette 1 will remain the same.

Is that an architectural statement that will soon be making its way into
the SDM?

^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)
  2021-12-22 14:44         ` Paolo Bonzini
@ 2021-12-22 23:47           ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2021-12-22 23:47 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo, bp,
	dave.hansen, Christopherson,,
	Sean, Tian, Kevin, jing2.liu, Liu, Jing2

> On Dec 22, 2021, at 6:44 AM, Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> On 12/20/21 18:54, Nakajima, Jun wrote:
>> Hi Paolo,
>> I would like to confirm that the state component 18 will remain 8KB and palette 1 will remain the same.
> 
> Great!  Can you also confirm that XCR0 bits will control which palettes can be chosen by LDTILECFG?
> 

I need to discuss with the H/W team more and come back. I think this request is plausible; I will suggest it.

Thanks,
--- 
Jun


^ permalink raw reply	[flat|nested] 80+ messages in thread

* Re: State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl)
  2021-12-22 14:52         ` Dave Hansen
@ 2021-12-22 23:51           ` Nakajima, Jun
  0 siblings, 0 replies; 80+ messages in thread
From: Nakajima, Jun @ 2021-12-22 23:51 UTC (permalink / raw)
  To: Hansen, Dave
  Cc: Paolo Bonzini, Zhong, Yang, x86, kvm, linux-kernel, tglx, mingo,
	bp, dave.hansen, Christopherson,,
	Sean, Tian, Kevin, jing2.liu, Liu, Jing2


> On Dec 22, 2021, at 6:52 AM, Hansen, Dave <dave.hansen@intel.com> wrote:
> 
> On 12/20/21 9:54 AM, Nakajima, Jun wrote:
>>> So, I hope that save state 18 will be frozen to 8k.  In that case,
>>> and if palette 1 is frozen to the same values as today,
>>> implementing migration will not be a problem; it will be
>>> essentially the same as SSE->AVX (horizontal extension of existing
>>> registers) and/or AVX->AVX512 (both horizontal and vertical
>>> extension).
>> 
>> I would like to confirm that the state component 18 will remain 8KB
>> and palette 1 will remain the same.
> 
> Is that an architectural statement that will soon be making its way into
> the SDM?

Yes, with the other clarifications (e.g. setting IA32_XFD[18]).

Thanks,
--- 
Jun


^ permalink raw reply	[flat|nested] 80+ messages in thread

end of thread, other threads:[~2021-12-22 23:51 UTC | newest]

Thread overview: 80+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-08  0:03 [PATCH 00/19] AMX Support in KVM Yang Zhong
2021-12-08  0:03 ` [PATCH 01/19] x86/fpu: Extend prctl() with guest permissions Yang Zhong
2021-12-14  0:16   ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 02/19] x86/fpu: Prepare KVM for dynamically enabled states Yang Zhong
2021-12-13  9:12   ` Paolo Bonzini
2021-12-13 12:00     ` Thomas Gleixner
2021-12-13 12:45       ` Paolo Bonzini
2021-12-13 19:50         ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 03/19] kvm: x86: Fix xstate_required_size() to follow XSTATE alignment rule Yang Zhong
2021-12-08  0:03 ` [PATCH 04/19] kvm: x86: Check guest xstate permissions when KVM_SET_CPUID2 Yang Zhong
2021-12-08  0:03 ` [PATCH 05/19] x86/fpu: Move xfd initialization out of __fpstate_reset() to the callers Yang Zhong
2021-12-10 22:33   ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 06/19] x86/fpu: Add reallocation mechanims for KVM Yang Zhong
2021-12-08  0:03 ` [PATCH 07/19] kvm: x86: Propagate fpstate reallocation error to userspace Yang Zhong
2021-12-10 15:44   ` Paolo Bonzini
2021-12-08  0:03 ` [PATCH 08/19] x86/fpu: Move xfd_update_state() to xstate.c and export symbol Yang Zhong
2021-12-10 22:44   ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 09/19] kvm: x86: Prepare reallocation check Yang Zhong
2021-12-13  9:16   ` Paolo Bonzini
2021-12-14  7:06     ` Tian, Kevin
2021-12-14 10:16       ` Paolo Bonzini
2021-12-14 14:41         ` Liu, Jing2
2021-12-15  7:09           ` Tian, Kevin
2021-12-08  0:03 ` [PATCH 10/19] kvm: x86: Emulate WRMSR of guest IA32_XFD Yang Zhong
2021-12-10 16:02   ` Paolo Bonzini
2021-12-13  7:51     ` Liu, Jing2
2021-12-13  9:01       ` Paolo Bonzini
2021-12-14 10:26     ` Yang Zhong
2021-12-14 11:24       ` Paolo Bonzini
2021-12-10 23:09   ` Thomas Gleixner
2021-12-13 15:06   ` Paolo Bonzini
2021-12-13 19:45     ` Thomas Gleixner
2021-12-13 21:23       ` Thomas Gleixner
2021-12-14  7:16         ` Tian, Kevin
2021-12-08  0:03 ` [PATCH 11/19] kvm: x86: Check fpstate reallocation in XSETBV emulation Yang Zhong
2021-12-08  0:03 ` [PATCH 12/19] x86/fpu: Prepare KVM for bringing XFD state back in-sync Yang Zhong
2021-12-10 23:11   ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 13/19] kvm: x86: Disable WRMSR interception for IA32_XFD on demand Yang Zhong
2021-12-08  7:23   ` Liu, Jing2
2021-12-08  0:03 ` [PATCH 14/19] x86/fpu: Prepare for KVM XFD_ERR handling Yang Zhong
2021-12-10 16:16   ` Paolo Bonzini
2021-12-10 23:20   ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 15/19] kvm: x86: Save and restore guest XFD_ERR properly Yang Zhong
2021-12-10 16:23   ` Paolo Bonzini
2021-12-10 22:01   ` Paolo Bonzini
2021-12-12 13:10     ` Yang Zhong
2021-12-11  0:10   ` Thomas Gleixner
2021-12-11  1:31     ` Paolo Bonzini
2021-12-11  3:23       ` Tian, Kevin
2021-12-11 13:10       ` Thomas Gleixner
2021-12-11  3:07     ` Tian, Kevin
2021-12-11 13:29       ` Thomas Gleixner
2021-12-12  1:50         ` Tian, Kevin
2021-12-12  9:10           ` Paolo Bonzini
2021-12-08  0:03 ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Yang Zhong
2021-12-10 16:25   ` Paolo Bonzini
2021-12-10 16:30   ` Paolo Bonzini
2021-12-10 22:13     ` Paolo Bonzini
2021-12-13  8:23       ` Wang, Wei W
2021-12-13  9:24         ` Paolo Bonzini
2021-12-14  6:06           ` Wang, Wei W
2021-12-14  6:18             ` Paolo Bonzini
2021-12-15  2:39               ` Wang, Wei W
2021-12-15 13:42                 ` Paolo Bonzini
2021-12-16  8:25                   ` Wang, Wei W
2021-12-16 10:28                     ` Paolo Bonzini
2021-12-20 17:54       ` State Component 18 and Palette 1 (Re: [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl) Nakajima, Jun
2021-12-22 14:44         ` Paolo Bonzini
2021-12-22 23:47           ` Nakajima, Jun
2021-12-22 14:52         ` Dave Hansen
2021-12-22 23:51           ` Nakajima, Jun
2021-12-13 10:10     ` [PATCH 16/19] kvm: x86: Introduce KVM_{G|S}ET_XSAVE2 ioctl Thomas Gleixner
2021-12-13 10:43       ` Paolo Bonzini
2021-12-13 12:40         ` Thomas Gleixner
2021-12-08  0:03 ` [PATCH 17/19] docs: virt: api.rst: Document the new KVM_{G, S}ET_XSAVE2 ioctls Yang Zhong
2021-12-08  0:03 ` [PATCH 18/19] kvm: x86: AMX XCR0 support for guest Yang Zhong
2021-12-10 16:30   ` Paolo Bonzini
2021-12-08  0:03 ` [PATCH 19/19] kvm: x86: Add AMX CPUIDs support Yang Zhong
2021-12-10 21:52   ` Paolo Bonzini
2021-12-11 21:20 ` [PATCH 00/19] AMX Support in KVM Thomas Gleixner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).