All of lore.kernel.org
 help / color / mirror / Atom feed
* [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
@ 2021-12-14  2:50 Thomas Gleixner
  2021-12-14  2:50 ` [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions Thomas Gleixner
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

Folks,

this is a follow up to the initial sketch of patches which got picked up by
Jing and have been posted in combination with the KVM parts:

   https://lore.kernel.org/r/20211208000359.2853257-1-yang.zhong@intel.com

This update is only touching the x86/fpu code and not changing anything on
the KVM side.

    BIG FAT WARNING: This is compile tested only!

In course of the dicsussion of the above patchset it turned out that there
are a few conceptual issues vs. hardware and software state and also
vs. guest restore.

This series addresses this with the following changes vs. the original
approach:

  1) fpstate reallocation is now independent of fpu_swap_kvm_fpstate()

     It is triggered directly via XSETBV and XFD MSR write emulation which
     are used both for runtime and restore purposes.

     For this it provides two wrappers around a common update function, one
     for XCR0 and one for XFD.

     Both check the validity of the arguments and the correct sizing of the
     guest FPU fpstate. If the size is not sufficient, fpstate is
     reallocated.

     The functions can fail.

  2) XFD synchronization

     KVM must neither touch the XFD MSR nor the fpstate->xfd software state
     in order to guarantee state consistency.

     In the MSR write emulation case the XFD specific update handler has to
     be invoked. See #1

     If MSR write emulation is disabled because the buffer size is
     sufficient for all use cases, i.e.:

     		guest_fpu::xfeatures == guest_fpu::perm

     then there is no guarantee that the XFD software state on VMEXIT is
     the same as the state on VMENTER.

     A separate synchronization function is provided which reads the XFD
     MSR and updates the relevant software state. This function has to be
     invoked after a VMEXIT before reenabling interrupts.

With that the KVM logic looks like this:

     xsetbv_emulate()
	ret = fpu_update_guest_xcr0(&vcpu->arch.guest_fpu, xcr0);
	if (ret)
		handle_fail()
	....


     kvm_emulate_wrmsr()
        ....
	case MSR_IA32_XFD:
	     ret = fpu_update_guest_xfd(&vcpu->arch.guest_fpu, vcpu->arch.xcr0, msrval);
	     if (ret)
		handle_fail()
	     ....

This covers both the case of a running vCPU and the case of restore.

The XFD synchronization mechanism is only relevant for a running vCPU after
VMEXIT when XFD MSR write emulation is disabled:

     vcpu_run()
	vcpu_enter_guest()
	  for (;;) {
	      ...
	      vmenter();
	      ...
	  };
	  ...

	  if (!xfd_write_emulated(vcpu))
		fpu_sync_guest_vmexit_xfd_state();

	  local_irq_enable();

It has no relevance for the guest restore case.

With that all XFD/fpstate related issues should be covered in a consistent
way.

CPUID validation can be done without exporting yet more FPU functions:

      if (requested_xfeatures & ~vcpu->arch.guest_fpu.perm)
      		return -ENOPONY;

That's the purpose of fpu_guest::perm from the beginning along with
fpu_guest::xfeatures for other validation purposes.

XFD_ERR MSR handling is completely separate and as discussed a KVM only
issue for now. KVM has to ensure that the MSR is 0 before interrupts are
enabled. So this is not touched here.

The only remaining issue is the KVM XSTATE save/restore size checking which
probably requires some FPU core assistance. But that requires some more
thoughts vs. the IOCTL interface extension and once that is settled it
needs to be solved in one go. But that's an orthogonal issue to the above.

The series is also available from git:

   git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm

Thanks,

	tglx
---
 include/asm/fpu/api.h    |   63 ++++++++++++++++++++++++
 include/asm/fpu/types.h  |   22 ++++++++
 include/uapi/asm/prctl.h |   26 +++++----
 kernel/fpu/core.c        |  123 ++++++++++++++++++++++++++++++++++++++++++++---
 kernel/fpu/xstate.c      |  118 +++++++++++++++++++++++++++------------------
 kernel/fpu/xstate.h      |   20 ++++++-
 kernel/process.c         |    2 
 7 files changed, 307 insertions(+), 67 deletions(-)

^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-14  5:13   ` Tian, Kevin
  2021-12-14  2:50 ` [patch 2/6] x86/fpu: Prepare guest FPU for dynamically enabled FPU features Thomas Gleixner
                   ` (7 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

KVM requires a clear separation of host user space and guest permissions
for dynamic XSTATE components.

Add a guest permissions member to struct fpu and a separate set of prctl()
arguments: ARCH_GET_XCOMP_GUEST_PERM and ARCH_REQ_XCOMP_GUEST_PERM.

The semantics are equivalent to the host user space permission control
except for the following constraints:

  1) Permissions have to be requested before the first vCPU is created

  2) Permissions are frozen when the first vCPU is created to ensure
     consistency. Any attempt to expand permissions via the prctl() after
     that point is rejected.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fpu/types.h  |    9 +++++++
 arch/x86/include/uapi/asm/prctl.h |   26 +++++++++++----------
 arch/x86/kernel/fpu/core.c        |    3 ++
 arch/x86/kernel/fpu/xstate.c      |   45 +++++++++++++++++++++++++++-----------
 arch/x86/kernel/fpu/xstate.h      |   18 +++++++++++++--
 arch/x86/kernel/process.c         |    2 +
 6 files changed, 76 insertions(+), 27 deletions(-)

--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -387,6 +387,8 @@ struct fpstate {
 	/* @regs is dynamically sized! Don't add anything after @regs! */
 } __aligned(64);
 
+#define FPU_GUEST_PERM_LOCKED		BIT_ULL(63)
+
 struct fpu_state_perm {
 	/*
 	 * @__state_perm:
@@ -477,6 +479,13 @@ struct fpu {
 	struct fpu_state_perm		perm;
 
 	/*
+	 * @guest_perm:
+	 *
+	 * Permission related information for guest pseudo FPUs
+	 */
+	struct fpu_state_perm		guest_perm;
+
+	/*
 	 * @__fpstate:
 	 *
 	 * Initial in-memory storage for FPU registers which are saved in
--- a/arch/x86/include/uapi/asm/prctl.h
+++ b/arch/x86/include/uapi/asm/prctl.h
@@ -2,20 +2,22 @@
 #ifndef _ASM_X86_PRCTL_H
 #define _ASM_X86_PRCTL_H
 
-#define ARCH_SET_GS		0x1001
-#define ARCH_SET_FS		0x1002
-#define ARCH_GET_FS		0x1003
-#define ARCH_GET_GS		0x1004
+#define ARCH_SET_GS			0x1001
+#define ARCH_SET_FS			0x1002
+#define ARCH_GET_FS			0x1003
+#define ARCH_GET_GS			0x1004
 
-#define ARCH_GET_CPUID		0x1011
-#define ARCH_SET_CPUID		0x1012
+#define ARCH_GET_CPUID			0x1011
+#define ARCH_SET_CPUID			0x1012
 
-#define ARCH_GET_XCOMP_SUPP	0x1021
-#define ARCH_GET_XCOMP_PERM	0x1022
-#define ARCH_REQ_XCOMP_PERM	0x1023
+#define ARCH_GET_XCOMP_SUPP		0x1021
+#define ARCH_GET_XCOMP_PERM		0x1022
+#define ARCH_REQ_XCOMP_PERM		0x1023
+#define ARCH_GET_XCOMP_GUEST_PERM	0x1024
+#define ARCH_REQ_XCOMP_GUEST_PERM	0x1025
 
-#define ARCH_MAP_VDSO_X32	0x2001
-#define ARCH_MAP_VDSO_32	0x2002
-#define ARCH_MAP_VDSO_64	0x2003
+#define ARCH_MAP_VDSO_X32		0x2001
+#define ARCH_MAP_VDSO_32		0x2002
+#define ARCH_MAP_VDSO_64		0x2003
 
 #endif /* _ASM_X86_PRCTL_H */
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -450,6 +450,8 @@ void fpstate_reset(struct fpu *fpu)
 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;
 	fpu->perm.__state_size		= fpu_kernel_cfg.default_size;
 	fpu->perm.__user_state_size	= fpu_user_cfg.default_size;
+	/* Same defaults for guests */
+	fpu->guest_perm = fpu->perm;
 }
 
 static inline void fpu_inherit_perms(struct fpu *dst_fpu)
@@ -460,6 +462,7 @@ static inline void fpu_inherit_perms(str
 		spin_lock_irq(&current->sighand->siglock);
 		/* Fork also inherits the permissions of the parent */
 		dst_fpu->perm = src_fpu->perm;
+		dst_fpu->guest_perm = src_fpu->guest_perm;
 		spin_unlock_irq(&current->sighand->siglock);
 	}
 }
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1595,7 +1595,7 @@ static int validate_sigaltstack(unsigned
 	return 0;
 }
 
-static int __xstate_request_perm(u64 permitted, u64 requested)
+static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
 {
 	/*
 	 * This deliberately does not exclude !XSAVES as we still might
@@ -1605,6 +1605,7 @@ static int __xstate_request_perm(u64 per
 	 */
 	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
 	struct fpu *fpu = &current->group_leader->thread.fpu;
+	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
 	u64 mask;
 	int ret;
@@ -1621,15 +1622,18 @@ static int __xstate_request_perm(u64 per
 	mask &= XFEATURE_MASK_USER_SUPPORTED;
 	usize = xstate_calculate_size(mask, false);
 
-	ret = validate_sigaltstack(usize);
-	if (ret)
-		return ret;
+	if (!guest) {
+		ret = validate_sigaltstack(usize);
+		if (ret)
+			return ret;
+	}
 
+	perm = guest ? &fpu->guest_perm : &fpu->perm;
 	/* Pairs with the READ_ONCE() in xstate_get_group_perm() */
-	WRITE_ONCE(fpu->perm.__state_perm, requested);
+	WRITE_ONCE(perm->__state_perm, requested);
 	/* Protected by sighand lock */
-	fpu->perm.__state_size = ksize;
-	fpu->perm.__user_state_size = usize;
+	perm->__state_size = ksize;
+	perm->__user_state_size = usize;
 	return ret;
 }
 
@@ -1640,7 +1644,7 @@ static const u64 xstate_prctl_req[XFEATU
 	[XFEATURE_XTILE_DATA] = XFEATURE_MASK_XTILE_DATA,
 };
 
-static int xstate_request_perm(unsigned long idx)
+static int xstate_request_perm(unsigned long idx, bool guest)
 {
 	u64 permitted, requested;
 	int ret;
@@ -1661,14 +1665,19 @@ static int xstate_request_perm(unsigned
 		return -EOPNOTSUPP;
 
 	/* Lockless quick check */
-	permitted = xstate_get_host_group_perm();
+	permitted = xstate_get_group_perm(guest);
 	if ((permitted & requested) == requested)
 		return 0;
 
 	/* Protect against concurrent modifications */
 	spin_lock_irq(&current->sighand->siglock);
-	permitted = xstate_get_host_group_perm();
-	ret = __xstate_request_perm(permitted, requested);
+	permitted = xstate_get_group_perm(guest);
+
+	/* First vCPU allocation locks the permissions. */
+	if (guest && (permitted & FPU_GUEST_PERM_LOCKED))
+		ret = -EBUSY;
+	else
+		ret = __xstate_request_perm(permitted, requested, guest);
 	spin_unlock_irq(&current->sighand->siglock);
 	return ret;
 }
@@ -1713,7 +1722,7 @@ int xfd_enable_feature(u64 xfd_err)
 	return 0;
 }
 #else /* CONFIG_X86_64 */
-static inline int xstate_request_perm(unsigned long idx)
+static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
 	return -EPERM;
 }
@@ -1742,6 +1751,7 @@ long fpu_xstate_prctl(struct task_struct
 	u64 __user *uptr = (u64 __user *)arg2;
 	u64 permitted, supported;
 	unsigned long idx = arg2;
+	bool guest = false;
 
 	if (tsk != current)
 		return -EPERM;
@@ -1760,11 +1770,20 @@ long fpu_xstate_prctl(struct task_struct
 		permitted &= XFEATURE_MASK_USER_SUPPORTED;
 		return put_user(permitted, uptr);
 
+	case ARCH_GET_XCOMP_GUEST_PERM:
+		permitted = xstate_get_guest_group_perm();
+		permitted &= XFEATURE_MASK_USER_SUPPORTED;
+		return put_user(permitted, uptr);
+
+	case ARCH_REQ_XCOMP_GUEST_PERM:
+		guest = true;
+		fallthrough;
+
 	case ARCH_REQ_XCOMP_PERM:
 		if (!IS_ENABLED(CONFIG_X86_64))
 			return -EOPNOTSUPP;
 
-		return xstate_request_perm(idx);
+		return xstate_request_perm(idx, guest);
 
 	default:
 		return -EINVAL;
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -20,10 +20,24 @@ static inline void xstate_init_xcomp_bv(
 		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
 }
 
-static inline u64 xstate_get_host_group_perm(void)
+static inline u64 xstate_get_group_perm(bool guest)
 {
+	struct fpu *fpu = &current->group_leader->thread.fpu;
+	struct fpu_state_perm *perm;
+
 	/* Pairs with WRITE_ONCE() in xstate_request_perm() */
-	return READ_ONCE(current->group_leader->thread.fpu.perm.__state_perm);
+	perm = guest ? &fpu->guest_perm : &fpu->perm;
+	return READ_ONCE(perm->__state_perm);
+}
+
+static inline u64 xstate_get_host_group_perm(void)
+{
+	return xstate_get_group_perm(false);
+}
+
+static inline u64 xstate_get_guest_group_perm(void)
+{
+	return xstate_get_group_perm(true);
 }
 
 enum xstate_copy_mode {
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -993,6 +993,8 @@ long do_arch_prctl_common(struct task_st
 	case ARCH_GET_XCOMP_SUPP:
 	case ARCH_GET_XCOMP_PERM:
 	case ARCH_REQ_XCOMP_PERM:
+	case ARCH_GET_XCOMP_GUEST_PERM:
+	case ARCH_REQ_XCOMP_GUEST_PERM:
 		return fpu_xstate_prctl(task, option, arg2);
 	}
 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 2/6] x86/fpu: Prepare guest FPU for dynamically enabled FPU features
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
  2021-12-14  2:50 ` [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-14  2:50 ` [patch 3/6] x86/fpu: Make XFD initialization in __fpstate_reset() a function argument Thomas Gleixner
                   ` (6 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

To support dynamically enabled FPU features for guests prepare the guest
pseudo FPU container to keep track of the currently enabled xfeatures and
the guest permissions.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fpu/types.h |   13 +++++++++++++
 arch/x86/kernel/fpu/core.c       |   26 +++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -505,6 +505,19 @@ struct fpu {
  */
 struct fpu_guest {
 	/*
+	 * @xfeatures:			xfeature bitmap of features which are
+	 *				currently enabled for the guest vCPU.
+	 */
+	u64				xfeatures;
+
+	/*
+	 * @perm:			xfeature bitmap of features which are
+	 *				permitted to be enabled for the guest
+	 *				vCPU.
+	 */
+	u64				perm;
+
+	/*
 	 * @fpstate:			Pointer to the allocated guest fpstate
 	 */
 	struct fpstate			*fpstate;
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -201,6 +201,26 @@ void fpu_reset_from_exception_fixup(void
 #if IS_ENABLED(CONFIG_KVM)
 static void __fpstate_reset(struct fpstate *fpstate);
 
+static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
+{
+	struct fpu_state_perm *fpuperm;
+	u64 perm;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return;
+
+	spin_lock_irq(&current->sighand->siglock);
+	fpuperm = &current->group_leader->thread.fpu.guest_perm;
+	perm = fpuperm->__state_perm;
+
+	/* First fpstate allocation locks down permissions. */
+	WRITE_ONCE(fpuperm->__state_perm, perm | FPU_GUEST_PERM_LOCKED);
+
+	spin_unlock_irq(&current->sighand->siglock);
+
+	gfpu->perm = perm & ~FPU_GUEST_PERM_LOCKED;
+}
+
 bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu)
 {
 	struct fpstate *fpstate;
@@ -216,7 +236,11 @@ bool fpu_alloc_guest_fpstate(struct fpu_
 	fpstate->is_valloc	= true;
 	fpstate->is_guest	= true;
 
-	gfpu->fpstate = fpstate;
+	gfpu->fpstate		= fpstate;
+	gfpu->xfeatures		= fpu_user_cfg.default_features;
+	gfpu->perm		= fpu_user_cfg.default_features;
+	fpu_init_guest_permissions(gfpu);
+
 	return true;
 }
 EXPORT_SYMBOL_GPL(fpu_alloc_guest_fpstate);


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 3/6] x86/fpu: Make XFD initialization in __fpstate_reset() a function argument
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
  2021-12-14  2:50 ` [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions Thomas Gleixner
  2021-12-14  2:50 ` [patch 2/6] x86/fpu: Prepare guest FPU for dynamically enabled FPU features Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-14  2:50 ` [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature() Thomas Gleixner
                   ` (5 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian, Jing Liu

From: Jing Liu <jing2.liu@intel.com>

vCPU threads are different from native tasks regarding to the initial XFD
value. While all native tasks follow a fixed value (init_fpstate::xfd)
established by the FPU core at boot, vCPU threads need to obey the reset
value (i.e. ZERO) defined by the specification, to meet the expectation of
the guest.

Let the caller supply an argument and adjust the host and guest related
invocations accordingly.

[ tglx: Make it explicit via function argument ]

Signed-off-by: Jing Liu <jing2.liu@intel.com>
Signed-off-by: Yang Zhong <yang.zhong@intel.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
---
 arch/x86/kernel/fpu/core.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -199,7 +199,7 @@ void fpu_reset_from_exception_fixup(void
 }
 
 #if IS_ENABLED(CONFIG_KVM)
-static void __fpstate_reset(struct fpstate *fpstate);
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd);
 
 static void fpu_init_guest_permissions(struct fpu_guest *gfpu)
 {
@@ -231,7 +231,8 @@ bool fpu_alloc_guest_fpstate(struct fpu_
 	if (!fpstate)
 		return false;
 
-	__fpstate_reset(fpstate);
+	/* Leave xfd to 0 (the reset value defined by spec) */
+	__fpstate_reset(fpstate, 0);
 	fpstate_init_user(fpstate);
 	fpstate->is_valloc	= true;
 	fpstate->is_guest	= true;
@@ -454,21 +455,21 @@ void fpstate_init_user(struct fpstate *f
 		fpstate_init_fstate(fpstate);
 }
 
-static void __fpstate_reset(struct fpstate *fpstate)
+static void __fpstate_reset(struct fpstate *fpstate, u64 xfd)
 {
 	/* Initialize sizes and feature masks */
 	fpstate->size		= fpu_kernel_cfg.default_size;
 	fpstate->user_size	= fpu_user_cfg.default_size;
 	fpstate->xfeatures	= fpu_kernel_cfg.default_features;
 	fpstate->user_xfeatures	= fpu_user_cfg.default_features;
-	fpstate->xfd		= init_fpstate.xfd;
+	fpstate->xfd		= xfd;
 }
 
 void fpstate_reset(struct fpu *fpu)
 {
 	/* Set the fpstate pointer to the default fpstate */
 	fpu->fpstate = &fpu->__fpstate;
-	__fpstate_reset(fpu->fpstate);
+	__fpstate_reset(fpu->fpstate, init_fpstate.xfd);
 
 	/* Initialize the permission related info in fpu */
 	fpu->perm.__state_perm		= fpu_kernel_cfg.default_features;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (2 preceding siblings ...)
  2021-12-14  2:50 ` [patch 3/6] x86/fpu: Make XFD initialization in __fpstate_reset() a function argument Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-14  6:05   ` Tian, Kevin
  2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
                   ` (4 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

Guest support for dynamically enabling FPU features requires a few
modifications to the enablement function which is currently invoked from
the #NM handler:

  1) Use guest permissions and sizes for the update

  2) Update fpu_guest state accordingly

  3) Take into account that the enabling can be triggered either from a
     running guest via XSETBV and MSR_IA32_XFD write emulation and from
     a guest restore. In the latter case the guests fpstate is not the
     current tasks active fpstate.

Split the function and implement the guest mechanics throughout the
callchain.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
Split out from combo patch. Add fpstate.in_use initialization.
---
 arch/x86/kernel/fpu/xstate.c |   73 ++++++++++++++++++++++---------------------
 arch/x86/kernel/fpu/xstate.h |    2 +
 2 files changed, 41 insertions(+), 34 deletions(-)

--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1500,35 +1500,13 @@ void fpstate_free(struct fpu *fpu)
 }
 
 /**
- * fpu_install_fpstate - Update the active fpstate in the FPU
- *
- * @fpu:	A struct fpu * pointer
- * @newfps:	A struct fpstate * pointer
- *
- * Returns:	A null pointer if the last active fpstate is the embedded
- *		one or the new fpstate is already installed;
- *		otherwise, a pointer to the old fpstate which has to
- *		be freed by the caller.
- */
-static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
-					   struct fpstate *newfps)
-{
-	struct fpstate *oldfps = fpu->fpstate;
-
-	if (fpu->fpstate == newfps)
-		return NULL;
-
-	fpu->fpstate = newfps;
-	return oldfps != &fpu->__fpstate ? oldfps : NULL;
-}
-
-/**
  * fpstate_realloc - Reallocate struct fpstate for the requested new features
  *
  * @xfeatures:	A bitmap of xstate features which extend the enabled features
  *		of that task
  * @ksize:	The required size for the kernel buffer
  * @usize:	The required size for user space buffers
+ * @guest_fpu:	Pointer to a guest FPU container. NULL for host allocations
  *
  * Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer
  * terminates quickly, vfree()-induced IPIs may be a concern, but tasks
@@ -1537,7 +1515,7 @@ static struct fpstate *fpu_install_fpsta
  * Returns: 0 on success, -ENOMEM on allocation error.
  */
 static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
-			   unsigned int usize)
+			   unsigned int usize, struct fpu_guest *guest_fpu)
 {
 	struct fpu *fpu = &current->thread.fpu;
 	struct fpstate *curfps, *newfps = NULL;
@@ -1553,6 +1531,13 @@ static int fpstate_realloc(u64 xfeatures
 	newfps->user_size = usize;
 	newfps->is_valloc = true;
 
+	if (guest_fpu) {
+		newfps->is_guest = true;
+		newfps->is_confidential = curfps->is_confidential;
+		newfps->in_use = curfps->in_use;
+		guest_fpu->xfeatures |= xfeatures;
+	}
+
 	fpregs_lock();
 	/*
 	 * Ensure that the current state is in the registers before
@@ -1566,15 +1551,25 @@ static int fpstate_realloc(u64 xfeatures
 	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
 	newfps->xfd = curfps->xfd & ~xfeatures;
 
-	curfps = fpu_install_fpstate(fpu, newfps);
-
 	/* Do the final updates within the locked region */
 	xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures);
-	xfd_update_state(newfps);
 
+	if (guest_fpu) {
+		curfps = xchg(&guest_fpu->fpstate, newfps);
+		/* If curfps is active, update the FPU fpstate pointer */
+		if (fpu->fpstate == curfps)
+			fpu->fpstate = newfps;
+	} else {
+		curfps = xchg(&fpu->fpstate, newfps);
+	}
+
+	xfd_update_state(fpu->fpstate);
 	fpregs_unlock();
 
-	vfree(curfps);
+	/* Only free valloc'ed state */
+	if (curfps && curfps->is_valloc)
+		vfree(curfps);
+
 	return 0;
 }
 
@@ -1682,14 +1677,16 @@ static int xstate_request_perm(unsigned
 	return ret;
 }
 
-int xfd_enable_feature(u64 xfd_err)
+int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu)
 {
 	u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC;
+	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
 	struct fpu *fpu;
 
 	if (!xfd_event) {
-		pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
+		if (!guest_fpu)
+			pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
 		return 0;
 	}
 
@@ -1697,14 +1694,16 @@ int xfd_enable_feature(u64 xfd_err)
 	spin_lock_irq(&current->sighand->siglock);
 
 	/* If not permitted let it die */
-	if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
+	if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
 		spin_unlock_irq(&current->sighand->siglock);
 		return -EPERM;
 	}
 
 	fpu = &current->group_leader->thread.fpu;
-	ksize = fpu->perm.__state_size;
-	usize = fpu->perm.__user_state_size;
+	perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
+	ksize = perm->__state_size;
+	usize = perm->__user_state_size;
+
 	/*
 	 * The feature is permitted. State size is sufficient.  Dropping
 	 * the lock is safe here even if more features are added from
@@ -1717,10 +1716,16 @@ int xfd_enable_feature(u64 xfd_err)
 	 * Try to allocate a new fpstate. If that fails there is no way
 	 * out.
 	 */
-	if (fpstate_realloc(xfd_event, ksize, usize))
+	if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu))
 		return -EFAULT;
 	return 0;
 }
+
+int xfd_enable_feature(u64 xfd_err)
+{
+	return __xfd_enable_feature(xfd_err, NULL);
+}
+
 #else /* CONFIG_X86_64 */
 static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -60,6 +60,8 @@ extern void fpu__init_system_xstate(unsi
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
 
+extern int __xfd_enable_feature(u64 which, struct fpu_guest *guest_fpu);
+
 static inline u64 xfeatures_mask_supervisor(void)
 {
 	return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (3 preceding siblings ...)
  2021-12-14  2:50 ` [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature() Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-14  6:25   ` Tian, Kevin
                     ` (2 more replies)
  2021-12-14  2:50 ` [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state() Thomas Gleixner
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

KVM can require fpstate expansion due to updates to XCR0 and to the XFD
MSR. In both cases it is required to check whether:

  - the requested values are correct or permitted

  - the resulting xfeature mask which is relevant for XSAVES is a subset of
    the guests fpstate xfeature mask for which the register buffer is sized.

    If the feature mask does not fit into the guests fpstate then
    reallocation is required.

Provide a common update function which utilizes the existing XFD enablement
mechanics and two wrapper functions, one for XCR0 and one for XFD.

These wrappers have to be invoked from XSETBV emulation and the XFD MSR
write emulation.

XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
success.

XFD modification is done by the FPU core code as it requires to update the
software state as well.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
New version to handle the restore case and XCR0 updates correctly.
---
 arch/x86/include/asm/fpu/api.h |   57 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/fpu/core.c     |   57 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 114 insertions(+)

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -136,6 +136,63 @@ extern void fpstate_clear_xstate_compone
 extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);
 extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu);
 extern int fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
+extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd);
+
+/**
+ * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Requested guest XCR0
+ *
+ * Has to be invoked before making the guest XCR0 update effective. The
+ * function validates that the requested features are permitted and ensures
+ * that @guest_fpu->fpstate is properly sized taking @guest_fpu->fpstate->xfd
+ * into account.
+ *
+ * If @guest_fpu->fpstate is not the current tasks active fpstate then the
+ * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in
+ * use, i.e. the guest restore case.
+ *
+ * Return:
+ * 0		- Success
+ * -EPERM	- Feature(s) not permitted
+ * -EFAULT	- Resizing of fpstate failed
+ */
+static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu, u64 xcr0)
+{
+	return __fpu_update_guest_features(guest_fpu, xcr0, guest_fpu->fpstate->xfd);
+}
+
+/**
+ * fpu_update_guest_xfd - Update guest XFD from MSR write emulation
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Current guest XCR0
+ * @xfd:	Requested XFD value
+ *
+ * Has to be invoked to make the guest XFD update effective. The function
+ * validates the XFD value and ensures that @guest_fpu->fpstate is properly
+ * sized by taking @xcr0 into account.
+ *
+ * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR
+ * directly.
+ *
+ * If @guest_fpu->fpstate is not the current tasks active fpstate then the
+ * caller has to ensure that @guest_fpu->fpstate cannot be concurrently in
+ * use, i.e. the guest restore case.
+ *
+ * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd and
+ * if the guest fpstate is active then MSR_IA32_XFD == @xfd.
+ *
+ * On failure the previous state is retained.
+ *
+ * Return:
+ * 0		- Success
+ * -ENOTSUPP	- XFD value not supported
+ * -EFAULT	- Resizing of fpstate failed
+ */
+static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd)
+{
+	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
+}
 
 extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
 extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -261,6 +261,63 @@ void fpu_free_guest_fpstate(struct fpu_g
 }
 EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
 
+/**
+ * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD updates
+ * @guest_fpu:	Pointer to the guest FPU container
+ * @xcr0:	Guest XCR0
+ * @xfd:	Guest XFD
+ *
+ * Note: @xcr0 and @xfd must either be the already validated values or the
+ * requested values (guest emulation or host write on restore).
+ *
+ * Do not invoke directly. Use the provided wrappers fpu_validate_guest_xcr0()
+ * and fpu_update_guest_xfd() instead.
+ *
+ * Return: 0 on success, error code otherwise
+ */
+int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64 xfd)
+{
+	u64 expand, requested;
+
+	lockdep_assert_preemption_enabled();
+
+	/* Only permitted features are allowed in XCR0 */
+	if (xcr0 & ~guest_fpu->perm)
+		return -EPERM;
+
+	/* Check for unsupported XFD values */
+	if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd & ~fpu_user_cfg.max_features)
+		return -ENOTSUPP;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return 0;
+
+	/*
+	 * The requested features are set in XCR0 and not set in XFD.
+	 * Feature expansion is required when the requested features are
+	 * not a subset of the xfeatures for which @guest_fpu->fpstate
+	 * is sized.
+	 */
+	requested = xcr0 & ~xfd;
+	expand = requested & ~guest_fpu->xfeatures;
+	if (!expand) {
+		/*
+		 * fpstate size is correct, update the XFD value in fpstate
+		 * and if @guest_fpu->fpstate is active also the per CPU
+		 * cache and the MSR.
+		 */
+		fpregs_lock();
+		guest_fpu->fpstate->xfd = xfd;
+		if (guest_fpu->fpstate->in_use)
+			xfd_update_state(guest_fpu->fpstate);
+		fpregs_unlock();
+		return 0;
+	}
+
+	return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(__fpu_update_guest_features);
+
 int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 {
 	struct fpstate *guest_fps = guest_fpu->fpstate;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state()
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (4 preceding siblings ...)
  2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
@ 2021-12-14  2:50 ` Thomas Gleixner
  2021-12-15  6:35   ` Liu, Jing2
  2021-12-14  6:50 ` [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Tian, Kevin
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14  2:50 UTC (permalink / raw)
  To: LKML
  Cc: Jing Liu, Yang Zhong, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Jin Nakajima, Kevin Tian

KVM can disable the write emulation for the XFD MSR when the vCPU's fpstate
is already correctly sized to reduce the overhead.

When write emulation is disabled the XFD MSR state after a VMEXIT is
unknown and therefore not in sync with the software states in fpstate and
the per CPU XFD cache.

Provide kvm_sync_guest_vmexit_xfd_state() which has to be invoked after a
VMEXIT before enabling interrupts when write emulation is disabled for the
XFD MSR.

It could be invoked unconditionally even when write emulation is enabled
for the price of a pointless MSR read.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/fpu/api.h |    6 ++++++
 arch/x86/kernel/fpu/core.c     |   26 ++++++++++++++++++++++++++
 2 files changed, 32 insertions(+)

--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -194,6 +194,12 @@ static inline int fpu_update_guest_xfd(s
 	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
 }
 
+#ifdef CONFIG_X86_64
+extern void fpu_sync_guest_vmexit_xfd_state(void);
+#else
+static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
+#endif
+
 extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
 extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
 
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -318,6 +318,32 @@ int __fpu_update_guest_features(struct f
 }
 EXPORT_SYMBOL_GPL(__fpu_update_guest_features);
 
+#ifdef CONFIG_X86_64
+/**
+ * fpu_sync_guest_vmexit_xfd_state - Synchronize XFD MSR and software state
+ *
+ * Must be invoked from KVM after a VMEXIT before enabling interrupts when
+ * XFD write emulation is disabled. This is required because the guest can
+ * freely modify XFD and the state at VMEXIT is not guaranteed to be the
+ * same as the state on VMENTER. So software state has to be udpated before
+ * any operation which depends on it can take place.
+ *
+ * Note: It can be invoked unconditionally even when write emulation is
+ * enabled for the price of a then pointless MSR read.
+ */
+void fpu_sync_guest_vmexit_xfd_state(void)
+{
+	struct fpstate *fps = current->thread.fpu.fpstate;
+
+	lockdep_assert_irqs_disabled();
+	if (fpu_state_size_dynamic()) {
+		rdmsrl(MSR_IA32_XFD, fps->xfd);
+		__this_cpu_write(xfd_state, fps->xfd);
+	}
+}
+EXPORT_SYMBOL_GPL(fpu_sync_guest_vmexit_xfd_state);
+#endif /* CONFIG_X86_64 */
+
 int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 {
 	struct fpstate *guest_fps = guest_fpu->fpstate;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions
  2021-12-14  2:50 ` [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions Thomas Gleixner
@ 2021-12-14  5:13   ` Tian, Kevin
  2021-12-14 10:37     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-14  5:13 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 10:50 AM
> 
> KVM requires a clear separation of host user space and guest permissions
> for dynamic XSTATE components.
> 
> Add a guest permissions member to struct fpu and a separate set of prctl()
> arguments: ARCH_GET_XCOMP_GUEST_PERM and
> ARCH_REQ_XCOMP_GUEST_PERM.
> 
> The semantics are equivalent to the host user space permission control
> except for the following constraints:
> 
>   1) Permissions have to be requested before the first vCPU is created
> 
>   2) Permissions are frozen when the first vCPU is created to ensure
>      consistency. Any attempt to expand permissions via the prctl() after
>      that point is rejected.

A curiosity question. Do we allow the user to reduce permissions?

> @@ -477,6 +479,13 @@ struct fpu {
>  	struct fpu_state_perm		perm;
> 
>  	/*
> +	 * @guest_perm:
> +	 *
> +	 * Permission related information for guest pseudo FPUs
> +	 */

why calling it 'pseudo'? It's real FPU state managed by this series...

> @@ -1742,6 +1751,7 @@ long fpu_xstate_prctl(struct task_struct
>  	u64 __user *uptr = (u64 __user *)arg2;
>  	u64 permitted, supported;
>  	unsigned long idx = arg2;
> +	bool guest = false;
> 
>  	if (tsk != current)
>  		return -EPERM;
> @@ -1760,11 +1770,20 @@ long fpu_xstate_prctl(struct task_struct
>  		permitted &= XFEATURE_MASK_USER_SUPPORTED;
>  		return put_user(permitted, uptr);
> 
> +	case ARCH_GET_XCOMP_GUEST_PERM:
> +		permitted = xstate_get_guest_group_perm();
> +		permitted &= XFEATURE_MASK_USER_SUPPORTED;
> +		return put_user(permitted, uptr);

Similarly as done for ARCH_REQ_XCOMP_GUEST_PERM:

+	case ARCH_GET_XCOMP_GUEST_PERM:
+		guest = true;
+		fallthrough;
+
	case ARCH_GET_XCOMP_PERM:
		/*
		 * Lockless snapshot as it can also change right after the
		 * dropping the lock.
		 */
-		permitted = xstate_get_host_group_perm();
+		permitted = xstate_get_group_perm(guest);
		permitted &= XFEATURE_MASK_USER_SUPPORTED;
		return put_user(permitted, uptr);

So the comment about 'lockless' is shared by both.

> +
> +	case ARCH_REQ_XCOMP_GUEST_PERM:
> +		guest = true;
> +		fallthrough;
> +
>  	case ARCH_REQ_XCOMP_PERM:
>  		if (!IS_ENABLED(CONFIG_X86_64))
>  			return -EOPNOTSUPP;
> 
> -		return xstate_request_perm(idx);
> +		return xstate_request_perm(idx, guest);
> 
>  	default:
>  		return -EINVAL;

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-14  2:50 ` [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature() Thomas Gleixner
@ 2021-12-14  6:05   ` Tian, Kevin
  2021-12-14 10:21     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-14  6:05 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 10:50 AM
> 
> Guest support for dynamically enabling FPU features requires a few

'enabling' -> 'enabled'

> modifications to the enablement function which is currently invoked from
> the #NM handler:
> 
>   1) Use guest permissions and sizes for the update
> 
>   2) Update fpu_guest state accordingly
> 
>   3) Take into account that the enabling can be triggered either from a
>      running guest via XSETBV and MSR_IA32_XFD write emulation and from

'and from' -> 'or from'

>      a guest restore. In the latter case the guests fpstate is not the
>      current tasks active fpstate.
> 
> Split the function and implement the guest mechanics throughout the
> callchain.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

[...]
> @@ -1553,6 +1531,13 @@ static int fpstate_realloc(u64 xfeatures
>  	newfps->user_size = usize;
>  	newfps->is_valloc = true;
> 
> +	if (guest_fpu) {
> +		newfps->is_guest = true;
> +		newfps->is_confidential = curfps->is_confidential;
> +		newfps->in_use = curfps->in_use;
> +		guest_fpu->xfeatures |= xfeatures;
> +	}
> +

As you explained guest fpstate is not current active in the restoring 
path, thus it's not correct to always inherit attributes from the 
active one.

Also we want to avoid touching real hardware state if guest_fpstate
!= curfps, e.g.:

	if (test_thread_flag(TIF_NEED_FPU_LOAD))
		fpregs_restore_userregs();

> +	if (guest_fpu) {
> +		curfps = xchg(&guest_fpu->fpstate, newfps);
> +		/* If curfps is active, update the FPU fpstate pointer */
> +		if (fpu->fpstate == curfps)
> +			fpu->fpstate = newfps;
> +	} else {
> +		curfps = xchg(&fpu->fpstate, newfps);
> +	}
> +
> +	xfd_update_state(fpu->fpstate);

and also here.

> @@ -1697,14 +1694,16 @@ int xfd_enable_feature(u64 xfd_err)
>  	spin_lock_irq(&current->sighand->siglock);
> 
>  	/* If not permitted let it die */
> -	if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
> +	if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
>  		spin_unlock_irq(&current->sighand->siglock);
>  		return -EPERM;
>  	}
> 
>  	fpu = &current->group_leader->thread.fpu;
> -	ksize = fpu->perm.__state_size;
> -	usize = fpu->perm.__user_state_size;
> +	perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
> +	ksize = perm->__state_size;
> +	usize = perm->__user_state_size;
> +

Do we want to mention in the commit msg that fpstate 
reallocation size is based on permissions instead of requested 
features? The intuitive thought is that each time a new feature is 
requested this expands the buffer to match the requested feature...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
@ 2021-12-14  6:25   ` Tian, Kevin
  2021-12-14 15:09   ` Wang, Wei W
  2021-12-15  6:14   ` Tian, Kevin
  2 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-14  6:25 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 10:50 AM
> 
> KVM can require fpstate expansion due to updates to XCR0 and to the XFD
> MSR. In both cases it is required to check whether:
> 
>   - the requested values are correct or permitted
> 
>   - the resulting xfeature mask which is relevant for XSAVES is a subset of
>     the guests fpstate xfeature mask for which the register buffer is sized.
> 
>     If the feature mask does not fit into the guests fpstate then
>     reallocation is required.
> 
> Provide a common update function which utilizes the existing XFD
> enablement
> mechanics and two wrapper functions, one for XCR0 and one for XFD.
> 
> These wrappers have to be invoked from XSETBV emulation and the XFD
> MSR
> write emulation.
> 
> XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
> success.

Currently XCR0 is modified right before entering guest with preemption 
disabled (see kvm_load_guest_xsave_state()). So this assumption is met.

> 
> XFD modification is done by the FPU core code as it requires to update the
> software state as well.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>

[...]
> +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xcr0,
> u64 xfd)
> +{
> +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
> +}

no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0.

> +int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
> u64 xfd)
> +{
> +	u64 expand, requested;
> +
> +	lockdep_assert_preemption_enabled();
> +
> +	/* Only permitted features are allowed in XCR0 */
> +	if (xcr0 & ~guest_fpu->perm)
> +		return -EPERM;
> +
> +	/* Check for unsupported XFD values */
> +	if (xfd & ~XFEATURE_MASK_USER_DYNAMIC || xfd &
> ~fpu_user_cfg.max_features)
> +		return -ENOTSUPP;
> +
> +	if (!IS_ENABLED(CONFIG_X86_64))
> +		return 0;

this could be checked first.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (5 preceding siblings ...)
  2021-12-14  2:50 ` [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state() Thomas Gleixner
@ 2021-12-14  6:50 ` Tian, Kevin
  2021-12-14  6:52 ` Liu, Jing2
  2021-12-14 10:42 ` Paolo Bonzini
  8 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-14  6:50 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 10:50 AM
> 
> Folks,
> 
> this is a follow up to the initial sketch of patches which got picked up by
> Jing and have been posted in combination with the KVM parts:
> 
>    https://lore.kernel.org/r/20211208000359.2853257-1-
> yang.zhong@intel.com
> 
> This update is only touching the x86/fpu code and not changing anything on
> the KVM side.
> 
>     BIG FAT WARNING: This is compile tested only!
> 
> In course of the dicsussion of the above patchset it turned out that there
> are a few conceptual issues vs. hardware and software state and also
> vs. guest restore.

Overall this is definitely a good move and also help simplify the
KVM side logic. 😊

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (6 preceding siblings ...)
  2021-12-14  6:50 ` [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Tian, Kevin
@ 2021-12-14  6:52 ` Liu, Jing2
  2021-12-14  7:54   ` Tian, Kevin
  2021-12-14 10:42 ` Paolo Bonzini
  8 siblings, 1 reply; 47+ messages in thread
From: Liu, Jing2 @ 2021-12-14  6:52 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Yang Zhong, Paolo Bonzini, x86, kvm, Sean Christoperson,
	Jin Nakajima, Kevin Tian

Hi Thomas,

On 12/14/2021 10:50 AM, Thomas Gleixner wrote:
> Folks,
>
> this is a follow up to the initial sketch of patches which got picked up by
> Jing and have been posted in combination with the KVM parts:
>
>     https://lore.kernel.org/r/20211208000359.2853257-1-yang.zhong@intel.com
>
> This update is only touching the x86/fpu code and not changing anything on
> the KVM side.
>
>      BIG FAT WARNING: This is compile tested only!
>
> In course of the dicsussion of the above patchset it turned out that there
> are a few conceptual issues vs. hardware and software state and also
> vs. guest restore.
>
> This series addresses this with the following changes vs. the original
> approach:
>
>    1) fpstate reallocation is now independent of fpu_swap_kvm_fpstate()
>
>       It is triggered directly via XSETBV and XFD MSR write emulation which
>       are used both for runtime and restore purposes.
>
>       For this it provides two wrappers around a common update function, one
>       for XCR0 and one for XFD.
>
>       Both check the validity of the arguments and the correct sizing of the
>       guest FPU fpstate. If the size is not sufficient, fpstate is
>       reallocated.
>
>       The functions can fail.
>
>    2) XFD synchronization
>
>       KVM must neither touch the XFD MSR nor the fpstate->xfd software state
>       in order to guarantee state consistency.
>
>       In the MSR write emulation case the XFD specific update handler has to
>       be invoked. See #1
>
>       If MSR write emulation is disabled because the buffer size is
>       sufficient for all use cases, i.e.:
>
>       		guest_fpu::xfeatures == guest_fpu::perm
>
The buffer size can be sufficient once one of the features is requested 
since
kernel fpu realloc full size (permitted). And I think we don't want to 
disable
interception until all the features are detected e.g., one by one.

Thus it can be guest_fpu::xfeatures != guest_fpu::perm.


Thanks,
Jing


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
  2021-12-14  6:52 ` Liu, Jing2
@ 2021-12-14  7:54   ` Tian, Kevin
  0 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-14  7:54 UTC (permalink / raw)
  To: Liu, Jing2, Thomas Gleixner, LKML
  Cc: Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Liu, Jing2 <jing2.liu@linux.intel.com>
> Sent: Tuesday, December 14, 2021 2:52 PM
> 
> On 12/14/2021 10:50 AM, Thomas Gleixner wrote:
> >       If MSR write emulation is disabled because the buffer size is
> >       sufficient for all use cases, i.e.:
> >
> >       		guest_fpu::xfeatures == guest_fpu::perm
> >
> The buffer size can be sufficient once one of the features is requested
> since
> kernel fpu realloc full size (permitted). And I think we don't want to
> disable
> interception until all the features are detected e.g., one by one.
> 
> Thus it can be guest_fpu::xfeatures != guest_fpu::perm.
> 

There are two options to handle multiple xfd features.

a) a conservative approach as Thomas suggested, i.e. don't disable emulation
   until all the features in guest_fpu::perm are requested by the guest. This
   definitely has poor performance if the guest only wants to use a subset of
   perm features. But functionally p.o.v it just works.

   Given we only have one xfeature today, let's just use this simple check which
   has ZERO negative impact.

b) an optimized approach by dynamically enabling/disabling emulation. e.g.
   we can disable emulation after the 1st xfd feature is enabled and then 
   reenable it in #NM vmexit handler when XFD_ERR includes a bit which is 
   not in guest_fpu::xfeatures, sort of like:

	--xfd trapped, perm has two xfd features--
	(G) access xfd_feature1;
	(H) trap #NM (XFD_ERR = xfd_feature1) and inject #NM;
	(G) WRMSR(IA32_XFD, (-1ULL) & ~xfd_feature1);
	(H) reallocate fpstate and disable write emulation for XFD;

	--xfd passed through--
	(G) do something...
	(G) access xfd_feature2;
	(H) trap #NM (XFD_ERR = xfd_feature2), enable emulation, inject #NM;

	--xfd trapped--
	(G) WRMSR(IA32_XFD, (-1ULL) & ~(xfd_feature1 | xfd_feature2));
	(H) reallocate fpstate and disable write emulation for XFD;

	--xfd passed through--
	(G) do something...

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-14  6:05   ` Tian, Kevin
@ 2021-12-14 10:21     ` Paolo Bonzini
  2021-12-14 13:15       ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-14 10:21 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

On 12/14/21 07:05, Tian, Kevin wrote:
>> +	if (guest_fpu) {
>> +		newfps->is_guest = true;
>> +		newfps->is_confidential = curfps->is_confidential;
>> +		newfps->in_use = curfps->in_use;
>> +		guest_fpu->xfeatures |= xfeatures;
>> +	}
>> +
> As you explained guest fpstate is not current active in the restoring
> path, thus it's not correct to always inherit attributes from the
> active one.

Indeed, guest_fpu->fpstate should be used instead of curfps.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions
  2021-12-14  5:13   ` Tian, Kevin
@ 2021-12-14 10:37     ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-14 10:37 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

On 12/14/21 06:13, Tian, Kevin wrote:
>>    2) Permissions are frozen when the first vCPU is created to ensure
>>       consistency. Any attempt to expand permissions via the prctl() after
>>       that point is rejected.
>
> A curiosity question. Do we allow the user to reduce permissions?

No, there's no prctl for that.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
  2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
                   ` (7 preceding siblings ...)
  2021-12-14  6:52 ` Liu, Jing2
@ 2021-12-14 10:42 ` Paolo Bonzini
  2021-12-14 13:24   ` Thomas Gleixner
  8 siblings, 1 reply; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-14 10:42 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Yang Zhong, x86, kvm, Sean Christoperson, Jin Nakajima,
	Kevin Tian

On 12/14/21 03:50, Thomas Gleixner wrote:
> The only remaining issue is the KVM XSTATE save/restore size checking which
> probably requires some FPU core assistance. But that requires some more
> thoughts vs. the IOCTL interface extension and once that is settled it
> needs to be solved in one go. But that's an orthogonal issue to the above.

That's not a big deal because KVM uses the uncompacted format.  So 
KVM_CHECK_EXTENSION and KVM_GET_XSAVE can just use CPUID to retrieve the 
size and uncompacted offset of the largest bit that is set in 
kvm_supported_xcr0, while KVM_SET_XSAVE can do the same with the largest 
bit that is set in the xstate_bv.

Paolo



> The series is also available from git:
> 
>     git://git.kernel.org/pub/scm/linux/kernel/git/people/tglx/devel.git x86/fpu-kvm


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-14 10:21     ` Paolo Bonzini
@ 2021-12-14 13:15       ` Thomas Gleixner
  2021-12-15  5:46         ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14 13:15 UTC (permalink / raw)
  To: Paolo Bonzini, Tian, Kevin, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

On Tue, Dec 14 2021 at 11:21, Paolo Bonzini wrote:
> On 12/14/21 07:05, Tian, Kevin wrote:
>>> +	if (guest_fpu) {
>>> +		newfps->is_guest = true;
>>> +		newfps->is_confidential = curfps->is_confidential;
>>> +		newfps->in_use = curfps->in_use;
>>> +		guest_fpu->xfeatures |= xfeatures;
>>> +	}
>>> +
>> As you explained guest fpstate is not current active in the restoring
>> path, thus it's not correct to always inherit attributes from the
>> active one.

Good catch!

> Indeed, guest_fpu->fpstate should be used instead of curfps.

Something like the below.

Thanks,

        tglx
---
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1500,35 +1500,13 @@ void fpstate_free(struct fpu *fpu)
 }
 
 /**
- * fpu_install_fpstate - Update the active fpstate in the FPU
- *
- * @fpu:	A struct fpu * pointer
- * @newfps:	A struct fpstate * pointer
- *
- * Returns:	A null pointer if the last active fpstate is the embedded
- *		one or the new fpstate is already installed;
- *		otherwise, a pointer to the old fpstate which has to
- *		be freed by the caller.
- */
-static struct fpstate *fpu_install_fpstate(struct fpu *fpu,
-					   struct fpstate *newfps)
-{
-	struct fpstate *oldfps = fpu->fpstate;
-
-	if (fpu->fpstate == newfps)
-		return NULL;
-
-	fpu->fpstate = newfps;
-	return oldfps != &fpu->__fpstate ? oldfps : NULL;
-}
-
-/**
  * fpstate_realloc - Reallocate struct fpstate for the requested new features
  *
  * @xfeatures:	A bitmap of xstate features which extend the enabled features
  *		of that task
  * @ksize:	The required size for the kernel buffer
  * @usize:	The required size for user space buffers
+ * @guest_fpu:	Pointer to a guest FPU container. NULL for host allocations
  *
  * Note vs. vmalloc(): If the task with a vzalloc()-allocated buffer
  * terminates quickly, vfree()-induced IPIs may be a concern, but tasks
@@ -1537,13 +1515,13 @@ static struct fpstate *fpu_install_fpsta
  * Returns: 0 on success, -ENOMEM on allocation error.
  */
 static int fpstate_realloc(u64 xfeatures, unsigned int ksize,
-			   unsigned int usize)
+			   unsigned int usize, struct fpu_guest *guest_fpu)
 {
 	struct fpu *fpu = &current->thread.fpu;
 	struct fpstate *curfps, *newfps = NULL;
 	unsigned int fpsize;
+	bool in_use;
 
-	curfps = fpu->fpstate;
 	fpsize = ksize + ALIGN(offsetof(struct fpstate, regs), 64);
 
 	newfps = vzalloc(fpsize);
@@ -1553,28 +1531,55 @@ static int fpstate_realloc(u64 xfeatures
 	newfps->user_size = usize;
 	newfps->is_valloc = true;
 
+	/*
+	 * When a guest FPU is supplied, use @guest_fpu->fpstate
+	 * as reference independent whether it is in use or not.
+	 */
+	curfps = guest_fpu ? guest_fpu->fpstate : fpu->fpstate;
+
+	/* Determine whether @curfps is the active fpstate */
+	in_use = fpu->fpstate == curfps;
+
+	if (guest_fpu) {
+		newfps->is_guest = true;
+		newfps->is_confidential = curfps->is_confidential;
+		newfps->in_use = curfps->in_use;
+		guest_fpu->xfeatures |= xfeatures;
+	}
+
 	fpregs_lock();
 	/*
-	 * Ensure that the current state is in the registers before
-	 * swapping fpstate as that might invalidate it due to layout
-	 * changes.
+	 * If @curfps is in use, ensure that the current state is in the
+	 * registers before swapping fpstate as that might invalidate it
+	 * due to layout changes.
 	 */
-	if (test_thread_flag(TIF_NEED_FPU_LOAD))
+	if (in_use && test_thread_flag(TIF_NEED_FPU_LOAD))
 		fpregs_restore_userregs();
 
 	newfps->xfeatures = curfps->xfeatures | xfeatures;
 	newfps->user_xfeatures = curfps->user_xfeatures | xfeatures;
 	newfps->xfd = curfps->xfd & ~xfeatures;
 
-	curfps = fpu_install_fpstate(fpu, newfps);
-
 	/* Do the final updates within the locked region */
 	xstate_init_xcomp_bv(&newfps->regs.xsave, newfps->xfeatures);
-	xfd_update_state(newfps);
 
+	if (guest_fpu) {
+		curfps = xchg(&guest_fpu->fpstate, newfps);
+		/* If curfps is active, update the FPU fpstate pointer */
+		if (in_use)
+			fpu->fpstate = newfps;
+	} else {
+		curfps = xchg(&fpu->fpstate, newfps);
+	}
+
+	if (in_use)
+		xfd_update_state(fpu->fpstate);
 	fpregs_unlock();
 
-	vfree(curfps);
+	/* Only free valloc'ed state */
+	if (curfps && curfps->is_valloc)
+		vfree(curfps);
+
 	return 0;
 }
 
@@ -1682,14 +1687,16 @@ static int xstate_request_perm(unsigned
 	return ret;
 }
 
-int xfd_enable_feature(u64 xfd_err)
+int __xfd_enable_feature(u64 xfd_err, struct fpu_guest *guest_fpu)
 {
 	u64 xfd_event = xfd_err & XFEATURE_MASK_USER_DYNAMIC;
+	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
 	struct fpu *fpu;
 
 	if (!xfd_event) {
-		pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
+		if (!guest_fpu)
+			pr_err_once("XFD: Invalid xfd error: %016llx\n", xfd_err);
 		return 0;
 	}
 
@@ -1697,14 +1704,16 @@ int xfd_enable_feature(u64 xfd_err)
 	spin_lock_irq(&current->sighand->siglock);
 
 	/* If not permitted let it die */
-	if ((xstate_get_host_group_perm() & xfd_event) != xfd_event) {
+	if ((xstate_get_group_perm(!!guest_fpu) & xfd_event) != xfd_event) {
 		spin_unlock_irq(&current->sighand->siglock);
 		return -EPERM;
 	}
 
 	fpu = &current->group_leader->thread.fpu;
-	ksize = fpu->perm.__state_size;
-	usize = fpu->perm.__user_state_size;
+	perm = guest_fpu ? &fpu->guest_perm : &fpu->perm;
+	ksize = perm->__state_size;
+	usize = perm->__user_state_size;
+
 	/*
 	 * The feature is permitted. State size is sufficient.  Dropping
 	 * the lock is safe here even if more features are added from
@@ -1717,10 +1726,16 @@ int xfd_enable_feature(u64 xfd_err)
 	 * Try to allocate a new fpstate. If that fails there is no way
 	 * out.
 	 */
-	if (fpstate_realloc(xfd_event, ksize, usize))
+	if (fpstate_realloc(xfd_event, ksize, usize, guest_fpu))
 		return -EFAULT;
 	return 0;
 }
+
+int xfd_enable_feature(u64 xfd_err)
+{
+	return __xfd_enable_feature(xfd_err, NULL);
+}
+
 #else /* CONFIG_X86_64 */
 static inline int xstate_request_perm(unsigned long idx, bool guest)
 {
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -60,6 +60,8 @@ extern void fpu__init_system_xstate(unsi
 
 extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
 
+extern int __xfd_enable_feature(u64 which, struct fpu_guest *guest_fpu);
+
 static inline u64 xfeatures_mask_supervisor(void)
 {
 	return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED;


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 0/6] x86/fpu: Preparatory changes for guest AMX support
  2021-12-14 10:42 ` Paolo Bonzini
@ 2021-12-14 13:24   ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14 13:24 UTC (permalink / raw)
  To: Paolo Bonzini, LKML
  Cc: Jing Liu, Yang Zhong, x86, kvm, Sean Christoperson, Jin Nakajima,
	Kevin Tian

On Tue, Dec 14 2021 at 11:42, Paolo Bonzini wrote:
> On 12/14/21 03:50, Thomas Gleixner wrote:
>> The only remaining issue is the KVM XSTATE save/restore size checking which
>> probably requires some FPU core assistance. But that requires some more
>> thoughts vs. the IOCTL interface extension and once that is settled it
>> needs to be solved in one go. But that's an orthogonal issue to the above.
>
> That's not a big deal because KVM uses the uncompacted format.  So 
> KVM_CHECK_EXTENSION and KVM_GET_XSAVE can just use CPUID to retrieve the 
> size and uncompacted offset of the largest bit that is set in 
> kvm_supported_xcr0, while KVM_SET_XSAVE can do the same with the largest 
> bit that is set in the xstate_bv.

For simplicity you can just get that information from guest_fpu. See
below.

Thanks,

        tglx
---
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -518,6 +518,11 @@ struct fpu_guest {
 	u64				perm;
 
 	/*
+	 * @uabi_size:			Size required for save/restore
+	 */
+	unsigned int			uabi_size;
+
+	/*
 	 * @fpstate:			Pointer to the allocated guest fpstate
 	 */
 	struct fpstate			*fpstate;
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -240,6 +240,7 @@ bool fpu_alloc_guest_fpstate(struct fpu_
 	gfpu->fpstate		= fpstate;
 	gfpu->xfeatures		= fpu_user_cfg.default_features;
 	gfpu->perm		= fpu_user_cfg.default_features;
+	gfpu->uabi_size		= fpu_user_cfg.default_size;
 	fpu_init_guest_permissions(gfpu);
 
 	return true;
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1545,6 +1545,7 @@ static int fpstate_realloc(u64 xfeatures
 		newfps->is_confidential = curfps->is_confidential;
 		newfps->in_use = curfps->in_use;
 		guest_fpu->xfeatures |= xfeatures;
+		guest_fpu->uabi_size = usize;
 	}
 
 	fpregs_lock();

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
  2021-12-14  6:25   ` Tian, Kevin
@ 2021-12-14 15:09   ` Wang, Wei W
  2021-12-14 15:40     ` Thomas Gleixner
  2021-12-15  6:14   ` Tian, Kevin
  2 siblings, 1 reply; 47+ messages in thread
From: Wang, Wei W @ 2021-12-14 15:09 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin

On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
> KVM can require fpstate expansion due to updates to XCR0 and to the XFD
> MSR. In both cases it is required to check whether:
> 
>   - the requested values are correct or permitted
> 
>   - the resulting xfeature mask which is relevant for XSAVES is a subset of
>     the guests fpstate xfeature mask for which the register buffer is sized.
> 
>     If the feature mask does not fit into the guests fpstate then
>     reallocation is required.
> 
> Provide a common update function which utilizes the existing XFD
> enablement mechanics and two wrapper functions, one for XCR0 and one
> for XFD.
> 
> These wrappers have to be invoked from XSETBV emulation and the XFD
> MSR write emulation.
> 
> XCR0 modification can only proceed when fpu_update_guest_xcr0() returns
> success.
> 
> XFD modification is done by the FPU core code as it requires to update the
> software state as well.
> 
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
> New version to handle the restore case and XCR0 updates correctly.
> ---
>  arch/x86/include/asm/fpu/api.h |   57
> +++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kernel/fpu/core.c     |   57
> +++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 114 insertions(+)
> 
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -136,6 +136,63 @@ extern void fpstate_clear_xstate_compone
> extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu);  extern void
> fpu_free_guest_fpstate(struct fpu_guest *gfpu);  extern int
> fpu_swap_kvm_fpstate(struct fpu_guest *gfpu, bool enter_guest);
> +extern int __fpu_update_guest_features(struct fpu_guest *guest_fpu, u64
> +xcr0, u64 xfd);
> +
> +/**
> + * fpu_update_guest_xcr0 - Update guest XCR0 from XSETBV emulation
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Requested guest XCR0
> + *
> + * Has to be invoked before making the guest XCR0 update effective. The
> + * function validates that the requested features are permitted and
> +ensures
> + * that @guest_fpu->fpstate is properly sized taking
> +@guest_fpu->fpstate->xfd
> + * into account.
> + *
> + * If @guest_fpu->fpstate is not the current tasks active fpstate then
> +the
> + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently
> +in
> + * use, i.e. the guest restore case.
> + *
> + * Return:
> + * 0		- Success
> + * -EPERM	- Feature(s) not permitted
> + * -EFAULT	- Resizing of fpstate failed
> + */
> +static inline int fpu_update_guest_xcr0(struct fpu_guest *guest_fpu,
> +u64 xcr0) {
> +	return __fpu_update_guest_features(guest_fpu, xcr0,
> +guest_fpu->fpstate->xfd); }
> +
> +/**
> + * fpu_update_guest_xfd - Update guest XFD from MSR write emulation
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Current guest XCR0
> + * @xfd:	Requested XFD value
> + *
> + * Has to be invoked to make the guest XFD update effective. The
> +function
> + * validates the XFD value and ensures that @guest_fpu->fpstate is
> +properly
> + * sized by taking @xcr0 into account.
> + *
> + * The caller must not modify @guest_fpu->fpstate->xfd or the XFD MSR
> + * directly.
> + *
> + * If @guest_fpu->fpstate is not the current tasks active fpstate then
> +the
> + * caller has to ensure that @guest_fpu->fpstate cannot be concurrently
> +in
> + * use, i.e. the guest restore case.
> + *
> + * On success the buffer size is valid, @guest_fpu->fpstate.xfd == @xfd
> +and
> + * if the guest fpstate is active then MSR_IA32_XFD == @xfd.
> + *
> + * On failure the previous state is retained.
> + *
> + * Return:
> + * 0		- Success
> + * -ENOTSUPP	- XFD value not supported
> + * -EFAULT	- Resizing of fpstate failed
> + */
> +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64
> +xcr0, u64 xfd) {
> +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd); }
> 
>  extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void
> *buf, unsigned int size, u32 pkru);  extern int
> fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf,
> u64 xcr0, u32 *vpkru);
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -261,6 +261,63 @@ void fpu_free_guest_fpstate(struct fpu_g  }
> EXPORT_SYMBOL_GPL(fpu_free_guest_fpstate);
> 
> +/**
> + * __fpu_update_guest_features - Validate and enable guest XCR0 and XFD
> updates
> + * @guest_fpu:	Pointer to the guest FPU container
> + * @xcr0:	Guest XCR0
> + * @xfd:	Guest XFD
> + *
> + * Note: @xcr0 and @xfd must either be the already validated values or
> +the
> + * requested values (guest emulation or host write on restore).
> + *
> + * Do not invoke directly. Use the provided wrappers
> +fpu_validate_guest_xcr0()
> + * and fpu_update_guest_xfd() instead.
> + *
> + * Return: 0 on success, error code otherwise  */ int
> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64
> +xfd) {

I think there would be one issue for the "host write on restore" case.
The current QEMU based host restore uses the following sequence:
1) restore xsave
2) restore xcr0
3) restore XFD MSR

At the time of "1) restore xsave", KVM already needs fpstate expansion before restoring the xsave data.
So the 2 APIs here might not be usable for this usage.
Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is:

kvm_load_guest_fpu(vcpu);
guest_fpu->realloc_request = realloc_request;
kvm_put_guest_fpu(vcpu);

"realloc_request" above is generated from the "xstate_header" received from userspace.

Thanks,
Wei


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 15:09   ` Wang, Wei W
@ 2021-12-14 15:40     ` Thomas Gleixner
  2021-12-14 16:11       ` Wang, Wei W
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14 15:40 UTC (permalink / raw)
  To: Wang, Wei W, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin

On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
> On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>> + * Return: 0 on success, error code otherwise  */ int
>> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0, u64
>> +xfd) {
>
> I think there would be one issue for the "host write on restore" case.
> The current QEMU based host restore uses the following sequence:
> 1) restore xsave
> 2) restore xcr0
> 3) restore XFD MSR

This needs to be fixed. Ordering clearly needs to be:

  XFD, XCR0, XSTATE

> At the time of "1) restore xsave", KVM already needs fpstate expansion
> before restoring the xsave data.

> So the 2 APIs here might not be usable for this usage.
> Our current solution to fpstate expansion at KVM_SET_XSAVE (i.e. step 1) above) is:
>
> kvm_load_guest_fpu(vcpu);
> guest_fpu->realloc_request = realloc_request;
> kvm_put_guest_fpu(vcpu);
>
> "realloc_request" above is generated from the "xstate_header" received from userspace.

That's a horrible hack. Please fix the ordering in QEMU. Trying to
accomodate for nonsensical use cases in the kernel is just wrong.

That's like you expect the following to work:

       u8 *p = mmap(NULL, 4096, ....);

       p[8192] = x;

It rightfully explodes in your face and you can keep the pieces.

Having ordering constraints vs. these 3 involved parts is just sensible.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 15:40     ` Thomas Gleixner
@ 2021-12-14 16:11       ` Wang, Wei W
  2021-12-14 18:04         ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Wang, Wei W @ 2021-12-14 16:11 UTC (permalink / raw)
  To: Thomas Gleixner, LKML, Dr. David Alan Gilbert, Juan Quintela
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin

On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
> >> + * Return: 0 on success, error code otherwise  */ int
> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
> >> +u64
> >> +xfd) {
> >
> > I think there would be one issue for the "host write on restore" case.
> > The current QEMU based host restore uses the following sequence:
> > 1) restore xsave
> > 2) restore xcr0
> > 3) restore XFD MSR
> 
> This needs to be fixed. Ordering clearly needs to be:
> 
>   XFD, XCR0, XSTATE

Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling.
It has been there for long time for the general restoring of all the XCRs and MSRs.
(if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
- kvm_put_xsave()
- kvm_put_xcrs()
- kvm_put_msrs()

We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
if changing that ordering would be OK.
(In general, I think there are no hard rules documented for this ordering)

Thanks,
Wei


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 16:11       ` Wang, Wei W
@ 2021-12-14 18:04         ` Thomas Gleixner
  2021-12-14 19:07           ` Juan Quintela
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14 18:04 UTC (permalink / raw)
  To: Wang, Wei W, LKML, Dr. David Alan Gilbert, Juan Quintela
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin

Wei,

On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
> On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
>> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
>> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>> >> + * Return: 0 on success, error code otherwise  */ int
>> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
>> >> +u64
>> >> +xfd) {
>> >
>> > I think there would be one issue for the "host write on restore" case.
>> > The current QEMU based host restore uses the following sequence:
>> > 1) restore xsave
>> > 2) restore xcr0
>> > 3) restore XFD MSR
>> 
>> This needs to be fixed. Ordering clearly needs to be:
>> 
>>   XFD, XCR0, XSTATE
>
> Sorry, just to clarify that the ordering in QEMU isn't made by us for this specific XFD enabling.
> It has been there for long time for the general restoring of all the XCRs and MSRs.
> (if you are interested..FYI: https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
> - kvm_put_xsave()
> - kvm_put_xcrs()
> - kvm_put_msrs()
>
> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
> if changing that ordering would be OK.
> (In general, I think there are no hard rules documented for this ordering)

There haven't been ordering requirements so far, but with dynamic
feature enablement there are.

I really want to avoid going to the point to deduce it from the
xstate:xfeatures bitmap, which is just backwards and Qemu has all the
required information already.

Thanks,

        tglx





^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 18:04         ` Thomas Gleixner
@ 2021-12-14 19:07           ` Juan Quintela
  2021-12-14 20:28             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Juan Quintela @ 2021-12-14 19:07 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Wang, Wei W, LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang,
	Paolo Bonzini, x86, kvm, Sean Christoperson, Nakajima, Jun, Tian,
	Kevin

Thomas Gleixner <tglx@linutronix.de> wrote:
> Wei,
>
> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>> On Tuesday, December 14, 2021 11:40 PM, Thomas Gleixner wrote:
>>> On Tue, Dec 14 2021 at 15:09, Wei W. Wang wrote:
>>> > On Tuesday, December 14, 2021 10:50 AM, Thomas Gleixner wrote:
>>> >> + * Return: 0 on success, error code otherwise  */ int
>>> >> +__fpu_update_guest_features(struct fpu_guest *guest_fpu, u64 xcr0,
>>> >> +u64
>>> >> +xfd) {
>>> >
>>> > I think there would be one issue for the "host write on restore" case.
>>> > The current QEMU based host restore uses the following sequence:
>>> > 1) restore xsave
>>> > 2) restore xcr0
>>> > 3) restore XFD MSR
>>> 
>>> This needs to be fixed. Ordering clearly needs to be:
>>> 
>>>   XFD, XCR0, XSTATE
>>
>> Sorry, just to clarify that the ordering in QEMU isn't made by us
>> for this specific XFD enabling.
>> It has been there for long time for the general restoring of all the
>> XCRs and MSRs.
>> (if you are interested..FYI:
>> https://github.com/qemu/qemu/blob/master/target/i386/kvm/kvm.c#L4168).
>> - kvm_put_xsave()
>> - kvm_put_xcrs()
>> - kvm_put_msrs()
>>
>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>> if changing that ordering would be OK.
>> (In general, I think there are no hard rules documented for this ordering)
>
> There haven't been ordering requirements so far, but with dynamic
> feature enablement there are.
>
> I really want to avoid going to the point to deduce it from the
> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
> required information already.

Hi

First of all, I claim ZERO knowledge about low level x86_64.

Once told that, this don't matter for qemu migration, code is at

target/i386/kvm/kvm.c:kvm_arch_put_registers()


    ret = kvm_put_xsave(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    ret = kvm_put_xcrs(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    /* must be before kvm_put_msrs */
    ret = kvm_inject_mce_oldstyle(x86_cpu);
    if (ret < 0) {
        return ret;
    }
    ret = kvm_put_msrs(x86_cpu, level);
    if (ret < 0) {
        return ret;
    }

If it needs to be done in any other order, it is completely independent
of whatever is inside the migration stream.

I guess that Paolo will put some light here.

Later, Juan.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 19:07           ` Juan Quintela
@ 2021-12-14 20:28             ` Thomas Gleixner
  2021-12-14 21:35               ` Juan Quintela
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-14 20:28 UTC (permalink / raw)
  To: quintela
  Cc: Wang, Wei W, LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang,
	Paolo Bonzini, x86, kvm, Sean Christoperson, Nakajima, Jun, Tian,
	Kevin

Juan,

On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
> Thomas Gleixner <tglx@linutronix.de> wrote:
>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>>> if changing that ordering would be OK.
>>> (In general, I think there are no hard rules documented for this ordering)
>>
>> There haven't been ordering requirements so far, but with dynamic
>> feature enablement there are.
>>
>> I really want to avoid going to the point to deduce it from the
>> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
>> required information already.
>
> First of all, I claim ZERO knowledge about low level x86_64.

Lucky you.

> Once told that, this don't matter for qemu migration, code is at

Once, that was at the time where rubber boots were still made of wood,
right? :)

> target/i386/kvm/kvm.c:kvm_arch_put_registers()
>
>
>     ret = kvm_put_xsave(x86_cpu);
>     if (ret < 0) {
>         return ret;
>     }
>     ret = kvm_put_xcrs(x86_cpu);
>     if (ret < 0) {
>         return ret;
>     }
>     /* must be before kvm_put_msrs */
>     ret = kvm_inject_mce_oldstyle(x86_cpu);

So this has already ordering requirements.

>     if (ret < 0) {
>         return ret;
>     }
>     ret = kvm_put_msrs(x86_cpu, level);
>     if (ret < 0) {
>         return ret;
>     }
>
> If it needs to be done in any other order, it is completely independent
> of whatever is inside the migration stream.

From the migration data perspective that's correct, but I have the
nagging feeling that this in not that simple.

> I guess that Paolo will put some light here.

I fear shining light on that will unearth quite a few skeletons :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 20:28             ` Thomas Gleixner
@ 2021-12-14 21:35               ` Juan Quintela
  2021-12-15  2:17                 ` Wang, Wei W
  0 siblings, 1 reply; 47+ messages in thread
From: Juan Quintela @ 2021-12-14 21:35 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Wang, Wei W, LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang,
	Paolo Bonzini, x86, kvm, Sean Christoperson, Nakajima, Jun, Tian,
	Kevin

Thomas Gleixner <tglx@linutronix.de> wrote:

Hi Thomas

> On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
>> Thomas Gleixner <tglx@linutronix.de> wrote:
>>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
>>>> We need to check with the QEMU migration maintainer (Dave and Juan CC-ed)
>>>> if changing that ordering would be OK.
>>>> (In general, I think there are no hard rules documented for this ordering)
>>>
>>> There haven't been ordering requirements so far, but with dynamic
>>> feature enablement there are.
>>>
>>> I really want to avoid going to the point to deduce it from the
>>> xstate:xfeatures bitmap, which is just backwards and Qemu has all the
>>> required information already.
>>
>> First of all, I claim ZERO knowledge about low level x86_64.
>
> Lucky you.

Well, that is true until I have to debug some bug, at that time I miss
the knowledge O:-)

>> Once told that, this don't matter for qemu migration, code is at
>
> Once, that was at the time where rubber boots were still made of wood,
> right? :)

I forgot to add: "famous last words".

>> target/i386/kvm/kvm.c:kvm_arch_put_registers()
>>
>>
>>     ret = kvm_put_xsave(x86_cpu);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     ret = kvm_put_xcrs(x86_cpu);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     /* must be before kvm_put_msrs */
>>     ret = kvm_inject_mce_oldstyle(x86_cpu);
>
> So this has already ordering requirements.
>
>>     if (ret < 0) {
>>         return ret;
>>     }
>>     ret = kvm_put_msrs(x86_cpu, level);
>>     if (ret < 0) {
>>         return ret;
>>     }
>>
>> If it needs to be done in any other order, it is completely independent
>> of whatever is inside the migration stream.
>
> From the migration data perspective that's correct, but I have the
> nagging feeling that this in not that simple.

Oh, I was not meaning that it was simple at all.

We have backward compatibility baggage on x86_64 that is grotesque at
this point.  We don't send a single msr (by that name) on the main
migration stream.  And then we send them based on "conditions".  So the
trick as somithing like:

- qemu reads the msrs from kvm at some order
- it stores them in a list of MSR's
- On migration pre_save we "mangle" that msr's and other cpu state to
  on the main (decades old) state
- then we send the main state
- then we send conditionally the variable state

on reception side:

- we receive everything that the sending side have sent
- we "demangle" it on pre_load
- then we create the list of MSR's that need to be transferred
- at that point we send them to kvm in another random order

So yes, I fully agree that it is not _that_ simple O:-)


>> I guess that Paolo will put some light here.
>
> I fear shining light on that will unearth quite a few skeletons :)

It is quite probable.

When a bugzilla start with: We found a bug while we were trying to
migrate during (BIOS) boot, I just ran for the hills O:-)

Later, Juan.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14 21:35               ` Juan Quintela
@ 2021-12-15  2:17                 ` Wang, Wei W
  2021-12-15 10:09                   ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Wang, Wei W @ 2021-12-15  2:17 UTC (permalink / raw)
  To: quintela, Thomas Gleixner
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang,
	Paolo Bonzini, x86, kvm, Sean Christoperson, Nakajima, Jun, Tian,
	Kevin, Zeng, Guang

Hi Thomas,

On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote:
> To: Thomas Gleixner <tglx@linutronix.de>
> Cc: Wang, Wei W <wei.w.wang@intel.com>; LKML
> <linux-kernel@vger.kernel.org>; Dr. David Alan Gilbert <dgilbert@redhat.com>;
> Jing Liu <jing2.liu@linux.intel.com>; Zhong, Yang <yang.zhong@intel.com>;
> Paolo Bonzini <pbonzini@redhat.com>; x86@kernel.org; kvm@vger.kernel.org;
> Sean Christoperson <seanjc@google.com>; Nakajima, Jun
> <jun.nakajima@intel.com>; Tian, Kevin <kevin.tian@intel.com>
> Subject: Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
> 
> Thomas Gleixner <tglx@linutronix.de> wrote:
> 
> Hi Thomas
> 
> > On Tue, Dec 14 2021 at 20:07, Juan Quintela wrote:
> >> Thomas Gleixner <tglx@linutronix.de> wrote:
> >>> On Tue, Dec 14 2021 at 16:11, Wei W. Wang wrote:
> >>>> We need to check with the QEMU migration maintainer (Dave and Juan
> >>>> CC-ed) if changing that ordering would be OK.
> >>>> (In general, I think there are no hard rules documented for this
> >>>> ordering)
> >>>
> >>> There haven't been ordering requirements so far, but with dynamic
> >>> feature enablement there are.
> >>>
> >>> I really want to avoid going to the point to deduce it from the
> >>> xstate:xfeatures bitmap, which is just backwards and Qemu has all
> >>> the required information already.
> >>
> >> First of all, I claim ZERO knowledge about low level x86_64.
> >
> > Lucky you.
> 
> Well, that is true until I have to debug some bug, at that time I miss the
> knowledge O:-)
> 
> >> Once told that, this don't matter for qemu migration, code is at
> >
> > Once, that was at the time where rubber boots were still made of wood,
> > right? :)
> 
> I forgot to add: "famous last words".
> 
> >> target/i386/kvm/kvm.c:kvm_arch_put_registers()
> >>
> >>
> >>     ret = kvm_put_xsave(x86_cpu);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     ret = kvm_put_xcrs(x86_cpu);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     /* must be before kvm_put_msrs */
> >>     ret = kvm_inject_mce_oldstyle(x86_cpu);
> >
> > So this has already ordering requirements.
> >
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>     ret = kvm_put_msrs(x86_cpu, level);
> >>     if (ret < 0) {
> >>         return ret;
> >>     }
> >>
> >> If it needs to be done in any other order, it is completely
> >> independent of whatever is inside the migration stream.
> >
> > From the migration data perspective that's correct, but I have the
> > nagging feeling that this in not that simple.
> 
> Oh, I was not meaning that it was simple at all.

It seems to be a consensus that the ordering constraint wouldn't be that easy.
Would you think that our current solution (the 3 parts shared earlier to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st version?

Thanks,
Wei

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-14 13:15       ` Thomas Gleixner
@ 2021-12-15  5:46         ` Tian, Kevin
  2021-12-15  9:53           ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-15  5:46 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Tuesday, December 14, 2021 9:16 PM
> 
> On Tue, Dec 14 2021 at 11:21, Paolo Bonzini wrote:
> > On 12/14/21 07:05, Tian, Kevin wrote:
> >>> +	if (guest_fpu) {
> >>> +		newfps->is_guest = true;
> >>> +		newfps->is_confidential = curfps->is_confidential;
> >>> +		newfps->in_use = curfps->in_use;
> >>> +		guest_fpu->xfeatures |= xfeatures;
> >>> +	}
> >>> +
> >> As you explained guest fpstate is not current active in the restoring
> >> path, thus it's not correct to always inherit attributes from the
> >> active one.
> 
> Good catch!
> 
> > Indeed, guest_fpu->fpstate should be used instead of curfps.
> 
> Something like the below.

Looks good. Just two nits.

> +	/*
> +	 * When a guest FPU is supplied, use @guest_fpu->fpstate
> +	 * as reference independent whether it is in use or not.
> +	 */
> +	curfps = guest_fpu ? guest_fpu->fpstate : fpu->fpstate;
> +
> +	/* Determine whether @curfps is the active fpstate */
> +	in_use = fpu->fpstate == curfps;
> +
> +	if (guest_fpu) {
> +		newfps->is_guest = true;
> +		newfps->is_confidential = curfps->is_confidential;
> +		newfps->in_use = curfps->in_use;

What is the purpose of this 'in_use' field? Currently it's only
touched in three places:

  - set when entering guest;
  - cleared when exiting to userspace;
  - checked when freeing a guest FPU;

The last one can be easily checked by comparing to current fps.

Any other intended usage which is currently missed for guest FPU?

> 
> +	if (guest_fpu) {
> +		curfps = xchg(&guest_fpu->fpstate, newfps);

This can be a direct value update to guest_fpu->fpstate since 
curfps has already been acquired in the start.

> +		/* If curfps is active, update the FPU fpstate pointer */
> +		if (in_use)
> +			fpu->fpstate = newfps;
> +	} else {
> +		curfps = xchg(&fpu->fpstate, newfps);

ditto.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
  2021-12-14  6:25   ` Tian, Kevin
  2021-12-14 15:09   ` Wang, Wei W
@ 2021-12-15  6:14   ` Tian, Kevin
  2 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-15  6:14 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Jing Liu, Zhong, Yang, Paolo Bonzini, x86, kvm, Christopherson,,
	Sean, Nakajima, Jun

> From: Tian, Kevin
> Sent: Tuesday, December 14, 2021 2:26 PM
> > +static inline int fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64
> xcr0,
> > u64 xfd)
> > +{
> > +	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
> > +}
> 
> no need to pass in xcr0. It can be fetched from vcpu->arch.xcr0.
> 

This is a silly comment. There is not vcpu to be referenced...

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state()
  2021-12-14  2:50 ` [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state() Thomas Gleixner
@ 2021-12-15  6:35   ` Liu, Jing2
  2021-12-15  9:49     ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Liu, Jing2 @ 2021-12-15  6:35 UTC (permalink / raw)
  To: Thomas Gleixner, LKML
  Cc: Yang Zhong, Paolo Bonzini, x86, kvm, Sean Christoperson,
	Jin Nakajima, Kevin Tian


Hi Thomas,
On 12/14/2021 10:50 AM, Thomas Gleixner wrote:
> KVM can disable the write emulation for the XFD MSR when the vCPU's fpstate
> is already correctly sized to reduce the overhead.
>
> When write emulation is disabled the XFD MSR state after a VMEXIT is
> unknown and therefore not in sync with the software states in fpstate and
> the per CPU XFD cache.
>
> Provide kvm_sync_guest_vmexit_xfd_state() which has to be invoked after a
> VMEXIT before enabling interrupts when write emulation is disabled for the
> XFD MSR.
Thanks for this function.

s/kvm_sync_guest_vmexit_xfd_state/fpu_sync_guest_vmexit_xfd_state
in subject and changelog.


Thanks,
Jing
>
> It could be invoked unconditionally even when write emulation is enabled
> for the price of a pointless MSR read.
>
> Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
> ---
>   arch/x86/include/asm/fpu/api.h |    6 ++++++
>   arch/x86/kernel/fpu/core.c     |   26 ++++++++++++++++++++++++++
>   2 files changed, 32 insertions(+)
>
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -194,6 +194,12 @@ static inline int fpu_update_guest_xfd(s
>   	return __fpu_update_guest_features(guest_fpu, xcr0, xfd);
>   }
>   
> +#ifdef CONFIG_X86_64
> +extern void fpu_sync_guest_vmexit_xfd_state(void);
> +#else
> +static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
> +#endif
> +
>   extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
>   extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
>   
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -318,6 +318,32 @@ int __fpu_update_guest_features(struct f
>   }
>   EXPORT_SYMBOL_GPL(__fpu_update_guest_features);
>   
> +#ifdef CONFIG_X86_64
> +/**
> + * fpu_sync_guest_vmexit_xfd_state - Synchronize XFD MSR and software state
> + *
> + * Must be invoked from KVM after a VMEXIT before enabling interrupts when
> + * XFD write emulation is disabled. This is required because the guest can
> + * freely modify XFD and the state at VMEXIT is not guaranteed to be the
> + * same as the state on VMENTER. So software state has to be udpated before
> + * any operation which depends on it can take place.
> + *
> + * Note: It can be invoked unconditionally even when write emulation is
> + * enabled for the price of a then pointless MSR read.
> + */
> +void fpu_sync_guest_vmexit_xfd_state(void)
> +{
> +	struct fpstate *fps = current->thread.fpu.fpstate;
> +
> +	lockdep_assert_irqs_disabled();
> +	if (fpu_state_size_dynamic()) {
> +		rdmsrl(MSR_IA32_XFD, fps->xfd);
> +		__this_cpu_write(xfd_state, fps->xfd);
> +	}
> +}
> +EXPORT_SYMBOL_GPL(fpu_sync_guest_vmexit_xfd_state);
> +#endif /* CONFIG_X86_64 */
> +
>   int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>   {
>   	struct fpstate *guest_fps = guest_fpu->fpstate;
>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state()
  2021-12-15  6:35   ` Liu, Jing2
@ 2021-12-15  9:49     ` Thomas Gleixner
  0 siblings, 0 replies; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-15  9:49 UTC (permalink / raw)
  To: Liu, Jing2, LKML
  Cc: Yang Zhong, Paolo Bonzini, x86, kvm, Sean Christoperson,
	Jin Nakajima, Kevin Tian

Jing,

On Wed, Dec 15 2021 at 14:35, Jing2 Liu wrote:
> On 12/14/2021 10:50 AM, Thomas Gleixner wrote:
>> KVM can disable the write emulation for the XFD MSR when the vCPU's fpstate
>> is already correctly sized to reduce the overhead.
>>
>> When write emulation is disabled the XFD MSR state after a VMEXIT is
>> unknown and therefore not in sync with the software states in fpstate and
>> the per CPU XFD cache.
>>
>> Provide kvm_sync_guest_vmexit_xfd_state() which has to be invoked after a
>> VMEXIT before enabling interrupts when write emulation is disabled for the
>> XFD MSR.
> Thanks for this function.
>
> s/kvm_sync_guest_vmexit_xfd_state/fpu_sync_guest_vmexit_xfd_state
> in subject and changelog.

I clearly need more sleep.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-15  5:46         ` Tian, Kevin
@ 2021-12-15  9:53           ` Thomas Gleixner
  2021-12-15 10:02             ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-15  9:53 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

On Wed, Dec 15 2021 at 05:46, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> +	if (guest_fpu) {
>> +		newfps->is_guest = true;
>> +		newfps->is_confidential = curfps->is_confidential;
>> +		newfps->in_use = curfps->in_use;
>
> What is the purpose of this 'in_use' field? Currently it's only
> touched in three places:
>
>   - set when entering guest;
>   - cleared when exiting to userspace;
>   - checked when freeing a guest FPU;
>
> The last one can be easily checked by comparing to current fps.

I added it for paranoia sake because the destruction of the KVM FPU
state is not necessarily in the context of the vCPU thread. Yes, it
should not happen...

>> +	if (guest_fpu) {
>> +		curfps = xchg(&guest_fpu->fpstate, newfps);
>
> This can be a direct value update to guest_fpu->fpstate since 
> curfps has already been acquired in the start.

Indeed.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature()
  2021-12-15  9:53           ` Thomas Gleixner
@ 2021-12-15 10:02             ` Tian, Kevin
  0 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-15 10:02 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini, LKML
  Cc: Jing Liu, Zhong, Yang, x86, kvm, Christopherson,, Sean, Nakajima, Jun

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Wednesday, December 15, 2021 5:53 PM
> 
> On Wed, Dec 15 2021 at 05:46, Kevin Tian wrote:
> >> From: Thomas Gleixner <tglx@linutronix.de>
> >> +	if (guest_fpu) {
> >> +		newfps->is_guest = true;
> >> +		newfps->is_confidential = curfps->is_confidential;
> >> +		newfps->in_use = curfps->in_use;
> >
> > What is the purpose of this 'in_use' field? Currently it's only
> > touched in three places:
> >
> >   - set when entering guest;
> >   - cleared when exiting to userspace;
> >   - checked when freeing a guest FPU;
> >
> > The last one can be easily checked by comparing to current fps.
> 
> I added it for paranoia sake because the destruction of the KVM FPU
> state is not necessarily in the context of the vCPU thread. Yes, it
> should not happen...
> 
> >> +	if (guest_fpu) {
> >> +		curfps = xchg(&guest_fpu->fpstate, newfps);
> >
> > This can be a direct value update to guest_fpu->fpstate since
> > curfps has already been acquired in the start.
> 
> Indeed.
> 

Thanks for confirmation. We'll include those changes in next version.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15  2:17                 ` Wang, Wei W
@ 2021-12-15 10:09                   ` Thomas Gleixner
  2021-12-15 10:27                     ` Paolo Bonzini
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-15 10:09 UTC (permalink / raw)
  To: Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang,
	Paolo Bonzini, x86, kvm, Sean Christoperson, Nakajima, Jun, Tian,
	Kevin, Zeng, Guang

On Wed, Dec 15 2021 at 02:17, Wei W. Wang wrote:
> On Wednesday, December 15, 2021 5:36 AM, Juan Quintela wrote:
>> >> If it needs to be done in any other order, it is completely
>> >> independent of whatever is inside the migration stream.
>> >
>> > From the migration data perspective that's correct, but I have the
>> > nagging feeling that this in not that simple.
>> 
>> Oh, I was not meaning that it was simple at all.
>
> It seems to be a consensus that the ordering constraint wouldn't be
> that easy.

Right, but what is easy in this context? Not easy does not mean it is
impossible.

> Would you think that our current solution (the 3 parts shared earlier
> to do fpstate expansion at KVM_SET_XSAVE) is acceptable as the 1st
> version?

This is really the wrong question in the context of an user space ABI.

The point is that if we go and add that hack, then the hack has to be
supported forever. So there is no "as the 1st version".

I'm not at all a fan of this kind of approach because it puts the burden
at the wrong end and it carefully avoids to sit down and really think it
through.

That way we just pile hacks on hacks forever up to the point where the
hacks end up in a circular dependency which becomes unresolvable.

That's not a KVM/Qemu specific problem. That's a problem in general and
we've been bitten by that again and again.

The worst about this is that those people who try to sell their quick
and dirty hack in the first place are not those who end up dealing with
the consequences some time down the road.

Lets assume the restore order is XSTATE, XCR0, XFD:

     XSTATE has everything in init state, which means the default
     buffer is good enough

     XCR0 has everything enabled including AMX, so the buffer is
     expanded

     XFD has AMX disable set, which means the buffer expansion was
     pointless

If we go there, then we can just use a full expanded buffer for KVM
unconditionally and be done with it. That spares a lot of code.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:09                   ` Thomas Gleixner
@ 2021-12-15 10:27                     ` Paolo Bonzini
  2021-12-15 10:41                       ` Paolo Bonzini
  2021-12-16  1:04                       ` Tian, Kevin
  0 siblings, 2 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-15 10:27 UTC (permalink / raw)
  To: Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin, Zeng, Guang

On 12/15/21 11:09, Thomas Gleixner wrote:
> Lets assume the restore order is XSTATE, XCR0, XFD:
> 
>       XSTATE has everything in init state, which means the default
>       buffer is good enough
> 
>       XCR0 has everything enabled including AMX, so the buffer is
>       expanded
> 
>       XFD has AMX disable set, which means the buffer expansion was
>       pointless
> 
> If we go there, then we can just use a full expanded buffer for KVM
> unconditionally and be done with it. That spares a lot of code.

If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is 
done, that would work for me.  Basically KVM_SET_CPUID2 would:

- check bits from CPUID[0xD] against the prctl requested with GUEST_PERM

- return with -ENXIO or whatever if any dynamic bits were not requested

- otherwise call fpstate_realloc if there are any dynamic bits requested

Considering that in practice all Linux guests with AMX would have XFD 
passthrough (because if there's no prctl, Linux keeps AMX disabled in 
XFD), this removes the need to do all the #NM handling too.  Just make 
XFD passthrough if it can ever be set to a nonzero value.  This costs an 
RDMSR per vmexit even if neither the host nor the guest ever use AMX.

That said, if we don't want to use a full expanded buffer, I don't 
expect any issue with requiring XFD first then XCR0 then XSAVE.  As Juan 
said, QEMU first gets everything from the migration stream and then 
restores it.  So yes, the QEMU code is complicated and messy but we can 
change the order without breaking migration from old to new QEMU.  QEMU 
also forbids migration if there's any CPUID feature that it does not 
understand, so the old versions that don't understand QEMU won't migrate 
AMX (with no possibility to override).

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:27                     ` Paolo Bonzini
@ 2021-12-15 10:41                       ` Paolo Bonzini
  2021-12-16  1:00                         ` Tian, Kevin
                                           ` (3 more replies)
  2021-12-16  1:04                       ` Tian, Kevin
  1 sibling, 4 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-15 10:41 UTC (permalink / raw)
  To: Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Sean Christoperson, Nakajima, Jun, Tian, Kevin, Zeng, Guang

On 12/15/21 11:27, Paolo Bonzini wrote:
> On 12/15/21 11:09, Thomas Gleixner wrote:
>> Lets assume the restore order is XSTATE, XCR0, XFD:
>>
>>       XSTATE has everything in init state, which means the default
>>       buffer is good enough
>>
>>       XCR0 has everything enabled including AMX, so the buffer is
>>       expanded
>>
>>       XFD has AMX disable set, which means the buffer expansion was
>>       pointless
>>
>> If we go there, then we can just use a full expanded buffer for KVM
>> unconditionally and be done with it. That spares a lot of code.
> 
> If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is 
> done, that would work for me. 

Off-list, Thomas mentioned doing it even at vCPU creation as long as the 
prctl has been called.  That is also okay and even simpler.

There's also another important thing that hasn't been mentioned so far: 
KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in 
CPUID[0xD] if they have not been requested with prctl.  It's okay to 
return the AMX bit, but not the bit in CPUID[0xD].

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:41                       ` Paolo Bonzini
@ 2021-12-16  1:00                         ` Tian, Kevin
  2021-12-16  5:36                         ` Tian, Kevin
                                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16  1:00 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:41 PM
> 
> On 12/15/21 11:27, Paolo Bonzini wrote:
> > On 12/15/21 11:09, Thomas Gleixner wrote:
> >> Lets assume the restore order is XSTATE, XCR0, XFD:
> >>
> >>       XSTATE has everything in init state, which means the default
> >>       buffer is good enough
> >>
> >>       XCR0 has everything enabled including AMX, so the buffer is
> >>       expanded
> >>
> >>       XFD has AMX disable set, which means the buffer expansion was
> >>       pointless
> >>
> >> If we go there, then we can just use a full expanded buffer for KVM
> >> unconditionally and be done with it. That spares a lot of code.
> >
> > If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is
> > done, that would work for me.
> 
> Off-list, Thomas mentioned doing it even at vCPU creation as long as the
> prctl has been called.  That is also okay and even simpler.

Make sense. It also avoids the #GP thing in the emulation path if just due
to reallocation error.

We'll follow this direction to do a quick update/test. 

> 
> There's also another important thing that hasn't been mentioned so far:
> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
> CPUID[0xD] if they have not been requested with prctl.  It's okay to
> return the AMX bit, but not the bit in CPUID[0xD].
> 

will do.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:27                     ` Paolo Bonzini
  2021-12-15 10:41                       ` Paolo Bonzini
@ 2021-12-16  1:04                       ` Tian, Kevin
  2021-12-16  9:34                         ` Thomas Gleixner
  1 sibling, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16  1:04 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:28 PM
> 
> On 12/15/21 11:09, Thomas Gleixner wrote:
> > Lets assume the restore order is XSTATE, XCR0, XFD:
> >
> >       XSTATE has everything in init state, which means the default
> >       buffer is good enough
> >
> >       XCR0 has everything enabled including AMX, so the buffer is
> >       expanded
> >
> >       XFD has AMX disable set, which means the buffer expansion was
> >       pointless
> >
> > If we go there, then we can just use a full expanded buffer for KVM
> > unconditionally and be done with it. That spares a lot of code.
> 
> If we decide to use a full expanded buffer as soon as KVM_SET_CPUID2 is
> done, that would work for me.  Basically KVM_SET_CPUID2 would:
> 
> - check bits from CPUID[0xD] against the prctl requested with GUEST_PERM
> 
> - return with -ENXIO or whatever if any dynamic bits were not requested
> 
> - otherwise call fpstate_realloc if there are any dynamic bits requested
> 
> Considering that in practice all Linux guests with AMX would have XFD
> passthrough (because if there's no prctl, Linux keeps AMX disabled in
> XFD), this removes the need to do all the #NM handling too.  Just make

#NM trap is for XFD_ERR thus still required.

> XFD passthrough if it can ever be set to a nonzero value.  This costs an
> RDMSR per vmexit even if neither the host nor the guest ever use AMX.

Well, we can still trap WRMSR(XFD) in the start and then disable interception
after the 1st trap.

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:41                       ` Paolo Bonzini
  2021-12-16  1:00                         ` Tian, Kevin
@ 2021-12-16  5:36                         ` Tian, Kevin
  2021-12-16 21:07                           ` Paolo Bonzini
  2021-12-16 10:21                         ` Tian, Kevin
  2021-12-16 13:00                         ` Tian, Kevin
  3 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16  5:36 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Tian, Kevin
> Sent: Thursday, December 16, 2021 9:00 AM
> >
> > Off-list, Thomas mentioned doing it even at vCPU creation as long as the
> > prctl has been called.  That is also okay and even simpler.
> 
> Make sense. It also avoids the #GP thing in the emulation path if just due
> to reallocation error.
> 
> We'll follow this direction to do a quick update/test.
> 

After some study there are three opens which we'd like to sync here. Once
they are closed we'll send out a new version very soon (hopefully tomorrow).

1) Have a full expanded buffer at vCPU creation

There are two options.

One is to directly allocate a big-enough buffer upon guest_fpu::perm in
fpu_alloc_guest_fpstate(). There is no reallocation per-se thus most changes
in this series are not required.

The other is to keep the reallocation concept (thus all previous patches are
kept) and still call a wrapper around __xfd_enable_feature() even at vCPU
creation (e.g. right after fpu_init_guest_permissions()). This matches the
fpu core assumption that fpstate for xfd features are dynamically allocated,
though the actual calling point may not be dynamic. This also allows us
to exploit doing expansion in KVM_SET_CPUID2 (see next).

2) Do expansion at vCPU creation or KVM_ SET_CPUID2?

If the reallocation concept is still kept, then we feel doing expansion in
KVM_SET_CPUID2 makes slightly more sense. There is no functional 
difference between two options since the guest is not running at this 
point. And in general Qemu should set prctl according to the cpuid bits. 
But since anyway we still need to check guest cpuid against guest perm in 
KVM_SET_CPUID2, it reads clearer to expand the buffer only after this 
check is passed.

If this approach is agreed, then we may replace the helper functions in
this patch with a new one:

/*
 * fpu_update_guest_perm_features - Enable xfeatures according to guest perm
 * @guest_fpu:		Pointer to the guest FPU container
 *
 * Enable all dynamic xfeatures according to guest perm. Invoked if the
 * caller wants to conservatively expand fpstate buffer instead of waiting
 * until a given feature is accessed.
 *
 * Return: 0 on success, error code otherwise
 */
+int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
+{
+	u64 expand;
+
+	lockdep_assert_preemption_enabled();
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		return 0;
+
+	expand = guest_fpu->perm & ~guest_fpu->xfeatures;
+	if (!expand)
+		return 0;
+
+	return __xfd_enable_feature(expand, guest_fpu);
+}
+EXPORT_SYMBOL_GPL(fpu_update_guest_features);

3) Always disable interception of disable after 1st interception?

Once we choose to have a full expanded buffer before guest runs, the
point of intercepting WRMSR(IA32_XFD) becomes less obvious since
no dynamic reallocation is required.

One option is to always disable WRMSR interception once 
KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. 
But doing so affects legacy OS which even has no XFD logic at all.

The other option is to continue the current policy i.e. disable write 
emulation only after the 1st interception of setting XFD to a non-zero 
value. Then the RDMSR cost is added only for guest which supports XFD.

In either case we need another helper to update guest_fpu->fpstate->xfd
as required in the restore path. For the 2nd option we further want
to update MSR if guest_fpu is currently in use:

+void xfd_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
+{
+	fpregs_lock();
+	guest_fpu->fpstae->xfd = xfd;
+	if (guest_fpu->fpstate->in_use)
+		xfd_update_state(guest_fpu->fpstate);
+	fpregs_unlock();
+}

Thoughts?
--
p.s. currently we're working on a quick prototype based on:
  - Expand buffer in KVM_SET_CPUID2
  - Disable write emulation after first interception
to check any oversight. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16  1:04                       ` Tian, Kevin
@ 2021-12-16  9:34                         ` Thomas Gleixner
  2021-12-16  9:59                           ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-16  9:34 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote:
>> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo Bonzini
>> Considering that in practice all Linux guests with AMX would have XFD
>> passthrough (because if there's no prctl, Linux keeps AMX disabled in
>> XFD), this removes the need to do all the #NM handling too.  Just make
>
> #NM trap is for XFD_ERR thus still required.
>
>> XFD passthrough if it can ever be set to a nonzero value.  This costs an
>> RDMSR per vmexit even if neither the host nor the guest ever use AMX.
>
> Well, we can still trap WRMSR(XFD) in the start and then disable interception
> after the 1st trap.

If we go for buffer expansion at vcpu_create() or CPUID2 then I think
you don't need a trap at all.

XFD_ERR:  Always 0 on the host. Guest state needs to be preserved on
          VMEXIT and restored on VMENTER

This can be done simply with the MSR entry/exit controls. No trap
required neither for #NM for for XFD_ERR.

VMENTER loads guest state. VMEXIT saves guest state and loads host state
(0)

XFD:     Always guest state

So VMENTER does nothing and VMEXIT either saves guest state and the sync
function uses the automatically saved value or you keep the sync
function which does the rdmsrl() as is.

Hmm?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16  9:34                         ` Thomas Gleixner
@ 2021-12-16  9:59                           ` Tian, Kevin
  2021-12-16 14:12                             ` Thomas Gleixner
  0 siblings, 1 reply; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16  9:59 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 16, 2021 5:35 PM
> 
> On Thu, Dec 16 2021 at 01:04, Kevin Tian wrote:
> >> From: Paolo Bonzini <paolo.bonzini@gmail.com> On Behalf Of Paolo
> Bonzini
> >> Considering that in practice all Linux guests with AMX would have XFD
> >> passthrough (because if there's no prctl, Linux keeps AMX disabled in
> >> XFD), this removes the need to do all the #NM handling too.  Just make
> >
> > #NM trap is for XFD_ERR thus still required.
> >
> >> XFD passthrough if it can ever be set to a nonzero value.  This costs an
> >> RDMSR per vmexit even if neither the host nor the guest ever use AMX.
> >
> > Well, we can still trap WRMSR(XFD) in the start and then disable
> interception
> > after the 1st trap.
> 
> If we go for buffer expansion at vcpu_create() or CPUID2 then I think
> you don't need a trap at all.
> 
> XFD_ERR:  Always 0 on the host. Guest state needs to be preserved on
>           VMEXIT and restored on VMENTER
> 
> This can be done simply with the MSR entry/exit controls. No trap
> required neither for #NM for for XFD_ERR.
> 
> VMENTER loads guest state. VMEXIT saves guest state and loads host state
> (0)

This implies three MSR operations for every vm-exit.

With trap we only need one RDMSR in host #NM handler, one 
RDMSR/one WRMSR exit in guest #NM handler, which are both rare.
plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is 
non-zero which is again rare.

> 
> XFD:     Always guest state
> 
> So VMENTER does nothing and VMEXIT either saves guest state and the sync
> function uses the automatically saved value or you keep the sync
> function which does the rdmsrl() as is.
> 

Yes, this is the 3rd open that I asked in another reply. The only restriction
with this approach is that the sync cost is added also for legacy OS which
doesn't touch xfd at all. 

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:41                       ` Paolo Bonzini
  2021-12-16  1:00                         ` Tian, Kevin
  2021-12-16  5:36                         ` Tian, Kevin
@ 2021-12-16 10:21                         ` Tian, Kevin
  2021-12-16 10:24                           ` Paolo Bonzini
  2021-12-16 10:26                           ` Paolo Bonzini
  2021-12-16 13:00                         ` Tian, Kevin
  3 siblings, 2 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16 10:21 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Paolo Bonzini
> Sent: Wednesday, December 15, 2021 6:41 PM
> 
> There's also another important thing that hasn't been mentioned so far:
> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
> CPUID[0xD] if they have not been requested with prctl.  It's okay to
> return the AMX bit, but not the bit in CPUID[0xD].

There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.

This then requires exposing xstate_get_guest_group_perm() to KVM. Thomas, 
are you OK with this change given Paolo's ask? v1 included this change but
it was not necessary at the moment:

	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/

and Paolo, do we want to document that prctl() must be done before
calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16 10:21                         ` Tian, Kevin
@ 2021-12-16 10:24                           ` Paolo Bonzini
  2021-12-16 10:26                           ` Paolo Bonzini
  1 sibling, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-16 10:24 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

On 12/16/21 11:21, Tian, Kevin wrote:
>> From: Paolo Bonzini
>> Sent: Wednesday, December 15, 2021 6:41 PM
>>
>> There's also another important thing that hasn't been mentioned so far:
>> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
>> CPUID[0xD] if they have not been requested with prctl.  It's okay to
>> return the AMX bit, but not the bit in CPUID[0xD].
> 
> There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.
> 
> This then requires exposing xstate_get_guest_group_perm() to KVM.

Right, this is a generic /dev/kvm ioctl therefore it has to check the 
process state.

> Thomas, are you OK with this change given Paolo's ask? v1 included
> this change but it was not necessary at the moment:
> 
> 	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/
> 
> and Paolo, do we want to document that prctl() must be done before
> calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

You can document it under the KVM_GET_SUPPORTED_CPUID ioctl.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16 10:21                         ` Tian, Kevin
  2021-12-16 10:24                           ` Paolo Bonzini
@ 2021-12-16 10:26                           ` Paolo Bonzini
  1 sibling, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-16 10:26 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

On 12/16/21 11:21, Tian, Kevin wrote:
>> From: Paolo Bonzini
>> Sent: Wednesday, December 15, 2021 6:41 PM
>>
>> There's also another important thing that hasn't been mentioned so far:
>> KVM_GET_SUPPORTED_CPUID should _not_ include the dynamic bits in
>> CPUID[0xD] if they have not been requested with prctl.  It's okay to
>> return the AMX bit, but not the bit in CPUID[0xD].
> 
> There is no vcpu in this ioctl, thus we cannot check vcpu->arch.guest_fpu.perm.
> 
> This then requires exposing xstate_get_guest_group_perm() to KVM.

Right, this is a generic /dev/kvm ioctl therefore it has to check the 
process state.

> Thomas, are you OK with this change given Paolo's ask? v1 included
> this change but it was not necessary at the moment:
> 
> 	https://lore.kernel.org/lkml/87lf0ot50q.ffs@tglx/
> 
> and Paolo, do we want to document that prctl() must be done before
> calling KVM_GET_SUPPORTED_CPUID? If yes, where is the proper location?

You can document it under the KVM_GET_SUPPORTED_CPUID ioctl.

(The reason for this ordering is backwards compatibility: otherwise a 
process could pass KVM_GET_SUPPORTED_CPUID to KVM_SET_CPUID2 directly, 
and the resulting VM would not be able to use AMX because it hasn't been 
requested.  Likewise, userspace needs to know that if you use prctl then 
you also need to allocate >4K for the xstate and use KVM_GET_XSAVE2 to 
retrieve it).

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-15 10:41                       ` Paolo Bonzini
                                           ` (2 preceding siblings ...)
  2021-12-16 10:21                         ` Tian, Kevin
@ 2021-12-16 13:00                         ` Tian, Kevin
  3 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-16 13:00 UTC (permalink / raw)
  To: Paolo Bonzini, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

Hi, Paolo/Thomas,

Any comment on following opens? In general doing static buffer 
expansion definitely simplifies the logic, but still need your help to 
finalize its impact on the overall design. 😊

Thanks
Kevin

> From: Tian, Kevin
> Sent: Thursday, December 16, 2021 1:36 PM
> 
> > From: Tian, Kevin
> > Sent: Thursday, December 16, 2021 9:00 AM
> > >
> > > Off-list, Thomas mentioned doing it even at vCPU creation as long as the
> > > prctl has been called.  That is also okay and even simpler.
> >
> > Make sense. It also avoids the #GP thing in the emulation path if just due
> > to reallocation error.
> >
> > We'll follow this direction to do a quick update/test.
> >
> 
> After some study there are three opens which we'd like to sync here. Once
> they are closed we'll send out a new version very soon (hopefully tomorrow).
> 
> 1) Have a full expanded buffer at vCPU creation
> 
> There are two options.
> 
> One is to directly allocate a big-enough buffer upon guest_fpu::perm in
> fpu_alloc_guest_fpstate(). There is no reallocation per-se thus most changes
> in this series are not required.
> 
> The other is to keep the reallocation concept (thus all previous patches are
> kept) and still call a wrapper around __xfd_enable_feature() even at vCPU
> creation (e.g. right after fpu_init_guest_permissions()). This matches the
> fpu core assumption that fpstate for xfd features are dynamically allocated,
> though the actual calling point may not be dynamic. This also allows us
> to exploit doing expansion in KVM_SET_CPUID2 (see next).
> 
> 2) Do expansion at vCPU creation or KVM_ SET_CPUID2?
> 
> If the reallocation concept is still kept, then we feel doing expansion in
> KVM_SET_CPUID2 makes slightly more sense. There is no functional
> difference between two options since the guest is not running at this
> point. And in general Qemu should set prctl according to the cpuid bits.
> But since anyway we still need to check guest cpuid against guest perm in
> KVM_SET_CPUID2, it reads clearer to expand the buffer only after this
> check is passed.
> 
> If this approach is agreed, then we may replace the helper functions in
> this patch with a new one:
> 
> /*
>  * fpu_update_guest_perm_features - Enable xfeatures according to guest
> perm
>  * @guest_fpu:		Pointer to the guest FPU container
>  *
>  * Enable all dynamic xfeatures according to guest perm. Invoked if the
>  * caller wants to conservatively expand fpstate buffer instead of waiting
>  * until a given feature is accessed.
>  *
>  * Return: 0 on success, error code otherwise
>  */
> +int fpu_update_guest_perm_features(struct fpu_guest *guest_fpu)
> +{
> +	u64 expand;
> +
> +	lockdep_assert_preemption_enabled();
> +
> +	if (!IS_ENABLED(CONFIG_X86_64))
> +		return 0;
> +
> +	expand = guest_fpu->perm & ~guest_fpu->xfeatures;
> +	if (!expand)
> +		return 0;
> +
> +	return __xfd_enable_feature(expand, guest_fpu);
> +}
> +EXPORT_SYMBOL_GPL(fpu_update_guest_features);
> 
> 3) Always disable interception of disable after 1st interception?
> 
> Once we choose to have a full expanded buffer before guest runs, the
> point of intercepting WRMSR(IA32_XFD) becomes less obvious since
> no dynamic reallocation is required.
> 
> One option is to always disable WRMSR interception once
> KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit.
> But doing so affects legacy OS which even has no XFD logic at all.
> 
> The other option is to continue the current policy i.e. disable write
> emulation only after the 1st interception of setting XFD to a non-zero
> value. Then the RDMSR cost is added only for guest which supports XFD.
> 
> In either case we need another helper to update guest_fpu->fpstate->xfd
> as required in the restore path. For the 2nd option we further want
> to update MSR if guest_fpu is currently in use:
> 
> +void xfd_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd)
> +{
> +	fpregs_lock();
> +	guest_fpu->fpstae->xfd = xfd;
> +	if (guest_fpu->fpstate->in_use)
> +		xfd_update_state(guest_fpu->fpstate);
> +	fpregs_unlock();
> +}
> 
> Thoughts?
> --
> p.s. currently we're working on a quick prototype based on:
>   - Expand buffer in KVM_SET_CPUID2
>   - Disable write emulation after first interception
> to check any oversight.
> 
> Thanks
> Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16  9:59                           ` Tian, Kevin
@ 2021-12-16 14:12                             ` Thomas Gleixner
  2021-12-17 15:33                               ` Tian, Kevin
  0 siblings, 1 reply; 47+ messages in thread
From: Thomas Gleixner @ 2021-12-16 14:12 UTC (permalink / raw)
  To: Tian, Kevin, Paolo Bonzini, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

On Thu, Dec 16 2021 at 09:59, Kevin Tian wrote:
>> From: Thomas Gleixner <tglx@linutronix.de>
>> This can be done simply with the MSR entry/exit controls. No trap
>> required neither for #NM for for XFD_ERR.
>> 
>> VMENTER loads guest state. VMEXIT saves guest state and loads host state
>> (0)
>
> This implies three MSR operations for every vm-exit.
>
> With trap we only need one RDMSR in host #NM handler, one 
> RDMSR/one WRMSR exit in guest #NM handler, which are both rare.
> plus one RDMSR/one WRMSR per vm-exit only if saved xfd_err is 
> non-zero which is again rare.

Fair enough.

>> XFD:     Always guest state
>> 
>> So VMENTER does nothing and VMEXIT either saves guest state and the sync
>> function uses the automatically saved value or you keep the sync
>> function which does the rdmsrl() as is.
>> 
>
> Yes, this is the 3rd open that I asked in another reply. The only restriction
> with this approach is that the sync cost is added also for legacy OS which
> doesn't touch xfd at all. 

You still can make that conditional on the guest XCR0. If guest never
enables the extended bit then neither the #NM trap nor the XFD sync
are required.

But yes, there are too many moving parts here :)

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16  5:36                         ` Tian, Kevin
@ 2021-12-16 21:07                           ` Paolo Bonzini
  0 siblings, 0 replies; 47+ messages in thread
From: Paolo Bonzini @ 2021-12-16 21:07 UTC (permalink / raw)
  To: Tian, Kevin, Thomas Gleixner, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

On 12/16/21 06:36, Tian, Kevin wrote:
> 2) Do expansion at vCPU creation or KVM_ SET_CPUID2?
> 
> If the reallocation concept is still kept, then we feel doing expansion in
> KVM_SET_CPUID2 makes slightly more sense. There is no functional
> difference between two options since the guest is not running at this
> point. And in general Qemu should set prctl according to the cpuid bits.
> But since anyway we still need to check guest cpuid against guest perm in
> KVM_SET_CPUID2, it reads clearer to expand the buffer only after this
> check is passed.

Yes, that makes sense to me as well.  In principle userspace could call 
prctl only after KVM_CREATE_VCPU.

> 
> One option is to always disable WRMSR interception once 
> KVM_SET_CPUID2 succeeds, with the cost of one RDMSR per vm-exit. 
> But doing so affects legacy OS which even has no XFD logic at all.
> 
> The other option is to continue the current policy i.e. disable write 
> emulation only after the 1st interception of setting XFD to a non-zero 
> value. Then the RDMSR cost is added only for guest which supports XFD.

For this I suggest to implement the current policy, but place it at the 
end of the series so it's easy to drop it.

Paolo

^ permalink raw reply	[flat|nested] 47+ messages in thread

* RE: [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd()
  2021-12-16 14:12                             ` Thomas Gleixner
@ 2021-12-17 15:33                               ` Tian, Kevin
  0 siblings, 0 replies; 47+ messages in thread
From: Tian, Kevin @ 2021-12-17 15:33 UTC (permalink / raw)
  To: Thomas Gleixner, Paolo Bonzini, Wang, Wei W, quintela
  Cc: LKML, Dr. David Alan Gilbert, Jing Liu, Zhong, Yang, x86, kvm,
	Christopherson,,
	Sean, Nakajima, Jun, Zeng, Guang

> From: Thomas Gleixner <tglx@linutronix.de>
> Sent: Thursday, December 16, 2021 10:13 PM
> >
> > Yes, this is the 3rd open that I asked in another reply. The only restriction
> > with this approach is that the sync cost is added also for legacy OS which
> > doesn't touch xfd at all.
> 
> You still can make that conditional on the guest XCR0. If guest never
> enables the extended bit then neither the #NM trap nor the XFD sync
> are required.
> 
> But yes, there are too many moving parts here :)
> 

Yes. Many moving parts but in general it's getting cleaner and simplified. 😊

We just sent out v2 to hopefully lock down already-closed opens. Based on
that we can see what remains to be further solved.

And really appreciate all the suggestions from you and Paolo!

Thanks
Kevin

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2021-12-17 15:34 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-12-14  2:50 [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Thomas Gleixner
2021-12-14  2:50 ` [patch 1/6] x86/fpu: Extend fpu_xstate_prctl() with guest permissions Thomas Gleixner
2021-12-14  5:13   ` Tian, Kevin
2021-12-14 10:37     ` Paolo Bonzini
2021-12-14  2:50 ` [patch 2/6] x86/fpu: Prepare guest FPU for dynamically enabled FPU features Thomas Gleixner
2021-12-14  2:50 ` [patch 3/6] x86/fpu: Make XFD initialization in __fpstate_reset() a function argument Thomas Gleixner
2021-12-14  2:50 ` [patch 4/6] x86/fpu: Add guest support to xfd_enable_feature() Thomas Gleixner
2021-12-14  6:05   ` Tian, Kevin
2021-12-14 10:21     ` Paolo Bonzini
2021-12-14 13:15       ` Thomas Gleixner
2021-12-15  5:46         ` Tian, Kevin
2021-12-15  9:53           ` Thomas Gleixner
2021-12-15 10:02             ` Tian, Kevin
2021-12-14  2:50 ` [patch 5/6] x86/fpu: Provide fpu_update_guest_xcr0/xfd() Thomas Gleixner
2021-12-14  6:25   ` Tian, Kevin
2021-12-14 15:09   ` Wang, Wei W
2021-12-14 15:40     ` Thomas Gleixner
2021-12-14 16:11       ` Wang, Wei W
2021-12-14 18:04         ` Thomas Gleixner
2021-12-14 19:07           ` Juan Quintela
2021-12-14 20:28             ` Thomas Gleixner
2021-12-14 21:35               ` Juan Quintela
2021-12-15  2:17                 ` Wang, Wei W
2021-12-15 10:09                   ` Thomas Gleixner
2021-12-15 10:27                     ` Paolo Bonzini
2021-12-15 10:41                       ` Paolo Bonzini
2021-12-16  1:00                         ` Tian, Kevin
2021-12-16  5:36                         ` Tian, Kevin
2021-12-16 21:07                           ` Paolo Bonzini
2021-12-16 10:21                         ` Tian, Kevin
2021-12-16 10:24                           ` Paolo Bonzini
2021-12-16 10:26                           ` Paolo Bonzini
2021-12-16 13:00                         ` Tian, Kevin
2021-12-16  1:04                       ` Tian, Kevin
2021-12-16  9:34                         ` Thomas Gleixner
2021-12-16  9:59                           ` Tian, Kevin
2021-12-16 14:12                             ` Thomas Gleixner
2021-12-17 15:33                               ` Tian, Kevin
2021-12-15  6:14   ` Tian, Kevin
2021-12-14  2:50 ` [patch 6/6] x86/fpu: Provide kvm_sync_guest_vmexit_xfd_state() Thomas Gleixner
2021-12-15  6:35   ` Liu, Jing2
2021-12-15  9:49     ` Thomas Gleixner
2021-12-14  6:50 ` [patch 0/6] x86/fpu: Preparatory changes for guest AMX support Tian, Kevin
2021-12-14  6:52 ` Liu, Jing2
2021-12-14  7:54   ` Tian, Kevin
2021-12-14 10:42 ` Paolo Bonzini
2021-12-14 13:24   ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.