linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization
@ 2022-04-04 12:11 Thomas Gleixner
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-04 12:11 UTC (permalink / raw)
  To: LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P, Andrew Cooper

This series adds:

  1) XSAVEC support

     Hypervisors expose XSAVEC but not XSAVES to guests, but the kernel
     does not support XSAVEC. Which means it cannot make use of the
     compacted storage format.

  2) Utilize XGETBV1

     XGETBV1 reads a bitmap which contains the current active XSTATE
     components. This bitmap can be used to optimize the behaviour of
     XSAVES and XSAVEC by requesting only the active components to be
     saved.

     While the init optimization of XSAVEC and XSAVES skips writing the
     state of components which are inactive, the buffer layout is still
     providing the space for the inactive, but requested to save
     components. Which is suboptimal in terms of prefetch and dTLB when
     the active component bitmap is sparse.

Thanks,

	tglx
---
 include/asm/cpufeatures.h |    2 
 kernel/fpu/xstate.c       |  145 +++++++++++++++++++++++++++++++++++++---------
 kernel/fpu/xstate.h       |   42 ++++++++++---
 3 files changed, 152 insertions(+), 37 deletions(-)

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel
  2022-04-04 12:11 [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization Thomas Gleixner
@ 2022-04-04 12:11 ` Thomas Gleixner
  2022-04-04 16:10   ` Andrew Cooper
                     ` (2 more replies)
  2022-04-04 12:11 ` [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction Thomas Gleixner
  2022-04-04 12:11 ` [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported Thomas Gleixner
  2 siblings, 3 replies; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-04 12:11 UTC (permalink / raw)
  To: LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P, Andrew Cooper

XSAVEC is the user space counterpart of XSAVES which cannot save supervisor
state. In virtualization scenarios the hypervisor does not expose XSAVES
but XSAVEC to the guest, though the kernel does not make use of it.

That's unfortunate because XSAVEC uses the compacted format of saving the
XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as
it does not create holes for XSTATE components which are not supported or
enabled by the kernel but are available in hardware. There is room for
further optimizations when XSAVEC/S and XGETBV1 are supported.

In order to support XSAVEC:

 - Define the XSAVEC ASM macro as it's not yet supported by the required
   minimal toolchain.

 - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted
   XSTATE buffer format for both XSAVEC and XSAVES.

 - Make XSAVEC an option in the 'XSAVE' ASM alternatives

Requested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/include/asm/cpufeatures.h |    2 -
 arch/x86/kernel/fpu/xstate.c       |   58 ++++++++++++++++++++++++-------------
 arch/x86/kernel/fpu/xstate.h       |   14 +++++---
 3 files changed, 48 insertions(+), 26 deletions(-)

--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -201,7 +201,7 @@
 #define X86_FEATURE_INVPCID_SINGLE	( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
-/* FREE!                                ( 7*32+10) */
+#define X86_FEATURE_XCOMPACTED		( 7*32+10) /* "" Use compacted XSTATE (XSAVES or XSAVEC) */
 #define X86_FEATURE_PTI			( 7*32+11) /* Kernel Page Table Isolation enabled */
 #define X86_FEATURE_RETPOLINE		( 7*32+12) /* "" Generic Retpoline mitigation for Spectre variant 2 */
 #define X86_FEATURE_RETPOLINE_LFENCE	( 7*32+13) /* "" Use LFENCE for Spectre variant 2 */
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -142,7 +142,8 @@ static unsigned int xfeature_get_offset(
 	 * Non-compacted format and legacy features use the cached fixed
 	 * offsets.
 	 */
-	if (!cpu_feature_enabled(X86_FEATURE_XSAVES) || xfeature <= XFEATURE_SSE)
+	if (!cpu_feature_enabled(X86_FEATURE_XCOMPACTED) ||
+	    xfeature <= XFEATURE_SSE)
 		return xstate_offsets[xfeature];
 
 	/*
@@ -369,12 +370,12 @@ static void __init setup_init_fpu_buf(vo
 	/*
 	 * All components are now in init state. Read the state back so
 	 * that init_fpstate contains all non-zero init state. This only
-	 * works with XSAVE, but not with XSAVEOPT and XSAVES because
+	 * works with XSAVE, but not with XSAVEOPT and XSAVEC/S because
 	 * those use the init optimization which skips writing data for
 	 * components in init state.
 	 *
 	 * XSAVE could be used, but that would require to reshuffle the
-	 * data when XSAVES is available because XSAVES uses xstate
+	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
 	 * compaction. But doing so is a pointless exercise because most
 	 * components have an all zeros init state except for the legacy
 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
@@ -584,7 +585,8 @@ static unsigned int xstate_calculate_siz
  */
 static bool __init paranoid_xstate_size_valid(unsigned int kernel_size)
 {
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	bool xsaves = cpu_feature_enabled(X86_FEATURE_XSAVES);
 	unsigned int size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
 	int i;
 
@@ -595,7 +597,7 @@ static bool __init paranoid_xstate_size_
 		 * Supervisor state components can be managed only by
 		 * XSAVES.
 		 */
-		if (!compacted && xfeature_is_supervisor(i)) {
+		if (!xsaves && xfeature_is_supervisor(i)) {
 			XSTATE_WARN_ON(1);
 			return false;
 		}
@@ -612,8 +614,11 @@ static bool __init paranoid_xstate_size_
  * the size of the *user* states.  If we use it to size a buffer
  * that we use 'XSAVES' on, we could potentially overflow the
  * buffer because 'XSAVES' saves system states too.
+ *
+ * This also takes compaction into account. So this works for
+ * XSAVEC as well.
  */
-static unsigned int __init get_xsaves_size(void)
+static unsigned int __init get_compacted_size(void)
 {
 	unsigned int eax, ebx, ecx, edx;
 	/*
@@ -623,6 +628,10 @@ static unsigned int __init get_xsaves_si
 	 *    containing all the state components
 	 *    corresponding to bits currently set in
 	 *    XCR0 | IA32_XSS.
+	 *
+	 * When XSAVES is not available but XSAVEC (virt)
+	 * then there are no supervisor states, but XSAVEC
+	 * still uses compacted format.
 	 */
 	cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
 	return ebx;
@@ -632,13 +641,13 @@ static unsigned int __init get_xsaves_si
  * Get the total size of the enabled xstates without the independent supervisor
  * features.
  */
-static unsigned int __init get_xsaves_size_no_independent(void)
+static unsigned int __init get_xsave_compacted_size(void)
 {
 	u64 mask = xfeatures_mask_independent();
 	unsigned int size;
 
 	if (!mask)
-		return get_xsaves_size();
+		return get_compacted_size();
 
 	/* Disable independent features. */
 	wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());
@@ -647,7 +656,7 @@ static unsigned int __init get_xsaves_si
 	 * Ask the hardware what size is required of the buffer.
 	 * This is the size required for the task->fpu buffer.
 	 */
-	size = get_xsaves_size();
+	size = get_compacted_size();
 
 	/* Re-enable independent features so XSAVES will work on them again. */
 	wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | mask);
@@ -687,20 +696,21 @@ static int __init init_xstate_size(void)
 {
 	/* Recompute the context size for enabled features: */
 	unsigned int user_size, kernel_size, kernel_default_size;
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
 
 	/* Uncompacted user space size */
 	user_size = get_xsave_size_user();
 
 	/*
-	 * XSAVES kernel size includes supervisor states and
-	 * uses compacted format when available.
+	 * XSAVES kernel size includes supervisor states and uses compacted
+	 * format. XSAVEC uses compacted format, but does not save
+	 * supervisor states.
 	 *
-	 * XSAVE does not support supervisor states so
-	 * kernel and user size is identical.
+	 * XSAVE[OPT] do not support supervisor states so kernel and user
+	 * size is identical.
 	 */
 	if (compacted)
-		kernel_size = get_xsaves_size_no_independent();
+		kernel_size = get_xsave_compacted_size();
 	else
 		kernel_size = user_size;
 
@@ -813,8 +823,11 @@ void __init fpu__init_system_xstate(unsi
 	if (!cpu_feature_enabled(X86_FEATURE_XFD))
 		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
 
-	fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED |
-			      XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+	if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
+		fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED;
+	else
+		fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED |
+					XFEATURE_MASK_SUPERVISOR_SUPPORTED;
 
 	fpu_user_cfg.max_features = fpu_kernel_cfg.max_features;
 	fpu_user_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED;
@@ -837,6 +850,11 @@ void __init fpu__init_system_xstate(unsi
 	 */
 	init_fpstate.xfd = fpu_user_cfg.max_features & XFEATURE_MASK_USER_DYNAMIC;
 
+	/* Set up compaction feature bit */
+	if (cpu_feature_enabled(X86_FEATURE_XSAVEC) ||
+	    cpu_feature_enabled(X86_FEATURE_XSAVES))
+		setup_force_cpu_cap(X86_FEATURE_XCOMPACTED);
+
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 
@@ -873,7 +891,7 @@ void __init fpu__init_system_xstate(unsi
 	pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
 		fpu_kernel_cfg.max_features,
 		fpu_kernel_cfg.max_size,
-		boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
+		boot_cpu_has(X86_FEATURE_XCOMPACTED) ? "compacted" : "standard");
 	return;
 
 out_disable:
@@ -917,7 +935,7 @@ static void *__raw_xsave_addr(struct xre
 	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
 		return NULL;
 
-	if (cpu_feature_enabled(X86_FEATURE_XSAVES)) {
+	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED)) {
 		if (WARN_ON_ONCE(!(xcomp_bv & BIT_ULL(xfeature_nr))))
 			return NULL;
 	}
@@ -1525,7 +1543,7 @@ static int __xstate_request_perm(u64 per
 	 * vendors into extending XFD for the pre AMX states, especially
 	 * AVX512.
 	 */
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
 	struct fpu *fpu = &current->group_leader->thread.fpu;
 	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -16,7 +16,7 @@ static inline void xstate_init_xcomp_bv(
 	 * XRSTORS requires these bits set in xcomp_bv, or it will
 	 * trigger #GP:
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
+	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
 		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
 }
 
@@ -79,6 +79,7 @@ static inline u64 xfeatures_mask_indepen
 /* These macros all use (%edi)/(%rdi) as the single memory argument. */
 #define XSAVE		".byte " REX_PREFIX "0x0f,0xae,0x27"
 #define XSAVEOPT	".byte " REX_PREFIX "0x0f,0xae,0x37"
+#define XSAVEC		".byte " REX_PREFIX "0x0f,0xc7,0x27"
 #define XSAVES		".byte " REX_PREFIX "0x0f,0xc7,0x2f"
 #define XRSTOR		".byte " REX_PREFIX "0x0f,0xae,0x2f"
 #define XRSTORS		".byte " REX_PREFIX "0x0f,0xc7,0x1f"
@@ -97,9 +98,11 @@ static inline u64 xfeatures_mask_indepen
 		     : "memory")
 
 /*
- * If XSAVES is enabled, it replaces XSAVEOPT because it supports a compact
- * format and supervisor states in addition to modified optimization in
- * XSAVEOPT.
+ * If XSAVES is enabled, it replaces XSAVEC because it supports supervisor
+ * states in addition to XSAVEC.
+ *
+ * Otherwise if XSAVEC is enabled, it replaces XSAVEOPT because it supports
+ * compacted storage format in addition to XSAVEOPT.
  *
  * Otherwise, if XSAVEOPT is enabled, XSAVEOPT replaces XSAVE because XSAVEOPT
  * supports modified optimization which is not supported by XSAVE.
@@ -111,8 +114,9 @@ static inline u64 xfeatures_mask_indepen
  * address of the instruction where we might get an exception at.
  */
 #define XSTATE_XSAVE(st, lmask, hmask, err)				\
-	asm volatile(ALTERNATIVE_2(XSAVE,				\
+	asm volatile(ALTERNATIVE_3(XSAVE,				\
 				   XSAVEOPT, X86_FEATURE_XSAVEOPT,	\
+				   XSAVEC,   X86_FEATURE_XSAVEC,	\
 				   XSAVES,   X86_FEATURE_XSAVES)	\
 		     "\n"						\
 		     "xor %[err], %[err]\n"				\


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction
  2022-04-04 12:11 [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization Thomas Gleixner
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
@ 2022-04-04 12:11 ` Thomas Gleixner
  2022-04-14 15:46   ` Dave Hansen
  2022-04-04 12:11 ` [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported Thomas Gleixner
  2 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-04 12:11 UTC (permalink / raw)
  To: LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P, Andrew Cooper

XSAVES/C allow for optimized compaction by saving only the active component
states and compact the storage so it is linear without holes, which is
better for prefetching and dTLB.

This needs preparation in the copy to/from UABI functions. They have to be
aware of the optimized layout.

The change for __copy_xstate_to_uabi_buf() is trivial. It just has to avoid
calculating the offset in the XSTATE buffer when a component is in init
state. That's an improvement in general.

For copy_uabi_to_xstate() it's more effort when supervisor states are
supported and active. If the user space buffer has active states which are
not in the set of current states of the XSTATE buffer, then the buffer
layout will change which means the supervisor states might be overwritten.

Provide a function, which moves them out of the way and invoke it before
any of the extended features is copied from the user space buffer.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/fpu/xstate.c |   77 ++++++++++++++++++++++++++++++++++++++++---
 arch/x86/kernel/fpu/xstate.h |   14 ++++++-
 2 files changed, 84 insertions(+), 7 deletions(-)

--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1147,10 +1147,10 @@ void __copy_xstate_to_uabi_buf(struct me
 			pkru.pkru = pkru_val;
 			membuf_write(&to, &pkru, sizeof(pkru));
 		} else {
-			copy_feature(header.xfeatures & BIT_ULL(i), &to,
-				     __raw_xsave_addr(xsave, i),
-				     __raw_xsave_addr(xinit, i),
-				     xstate_sizes[i]);
+			bool active = header.xfeatures & BIT_ULL(i);
+			void *from = __raw_xsave_addr(active ? xsave : xinit, i);
+
+			membuf_write(&to, from, xstate_sizes[i]);
 		}
 		/*
 		 * Keep track of the last copied state in the non-compacted
@@ -1195,6 +1195,73 @@ static int copy_from_buffer(void *dst, u
 	return 0;
 }
 
+/*
+ * Prepare the kernel XSTATE buffer for saving the user supplied XSTATE. If
+ * the XGETBV1 based optimization is in use then the kernel buffer might
+ * not have the user supplied active features set and an eventual active
+ * supervisor state has to be moved out of the way. With optimized
+ * compaction the features which are to be stored need to be set in the
+ * XCOMP_BV field of the XSTATE header.
+ */
+static void xsave_adjust_xcomp(struct fpstate *fpstate, u64 xuser)
+{
+	struct xregs_state *xsave = &fpstate->regs.xsave;
+	u64 xtmp, xall, xbv, xcur = xsave->header.xfeatures;
+	int i;
+
+	/* Nothing to do if optimized compaction is not in use */
+	if (!xsave_use_xgetbv1())
+		return;
+
+	/*
+	 * Check whether the current xstate buffer contains supervisor
+	 * state. If not, just set the user supplied features.
+	 */
+	if (!(xcur & XFEATURE_MASK_SUPERVISOR_ALL)) {
+		__xstate_init_xcomp_bv(xsave, xuser);
+		return;
+	}
+
+	/*
+	 * Set legacy features. They are at a fixed offset and do not
+	 * affect the layout.
+	 */
+	xbv = xsave->header.xcomp_bv;
+	xbv |= xuser & (XFEATURE_MASK_FP | XFEATURE_MASK_SSE);
+
+	/*
+	 * Check whether there is new extended state in the user supplied
+	 * buffer. If not, then nothing has to be moved around.
+	 */
+	if (!(xuser & ~xbv)) {
+		__xstate_init_xcomp_bv(xsave, xbv);
+		return;
+	}
+
+	/*
+	 * No more optimizations. Set the user features and move the
+	 * supervisor state(s). If the new user feature is past
+	 * the supervisor state, then the loop is moving nothing.
+	 */
+	xtmp = xbv & XFEATURE_MASK_SUPERVISOR_ALL;
+	xall = xbv | xuser;
+
+	for (i = fls64(xtmp) - 1; i >= FIRST_EXTENDED_XFEATURE;
+	     i = fls64(xtmp) - 1) {
+		unsigned int to, from;
+
+		from = xfeature_get_offset(xbv, i);
+		to = xfeature_get_offset(xall, i);
+		if (from < to) {
+			memmove((void *)xsave + to, (void *)xsave + from,
+				xstate_sizes[i]);
+		} else {
+			WARN_ON_ONCE(to < from);
+		}
+		xtmp &= ~BIT_ULL(i);
+	}
+	__xstate_init_xcomp_bv(xsave, xall);
+}
 
 static int copy_uabi_to_xstate(struct fpstate *fpstate, const void *kbuf,
 			       const void __user *ubuf)
@@ -1232,6 +1299,8 @@ static int copy_uabi_to_xstate(struct fp
 		}
 	}
 
+	xsave_adjust_xcomp(fpstate, hdr.xfeatures);
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,14 +10,22 @@
 DECLARE_PER_CPU(u64, xfd_state);
 #endif
 
-static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
+static inline bool xsave_use_xgetbv1(void) { return false; }
+
+static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 {
 	/*
 	 * XRSTORS requires these bits set in xcomp_bv, or it will
-	 * trigger #GP:
+	 * trigger #GP. It's also required for XRSTOR when the buffer
+	 * was saved with XSAVEC in compacted format.
 	 */
+	xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
+}
+
+static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
+{
 	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
-		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
+		__xstate_init_xcomp_bv(xsave, mask);
 }
 
 static inline u64 xstate_get_group_perm(bool guest)


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-04 12:11 [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization Thomas Gleixner
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
  2022-04-04 12:11 ` [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction Thomas Gleixner
@ 2022-04-04 12:11 ` Thomas Gleixner
  2022-04-14 17:24   ` Dave Hansen
  2 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-04 12:11 UTC (permalink / raw)
  To: LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P, Andrew Cooper

XSAVEC/S store the FPU state in compacted format, which avoids holes in the
memory image. The kernel uses this feature in a very naive way and just
avoids holes which come from unsupported features, like PT. That's a
marginal saving of 128 byte vs. the uncompacted format on a SKL-X.

The first 576 bytes are fixed. 512 byte legacy (FP/SSE) and 64 byte XSAVE
header. On a SKL-X machine the other components are stored at the following
offsets:
 
 xstate_offset[2]:  576, xstate_sizes[2]:  256
 xstate_offset[3]:  832, xstate_sizes[3]:   64
 xstate_offset[4]:  896, xstate_sizes[4]:   64
 xstate_offset[5]:  960, xstate_sizes[5]:   64
 xstate_offset[6]: 1024, xstate_sizes[6]:  512
 xstate_offset[7]: 1536, xstate_sizes[7]: 1024
 xstate_offset[9]: 2560, xstate_sizes[9]:    8

XSAVEC/S use the init optimization which does not write data of a component
when the component is in init state. The state is stored in the XSTATE_BV
bitmap of the XSTATE header.

The kernel requests to save all enabled components, which results in a
suboptimal write/read pattern when the set of active components is sparse.

A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full
supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads:

  - SSE in the legacy area (0-511)
  - Part of the XSTATE header (512-575)
  - PKRU at offset 2560

which is suboptimal. Prefetch works better when the access is linear. But
what's worse is that PKRU can be located in a different page which
obviously affects dTLB.

XSAVEC/S allows to further reduce the memory footprint when the active
feature set is sparse and the CPU supports XGETBV1. XGETBV1 reads the state
of the XSTATE components as a bitmap. This bitmap can be fed into XSAVEC/S
to request only the storage of the active components, which changes the
layout of the state buffer to:

  - SSE in the legacy area (0-511)
  - Part of the XSTATE header (512-575)
  - PKRU at offset 576

This optimization does not gain much for e.g. a kernel build, but for
context switch heavy applications it's very visible. Perf stats from
hackbench:

Before:

        242,618.89 msec task-clock                #  102.928 CPUs utilized            ( +-  0.20% )
         1,038,988      context-switches          #    0.004 M/sec                    ( +-  0.54% )
           460,081      cpu-migrations            #    0.002 M/sec                    ( +-  0.56% )
            10,813      page-faults               #    0.045 K/sec                    ( +-  0.62% )
   506,912,353,968      cycles                    #    2.089 GHz                      ( +-  0.20% )
   167,267,811,210      instructions              #    0.33  insn per cycle           ( +-  0.04% )
    34,481,978,727      branches                  #  142.124 M/sec                    ( +-  0.04% )
       305,975,304      branch-misses             #    0.89% of all branches          ( +-  0.09% )

           2.35717 +- 0.00607 seconds time elapsed  ( +-  0.26% )

   506,064,738,921      cycles                                                        ( +-  0.43% )
     3,334,160,871      L1-dcache-load-misses                                         ( +-  0.77% )
       135,271,979      dTLB-load-misses                                              ( +-  2.12% )
        18,169,634      dTLB-store-misses                                             ( +-  1.78% )

            2.3323 +- 0.0117 seconds time elapsed  ( +-  0.50% )

After:

        222,252.90 msec task-clock                #  103.800 CPUs utilized            ( +-  0.51% )
         1,004,665      context-switches          #    0.005 M/sec                    ( +-  0.42% )
           459,123      cpu-migrations            #    0.002 M/sec                    ( +-  0.33% )
            10,677      page-faults               #    0.048 K/sec                    ( +-  0.79% )
   464,356,465,870      cycles                    #    2.089 GHz                      ( +-  0.51% )
   166,615,501,152      instructions              #    0.36  insn per cycle           ( +-  0.05% )
    34,355,848,663      branches                  #  154.580 M/sec                    ( +-  0.05% )
       300,049,704      branch-misses             #    0.87% of all branches          ( +-  0.14% )

            2.1412 +- 0.0117 seconds time elapsed  ( +-  0.55% )

   473,864,807,936      cycles                                                        ( +-  0.64% )
     3,198,078,809      L1-dcache-load-misses                                         ( +-  0.24% )
        27,798,721      dTLB-load-misses                                              ( +-  2.33% )
         4,981,069      dTLB-store-misses                                             ( +-  1.80% )

            2.1733 +- 0.0132 seconds time elapsed  ( +-  0.61% )

The most significant change is in dTLB misses.

The effect depends on the application scenario, the kernel configuration
and the allocation placement of task_struct, so it might be not noticable
at all. As the XGETBV1 optimization is not introducing a measurable
overhead it's worth to use it if supported by the hardware.

Enable it when available with a static key and mask out the non-active
states in the requested bitmap for XSAVEC/S.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
---
 arch/x86/kernel/fpu/xstate.c |   10 ++++++++--
 arch/x86/kernel/fpu/xstate.h |   16 +++++++++++++---
 2 files changed, 21 insertions(+), 5 deletions(-)

--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR
 #define XSTATE_FLAG_SUPERVISOR	BIT(0)
 #define XSTATE_FLAG_ALIGNED64	BIT(1)
 
+DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
 /*
  * Return whether the system supports a given xfeature.
  *
@@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate *
 }
 #endif /* CONFIG_X86_DEBUG_FPU */
 
-static int __init xfd_update_static_branch(void)
+static int __init fpu_update_static_branches(void)
 {
 	/*
 	 * If init_fpstate.xfd has bits set then dynamic features are
@@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran
 	 */
 	if (init_fpstate.xfd)
 		static_branch_enable(&__fpu_state_size_dynamic);
+
+	if (cpu_feature_enabled(X86_FEATURE_XGETBV1) &&
+	    cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
+		static_branch_enable(&__xsave_use_xgetbv1);
 	return 0;
 }
-arch_initcall(xfd_update_static_branch)
+arch_initcall(fpu_update_static_branches)
 
 void fpstate_free(struct fpu *fpu)
 {
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -10,7 +10,12 @@
 DECLARE_PER_CPU(u64, xfd_state);
 #endif
 
-static inline bool xsave_use_xgetbv1(void) { return false; }
+DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
+
+static __always_inline __pure bool xsave_use_xgetbv1(void)
+{
+	return static_branch_likely(&__xsave_use_xgetbv1);
+}
 
 static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 {
@@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u
 static inline void os_xsave(struct fpstate *fpstate)
 {
 	u64 mask = fpstate->xfeatures;
-	u32 lmask = mask;
-	u32 hmask = mask >> 32;
+	u32 lmask, hmask;
 	int err;
 
 	WARN_ON_FPU(!alternatives_patched);
 	xfd_validate_state(fpstate, mask, false);
 
+	if (xsave_use_xgetbv1())
+		mask &= xgetbv(1);
+
+	lmask = mask;
+	hmask = mask >> 32;
+
 	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
 
 	/* We should never fault when copying to a kernel buffer: */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
@ 2022-04-04 16:10   ` Andrew Cooper
  2022-04-14 14:43   ` Dave Hansen
  2022-04-25 13:11   ` [tip: x86/fpu] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 16+ messages in thread
From: Andrew Cooper @ 2022-04-04 16:10 UTC (permalink / raw)
  To: Thomas Gleixner, LKML; +Cc: x86, Edgecombe, Rick P

On 04/04/2022 13:11, Thomas Gleixner wrote:
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -623,6 +628,10 @@ static unsigned int __init get_xsaves_si
>  	 *    containing all the state components
>  	 *    corresponding to bits currently set in
>  	 *    XCR0 | IA32_XSS.
> +	 *
> +	 * When XSAVES is not available but XSAVEC (virt)

"but XSAVEC is" ?

~Andrew

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
  2022-04-04 16:10   ` Andrew Cooper
@ 2022-04-14 14:43   ` Dave Hansen
  2022-04-25 13:11   ` [tip: x86/fpu] " tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2022-04-14 14:43 UTC (permalink / raw)
  To: Thomas Gleixner, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On 4/4/22 05:11, Thomas Gleixner wrote:
>  - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted
>    XSTATE buffer format for both XSAVEC and XSAVES.

We probably should have logically separated the compacted format bits
from the supervisor bits long ago.  I'm glad this is doing it now.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction
  2022-04-04 12:11 ` [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction Thomas Gleixner
@ 2022-04-14 15:46   ` Dave Hansen
  2022-04-19 12:39     ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2022-04-14 15:46 UTC (permalink / raw)
  To: Thomas Gleixner, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

[-- Attachment #1: Type: text/plain, Size: 6564 bytes --]

On 4/4/22 05:11, Thomas Gleixner wrote:
...
> For copy_uabi_to_xstate() it's more effort when supervisor states are
> supported and active. If the user space buffer has active states which are
> not in the set of current states of the XSTATE buffer, then the buffer
> layout will change which means the supervisor states might be overwritten.
> 
> Provide a function, which moves them out of the way and invoke it before
> any of the extended features is copied from the user space buffer.

This took me a minute to sort through.  There are three kinds of state
in the kernel buffer in copy_uabi_to_xstate().

1. User state, set in hdr.features which is copied from uabi buffer
2. User state, clear in hdr.features.  Feature will be set to its init
   state because xsave->header.xfeatures is cleared.
3. Supervisor state which is preserved.

I was confused for a *moment* because the space for a #2 might be
overwritten by a #1 copy-in from userspace.  But, that only happens for
states that will end up being set to init.  No harm, no foul.

Any interest in doing something like the attached to make
copy_uabi_to_xstate() easier to read?

> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1147,10 +1147,10 @@ void __copy_xstate_to_uabi_buf(struct me
>  			pkru.pkru = pkru_val;
>  			membuf_write(&to, &pkru, sizeof(pkru));
>  		} else {
> -			copy_feature(header.xfeatures & BIT_ULL(i), &to,
> -				     __raw_xsave_addr(xsave, i),
> -				     __raw_xsave_addr(xinit, i),
> -				     xstate_sizes[i]);
> +			bool active = header.xfeatures & BIT_ULL(i);
> +			void *from = __raw_xsave_addr(active ? xsave : xinit, i);
> +
> +			membuf_write(&to, from, xstate_sizes[i]);
>  		}
>  		/*
>  		 * Keep track of the last copied state in the non-compacted
> @@ -1195,6 +1195,73 @@ static int copy_from_buffer(void *dst, u
>  	return 0;
>  }
>  
> +/*
> + * Prepare the kernel XSTATE buffer for saving the user supplied XSTATE. If
> + * the XGETBV1 based optimization is in use then the kernel buffer might
> + * not have the user supplied active features set and an eventual active
> + * supervisor state has to be moved out of the way. With optimized
> + * compaction the features which are to be stored need to be set in the
> + * XCOMP_BV field of the XSTATE header.
> + */

What does "eventual active supervisor state" mean?  Just that the state
needs to be preserved since it needs to be restored eventually?  I found
that phrase a bit confusing.

I was thinking of a comment more along these lines:

/*
 * Adjust 'fpstate' so that it can additionally store all the xfeatures
 * set in 'xuser' when optimized compaction is enabled.  All bits in
 * 'xuser' will be set in XCOMP_BV.  Supervisor state will be preserved
 * and may be moved to make room for new 'xuser' states.  User state may
 * be destroyed.
 */

> +static void xsave_adjust_xcomp(struct fpstate *fpstate, u64 xuser)
> +{
> +	struct xregs_state *xsave = &fpstate->regs.xsave;
> +	u64 xtmp, xall, xbv, xcur = xsave->header.xfeatures;
> +	int i;
> +
> +	/* Nothing to do if optimized compaction is not in use */
> +	if (!xsave_use_xgetbv1())
> +		return;

The comment makes me wonder if we need a more descriptive name for
xsave_use_xgetbv1(), like:

	if (!xsave_do_optimized_compaction())
		return;

> +	/*
> +	 * Check whether the current xstate buffer contains supervisor
> +	 * state. If not, just set the user supplied features.
> +	 */
> +	if (!(xcur & XFEATURE_MASK_SUPERVISOR_ALL)) {
> +		__xstate_init_xcomp_bv(xsave, xuser);
> +		return;
> +	}
> +
> +	/*
> +	 * Set legacy features. They are at a fixed offset and do not
> +	 * affect the layout.
> +	 */
> +	xbv = xsave->header.xcomp_bv;
> +	xbv |= xuser & (XFEATURE_MASK_FP | XFEATURE_MASK_SSE);
> +
> +	/*
> +	 * Check whether there is new extended state in the user supplied
> +	 * buffer. If not, then nothing has to be moved around.
> +	 */
> +	if (!(xuser & ~xbv)) {
> +		__xstate_init_xcomp_bv(xsave, xbv);
> +		return;
> +	}
> +
> +	/*
> +	 * No more optimizations. Set the user features and move the
> +	 * supervisor state(s). If the new user feature is past
> +	 * the supervisor state, then the loop is moving nothing.
> +	 */
> +	xtmp = xbv & XFEATURE_MASK_SUPERVISOR_ALL;
> +	xall = xbv | xuser;


I'd probably at least comment why the loop is backwards:

	/*
	 * Features are only be moved up in the buffer.  Start with
	 * high features to avoid overwriting them with a lower ones.
	 */

I know this is a very typical way to implement non-destructive moves,
but my stupid brain seems to default to assuming that for loops only go
forward.

> +	for (i = fls64(xtmp) - 1; i >= FIRST_EXTENDED_XFEATURE;
> +	     i = fls64(xtmp) - 1) {
> +		unsigned int to, from;

Is it worth a check here like:

		/* Do not move features in their init state: */
		if (!(xcur & BIT_ULL(i))) {
			xtmp &= ~BIT_ULL(i);
			continue;
		}

(probably also moving the &= around instead of copying it)

> +		from = xfeature_get_offset(xbv, i);
> +		to = xfeature_get_offset(xall, i);
> +		if (from < to) {
> +			memmove((void *)xsave + to, (void *)xsave + from,
> +				xstate_sizes[i]);
> +		} else {
> +			WARN_ON_ONCE(to < from);
> +		}
> +		xtmp &= ~BIT_ULL(i);
> +	}
> +	__xstate_init_xcomp_bv(xsave, xall);
> +}
>  
>  static int copy_uabi_to_xstate(struct fpstate *fpstate, const void *kbuf,
>  			       const void __user *ubuf)
> @@ -1232,6 +1299,8 @@ static int copy_uabi_to_xstate(struct fp
>  		}
>  	}
>  
> +	xsave_adjust_xcomp(fpstate, hdr.xfeatures);
> +
>  	for (i = 0; i < XFEATURE_MAX; i++) {
>  		u64 mask = ((u64)1 << i);
>  
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -10,14 +10,22 @@
>  DECLARE_PER_CPU(u64, xfd_state);
>  #endif
>  
> -static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
> +static inline bool xsave_use_xgetbv1(void) { return false; }
> +
> +static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
>  {
>  	/*
>  	 * XRSTORS requires these bits set in xcomp_bv, or it will
> -	 * trigger #GP:
> +	 * trigger #GP. It's also required for XRSTOR when the buffer
> +	 * was saved with XSAVEC in compacted format.
>  	 */
> +	xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
> +}
> +
> +static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
> +{
>  	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
> -		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
> +		__xstate_init_xcomp_bv(xsave, mask);
>  }
>  
>  static inline u64 xstate_get_group_perm(bool guest)

[-- Attachment #2: copy_uabi_to_xstate.1.patch --]
[-- Type: text/x-patch, Size: 2077 bytes --]

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 39e1c8626ab9..8a32a8398e1f 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1178,6 +1178,15 @@ static int copy_from_buffer(void *dst, unsigned int offset, unsigned int size,
 }
 
 
+/*
+ * Replace all user xfeatures with data from 'ubuf'.  Features
+ * marked as init in 'ubuf' will be marked as init in the kernel
+ * buffer.  Supervisor xfeatures will be preserved.
+ *
+ * Returns 0 on success.  Non-zero return codes indicate that
+ * the kernel xsave buffer may have been left in an inconsistent
+ * state.  Caller must reset fpstate to recover.
+ */
 static int copy_uabi_to_xstate(struct fpstate *fpstate, const void *kbuf,
 			       const void __user *ubuf)
 {
@@ -1214,30 +1223,30 @@ static int copy_uabi_to_xstate(struct fpstate *fpstate, const void *kbuf,
 		}
 	}
 
+	/*
+	 * Leave only supervisor states in 'xfeatures'.  User xfeature
+	 * bits are set in the loop below.
+	 */
+	xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR_ALL;
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
+		void *dst = __raw_xsave_addr(xsave, i);
 
-		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
-
-			offset = xstate_offsets[i];
-			size = xstate_sizes[i];
-
-			if (copy_from_buffer(dst, offset, size, kbuf, ubuf))
-				return -EFAULT;
+		if (!hdr.xfeatures & mask) {
+			/* Feature was marked as init in uabi buffer: */
+			xsave->header.xfeatures &= ~mask;
+			continue
 		}
-	}
+		/* Feature was present in uabi buffer: */
+		xsave->header.xfeatures |= mask;
 
-	/*
-	 * The state that came in from userspace was user-state only.
-	 * Mask all the user states out of 'xfeatures':
-	 */
-	xsave->header.xfeatures &= XFEATURE_MASK_SUPERVISOR_ALL;
+		offset = xstate_offsets[i];
+		size = xstate_sizes[i];
 
-	/*
-	 * Add back in the features that came in from userspace:
-	 */
-	xsave->header.xfeatures |= hdr.xfeatures;
+		if (copy_from_buffer(dst, offset, size, kbuf, ubuf))
+			return -EFAULT;
+	}
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-04 12:11 ` [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported Thomas Gleixner
@ 2022-04-14 17:24   ` Dave Hansen
  2022-04-19 13:43     ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Dave Hansen @ 2022-04-14 17:24 UTC (permalink / raw)
  To: Thomas Gleixner, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On 4/4/22 05:11, Thomas Gleixner wrote:
> A typical scenario is an active set of 0x202 (PKRU + SSE) out of the full
> supported set of 0x2FF. That means XSAVEC/S writes and XRSTOR[S] reads:

It might be worth reminding folks why PKRU is a special snowflake:

The default PKRU enforced by the kernel is its most restrictive possible
value (0xfffffffc).  This means that PKRU defaults to being in its
non-init state even for tasks which do nothing protection-keys-related.


> which is suboptimal. Prefetch works better when the access is linear. But
> what's worse is that PKRU can be located in a different page which
> obviously affects dTLB.

The numbers don't lie, but I'm still surprised by this.  Was this in a
VM that isn't backed with large pages?  task_struct.thread.fpu is
kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
pages almost all the time.

> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -86,6 +86,8 @@ static unsigned int xstate_flags[XFEATUR
>  #define XSTATE_FLAG_SUPERVISOR	BIT(0)
>  #define XSTATE_FLAG_ALIGNED64	BIT(1)
>  
> +DEFINE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
> +
>  /*
>   * Return whether the system supports a given xfeature.
>   *
> @@ -1481,7 +1483,7 @@ void xfd_validate_state(struct fpstate *
>  }
>  #endif /* CONFIG_X86_DEBUG_FPU */
>  
> -static int __init xfd_update_static_branch(void)
> +static int __init fpu_update_static_branches(void)
>  {
>  	/*
>  	 * If init_fpstate.xfd has bits set then dynamic features are
> @@ -1489,9 +1491,13 @@ static int __init xfd_update_static_bran
>  	 */
>  	if (init_fpstate.xfd)
>  		static_branch_enable(&__fpu_state_size_dynamic);
> +
> +	if (cpu_feature_enabled(X86_FEATURE_XGETBV1) &&
> +	    cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
> +		static_branch_enable(&__xsave_use_xgetbv1);
>  	return 0;
>  }
> -arch_initcall(xfd_update_static_branch)
> +arch_initcall(fpu_update_static_branches)
>  
>  void fpstate_free(struct fpu *fpu)
>  {
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -10,7 +10,12 @@
>  DECLARE_PER_CPU(u64, xfd_state);
>  #endif
>  
> -static inline bool xsave_use_xgetbv1(void) { return false; }
> +DECLARE_STATIC_KEY_FALSE(__xsave_use_xgetbv1);
> +
> +static __always_inline __pure bool xsave_use_xgetbv1(void)
> +{
> +	return static_branch_likely(&__xsave_use_xgetbv1);
> +}
>  
>  static inline void __xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
>  {
> @@ -185,13 +190,18 @@ static inline int __xfd_enable_feature(u
>  static inline void os_xsave(struct fpstate *fpstate)
>  {
>  	u64 mask = fpstate->xfeatures;
> -	u32 lmask = mask;
> -	u32 hmask = mask >> 32;
> +	u32 lmask, hmask;
>  	int err;
>  
>  	WARN_ON_FPU(!alternatives_patched);
>  	xfd_validate_state(fpstate, mask, false);
>  
> +	if (xsave_use_xgetbv1())
> +		mask &= xgetbv(1);

How about this comment for the masking operation:

	/*
	 * Remove features in their init state from the mask.  This
	 * makes the XSAVE{S,C} writes less sparse and quicker for
	 * the CPU.
	 */

> +	lmask = mask;
> +	hmask = mask >> 32;
> +
>  	XSTATE_XSAVE(&fpstate->regs.xsave, lmask, hmask, err);
>  
>  	/* We should never fault when copying to a kernel buffer: */


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction
  2022-04-14 15:46   ` Dave Hansen
@ 2022-04-19 12:39     ` Thomas Gleixner
  2022-04-19 13:33       ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-19 12:39 UTC (permalink / raw)
  To: Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On Thu, Apr 14 2022 at 08:46, Dave Hansen wrote:
> On 4/4/22 05:11, Thomas Gleixner wrote:
> Any interest in doing something like the attached to make
> copy_uabi_to_xstate() easier to read?

Yeah. I've picked it up.

>> +static void xsave_adjust_xcomp(struct fpstate *fpstate, u64 xuser)
>> +{
>> +	struct xregs_state *xsave = &fpstate->regs.xsave;
>> +	u64 xtmp, xall, xbv, xcur = xsave->header.xfeatures;
>> +	int i;
>> +
>> +	/* Nothing to do if optimized compaction is not in use */
>> +	if (!xsave_use_xgetbv1())
>> +		return;
>
> The comment makes me wonder if we need a more descriptive name for
> xsave_use_xgetbv1(), like:
>
> 	if (!xsave_do_optimized_compaction())
> 		return;

Makes sense.

>> +	/*
>> +	 * No more optimizations. Set the user features and move the
>> +	 * supervisor state(s). If the new user feature is past
>> +	 * the supervisor state, then the loop is moving nothing.
>> +	 */
>> +	xtmp = xbv & XFEATURE_MASK_SUPERVISOR_ALL;
>> +	xall = xbv | xuser;
>
>
> I'd probably at least comment why the loop is backwards:
>
> 	/*
> 	 * Features are only be moved up in the buffer.  Start with
> 	 * high features to avoid overwriting them with a lower ones.
> 	 */
>
> I know this is a very typical way to implement non-destructive moves,
> but my stupid brain seems to default to assuming that for loops only go
> forward.

:)

>> +	for (i = fls64(xtmp) - 1; i >= FIRST_EXTENDED_XFEATURE;
>> +	     i = fls64(xtmp) - 1) {
>> +		unsigned int to, from;
>
> Is it worth a check here like:
>
> 		/* Do not move features in their init state: */
> 		if (!(xcur & BIT_ULL(i))) {
> 			xtmp &= ~BIT_ULL(i);
> 			continue;
> 		}

That would also require to clear the bit in xall, but we can't do that
in the loop as that affects offsets. Let me think about that.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction
  2022-04-19 12:39     ` Thomas Gleixner
@ 2022-04-19 13:33       ` Thomas Gleixner
  0 siblings, 0 replies; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-19 13:33 UTC (permalink / raw)
  To: Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On Tue, Apr 19 2022 at 14:39, Thomas Gleixner wrote:
> On Thu, Apr 14 2022 at 08:46, Dave Hansen wrote:
>>> +	for (i = fls64(xtmp) - 1; i >= FIRST_EXTENDED_XFEATURE;
>>> +	     i = fls64(xtmp) - 1) {
>>> +		unsigned int to, from;
>>
>> Is it worth a check here like:
>>
>> 		/* Do not move features in their init state: */
>> 		if (!(xcur & BIT_ULL(i))) {
>> 			xtmp &= ~BIT_ULL(i);
>> 			continue;
>> 		}
>
> That would also require to clear the bit in xall, but we can't do that
> in the loop as that affects offsets. Let me think about that.

Duh, misread it. Yes it's possible to do that. OTOH, with the optimized
compaction there won't be a bit set in xbv, but not in xfeatures, except
for the initial state when a task starts.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-14 17:24   ` Dave Hansen
@ 2022-04-19 13:43     ` Thomas Gleixner
  2022-04-19 21:22       ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-19 13:43 UTC (permalink / raw)
  To: Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
> On 4/4/22 05:11, Thomas Gleixner wrote:
>> which is suboptimal. Prefetch works better when the access is linear. But
>> what's worse is that PKRU can be located in a different page which
>> obviously affects dTLB.
>
> The numbers don't lie, but I'm still surprised by this.  Was this in a
> VM that isn't backed with large pages?  task_struct.thread.fpu is
> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
> pages almost all the time.

Hmm. Indeed, that's weird.

That was bare metal and I just checked that this was a production config
and not some weird debug muck which breaks large pages. I'll look deeper
into that.

Thanks,

        tglx




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-19 13:43     ` Thomas Gleixner
@ 2022-04-19 21:22       ` Thomas Gleixner
  2022-04-20 18:15         ` Tom Lendacky
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-19 21:22 UTC (permalink / raw)
  To: Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P, Tom Lendacky

On Tue, Apr 19 2022 at 15:43, Thomas Gleixner wrote:
> On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
>> On 4/4/22 05:11, Thomas Gleixner wrote:
>>> which is suboptimal. Prefetch works better when the access is linear. But
>>> what's worse is that PKRU can be located in a different page which
>>> obviously affects dTLB.
>>
>> The numbers don't lie, but I'm still surprised by this.  Was this in a
>> VM that isn't backed with large pages?  task_struct.thread.fpu is
>> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
>> pages almost all the time.
>
> Hmm. Indeed, that's weird.
>
> That was bare metal and I just checked that this was a production config
> and not some weird debug muck which breaks large pages. I'll look deeper
> into that.

I can't find any reasonable explanation. The pages are definitely large
pages, so yes the dTLB miss count does not make sense, but it's
consistently faster and it's always the dTLB miss count which makes the
big difference according to perf.

For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
test case (hackbench -l 10000) repeated 10 times by perf stat this is
consistently slower than the non optimized variant. There is at least an
explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.

XSAVE is wonderful, isn't it?

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-19 21:22       ` Thomas Gleixner
@ 2022-04-20 18:15         ` Tom Lendacky
  2022-04-22 19:30           ` Thomas Gleixner
  0 siblings, 1 reply; 16+ messages in thread
From: Tom Lendacky @ 2022-04-20 18:15 UTC (permalink / raw)
  To: Thomas Gleixner, Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On 4/19/22 16:22, Thomas Gleixner wrote:
> On Tue, Apr 19 2022 at 15:43, Thomas Gleixner wrote:
>> On Thu, Apr 14 2022 at 10:24, Dave Hansen wrote:
>>> On 4/4/22 05:11, Thomas Gleixner wrote:
>>>> which is suboptimal. Prefetch works better when the access is linear. But
>>>> what's worse is that PKRU can be located in a different page which
>>>> obviously affects dTLB.
>>>
>>> The numbers don't lie, but I'm still surprised by this.  Was this in a
>>> VM that isn't backed with large pages?  task_struct.thread.fpu is
>>> kmem_cache_alloc()'d and is in the direct map, which should be 2M/1G
>>> pages almost all the time.
>>
>> Hmm. Indeed, that's weird.
>>
>> That was bare metal and I just checked that this was a production config
>> and not some weird debug muck which breaks large pages. I'll look deeper
>> into that.
> 
> I can't find any reasonable explanation. The pages are definitely large
> pages, so yes the dTLB miss count does not make sense, but it's
> consistently faster and it's always the dTLB miss count which makes the
> big difference according to perf.
> 
> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
> test case (hackbench -l 10000) repeated 10 times by perf stat this is
> consistently slower than the non optimized variant. There is at least an
> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.

I'll take a look into this and see what I find. Might be interesting to 
see if the actual XSAVES is slower or quicker, too, based on the input mask.

If the performance slowdown shows up in real world benchmarks, we might 
want to consider not using the xgetbv() call on AMD.

Thanks,
Tom

> 
> XSAVE is wonderful, isn't it?
> 
> Thanks,
> 
>          tglx

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-20 18:15         ` Tom Lendacky
@ 2022-04-22 19:30           ` Thomas Gleixner
  2022-04-23 15:20             ` Dave Hansen
  0 siblings, 1 reply; 16+ messages in thread
From: Thomas Gleixner @ 2022-04-22 19:30 UTC (permalink / raw)
  To: Tom Lendacky, Dave Hansen, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On Wed, Apr 20 2022 at 13:15, Tom Lendacky wrote:
> On 4/19/22 16:22, Thomas Gleixner wrote:
>>> That was bare metal and I just checked that this was a production config
>>> and not some weird debug muck which breaks large pages. I'll look deeper
>>> into that.
>> 
>> I can't find any reasonable explanation. The pages are definitely large
>> pages, so yes the dTLB miss count does not make sense, but it's
>> consistently faster and it's always the dTLB miss count which makes the
>> big difference according to perf.
>> 
>> For enhanced fun, I ran the lot on a AMD Zen3 machine and with the same
>> test case (hackbench -l 10000) repeated 10 times by perf stat this is
>> consistently slower than the non optimized variant. There is at least an
>> explanation for that. A tight loop of 1 Mio xgetbv(1) invocations takes
>> 9 Mio cycles on a SKL-X and 50 Mio cycles on a AMD Zen3.
>
> I'll take a look into this and see what I find. Might be interesting to 
> see if the actual XSAVES is slower or quicker, too, based on the input mask.
>
> If the performance slowdown shows up in real world benchmarks, we might 
> want to consider not using the xgetbv() call on AMD.

As things stand now, I'm not going to pursue this further at the moment.

The effect on SKL-X is not explainable especially the dTLB miss count
decrease does not make any sense. Aside of that I just figured out that
it is very sensitive to kernel configurations and I have no idea yet
what exactly is the screw to turn to make the effect come and go.

So I just go and add the XSAVEC support alone as that's actually
something which _is_ beneficial for guests.

Thanks,

        tglx


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported
  2022-04-22 19:30           ` Thomas Gleixner
@ 2022-04-23 15:20             ` Dave Hansen
  0 siblings, 0 replies; 16+ messages in thread
From: Dave Hansen @ 2022-04-23 15:20 UTC (permalink / raw)
  To: Thomas Gleixner, Tom Lendacky, LKML; +Cc: x86, Andrew Cooper, Edgecombe, Rick P

On 4/22/22 12:30, Thomas Gleixner wrote:
> So I just go and add the XSAVEC support alone as that's actually
> something which _is_ beneficial for guests.

Yeah, agreed.

When I went to test these patches, a bit loop with XSAVEC was ~10%
faster that XSAVEOPT.  This system has XSAVES, so wouldn't have been
using XSAVEOPT in the first place in the kernel.  But, this is at least
one more data point in favor of XSAVEC.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: x86/fpu] x86/fpu/xsave: Support XSAVEC in the kernel
  2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
  2022-04-04 16:10   ` Andrew Cooper
  2022-04-14 14:43   ` Dave Hansen
@ 2022-04-25 13:11   ` tip-bot2 for Thomas Gleixner
  2 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Thomas Gleixner @ 2022-04-25 13:11 UTC (permalink / raw)
  To: linux-tip-commits; +Cc: Thomas Gleixner, x86, linux-kernel

The following commit has been merged into the x86/fpu branch of tip:

Commit-ID:     8ad7e8f696951f192c6629a0cbda9ac94c773159
Gitweb:        https://git.kernel.org/tip/8ad7e8f696951f192c6629a0cbda9ac94c773159
Author:        Thomas Gleixner <tglx@linutronix.de>
AuthorDate:    Mon, 04 Apr 2022 14:11:25 +02:00
Committer:     Thomas Gleixner <tglx@linutronix.de>
CommitterDate: Mon, 25 Apr 2022 15:05:37 +02:00

x86/fpu/xsave: Support XSAVEC in the kernel

XSAVEC is the user space counterpart of XSAVES which cannot save supervisor
state. In virtualization scenarios the hypervisor does not expose XSAVES
but XSAVEC to the guest, though the kernel does not make use of it.

That's unfortunate because XSAVEC uses the compacted format of saving the
XSTATE. This is more efficient in terms of storage space vs. XSAVE[OPT] as
it does not create holes for XSTATE components which are not supported or
enabled by the kernel but are available in hardware. There is room for
further optimizations when XSAVEC/S and XGETBV1 are supported.

In order to support XSAVEC:

 - Define the XSAVEC ASM macro as it's not yet supported by the required
   minimal toolchain.

 - Create a software defined X86_FEATURE_XCOMPACTED to select the compacted
   XSTATE buffer format for both XSAVEC and XSAVES.

 - Make XSAVEC an option in the 'XSAVE' ASM alternatives

Requested-by: Andrew Cooper <Andrew.Cooper3@citrix.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/r/20220404104820.598704095@linutronix.de

---
 arch/x86/include/asm/cpufeatures.h |  2 +-
 arch/x86/kernel/fpu/xstate.c       | 58 +++++++++++++++++++----------
 arch/x86/kernel/fpu/xstate.h       | 14 ++++---
 3 files changed, 48 insertions(+), 26 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 73e643a..ff08da8 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -201,7 +201,7 @@
 #define X86_FEATURE_INVPCID_SINGLE	( 7*32+ 7) /* Effectively INVPCID && CR4.PCIDE=1 */
 #define X86_FEATURE_HW_PSTATE		( 7*32+ 8) /* AMD HW-PState */
 #define X86_FEATURE_PROC_FEEDBACK	( 7*32+ 9) /* AMD ProcFeedbackInterface */
-/* FREE!                                ( 7*32+10) */
+#define X86_FEATURE_XCOMPACTED		( 7*32+10) /* "" Use compacted XSTATE (XSAVES or XSAVEC) */
 #define X86_FEATURE_PTI			( 7*32+11) /* Kernel Page Table Isolation enabled */
 #define X86_FEATURE_RETPOLINE		( 7*32+12) /* "" Generic Retpoline mitigation for Spectre variant 2 */
 #define X86_FEATURE_RETPOLINE_LFENCE	( 7*32+13) /* "" Use LFENCE for Spectre variant 2 */
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 39e1c86..31c12f4 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -142,7 +142,8 @@ static unsigned int xfeature_get_offset(u64 xcomp_bv, int xfeature)
 	 * Non-compacted format and legacy features use the cached fixed
 	 * offsets.
 	 */
-	if (!cpu_feature_enabled(X86_FEATURE_XSAVES) || xfeature <= XFEATURE_SSE)
+	if (!cpu_feature_enabled(X86_FEATURE_XCOMPACTED) ||
+	    xfeature <= XFEATURE_SSE)
 		return xstate_offsets[xfeature];
 
 	/*
@@ -369,12 +370,12 @@ static void __init setup_init_fpu_buf(void)
 	/*
 	 * All components are now in init state. Read the state back so
 	 * that init_fpstate contains all non-zero init state. This only
-	 * works with XSAVE, but not with XSAVEOPT and XSAVES because
+	 * works with XSAVE, but not with XSAVEOPT and XSAVEC/S because
 	 * those use the init optimization which skips writing data for
 	 * components in init state.
 	 *
 	 * XSAVE could be used, but that would require to reshuffle the
-	 * data when XSAVES is available because XSAVES uses xstate
+	 * data when XSAVEC/S is available because XSAVEC/S uses xstate
 	 * compaction. But doing so is a pointless exercise because most
 	 * components have an all zeros init state except for the legacy
 	 * ones (FP and SSE). Those can be saved with FXSAVE into the
@@ -584,7 +585,8 @@ static unsigned int xstate_calculate_size(u64 xfeatures, bool compacted)
  */
 static bool __init paranoid_xstate_size_valid(unsigned int kernel_size)
 {
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
+	bool xsaves = cpu_feature_enabled(X86_FEATURE_XSAVES);
 	unsigned int size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
 	int i;
 
@@ -595,7 +597,7 @@ static bool __init paranoid_xstate_size_valid(unsigned int kernel_size)
 		 * Supervisor state components can be managed only by
 		 * XSAVES.
 		 */
-		if (!compacted && xfeature_is_supervisor(i)) {
+		if (!xsaves && xfeature_is_supervisor(i)) {
 			XSTATE_WARN_ON(1);
 			return false;
 		}
@@ -612,8 +614,11 @@ static bool __init paranoid_xstate_size_valid(unsigned int kernel_size)
  * the size of the *user* states.  If we use it to size a buffer
  * that we use 'XSAVES' on, we could potentially overflow the
  * buffer because 'XSAVES' saves system states too.
+ *
+ * This also takes compaction into account. So this works for
+ * XSAVEC as well.
  */
-static unsigned int __init get_xsaves_size(void)
+static unsigned int __init get_compacted_size(void)
 {
 	unsigned int eax, ebx, ecx, edx;
 	/*
@@ -623,6 +628,10 @@ static unsigned int __init get_xsaves_size(void)
 	 *    containing all the state components
 	 *    corresponding to bits currently set in
 	 *    XCR0 | IA32_XSS.
+	 *
+	 * When XSAVES is not available but XSAVEC is (virt), then there
+	 * are no supervisor states, but XSAVEC still uses compacted
+	 * format.
 	 */
 	cpuid_count(XSTATE_CPUID, 1, &eax, &ebx, &ecx, &edx);
 	return ebx;
@@ -632,13 +641,13 @@ static unsigned int __init get_xsaves_size(void)
  * Get the total size of the enabled xstates without the independent supervisor
  * features.
  */
-static unsigned int __init get_xsaves_size_no_independent(void)
+static unsigned int __init get_xsave_compacted_size(void)
 {
 	u64 mask = xfeatures_mask_independent();
 	unsigned int size;
 
 	if (!mask)
-		return get_xsaves_size();
+		return get_compacted_size();
 
 	/* Disable independent features. */
 	wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor());
@@ -647,7 +656,7 @@ static unsigned int __init get_xsaves_size_no_independent(void)
 	 * Ask the hardware what size is required of the buffer.
 	 * This is the size required for the task->fpu buffer.
 	 */
-	size = get_xsaves_size();
+	size = get_compacted_size();
 
 	/* Re-enable independent features so XSAVES will work on them again. */
 	wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() | mask);
@@ -687,20 +696,21 @@ static int __init init_xstate_size(void)
 {
 	/* Recompute the context size for enabled features: */
 	unsigned int user_size, kernel_size, kernel_default_size;
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
 
 	/* Uncompacted user space size */
 	user_size = get_xsave_size_user();
 
 	/*
-	 * XSAVES kernel size includes supervisor states and
-	 * uses compacted format when available.
+	 * XSAVES kernel size includes supervisor states and uses compacted
+	 * format. XSAVEC uses compacted format, but does not save
+	 * supervisor states.
 	 *
-	 * XSAVE does not support supervisor states so
-	 * kernel and user size is identical.
+	 * XSAVE[OPT] do not support supervisor states so kernel and user
+	 * size is identical.
 	 */
 	if (compacted)
-		kernel_size = get_xsaves_size_no_independent();
+		kernel_size = get_xsave_compacted_size();
 	else
 		kernel_size = user_size;
 
@@ -813,8 +823,11 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 	if (!cpu_feature_enabled(X86_FEATURE_XFD))
 		fpu_kernel_cfg.max_features &= ~XFEATURE_MASK_USER_DYNAMIC;
 
-	fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED |
-			      XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+	if (!cpu_feature_enabled(X86_FEATURE_XSAVES))
+		fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED;
+	else
+		fpu_kernel_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED |
+					XFEATURE_MASK_SUPERVISOR_SUPPORTED;
 
 	fpu_user_cfg.max_features = fpu_kernel_cfg.max_features;
 	fpu_user_cfg.max_features &= XFEATURE_MASK_USER_SUPPORTED;
@@ -837,6 +850,11 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 	 */
 	init_fpstate.xfd = fpu_user_cfg.max_features & XFEATURE_MASK_USER_DYNAMIC;
 
+	/* Set up compaction feature bit */
+	if (cpu_feature_enabled(X86_FEATURE_XSAVEC) ||
+	    cpu_feature_enabled(X86_FEATURE_XSAVES))
+		setup_force_cpu_cap(X86_FEATURE_XCOMPACTED);
+
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 
@@ -873,7 +891,7 @@ void __init fpu__init_system_xstate(unsigned int legacy_size)
 	pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
 		fpu_kernel_cfg.max_features,
 		fpu_kernel_cfg.max_size,
-		boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
+		boot_cpu_has(X86_FEATURE_XCOMPACTED) ? "compacted" : "standard");
 	return;
 
 out_disable:
@@ -917,7 +935,7 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	if (WARN_ON_ONCE(!xfeature_enabled(xfeature_nr)))
 		return NULL;
 
-	if (cpu_feature_enabled(X86_FEATURE_XSAVES)) {
+	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED)) {
 		if (WARN_ON_ONCE(!(xcomp_bv & BIT_ULL(xfeature_nr))))
 			return NULL;
 	}
@@ -1525,7 +1543,7 @@ static int __xstate_request_perm(u64 permitted, u64 requested, bool guest)
 	 * vendors into extending XFD for the pre AMX states, especially
 	 * AVX512.
 	 */
-	bool compacted = cpu_feature_enabled(X86_FEATURE_XSAVES);
+	bool compacted = cpu_feature_enabled(X86_FEATURE_XCOMPACTED);
 	struct fpu *fpu = &current->group_leader->thread.fpu;
 	struct fpu_state_perm *perm;
 	unsigned int ksize, usize;
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index d22ace0..5ad4703 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -16,7 +16,7 @@ static inline void xstate_init_xcomp_bv(struct xregs_state *xsave, u64 mask)
 	 * XRSTORS requires these bits set in xcomp_bv, or it will
 	 * trigger #GP:
 	 */
-	if (cpu_feature_enabled(X86_FEATURE_XSAVES))
+	if (cpu_feature_enabled(X86_FEATURE_XCOMPACTED))
 		xsave->header.xcomp_bv = mask | XCOMP_BV_COMPACTED_FORMAT;
 }
 
@@ -79,6 +79,7 @@ static inline u64 xfeatures_mask_independent(void)
 /* These macros all use (%edi)/(%rdi) as the single memory argument. */
 #define XSAVE		".byte " REX_PREFIX "0x0f,0xae,0x27"
 #define XSAVEOPT	".byte " REX_PREFIX "0x0f,0xae,0x37"
+#define XSAVEC		".byte " REX_PREFIX "0x0f,0xc7,0x27"
 #define XSAVES		".byte " REX_PREFIX "0x0f,0xc7,0x2f"
 #define XRSTOR		".byte " REX_PREFIX "0x0f,0xae,0x2f"
 #define XRSTORS		".byte " REX_PREFIX "0x0f,0xc7,0x1f"
@@ -97,9 +98,11 @@ static inline u64 xfeatures_mask_independent(void)
 		     : "memory")
 
 /*
- * If XSAVES is enabled, it replaces XSAVEOPT because it supports a compact
- * format and supervisor states in addition to modified optimization in
- * XSAVEOPT.
+ * If XSAVES is enabled, it replaces XSAVEC because it supports supervisor
+ * states in addition to XSAVEC.
+ *
+ * Otherwise if XSAVEC is enabled, it replaces XSAVEOPT because it supports
+ * compacted storage format in addition to XSAVEOPT.
  *
  * Otherwise, if XSAVEOPT is enabled, XSAVEOPT replaces XSAVE because XSAVEOPT
  * supports modified optimization which is not supported by XSAVE.
@@ -111,8 +114,9 @@ static inline u64 xfeatures_mask_independent(void)
  * address of the instruction where we might get an exception at.
  */
 #define XSTATE_XSAVE(st, lmask, hmask, err)				\
-	asm volatile(ALTERNATIVE_2(XSAVE,				\
+	asm volatile(ALTERNATIVE_3(XSAVE,				\
 				   XSAVEOPT, X86_FEATURE_XSAVEOPT,	\
+				   XSAVEC,   X86_FEATURE_XSAVEC,	\
 				   XSAVES,   X86_FEATURE_XSAVES)	\
 		     "\n"						\
 		     "xor %[err], %[err]\n"				\

^ permalink raw reply related	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2022-04-25 13:11 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-04-04 12:11 [patch 0/3] x86/fpu/xsave: Add XSAVEC support and XGETBV1 utilization Thomas Gleixner
2022-04-04 12:11 ` [patch 1/3] x86/fpu/xsave: Support XSAVEC in the kernel Thomas Gleixner
2022-04-04 16:10   ` Andrew Cooper
2022-04-14 14:43   ` Dave Hansen
2022-04-25 13:11   ` [tip: x86/fpu] " tip-bot2 for Thomas Gleixner
2022-04-04 12:11 ` [patch 2/3] x86/fpu/xsave: Prepare for optimized compaction Thomas Gleixner
2022-04-14 15:46   ` Dave Hansen
2022-04-19 12:39     ` Thomas Gleixner
2022-04-19 13:33       ` Thomas Gleixner
2022-04-04 12:11 ` [patch 3/3] x86/fpu/xsave: Optimize XSAVEC/S when XGETBV1 is supported Thomas Gleixner
2022-04-14 17:24   ` Dave Hansen
2022-04-19 13:43     ` Thomas Gleixner
2022-04-19 21:22       ` Thomas Gleixner
2022-04-20 18:15         ` Tom Lendacky
2022-04-22 19:30           ` Thomas Gleixner
2022-04-23 15:20             ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).