linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions
@ 2020-10-01 20:38 Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 01/22] x86/fpu/xstate: Modify area init helper prototypes to access all the possible areas Chang S. Bae
                   ` (21 more replies)
  0 siblings, 22 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel Advanced Matrix Extensions (AMX)[1][2] will be shipping on servers
soon.  AMX consists of configurable TMM "TILE" registers plus new
accelerator instructions that operate on them.  TMUL (Tile matrix MULtiply)
is the first accelerator instruction set to use the new registers, and we
anticipate additional instructions in the future.

Neither AMX state nor TMUL instructions depend on AVX.  However, AMX and
AVX do share common challenges.  The TMM registers are 8KB today, and
architecturally as large as 64KB, which merits updates to hardware and
software state management.

Further, both technologies run faster when they are not simultaneously
running on SMT siblings, and both technologies use of power and bandwidth
impact the power and performance available to neighboring cores.  (This
impact has measurably improved in recent hardware.)

If the existing kernel approach for managing XSAVE state was employed to
handle AMX, 8KB space would be added to every task, but possibly rarely
used.  So Linux support is optimized by using a new XSAVE feature: eXtended
Feature Disabling (XFD).  The kernel arms XFD to provide a #NM exception
upon a tasks' first access to TILE state. The kernel exception handler
installs the appropriate XSAVE context switch buffer, and the task behaves
as if the kernel had done that for all tasks.  Using XFD, AMX space is
allocated only when needed, eliminating the memory waste for unused state
components.

This series requires the new minimum sigaltstack support [4] and is based
on the mainline with dynamic supervisor state support [3]. The series is
composed of three parts:
* Patch 1-16: Foundation to support dynamic user state management, as
              preparatory for managing tile data state.
* Patch 17-21: Actual AMX enablement, including unit tests
* Patch 22: Introduce boot parameters

Thanks to Len Brown and Dave Hansen for help with the cover letter.

[1]: Intel Architecture Instruction Set Extension Programming Reference
     June 2020, https://software.intel.com/content/dam/develop/public/us/en/documents/architecture-instruction-set-extensions-programming-reference.pdf
[2]: https://software.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compiler-reference/intrinsics/intrinsics-for-intel-advanced-matrix-extensions-intel-amx-instructions.html
[3]: https://lore.kernel.org/lkml/1593780569-62993-1-git-send-email-kan.liang@linux.intel.com/
[4]: https://lore.kernel.org/lkml/20200929205746.6763-1-chang.seok.bae@intel.com/

Chang S. Bae (22):
  x86/fpu/xstate: Modify area init helper prototypes to access all the
    possible areas
  x86/fpu/xstate: Modify xstate copy helper prototypes to access all the
    possible areas
  x86/fpu/xstate: Modify address finder prototypes to access all the
    possible areas
  x86/fpu/xstate: Modify save and restore helper prototypes to access
    all the possible areas
  x86/fpu/xstate: Introduce a new variable for dynamic user states
  x86/fpu/xstate: Outline dynamic xstate area size in the task context
  x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically
  x86/fpu/xstate: Define the scope of the initial xstate data
  x86/fpu/xstate: Introduce wrapper functions for organizing xstate area
    access
  x86/fpu/xstate: Update xstate save function for supporting dynamic
    user xstate
  x86/fpu/xstate: Update xstate area address finder for supporting
    dynamic user xstate
  x86/fpu/xstate: Update xstate context copy function for supporting
    dynamic area
  x86/fpu/xstate: Expand dynamic user state area on first use
  x86/fpu/xstate: Inherit dynamic user state when used in the parent
  x86/fpu/xstate: Support ptracer-induced xstate area expansion
  x86/fpu/xstate: Support dynamic user state in the signal handling path
  x86/fpu/xstate: Extend the table for mapping xstate components with
    features
  x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature
    bits
  x86/fpu/amx: Define AMX state components and have it used for
    boot-time checks
  x86/fpu/amx: Enable the AMX feature in 64-bit mode
  selftest/x86/amx: Include test cases for the AMX state management
  x86/fpu/xstate: Introduce boot-parameters for control some state
    component support

 .../admin-guide/kernel-parameters.txt         |  15 +
 arch/x86/include/asm/cpufeatures.h            |   4 +
 arch/x86/include/asm/fpu/internal.h           |  99 ++-
 arch/x86/include/asm/fpu/types.h              |  62 +-
 arch/x86/include/asm/fpu/xstate.h             |  41 +-
 arch/x86/include/asm/msr-index.h              |   2 +
 arch/x86/include/asm/pgtable.h                |   2 +-
 arch/x86/include/asm/processor.h              |  10 +-
 arch/x86/include/asm/trace/fpu.h              |   6 +-
 arch/x86/kernel/cpu/common.c                  |   2 +-
 arch/x86/kernel/cpu/cpuid-deps.c              |   3 +
 arch/x86/kernel/fpu/core.c                    | 107 ++-
 arch/x86/kernel/fpu/init.c                    |  89 ++-
 arch/x86/kernel/fpu/regset.c                  |  65 +-
 arch/x86/kernel/fpu/signal.c                  |  41 +-
 arch/x86/kernel/fpu/xstate.c                  | 483 ++++++++++--
 arch/x86/kernel/process.c                     |  11 +
 arch/x86/kernel/process_32.c                  |   2 +-
 arch/x86/kernel/process_64.c                  |   2 +-
 arch/x86/kernel/traps.c                       |   3 +
 arch/x86/kvm/x86.c                            |  43 +-
 arch/x86/mm/pkeys.c                           |   2 +-
 tools/testing/selftests/x86/Makefile          |   2 +-
 tools/testing/selftests/x86/amx.c             | 736 ++++++++++++++++++
 24 files changed, 1620 insertions(+), 212 deletions(-)
 create mode 100644 tools/testing/selftests/x86/amx.c

--
2.17.1


^ permalink raw reply	[flat|nested] 43+ messages in thread

* [RFC PATCH 01/22] x86/fpu/xstate: Modify area init helper prototypes to access all the possible areas
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 02/22] x86/fpu/xstate: Modify xstate copy " Chang S. Bae
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate infrastructure is not flexible to support dynamic areas in
task->fpu. Change the fpstate_init() prototype to access task->fpu
directly. It treats a null pointer as indicating init_fpstate, as this
initial data does not belong to any task. For the compacted format,
fpstate_init_xstate() now accepts the state component bitmap to configure
XCOMP_BV.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h |  6 +++---
 arch/x86/kernel/fpu/core.c          | 14 +++++++++++---
 arch/x86/kernel/fpu/init.c          |  2 +-
 arch/x86/kernel/fpu/regset.c        |  2 +-
 arch/x86/kernel/fpu/xstate.c        |  3 +--
 arch/x86/kvm/x86.c                  |  2 +-
 6 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 0a460f2a3f90..c404fedf1a75 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -79,20 +79,20 @@ static __always_inline __pure bool use_fxsr(void)
 
 extern union fpregs_state init_fpstate;
 
-extern void fpstate_init(union fpregs_state *state);
+extern void fpstate_init(struct fpu *fpu);
 #ifdef CONFIG_MATH_EMULATION
 extern void fpstate_init_soft(struct swregs_state *soft);
 #else
 static inline void fpstate_init_soft(struct swregs_state *soft) {}
 #endif
 
-static inline void fpstate_init_xstate(struct xregs_state *xsave)
+static inline void fpstate_init_xstate(struct xregs_state *xsave, u64 xcomp_mask)
 {
 	/*
 	 * XRSTORS requires these bits set in xcomp_bv, or it will
 	 * trigger #GP:
 	 */
-	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xfeatures_mask_all;
+	xsave->header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT | xcomp_mask;
 }
 
 static inline void fpstate_init_fxstate(struct fxregs_state *fx)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index eb86a2b831b1..41d926c76615 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -191,8 +191,16 @@ static inline void fpstate_init_fstate(struct fregs_state *fp)
 	fp->fos = 0xffff0000u;
 }
 
-void fpstate_init(union fpregs_state *state)
+/* If a null pointer is given, assume to take the initial FPU state, init_fpstate. */
+void fpstate_init(struct fpu *fpu)
 {
+	union fpregs_state *state;
+
+	if (fpu)
+		state = &fpu->state;
+	else
+		state = &init_fpstate;
+
 	if (!static_cpu_has(X86_FEATURE_FPU)) {
 		fpstate_init_soft(&state->soft);
 		return;
@@ -201,7 +209,7 @@ void fpstate_init(union fpregs_state *state)
 	memset(state, 0, fpu_kernel_xstate_size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave);
+		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
@@ -261,7 +269,7 @@ static void fpu__initialize(struct fpu *fpu)
 	WARN_ON_FPU(fpu != &current->thread.fpu);
 
 	set_thread_flag(TIF_NEED_FPU_LOAD);
-	fpstate_init(&fpu->state);
+	fpstate_init(fpu);
 	trace_x86_fpu_init_state(fpu);
 }
 
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 61ddc3a5e5c2..4e89a2698cfb 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -125,7 +125,7 @@ static void __init fpu__init_system_generic(void)
 	 * Set up the legacy init FPU context. (xstate init might overwrite this
 	 * with a more modern format, if the CPU supports it.)
 	 */
-	fpstate_init(&init_fpstate);
+	fpstate_init(NULL);
 
 	fpu__init_system_mxcsr();
 }
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index c413756ba89f..4c4d9059ff36 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -144,7 +144,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	 * In case of failure, mark all states as init:
 	 */
 	if (ret)
-		fpstate_init(&fpu->state);
+		fpstate_init(fpu);
 
 	return ret;
 }
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 038e19c0019e..ee4946c60ab1 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -454,8 +454,7 @@ static void __init setup_init_fpu_buf(void)
 	print_xstate_features();
 
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		init_fpstate.xsave.header.xcomp_bv = XCOMP_BV_COMPACTED_FORMAT |
-						     xfeatures_mask_all;
+		fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
 
 	/*
 	 * Init all the features state with header.xfeatures being 0x0
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ce856e0ece84..9da8cb4b8589 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9448,7 +9448,7 @@ static int sync_regs(struct kvm_vcpu *vcpu)
 
 static void fx_init(struct kvm_vcpu *vcpu)
 {
-	fpstate_init(&vcpu->arch.guest_fpu->state);
+	fpstate_init(vcpu->arch.guest_fpu);
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
 		vcpu->arch.guest_fpu->state.xsave.header.xcomp_bv =
 			host_xcr0 | XSTATE_COMPACTION_ENABLED;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 02/22] x86/fpu/xstate: Modify xstate copy helper prototypes to access all the possible areas
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 01/22] x86/fpu/xstate: Modify area init helper prototypes to access all the possible areas Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 03/22] x86/fpu/xstate: Modify address finder " Chang S. Bae
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

The xstate infrastructure is not flexible to support dynamic areas in
task->fpu. Make the xstate copy functions to access task->fpu directly.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/xstate.h |  8 ++++----
 arch/x86/kernel/fpu/regset.c      |  6 +++---
 arch/x86/kernel/fpu/signal.c      | 17 ++++++++---------
 arch/x86/kernel/fpu/xstate.c      | 19 +++++++++++++++----
 4 files changed, 30 insertions(+), 20 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 14ab815132d4..a315b055212f 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -105,10 +105,10 @@ const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
 struct membuf;
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave);
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf);
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf);
-void copy_supervisor_to_kernel(struct xregs_state *xsave);
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu);
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf);
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf);
+void copy_supervisor_to_kernel(struct fpu *fpu);
 void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask);
 void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask);
 
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 4c4d9059ff36..5e13e58d11d4 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -85,7 +85,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
 	fpu__prepare_read(fpu);
 
 	if (using_compacted_format()) {
-		copy_xstate_to_kernel(to, xsave);
+		copy_xstate_to_kernel(to, fpu);
 		return 0;
 	} else {
 		fpstate_sanitize_xstate(fpu);
@@ -126,9 +126,9 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	if (using_compacted_format()) {
 		if (kbuf)
-			ret = copy_kernel_to_xstate(xsave, kbuf);
+			ret = copy_kernel_to_xstate(fpu, kbuf);
 		else
-			ret = copy_user_to_xstate(xsave, ubuf);
+			ret = copy_user_to_xstate(fpu, ubuf);
 	} else {
 		ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, xsave, 0, -1);
 		if (!ret)
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 9f009525f551..adbf63114bc2 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -212,11 +212,11 @@ int copy_fpstate_to_sigframe(void __user *buf, void __user *buf_fx, int size)
 }
 
 static inline void
-sanitize_restored_user_xstate(union fpregs_state *state,
+sanitize_restored_user_xstate(struct fpu *fpu,
 			      struct user_i387_ia32_struct *ia32_env,
 			      u64 user_xfeatures, int fx_only)
 {
-	struct xregs_state *xsave = &state->xsave;
+	struct xregs_state *xsave = &fpu->state.xsave;
 	struct xstate_header *header = &xsave->header;
 
 	if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(union fpregs_state *state,
 		xsave->i387.mxcsr &= mxcsr_feature_mask;
 
 		if (ia32_env)
-			convert_to_fxsr(&state->fxsave, ia32_env);
+			convert_to_fxsr(&fpu->state.fxsave, ia32_env);
 	}
 }
 
@@ -396,7 +396,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		 * current supervisor states first and invalidate the FPU regs.
 		 */
 		if (xfeatures_mask_supervisor())
-			copy_supervisor_to_kernel(&fpu->state.xsave);
+			copy_supervisor_to_kernel(fpu);
 		set_thread_flag(TIF_NEED_FPU_LOAD);
 	}
 	__fpu_invalidate_fpregs_state(fpu);
@@ -406,18 +406,18 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		u64 init_bv = xfeatures_mask_user() & ~user_xfeatures;
 
 		if (using_compacted_format()) {
-			ret = copy_user_to_xstate(&fpu->state.xsave, buf_fx);
+			ret = copy_user_to_xstate(fpu, buf_fx);
 		} else {
 			ret = __copy_from_user(&fpu->state.xsave, buf_fx, state_size);
 
 			if (!ret && state_size > offsetof(struct xregs_state, header))
 				ret = validate_user_xstate_header(&fpu->state.xsave.header);
+
 		}
 		if (ret)
 			goto err_out;
 
-		sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
-					      fx_only);
+		sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);
 
 		fpregs_lock();
 		if (unlikely(init_bv))
@@ -437,8 +437,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			goto err_out;
 		}
 
-		sanitize_restored_user_xstate(&fpu->state, envp, user_xfeatures,
-					      fx_only);
+		sanitize_restored_user_xstate(fpu, envp, user_xfeatures, fx_only);
 
 		fpregs_lock();
 		if (use_xsave()) {
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index ee4946c60ab1..e3a9bddc39d9 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1067,14 +1067,17 @@ static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
  * It supports partial copy but pos always starts from zero. This is called
  * from xstateregs_get() and there we check the CPU has XSAVES.
  */
-void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
+void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 {
 	struct xstate_header header;
 	const unsigned off_mxcsr = offsetof(struct fxregs_state, mxcsr);
+	struct xregs_state *xsave;
 	unsigned size = to.left;
 	unsigned last = 0;
 	int i;
 
+	xsave = &fpu->state.xsave;
+
 	/*
 	 * The destination is a ptrace buffer; we put in only user xstates:
 	 */
@@ -1123,8 +1126,9 @@ void copy_xstate_to_kernel(struct membuf to, struct xregs_state *xsave)
  * Convert from a ptrace standard-format kernel buffer to kernel XSAVES format
  * and copy to the target thread. This is called from xstateregs_set().
  */
-int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
+int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 {
+	struct xregs_state *xsave;
 	unsigned int offset, size;
 	int i;
 	struct xstate_header hdr;
@@ -1137,6 +1141,8 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
+	xsave = &fpu->state.xsave;
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
@@ -1176,8 +1182,9 @@ int copy_kernel_to_xstate(struct xregs_state *xsave, const void *kbuf)
  * xstateregs_set(), as well as potentially from the sigreturn() and
  * rt_sigreturn() system calls.
  */
-int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
+int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 {
+	struct xregs_state *xsave;
 	unsigned int offset, size;
 	int i;
 	struct xstate_header hdr;
@@ -1191,6 +1198,8 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
+	xsave = &fpu->state.xsave;
+
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
@@ -1231,9 +1240,10 @@ int copy_user_to_xstate(struct xregs_state *xsave, const void __user *ubuf)
  * old states, and is intended to be used only in __fpu__restore_sig(), where
  * user states are restored from the user buffer.
  */
-void copy_supervisor_to_kernel(struct xregs_state *xstate)
+void copy_supervisor_to_kernel(struct fpu *fpu)
 {
 	struct xstate_header *header;
+	struct xregs_state *xstate;
 	u64 max_bit, min_bit;
 	u32 lmask, hmask;
 	int err, i;
@@ -1247,6 +1257,7 @@ void copy_supervisor_to_kernel(struct xregs_state *xstate)
 	max_bit = __fls(xfeatures_mask_supervisor());
 	min_bit = __ffs(xfeatures_mask_supervisor());
 
+	xstate = &fpu->state.xsave;
 	lmask = xfeatures_mask_supervisor();
 	hmask = xfeatures_mask_supervisor() >> 32;
 	XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 03/22] x86/fpu/xstate: Modify address finder prototypes to access all the possible areas
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 01/22] x86/fpu/xstate: Modify area init helper prototypes to access all the possible areas Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 02/22] x86/fpu/xstate: Modify xstate copy " Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 04/22] x86/fpu/xstate: Modify save and restore helper " Chang S. Bae
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate infrastructure is not flexible to support dynamic areas in
task->fpu. Change the prototype of some address finding functions to access
task->fpu directly. Make changes for both outer and inner helpers:
get_xsave_addr() and __raw_xsave_addr().

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h |  2 +-
 arch/x86/include/asm/fpu/xstate.h   |  2 +-
 arch/x86/include/asm/pgtable.h      |  2 +-
 arch/x86/kernel/cpu/common.c        |  2 +-
 arch/x86/kernel/fpu/xstate.c        | 43 ++++++++++++++++++++---------
 arch/x86/kvm/x86.c                  | 26 +++++++++++------
 arch/x86/mm/pkeys.c                 |  2 +-
 7 files changed, 52 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index c404fedf1a75..baca80e877a6 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -578,7 +578,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 	 * return to userland e.g. for a copy_to_user() operation.
 	 */
 	if (current->mm) {
-		pk = get_xsave_addr(&new_fpu->state.xsave, XFEATURE_PKRU);
+		pk = get_xsave_addr(new_fpu, XFEATURE_PKRU);
 		if (pk)
 			pkru_val = pk->pkru;
 	}
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index a315b055212f..3fbf45727ad6 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -100,7 +100,7 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr);
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index b836138ce852..e24a8fb8f479 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -142,7 +142,7 @@ static inline void write_pkru(u32 pkru)
 	if (!boot_cpu_has(X86_FEATURE_OSPKE))
 		return;
 
-	pk = get_xsave_addr(&current->thread.fpu.state.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(&current->thread.fpu, XFEATURE_PKRU);
 
 	/*
 	 * The PKRU value in xstate needs to be in sync with the value that is
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index d0363e15ec2e..183ee7f77065 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -478,7 +478,7 @@ static __always_inline void setup_pku(struct cpuinfo_x86 *c)
 		return;
 
 	cr4_set_bits(X86_CR4_PKE);
-	pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(NULL, XFEATURE_PKRU);
 	if (pk)
 		pk->pkru = init_pkru_value;
 	/*
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index e3a9bddc39d9..bab22766b79b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -891,15 +891,23 @@ void fpu__resume_cpu(void)
  * buffer the state is.  Callers should ensure that the buffer
  * is valid.
  */
-static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	void *xsave;
+
 	if (!xfeature_enabled(xfeature_nr)) {
 		WARN_ON_FPU(1);
 		return NULL;
 	}
 
-	return (void *)xsave + xstate_comp_offsets[xfeature_nr];
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
+	return xsave + xstate_comp_offsets[xfeature_nr];
 }
+
 /*
  * Given the xsave area and a state inside, this function returns the
  * address of the state.
@@ -911,15 +919,18 @@ static void *__raw_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
  * this will return NULL.
  *
  * Inputs:
- *	xstate: the thread's storage area for all FPU data
+ *	fpu: the thread's FPU data to access all the FPU state storages.
+	     (If a null pointer is given, assume the init_fpstate)
  *	xfeature_nr: state which is defined in xsave.h (e.g. XFEATURE_FP,
  *	XFEATURE_SSE, etc...)
  * Output:
  *	address of the state in the xsave area, or NULL if the
  *	field is not present in the xsave buffer.
  */
-void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
+void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
+	struct xregs_state *xsave;
+
 	/*
 	 * Do we even *have* xsave state?
 	 */
@@ -932,6 +943,12 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	 */
 	WARN_ONCE(!(xfeatures_mask_all & BIT_ULL(xfeature_nr)),
 		  "get of unsupported state");
+
+	if (fpu)
+		xsave = &fpu->state.xsave;
+	else
+		xsave = &init_fpstate.xsave;
+
 	/*
 	 * This assumes the last 'xsave*' instruction to
 	 * have requested that 'xfeature_nr' be saved.
@@ -946,7 +963,7 @@ void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr)
 	if (!(xsave->header.xfeatures & BIT_ULL(xfeature_nr)))
 		return NULL;
 
-	return __raw_xsave_addr(xsave, xfeature_nr);
+	return __raw_xsave_addr(fpu, xfeature_nr);
 }
 EXPORT_SYMBOL_GPL(get_xsave_addr);
 
@@ -977,7 +994,7 @@ const void *get_xsave_field_ptr(int xfeature_nr)
 	 */
 	fpu__save(fpu);
 
-	return get_xsave_addr(&fpu->state.xsave, xfeature_nr);
+	return get_xsave_addr(fpu, xfeature_nr);
 }
 
 #ifdef CONFIG_ARCH_HAS_PKEYS
@@ -1112,7 +1129,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 		 * Copy only in-use xstates:
 		 */
 		if ((header.xfeatures >> i) & 1) {
-			void *src = __raw_xsave_addr(xsave, i);
+			void *src = __raw_xsave_addr(fpu, i);
 
 			copy_part(&to, &last, xstate_offsets[i],
 				  xstate_sizes[i], src);
@@ -1141,13 +1158,11 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
-	xsave = &fpu->state.xsave;
-
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
 		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
+			void *dst = __raw_xsave_addr(fpu, i);
 
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
@@ -1156,6 +1171,8 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 		}
 	}
 
+	xsave = &fpu->state.xsave;
+
 	if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
 		offset = offsetof(struct fxregs_state, mxcsr);
 		size = MXCSR_AND_FLAGS_SIZE;
@@ -1198,13 +1215,11 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 	if (validate_user_xstate_header(&hdr))
 		return -EINVAL;
 
-	xsave = &fpu->state.xsave;
-
 	for (i = 0; i < XFEATURE_MAX; i++) {
 		u64 mask = ((u64)1 << i);
 
 		if (hdr.xfeatures & mask) {
-			void *dst = __raw_xsave_addr(xsave, i);
+			void *dst = __raw_xsave_addr(fpu, i);
 
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
@@ -1214,6 +1229,8 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 		}
 	}
 
+	xsave = &fpu->state.xsave;
+
 	if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
 		offset = offsetof(struct fxregs_state, mxcsr);
 		size = MXCSR_AND_FLAGS_SIZE;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9da8cb4b8589..c4b8d3705625 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4182,10 +4182,15 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 
 static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
-	u64 xstate_bv = xsave->header.xfeatures;
+	struct xregs_state *xsave;
+	struct fpu *guest_fpu;
+	u64 xstate_bv;
 	u64 valid;
 
+	guest_fpu = vcpu->arch.guest_fpu;
+	xsave = &guest_fpu->state.xsave;
+	xstate_bv = xsave->header.xfeatures;
+
 	/*
 	 * Copy legacy XSAVE area, to avoid complications with CPUID
 	 * leaves 0 and 1 in the loop below.
@@ -4204,7 +4209,7 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 	while (valid) {
 		u64 xfeature_mask = valid & -valid;
 		int xfeature_nr = fls64(xfeature_mask) - 1;
-		void *src = get_xsave_addr(xsave, xfeature_nr);
+		void *src = get_xsave_addr(guest_fpu, xfeature_nr);
 
 		if (src) {
 			u32 size, offset, ecx, edx;
@@ -4224,10 +4229,14 @@ static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
 
 static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 {
-	struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
 	u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
+	struct xregs_state *xsave;
+	struct fpu *guest_fpu;
 	u64 valid;
 
+	guest_fpu = vcpu->arch.guest_fpu;
+	xsave = &guest_fpu->state.xsave;
+
 	/*
 	 * Copy legacy XSAVE area, to avoid complications with CPUID
 	 * leaves 0 and 1 in the loop below.
@@ -4247,7 +4256,7 @@ static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
 	while (valid) {
 		u64 xfeature_mask = valid & -valid;
 		int xfeature_nr = fls64(xfeature_mask) - 1;
-		void *dest = get_xsave_addr(xsave, xfeature_nr);
+		void *dest = get_xsave_addr(guest_fpu, xfeature_nr);
 
 		if (dest) {
 			u32 size, offset, ecx, edx;
@@ -9662,6 +9671,7 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->arch.apf.halted = false;
 
 	if (kvm_mpx_supported()) {
+		struct fpu *guest_fpu = vcpu->arch.guest_fpu;
 		void *mpx_state_buffer;
 
 		/*
@@ -9670,12 +9680,10 @@ void kvm_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 		 */
 		if (init_event)
 			kvm_put_guest_fpu(vcpu);
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDREGS);
+		mpx_state_buffer = get_xsave_addr(guest_fpu, XFEATURE_BNDREGS);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndreg_state));
-		mpx_state_buffer = get_xsave_addr(&vcpu->arch.guest_fpu->state.xsave,
-					XFEATURE_BNDCSR);
+		mpx_state_buffer = get_xsave_addr(guest_fpu, XFEATURE_BNDCSR);
 		if (mpx_state_buffer)
 			memset(mpx_state_buffer, 0, sizeof(struct mpx_bndcsr));
 		if (init_event)
diff --git a/arch/x86/mm/pkeys.c b/arch/x86/mm/pkeys.c
index 8873ed1438a9..772e8bc3d49d 100644
--- a/arch/x86/mm/pkeys.c
+++ b/arch/x86/mm/pkeys.c
@@ -177,7 +177,7 @@ static ssize_t init_pkru_write_file(struct file *file,
 		return -EINVAL;
 
 	WRITE_ONCE(init_pkru_value, new_init_pkru);
-	pk = get_xsave_addr(&init_fpstate.xsave, XFEATURE_PKRU);
+	pk = get_xsave_addr(NULL, XFEATURE_PKRU);
 	if (!pk)
 		return -EINVAL;
 	pk->pkru = new_init_pkru;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 04/22] x86/fpu/xstate: Modify save and restore helper prototypes to access all the possible areas
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (2 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 03/22] x86/fpu/xstate: Modify address finder " Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 05/22] x86/fpu/xstate: Introduce a new variable for dynamic user states Chang S. Bae
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate infrastructure is not flexible to support dynamic areas in
task->fpu. Make the xstate save and restore helpers to access task->fpu
directly.

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h | 9 ++++++---
 arch/x86/kernel/fpu/core.c          | 4 ++--
 arch/x86/kernel/fpu/signal.c        | 3 +--
 arch/x86/kvm/x86.c                  | 2 +-
 4 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index baca80e877a6..6eec5209750f 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -396,8 +396,9 @@ static inline int copy_user_to_xregs(struct xregs_state __user *buf, u64 mask)
  * Restore xstate from kernel space xsave area, return an error code instead of
  * an exception.
  */
-static inline int copy_kernel_to_xregs_err(struct xregs_state *xstate, u64 mask)
+static inline int copy_kernel_to_xregs_err(struct fpu *fpu, u64 mask)
 {
+	struct xregs_state *xstate = &fpu->state.xsave;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
@@ -424,8 +425,10 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask
 	}
 }
 
-static inline void copy_kernel_to_fpregs(union fpregs_state *fpstate)
+static inline void copy_kernel_to_fpregs(struct fpu *fpu)
 {
+	union fpregs_state *fpstate = &fpu->state;
+
 	/*
 	 * AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
 	 * pending. Clear the x87 state here by setting it to fixed values.
@@ -510,7 +513,7 @@ static inline void __fpregs_load_activate(void)
 		return;
 
 	if (!fpregs_state_valid(fpu, cpu)) {
-		copy_kernel_to_fpregs(&fpu->state);
+		copy_kernel_to_fpregs(fpu);
 		fpregs_activate(fpu);
 		fpu->last_cpu = cpu;
 	}
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 41d926c76615..39ddb22c143b 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -172,7 +172,7 @@ void fpu__save(struct fpu *fpu)
 
 	if (!test_thread_flag(TIF_NEED_FPU_LOAD)) {
 		if (!copy_fpregs_to_fpstate(fpu)) {
-			copy_kernel_to_fpregs(&fpu->state);
+			copy_kernel_to_fpregs(fpu);
 		}
 	}
 
@@ -248,7 +248,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
-		copy_kernel_to_fpregs(&dst_fpu->state);
+		copy_kernel_to_fpregs(dst_fpu);
 
 	fpregs_unlock();
 
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index adbf63114bc2..6f3bcc7dab80 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -427,8 +427,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		 * Restore previously saved supervisor xstates along with
 		 * copied-in user xstates.
 		 */
-		ret = copy_kernel_to_xregs_err(&fpu->state.xsave,
-					       user_xfeatures | xfeatures_mask_supervisor());
+		ret = copy_kernel_to_xregs_err(fpu, user_xfeatures | xfeatures_mask_supervisor());
 
 	} else if (use_fxsr()) {
 		ret = __copy_from_user(&fpu->state.fxsave, buf_fx, state_size);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4b8d3705625..192d52ff5b8c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8877,7 +8877,7 @@ static void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
 
 	kvm_save_current_fpu(vcpu->arch.guest_fpu);
 
-	copy_kernel_to_fpregs(&vcpu->arch.user_fpu->state);
+	copy_kernel_to_fpregs(vcpu->arch.user_fpu);
 
 	fpregs_mark_activate();
 	fpregs_unlock();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 05/22] x86/fpu/xstate: Introduce a new variable for dynamic user states
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (3 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 04/22] x86/fpu/xstate: Modify save and restore helper " Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 06/22] x86/fpu/xstate: Outline dynamic xstate area size in the task context Chang S. Bae
                   ` (16 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

The kernel recently supported the dynamic supervisor states. The approach
does not save the register states at every context switch (even used), but
only when needed. It is not suitable for user states.

Introduce xfeatures_mask_user_dynamic to identify dynamic user states, and
rename these as related to the dynamic supervisor states:
	xfeatures_mask_supervisor_dynamic()
	XFEATURE_MASK_SUPERVISOR_DYNAMIC

No functional change.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/xstate.h | 12 +++++++-----
 arch/x86/kernel/fpu/xstate.c      | 29 +++++++++++++++++++----------
 2 files changed, 26 insertions(+), 15 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 3fbf45727ad6..9aad91c0725b 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -56,7 +56,7 @@
  * - Don't set the bit corresponding to the dynamic supervisor feature in
  *   IA32_XSS at run time, since it has been set at boot time.
  */
-#define XFEATURE_MASK_DYNAMIC (XFEATURE_MASK_LBR)
+#define XFEATURE_MASK_SUPERVISOR_DYNAMIC (XFEATURE_MASK_LBR)
 
 /*
  * Unsupported supervisor features. When a supervisor feature in this mask is
@@ -66,7 +66,7 @@
 
 /* All supervisor states including supported and unsupported states. */
 #define XFEATURE_MASK_SUPERVISOR_ALL (XFEATURE_MASK_SUPERVISOR_SUPPORTED | \
-				      XFEATURE_MASK_DYNAMIC | \
+				      XFEATURE_MASK_SUPERVISOR_DYNAMIC | \
 				      XFEATURE_MASK_SUPERVISOR_UNSUPPORTED)
 
 #ifdef CONFIG_X86_64
@@ -87,14 +87,16 @@ static inline u64 xfeatures_mask_user(void)
 	return xfeatures_mask_all & XFEATURE_MASK_USER_SUPPORTED;
 }
 
-static inline u64 xfeatures_mask_dynamic(void)
+static inline u64 xfeatures_mask_supervisor_dynamic(void)
 {
 	if (!boot_cpu_has(X86_FEATURE_ARCH_LBR))
-		return XFEATURE_MASK_DYNAMIC & ~XFEATURE_MASK_LBR;
+		return XFEATURE_MASK_SUPERVISOR_DYNAMIC & ~XFEATURE_MASK_LBR;
 
-	return XFEATURE_MASK_DYNAMIC;
+	return XFEATURE_MASK_SUPERVISOR_DYNAMIC;
 }
 
+extern u64 xfeatures_mask_user_dynamic;
+
 extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 
 extern void __init update_regset_xstate_info(unsigned int size,
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index bab22766b79b..bf2b09bf9b38 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -59,6 +59,12 @@ static short xsave_cpuid_features[] __initdata = {
  */
 u64 xfeatures_mask_all __read_mostly;
 
+/*
+ * This represents user xstates, a subset of xfeatures_mask_all, saved in a
+ * dynamic kernel XSAVE buffer.
+ */
+u64 xfeatures_mask_user_dynamic __read_mostly;
+
 static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
@@ -235,7 +241,7 @@ void fpu__init_cpu_xstate(void)
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
-				     xfeatures_mask_dynamic());
+				     xfeatures_mask_supervisor_dynamic());
 	}
 }
 
@@ -682,7 +688,7 @@ static unsigned int __init get_xsaves_size(void)
  */
 static unsigned int __init get_xsaves_size_no_dynamic(void)
 {
-	u64 mask = xfeatures_mask_dynamic();
+	u64 mask = xfeatures_mask_supervisor_dynamic();
 	unsigned int size;
 
 	if (!mask)
@@ -769,6 +775,7 @@ static int __init init_xstate_size(void)
 static void fpu__init_disable_system_xstate(void)
 {
 	xfeatures_mask_all = 0;
+	xfeatures_mask_user_dynamic = 0;
 	cr4_clear_bits(X86_CR4_OSXSAVE);
 	setup_clear_cpu_cap(X86_FEATURE_XSAVE);
 }
@@ -835,6 +842,8 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
+	/* Do not support the dynamically allocated area yet. */
+	xfeatures_mask_user_dynamic = 0;
 
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
@@ -882,7 +891,7 @@ void fpu__resume_cpu(void)
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVES)) {
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
-				     xfeatures_mask_dynamic());
+				     xfeatures_mask_supervisor_dynamic());
 	}
 }
 
@@ -1316,8 +1325,8 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
  * @mask: Represent the dynamic supervisor features saved into the xsave area
  *
  * Only the dynamic supervisor states sets in the mask are saved into the xsave
- * area (See the comment in XFEATURE_MASK_DYNAMIC for the details of dynamic
- * supervisor feature). Besides the dynamic supervisor states, the legacy
+ * area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the details of
+ * dynamic supervisor feature). Besides the dynamic supervisor states, the legacy
  * region and XSAVE header are also saved into the xsave area. The supervisor
  * features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
  * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not saved.
@@ -1326,7 +1335,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
  */
 void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
 {
-	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
 	u32 lmask, hmask;
 	int err;
 
@@ -1352,9 +1361,9 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
  * @mask: Represent the dynamic supervisor features restored from the xsave area
  *
  * Only the dynamic supervisor states sets in the mask are restored from the
- * xsave area (See the comment in XFEATURE_MASK_DYNAMIC for the details of
- * dynamic supervisor feature). Besides the dynamic supervisor states, the
- * legacy region and XSAVE header are also restored from the xsave area. The
+ * xsave area (See the comment in XFEATURE_MASK_SUPERVISOR_DYNAMIC for the
+ * details of dynamic supervisor feature). Besides the dynamic supervisor states,
+ * the legacy region and XSAVE header are also restored from the xsave area. The
  * supervisor features in the XFEATURE_MASK_SUPERVISOR_SUPPORTED and
  * XFEATURE_MASK_SUPERVISOR_UNSUPPORTED are not restored.
  *
@@ -1362,7 +1371,7 @@ void copy_dynamic_supervisor_to_kernel(struct xregs_state *xstate, u64 mask)
  */
 void copy_kernel_to_dynamic_supervisor(struct xregs_state *xstate, u64 mask)
 {
-	u64 dynamic_mask = xfeatures_mask_dynamic() & mask;
+	u64 dynamic_mask = xfeatures_mask_supervisor_dynamic() & mask;
 	u32 lmask, hmask;
 	int err;
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 06/22] x86/fpu/xstate: Outline dynamic xstate area size in the task context
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (4 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 05/22] x86/fpu/xstate: Introduce a new variable for dynamic user states Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:38 ` [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically Chang S. Bae
                   ` (15 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

The xstate area size in task->fpu used to be fixed at runtime. To
accommodate dynamic user states, introduce variables for representing the
maximum and default (as minimum) area sizes.

do_extra_xstate_size_checks() is ready to calculate both sizes, which can
be compared with CPUID. CPUID can immediately provide the maximum size. The
code needs to rewrite XCR0 registers to get the default size that excludes
the dynamic parts. It is not always straightforward especially when
inter-dependency exists between state component bits. To make it simple,
the code double-checks the maximum size only.

No functional change as long as the kernel does not support the dynamic
area.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/include/asm/processor.h | 10 ++-----
 arch/x86/kernel/fpu/core.c       |  6 ++--
 arch/x86/kernel/fpu/init.c       | 33 ++++++++++++----------
 arch/x86/kernel/fpu/signal.c     |  2 +-
 arch/x86/kernel/fpu/xstate.c     | 48 +++++++++++++++++++++-----------
 arch/x86/kernel/process.c        |  6 ++++
 arch/x86/kvm/x86.c               |  2 +-
 7 files changed, 65 insertions(+), 42 deletions(-)

diff --git a/arch/x86/include/asm/processor.h b/arch/x86/include/asm/processor.h
index 97143d87994c..f5f83aa1b90f 100644
--- a/arch/x86/include/asm/processor.h
+++ b/arch/x86/include/asm/processor.h
@@ -477,7 +477,8 @@ DECLARE_PER_CPU_ALIGNED(struct stack_canary, stack_canary);
 DECLARE_PER_CPU(struct irq_stack *, softirq_stack_ptr);
 #endif	/* X86_64 */
 
-extern unsigned int fpu_kernel_xstate_size;
+extern unsigned int fpu_kernel_xstate_default_size;
+extern unsigned int fpu_kernel_xstate_max_size;
 extern unsigned int fpu_user_xstate_size;
 
 struct perf_event;
@@ -551,12 +552,7 @@ struct thread_struct {
 };
 
 /* Whitelist the FPU state from the task_struct for hardened usercopy. */
-static inline void arch_thread_struct_whitelist(unsigned long *offset,
-						unsigned long *size)
-{
-	*offset = offsetof(struct thread_struct, fpu.state);
-	*size = fpu_kernel_xstate_size;
-}
+extern void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size);
 
 /*
  * Thread-synchronous status.
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 39ddb22c143b..875620fdfe61 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -206,7 +206,7 @@ void fpstate_init(struct fpu *fpu)
 		return;
 	}
 
-	memset(state, 0, fpu_kernel_xstate_size);
+	memset(state, 0, fpu_kernel_xstate_default_size);
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
 		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
@@ -233,7 +233,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 * Don't let 'init optimized' areas of the XSAVE area
 	 * leak into the child task:
 	 */
-	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_size);
+	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_default_size);
 
 	/*
 	 * If the FPU registers are not current just memcpy() the state.
@@ -245,7 +245,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_size);
+		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_default_size);
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
 		copy_kernel_to_fpregs(dst_fpu);
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 4e89a2698cfb..ee6499075a89 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -131,13 +131,17 @@ static void __init fpu__init_system_generic(void)
 }
 
 /*
- * Size of the FPU context state. All tasks in the system use the
- * same context size, regardless of what portion they use.
- * This is inherent to the XSAVE architecture which puts all state
- * components into a single, continuous memory block:
+ * Size of the maximum FPU context state. It is inherent to the XSAVE architecture
+ * which puts all state components into a single, continuous memory block:
  */
-unsigned int fpu_kernel_xstate_size;
-EXPORT_SYMBOL_GPL(fpu_kernel_xstate_size);
+unsigned int fpu_kernel_xstate_max_size;
+
+/*
+ * Size of the initial FPU context state. All tasks in the system use this context
+ * size by default.
+ */
+unsigned int fpu_kernel_xstate_default_size;
+EXPORT_SYMBOL_GPL(fpu_kernel_xstate_default_size);
 
 /* Get alignment of the TYPE. */
 #define TYPE_ALIGN(TYPE) offsetof(struct { char x; TYPE test; }, test)
@@ -167,9 +171,9 @@ static void __init fpu__init_task_struct_size(void)
 
 	/*
 	 * Add back the dynamically-calculated register state
-	 * size.
+	 * size by default.
 	 */
-	task_size += fpu_kernel_xstate_size;
+	task_size += fpu_kernel_xstate_default_size;
 
 	/*
 	 * We dynamically size 'struct fpu', so we require that
@@ -194,6 +198,7 @@ static void __init fpu__init_task_struct_size(void)
 static void __init fpu__init_system_xstate_size_legacy(void)
 {
 	static int on_boot_cpu __initdata = 1;
+	unsigned int size;
 
 	WARN_ON_FPU(!on_boot_cpu);
 	on_boot_cpu = 0;
@@ -204,17 +209,17 @@ static void __init fpu__init_system_xstate_size_legacy(void)
 	 */
 
 	if (!boot_cpu_has(X86_FEATURE_FPU)) {
-		fpu_kernel_xstate_size = sizeof(struct swregs_state);
+		size = sizeof(struct swregs_state);
 	} else {
 		if (boot_cpu_has(X86_FEATURE_FXSR))
-			fpu_kernel_xstate_size =
-				sizeof(struct fxregs_state);
+			size = sizeof(struct fxregs_state);
 		else
-			fpu_kernel_xstate_size =
-				sizeof(struct fregs_state);
+			size = sizeof(struct fregs_state);
 	}
 
-	fpu_user_xstate_size = fpu_kernel_xstate_size;
+	fpu_kernel_xstate_default_size = size;
+	fpu_kernel_xstate_max_size = size;
+	fpu_user_xstate_size = size;
 }
 
 /*
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 6f3bcc7dab80..4fcd2caa63d3 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -289,8 +289,8 @@ static int copy_user_to_fpregs_zeroing(void __user *buf, u64 xbv, int fx_only)
 
 static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 {
+	int state_size = fpu_kernel_xstate_default_size;
 	struct user_i387_ia32_struct *envp = NULL;
-	int state_size = fpu_kernel_xstate_size;
 	int ia32_fxstate = (buf != buf_fx);
 	struct task_struct *tsk = current;
 	struct fpu *fpu = &tsk->thread.fpu;
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index bf2b09bf9b38..6e0d8a9699ed 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -623,13 +623,20 @@ static void check_xstate_against_struct(int nr)
  */
 static void do_extra_xstate_size_checks(void)
 {
-	int paranoid_xstate_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+	int paranoid_min_size, paranoid_max_size;
 	int i;
 
+	paranoid_min_size = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+	paranoid_max_size = paranoid_min_size;
+
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		bool dynamic;
+
 		if (!xfeature_enabled(i))
 			continue;
 
+		dynamic = (xfeatures_mask_user_dynamic & BIT_ULL(i)) ? true : false;
+
 		check_xstate_against_struct(i);
 		/*
 		 * Supervisor state components can be managed only by
@@ -639,23 +646,32 @@ static void do_extra_xstate_size_checks(void)
 			XSTATE_WARN_ON(xfeature_is_supervisor(i));
 
 		/* Align from the end of the previous feature */
-		if (xfeature_is_aligned(i))
-			paranoid_xstate_size = ALIGN(paranoid_xstate_size, 64);
+		if (xfeature_is_aligned(i)) {
+			paranoid_max_size = ALIGN(paranoid_max_size, 64);
+			if (!dynamic)
+				paranoid_min_size = ALIGN(paranoid_min_size, 64);
+		}
 		/*
 		 * The offset of a given state in the non-compacted
 		 * format is given to us in a CPUID leaf.  We check
 		 * them for being ordered (increasing offsets) in
 		 * setup_xstate_features().
 		 */
-		if (!using_compacted_format())
-			paranoid_xstate_size = xfeature_uncompacted_offset(i);
+		if (!using_compacted_format()) {
+			paranoid_max_size = xfeature_uncompacted_offset(i);
+			if (!dynamic)
+				paranoid_min_size = xfeature_uncompacted_offset(i);
+		}
 		/*
 		 * The compacted-format offset always depends on where
 		 * the previous state ended.
 		 */
-		paranoid_xstate_size += xfeature_size(i);
+		paranoid_max_size += xfeature_size(i);
+		if (!dynamic)
+			paranoid_min_size += xfeature_size(i);
 	}
-	XSTATE_WARN_ON(paranoid_xstate_size != fpu_kernel_xstate_size);
+	XSTATE_WARN_ON(paranoid_max_size != fpu_kernel_xstate_max_size);
+	fpu_kernel_xstate_default_size = paranoid_min_size;
 }
 
 
@@ -740,27 +756,27 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
 static int __init init_xstate_size(void)
 {
 	/* Recompute the context size for enabled features: */
-	unsigned int possible_xstate_size;
+	unsigned int possible_max_xstate_size;
 	unsigned int xsave_size;
 
 	xsave_size = get_xsave_size();
 
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		possible_xstate_size = get_xsaves_size_no_dynamic();
+		possible_max_xstate_size = get_xsaves_size_no_dynamic();
 	else
-		possible_xstate_size = xsave_size;
-
-	/* Ensure we have the space to store all enabled: */
-	if (!is_supported_xstate_size(possible_xstate_size))
-		return -EINVAL;
+		possible_max_xstate_size = xsave_size;
 
 	/*
 	 * The size is OK, we are definitely going to use xsave,
 	 * make it known to the world that we need more space.
 	 */
-	fpu_kernel_xstate_size = possible_xstate_size;
+	fpu_kernel_xstate_max_size = possible_max_xstate_size;
 	do_extra_xstate_size_checks();
 
+	/* Ensure we have the default space: */
+	if (!is_supported_xstate_size(fpu_kernel_xstate_default_size))
+		return -EINVAL;
+
 	/*
 	 * User space is always in standard format.
 	 */
@@ -865,7 +881,7 @@ void __init fpu__init_system_xstate(void)
 
 	pr_info("x86/fpu: Enabled xstate features 0x%llx, context size is %d bytes, using '%s' format.\n",
 		xfeatures_mask_all,
-		fpu_kernel_xstate_size,
+		fpu_kernel_xstate_max_size,
 		boot_cpu_has(X86_FEATURE_XSAVES) ? "compacted" : "standard");
 	return;
 
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index ba4593a913fa..43d38bd09fb1 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -96,6 +96,12 @@ int arch_dup_task_struct(struct task_struct *dst, struct task_struct *src)
 	return fpu__copy(dst, src);
 }
 
+void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
+{
+	*offset = offsetof(struct thread_struct, fpu.state);
+	*size = fpu_kernel_xstate_default_size;
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 192d52ff5b8c..ecec6418ccca 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8848,7 +8848,7 @@ static void kvm_save_current_fpu(struct fpu *fpu)
 	 */
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
 		memcpy(&fpu->state, &current->thread.fpu.state,
-		       fpu_kernel_xstate_size);
+		       fpu_kernel_xstate_default_size);
 	else
 		copy_fpregs_to_fpstate(fpu);
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (5 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 06/22] x86/fpu/xstate: Outline dynamic xstate area size in the task context Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 23:41   ` Andy Lutomirski
  2020-10-01 20:38 ` [RFC PATCH 08/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
                   ` (14 subsequent siblings)
  21 siblings, 1 reply; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

task->fpu has a buffer to keep the extended register states, but it is not
expandable at runtime. Introduce runtime methods and new fpu struct fields
to support the expansion.

fpu->state_mask indicates the saved states per task and fpu->state_ptr
points the dynamically allocated area.

alloc_xstate_area() uses vmalloc() for its scalability. However, set a
threshold (64KB) to watch out a potential need for an alternative
mechanism.

Also, introduce a new helper -- get_xstate_size() to calculate the area
size.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/types.h  |  29 +++++--
 arch/x86/include/asm/fpu/xstate.h |   3 +
 arch/x86/kernel/fpu/core.c        |   3 +
 arch/x86/kernel/fpu/xstate.c      | 124 ++++++++++++++++++++++++++++++
 4 files changed, 154 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index c87364ea6446..4b7756644824 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -327,14 +327,33 @@ struct fpu {
 	 */
 	unsigned long			avx512_timestamp;
 
+	/*
+	 * @state_mask:
+	 *
+	 * The state component bitmap. It indicates the saved xstate in
+	 * either @state or @state_ptr. The map value starts to be aligned
+	 * with @state and then with @state_ptr once it is in use.
+	 */
+	u64				state_mask;
+
+	/*
+	 * @state_ptr:
+	 *
+	 * Copy of all extended register states, in a dynamically-allocated
+	 * area, we save and restore over context switches. When a task is
+	 * using extended features, the register state is always the most
+	 * current. This state copy is more recent than @state. If the task
+	 * context-switches away, they get saved here, representing the xstate.
+	 */
+	union fpregs_state		*state_ptr;
+
 	/*
 	 * @state:
 	 *
-	 * In-memory copy of all FPU registers that we save/restore
-	 * over context switches. If the task is using the FPU then
-	 * the registers in the FPU are more recent than this state
-	 * copy. If the task context-switches away then they get
-	 * saved here and represent the FPU state.
+	 * Copy of some extended register state that we save and restore
+	 * over context switches. If a task uses a dynamically-allocated
+	 * area, @state_ptr, then it has a more recent state copy than this.
+	 * This copy follows the same attributes as described for @state_ptr.
 	 */
 	union fpregs_state		state;
 	/*
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9aad91c0725b..37728bfcb71e 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -103,6 +103,9 @@ extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
+int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size);
+void free_xstate_area(struct fpu *fpu);
+
 const void *get_xsave_field_ptr(int xfeature_nr);
 int using_compacted_format(void);
 int xfeature_size(int xfeature_nr);
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 875620fdfe61..e25f7866800e 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -235,6 +235,9 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_default_size);
 
+	dst_fpu->state_mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
+	dst_fpu->state_ptr = NULL;
+
 	/*
 	 * If the FPU registers are not current just memcpy() the state.
 	 * Otherwise save current FPU registers directly into the child's FPU
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 6e0d8a9699ed..af60332aafef 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -10,6 +10,7 @@
 #include <linux/pkeys.h>
 #include <linux/seq_file.h>
 #include <linux/proc_fs.h>
+#include <linux/vmalloc.h>
 
 #include <asm/fpu/api.h>
 #include <asm/fpu/internal.h>
@@ -69,6 +70,7 @@ static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
 static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
 static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
+static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
 
 /*
  * The XSAVE area of kernel can be in standard or compacted format;
@@ -128,6 +130,48 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/*
+ * Available once those arrays for the offset, size, and alignment info are set up,
+ * by setup_xstate_features().
+ */
+static unsigned int get_xstate_size(u64 mask)
+{
+	unsigned int size;
+	u64 xmask;
+	int i, nr;
+
+	if (!mask)
+		return 0;
+	else if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
+		return fpu_kernel_xstate_default_size;
+	else if (mask == xfeatures_mask_all)
+		return fpu_kernel_xstate_max_size;
+
+	nr = fls64(mask) - 1;
+
+	if (!using_compacted_format())
+		return xstate_offsets[nr] + xstate_sizes[nr];
+
+	xmask = BIT_ULL(nr + 1) - 1;
+
+	if (mask == (xmask & xfeatures_mask_all))
+		return xstate_comp_offsets[nr] + xstate_sizes[nr];
+
+	/*
+	 * Calculate the size by summing up each state together, since no known
+	 * size found with the xstate area format out of the given mask.
+	 */
+	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		if (xstate_aligns[i])
+			size = ALIGN(size, 64);
+		size += xstate_sizes[i];
+	}
+	return size;
+}
+
 /*
  * When executing XSAVEOPT (or other optimized XSAVE instructions), if
  * a processor implementation detects that an FPU state component is still
@@ -268,10 +312,12 @@ static void __init setup_xstate_features(void)
 	xstate_offsets[XFEATURE_FP]	= 0;
 	xstate_sizes[XFEATURE_FP]	= offsetof(struct fxregs_state,
 						   xmm_space);
+	xstate_aligns[XFEATURE_FP]	= true;
 
 	xstate_offsets[XFEATURE_SSE]	= xstate_sizes[XFEATURE_FP];
 	xstate_sizes[XFEATURE_SSE]	= sizeof_field(struct fxregs_state,
 						       xmm_space);
+	xstate_aligns[XFEATURE_SSE]	= true;
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
@@ -289,6 +335,7 @@ static void __init setup_xstate_features(void)
 			continue;
 
 		xstate_offsets[i] = ebx;
+		xstate_aligns[i] = (ecx & 2) ? true : false;
 
 		/*
 		 * In our xstate size checks, we assume that the highest-numbered
@@ -753,6 +800,9 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
 	return false;
 }
 
+/* The watched threshold size of dynamically allocated xstate area */
+#define XSTATE_AREA_MAX_BYTES		(64 * 1024)
+
 static int __init init_xstate_size(void)
 {
 	/* Recompute the context size for enabled features: */
@@ -777,6 +827,14 @@ static int __init init_xstate_size(void)
 	if (!is_supported_xstate_size(fpu_kernel_xstate_default_size))
 		return -EINVAL;
 
+	/*
+	 * When observing the allocation goes beyond the threshold, it is better to consider
+	 * switching a better scalable mechanism than the current.
+	 */
+	if (fpu_kernel_xstate_max_size > XSTATE_AREA_MAX_BYTES)
+		pr_warn("x86/fpu: xstate buffer too large (%u > %u)\n",
+			fpu_kernel_xstate_max_size, XSTATE_AREA_MAX_BYTES);
+
 	/*
 	 * User space is always in standard format.
 	 */
@@ -867,6 +925,12 @@ void __init fpu__init_system_xstate(void)
 	if (err)
 		goto out_disable;
 
+	/*
+	 * At this point, it passed the size sanity check. Have the init_task take
+	 * the feature mask.
+	 */
+	current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+
 	/*
 	 * Update info used for ptrace frames; use standard-format size and no
 	 * supervisor xstates:
@@ -1086,6 +1150,66 @@ static inline bool xfeatures_mxcsr_quirk(u64 xfeatures)
 	return true;
 }
 
+void free_xstate_area(struct fpu *fpu)
+{
+	vfree(fpu->state_ptr);
+}
+
+/*
+ * Allocate a new xstate area with a calculated size, based on the given bit value.
+ *
+ * No mechanism implemented yet to shrink or reclaim the xstate area on the fly,
+ * the need of which is subject to the real usage.
+ */
+int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size)
+{
+	union fpregs_state *state_ptr;
+	unsigned int oldsz, newsz;
+	u64 state_mask;
+
+	state_mask = fpu->state_mask | mask;
+
+	oldsz = get_xstate_size(fpu->state_mask);
+	newsz = get_xstate_size(state_mask);
+
+	if (oldsz >= newsz)
+		return 0;
+
+	if (newsz > fpu_kernel_xstate_max_size) {
+		pr_warn_once("x86/fpu: state buffer too large (%u > %u bytes)\n",
+			     newsz, fpu_kernel_xstate_max_size);
+		XSTATE_WARN_ON(1);
+		return 0;
+	}
+
+	/*
+	 * The caller may be under interrupt disabled condition. Ensure interrupt
+	 * allowance before the memory allocation, which may involve with page faults.
+	 */
+	local_irq_enable();
+	/* We need 64B aligned pointer, but vmalloc() returns a page-aligned address */
+	state_ptr = vmalloc(newsz);
+	local_irq_disable();
+	if (!state_ptr)
+		return -ENOMEM;
+
+	memset(state_ptr, 0, newsz);
+	if (using_compacted_format())
+		fpstate_init_xstate(&state_ptr->xsave, state_mask);
+
+	/*
+	 * As long as the register state is intact, save the xstate in the new area at
+	 * the next context copy/switch or potentially ptrace-driven xstate writing
+	 */
+
+	vfree(fpu->state_ptr);
+	fpu->state_ptr = state_ptr;
+	fpu->state_mask = state_mask;
+	if (alloc_size)
+		*alloc_size = newsz;
+	return 0;
+}
+
 static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
 {
 	if (*last >= offset)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 08/22] x86/fpu/xstate: Define the scope of the initial xstate data
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (6 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically Chang S. Bae
@ 2020-10-01 20:38 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 09/22] x86/fpu/xstate: Introduce wrapper functions for organizing xstate area access Chang S. Bae
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:38 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

init_fpstate covers all the component states. But it becomes less efficient
to do this as the state size trends larger but with trivial initial data.

Limit init_fpstate by clarifying its size and coverage, which are all but
dynamic user states. The dynamic states are assumed to be large but having
initial data with zeros.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h | 18 +++++++++++++++---
 arch/x86/include/asm/fpu/xstate.h   |  1 +
 arch/x86/kernel/fpu/core.c          |  4 ++--
 arch/x86/kernel/fpu/xstate.c        |  4 ++--
 4 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 6eec5209750f..d64c1083bd93 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -79,6 +79,18 @@ static __always_inline __pure bool use_fxsr(void)
 
 extern union fpregs_state init_fpstate;
 
+static inline u64 get_init_fpstate_mask(void)
+{
+	/* init_fpstate covers state components, as covered in fpu->state */
+	return (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
+}
+
+static inline unsigned int get_init_fpstate_size(void)
+{
+	/* fpu->state size matches with init_fpstate size */
+	return fpu_kernel_xstate_default_size;
+}
+
 extern void fpstate_init(struct fpu *fpu);
 #ifdef CONFIG_MATH_EMULATION
 extern void fpstate_init_soft(struct swregs_state *soft);
@@ -268,12 +280,12 @@ static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 		     : "memory")
 
 /*
- * This function is called only during boot time when x86 caps are not set
- * up and alternative can not be used yet.
+ * Use this function to dump the initial state, only during boot time when x86
+ * caps not set up and alternative not available yet.
  */
 static inline void copy_xregs_to_kernel_booting(struct xregs_state *xstate)
 {
-	u64 mask = xfeatures_mask_all;
+	u64 mask = get_init_fpstate_mask();
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 37728bfcb71e..9de8b4c49855 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -103,6 +103,7 @@ extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
+unsigned int get_xstate_size(u64 mask);
 int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size);
 void free_xstate_area(struct fpu *fpu);
 
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index e25f7866800e..33956ae3de2b 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -206,10 +206,10 @@ void fpstate_init(struct fpu *fpu)
 		return;
 	}
 
-	memset(state, 0, fpu_kernel_xstate_default_size);
+	memset(state, 0, fpu ? get_xstate_size(fpu->state_mask) : get_init_fpstate_size());
 
 	if (static_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&state->xsave, xfeatures_mask_all);
+		fpstate_init_xstate(&state->xsave, fpu ? fpu->state_mask : get_init_fpstate_mask());
 	if (static_cpu_has(X86_FEATURE_FXSR))
 		fpstate_init_fxstate(&state->fxsave);
 	else
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index af60332aafef..2e190254d4aa 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -134,7 +134,7 @@ static bool xfeature_is_supervisor(int xfeature_nr)
  * Available once those arrays for the offset, size, and alignment info are set up,
  * by setup_xstate_features().
  */
-static unsigned int get_xstate_size(u64 mask)
+unsigned int get_xstate_size(u64 mask)
 {
 	unsigned int size;
 	u64 xmask;
@@ -507,7 +507,7 @@ static void __init setup_init_fpu_buf(void)
 	print_xstate_features();
 
 	if (boot_cpu_has(X86_FEATURE_XSAVES))
-		fpstate_init_xstate(&init_fpstate.xsave, xfeatures_mask_all);
+		fpstate_init_xstate(&init_fpstate.xsave, get_init_fpstate_mask());
 
 	/*
 	 * Init all the features state with header.xfeatures being 0x0
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 09/22] x86/fpu/xstate: Introduce wrapper functions for organizing xstate area access
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (7 preceding siblings ...)
  2020-10-01 20:38 ` [RFC PATCH 08/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 10/22] x86/fpu/xstate: Update xstate save function for supporting dynamic user xstate Chang S. Bae
                   ` (12 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

task->fpu now has two possible xstate areas, fpu->state or fpu->state_ptr.
Instead of open code for accessing to one of the two areas, rearrange them
to use a new wrapper.

Some open code (e.g., in KVM) is left unchanged as not going to use
fpu->state_ptr at the moment.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h | 10 ++++++----
 arch/x86/include/asm/fpu/xstate.h   | 10 ++++++++++
 arch/x86/include/asm/trace/fpu.h    |  6 ++++--
 arch/x86/kernel/fpu/core.c          | 27 ++++++++++++++++-----------
 arch/x86/kernel/fpu/regset.c        | 28 +++++++++++++++++-----------
 arch/x86/kernel/fpu/signal.c        | 23 +++++++++++++----------
 arch/x86/kernel/fpu/xstate.c        | 20 +++++++++++---------
 7 files changed, 77 insertions(+), 47 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index d64c1083bd93..2dfb3b6f58fc 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -209,10 +209,12 @@ static inline int copy_user_to_fregs(struct fregs_state __user *fx)
 
 static inline void copy_fxregs_to_kernel(struct fpu *fpu)
 {
+	union fpregs_state *xstate = __xstate(fpu);
+
 	if (IS_ENABLED(CONFIG_X86_32))
-		asm volatile( "fxsave %[fx]" : [fx] "=m" (fpu->state.fxsave));
+		asm volatile("fxsave %[fx]" : [fx] "=m" (xstate->fxsave));
 	else
-		asm volatile("fxsaveq %[fx]" : [fx] "=m" (fpu->state.fxsave));
+		asm volatile("fxsaveq %[fx]" : [fx] "=m" (xstate->fxsave));
 }
 
 /* These macros all use (%edi)/(%rdi) as the single memory argument. */
@@ -410,7 +412,7 @@ static inline int copy_user_to_xregs(struct xregs_state __user *buf, u64 mask)
  */
 static inline int copy_kernel_to_xregs_err(struct fpu *fpu, u64 mask)
 {
-	struct xregs_state *xstate = &fpu->state.xsave;
+	struct xregs_state *xstate = __xsave(fpu);
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
@@ -439,7 +441,7 @@ static inline void __copy_kernel_to_fpregs(union fpregs_state *fpstate, u64 mask
 
 static inline void copy_kernel_to_fpregs(struct fpu *fpu)
 {
-	union fpregs_state *fpstate = &fpu->state;
+	union fpregs_state *fpstate = __xstate(fpu);
 
 	/*
 	 * AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception is
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index 9de8b4c49855..b2125ec90cdb 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -102,6 +102,16 @@ extern u64 xstate_fx_sw_bytes[USER_XSTATE_FX_SW_WORDS];
 extern void __init update_regset_xstate_info(unsigned int size,
 					     u64 xstate_mask);
 
+static inline union fpregs_state *__xstate(struct fpu *fpu)
+{
+	return (fpu->state_ptr) ? fpu->state_ptr : &fpu->state;
+}
+
+static inline struct xregs_state *__xsave(struct fpu *fpu)
+{
+	return &__xstate(fpu)->xsave;
+}
+
 void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
 unsigned int get_xstate_size(u64 mask);
 int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size);
diff --git a/arch/x86/include/asm/trace/fpu.h b/arch/x86/include/asm/trace/fpu.h
index 879b77792f94..9dcce5833bc6 100644
--- a/arch/x86/include/asm/trace/fpu.h
+++ b/arch/x86/include/asm/trace/fpu.h
@@ -22,8 +22,10 @@ DECLARE_EVENT_CLASS(x86_fpu,
 		__entry->fpu		= fpu;
 		__entry->load_fpu	= test_thread_flag(TIF_NEED_FPU_LOAD);
 		if (boot_cpu_has(X86_FEATURE_OSXSAVE)) {
-			__entry->xfeatures = fpu->state.xsave.header.xfeatures;
-			__entry->xcomp_bv  = fpu->state.xsave.header.xcomp_bv;
+			struct xregs_state *xsave = __xsave(fpu);
+
+			__entry->xfeatures = xsave->header.xfeatures;
+			__entry->xcomp_bv  = xsave->header.xcomp_bv;
 		}
 	),
 	TP_printk("x86/fpu: %p load: %d xfeatures: %llx xcomp_bv: %llx",
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 33956ae3de2b..dca4961fcc36 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -94,14 +94,18 @@ EXPORT_SYMBOL(irq_fpu_usable);
  */
 int copy_fpregs_to_fpstate(struct fpu *fpu)
 {
+	union fpregs_state *xstate = __xstate(fpu);
+
 	if (likely(use_xsave())) {
-		copy_xregs_to_kernel(&fpu->state.xsave);
+		struct xregs_state *xsave = &xstate->xsave;
+
+		copy_xregs_to_kernel(xsave);
 
 		/*
 		 * AVX512 state is tracked here because its use is
 		 * known to slow the max clock speed of the core.
 		 */
-		if (fpu->state.xsave.header.xfeatures & XFEATURE_MASK_AVX512)
+		if (xsave->header.xfeatures & XFEATURE_MASK_AVX512)
 			fpu->avx512_timestamp = jiffies;
 		return 1;
 	}
@@ -115,7 +119,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
 	 * Legacy FPU register saving, FNSAVE always clears FPU registers,
 	 * so we have to mark them inactive:
 	 */
-	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (fpu->state.fsave));
+	asm volatile("fnsave %[fp]; fwait" : [fp] "=m" (xstate->fsave));
 
 	return 0;
 }
@@ -197,7 +201,7 @@ void fpstate_init(struct fpu *fpu)
 	union fpregs_state *state;
 
 	if (fpu)
-		state = &fpu->state;
+		state = __xstate(fpu);
 	else
 		state = &init_fpstate;
 
@@ -248,7 +252,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&dst_fpu->state, &src_fpu->state, fpu_kernel_xstate_default_size);
+		memcpy(__xstate(dst_fpu), __xstate(src_fpu), fpu_kernel_xstate_default_size);
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
 		copy_kernel_to_fpregs(dst_fpu);
@@ -384,7 +388,7 @@ static void fpu__clear(struct fpu *fpu, bool user_only)
 	if (user_only) {
 		if (!fpregs_state_valid(fpu, smp_processor_id()) &&
 		    xfeatures_mask_supervisor())
-			copy_kernel_to_xregs(&fpu->state.xsave,
+			copy_kernel_to_xregs(__xsave(fpu),
 					     xfeatures_mask_supervisor());
 		copy_init_fpstate_to_fpregs(xfeatures_mask_user());
 	} else {
@@ -451,6 +455,7 @@ EXPORT_SYMBOL_GPL(fpregs_mark_activate);
 
 int fpu__exception_code(struct fpu *fpu, int trap_nr)
 {
+	union fpregs_state *xstate = __xstate(fpu);
 	int err;
 
 	if (trap_nr == X86_TRAP_MF) {
@@ -466,11 +471,11 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		 * fully reproduce the context of the exception.
 		 */
 		if (boot_cpu_has(X86_FEATURE_FXSR)) {
-			cwd = fpu->state.fxsave.cwd;
-			swd = fpu->state.fxsave.swd;
+			cwd = xstate->fxsave.cwd;
+			swd = xstate->fxsave.swd;
 		} else {
-			cwd = (unsigned short)fpu->state.fsave.cwd;
-			swd = (unsigned short)fpu->state.fsave.swd;
+			cwd = (unsigned short)xstate->fsave.cwd;
+			swd = (unsigned short)xstate->fsave.swd;
 		}
 
 		err = swd & ~cwd;
@@ -484,7 +489,7 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 		unsigned short mxcsr = MXCSR_DEFAULT;
 
 		if (boot_cpu_has(X86_FEATURE_XMM))
-			mxcsr = fpu->state.fxsave.mxcsr;
+			mxcsr = xstate->fxsave.mxcsr;
 
 		err = ~(mxcsr >> 7) & mxcsr;
 	}
diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 5e13e58d11d4..8d863240b9c6 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -37,7 +37,7 @@ int xfpregs_get(struct task_struct *target, const struct user_regset *regset,
 	fpu__prepare_read(fpu);
 	fpstate_sanitize_xstate(fpu);
 
-	return membuf_write(&to, &fpu->state.fxsave, sizeof(struct fxregs_state));
+	return membuf_write(&to, &__xstate(fpu)->fxsave, sizeof(struct fxregs_state));
 }
 
 int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
@@ -45,6 +45,7 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
 		const void *kbuf, const void __user *ubuf)
 {
 	struct fpu *fpu = &target->thread.fpu;
+	union fpregs_state *xstate;
 	int ret;
 
 	if (!boot_cpu_has(X86_FEATURE_FXSR))
@@ -53,20 +54,22 @@ int xfpregs_set(struct task_struct *target, const struct user_regset *regset,
 	fpu__prepare_write(fpu);
 	fpstate_sanitize_xstate(fpu);
 
+	xstate = __xstate(fpu);
+
 	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf,
-				 &fpu->state.fxsave, 0, -1);
+				 &xstate->fxsave, 0, -1);
 
 	/*
 	 * mxcsr reserved bits must be masked to zero for security reasons.
 	 */
-	fpu->state.fxsave.mxcsr &= mxcsr_feature_mask;
+	xstate->fxsave.mxcsr &= mxcsr_feature_mask;
 
 	/*
 	 * update the header bits in the xsave header, indicating the
 	 * presence of FP and SSE state.
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVE))
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
+		xstate->xsave.header.xfeatures |= XFEATURE_MASK_FPSSE;
 
 	return ret;
 }
@@ -80,7 +83,7 @@ int xstateregs_get(struct task_struct *target, const struct user_regset *regset,
 	if (!boot_cpu_has(X86_FEATURE_XSAVE))
 		return -ENODEV;
 
-	xsave = &fpu->state.xsave;
+	xsave = __xsave(fpu);
 
 	fpu__prepare_read(fpu);
 
@@ -120,7 +123,7 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 	if ((pos != 0) || (count < fpu_user_xstate_size))
 		return -EFAULT;
 
-	xsave = &fpu->state.xsave;
+	xsave = __xsave(fpu);
 
 	fpu__prepare_write(fpu);
 
@@ -224,7 +227,7 @@ static inline u32 twd_fxsr_to_i387(struct fxregs_state *fxsave)
 void
 convert_from_fxsr(struct user_i387_ia32_struct *env, struct task_struct *tsk)
 {
-	struct fxregs_state *fxsave = &tsk->thread.fpu.state.fxsave;
+	struct fxregs_state *fxsave = &__xstate(&tsk->thread.fpu)->fxsave;
 	struct _fpreg *to = (struct _fpreg *) &env->st_space[0];
 	struct _fpxreg *from = (struct _fpxreg *) &fxsave->st_space[0];
 	int i;
@@ -297,7 +300,7 @@ int fpregs_get(struct task_struct *target, const struct user_regset *regset,
 		return fpregs_soft_get(target, regset, to);
 
 	if (!boot_cpu_has(X86_FEATURE_FXSR)) {
-		return membuf_write(&to, &fpu->state.fsave,
+		return membuf_write(&to, &__xstate(fpu)->fsave,
 				    sizeof(struct fregs_state));
 	}
 
@@ -318,6 +321,7 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 {
 	struct fpu *fpu = &target->thread.fpu;
 	struct user_i387_ia32_struct env;
+	union fpregs_state *xstate;
 	int ret;
 
 	fpu__prepare_write(fpu);
@@ -326,9 +330,11 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 	if (!boot_cpu_has(X86_FEATURE_FPU))
 		return fpregs_soft_set(target, regset, pos, count, kbuf, ubuf);
 
+	xstate = __xstate(fpu);
+
 	if (!boot_cpu_has(X86_FEATURE_FXSR))
 		return user_regset_copyin(&pos, &count, &kbuf, &ubuf,
-					  &fpu->state.fsave, 0,
+					  &xstate->fsave, 0,
 					  -1);
 
 	if (pos > 0 || count < sizeof(env))
@@ -336,14 +342,14 @@ int fpregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	ret = user_regset_copyin(&pos, &count, &kbuf, &ubuf, &env, 0, -1);
 	if (!ret)
-		convert_to_fxsr(&target->thread.fpu.state.fxsave, &env);
+		convert_to_fxsr(&__xstate(&target->thread.fpu)->fxsave, &env);
 
 	/*
 	 * update the header bit in the xsave header, indicating the
 	 * presence of FP.
 	 */
 	if (boot_cpu_has(X86_FEATURE_XSAVE))
-		fpu->state.xsave.header.xfeatures |= XFEATURE_MASK_FP;
+		xstate->xsave.header.xfeatures |= XFEATURE_MASK_FP;
 	return ret;
 }
 
diff --git a/arch/x86/kernel/fpu/signal.c b/arch/x86/kernel/fpu/signal.c
index 4fcd2caa63d3..941e80b1a315 100644
--- a/arch/x86/kernel/fpu/signal.c
+++ b/arch/x86/kernel/fpu/signal.c
@@ -58,7 +58,7 @@ static inline int check_for_xstate(struct fxregs_state __user *buf,
 static inline int save_fsave_header(struct task_struct *tsk, void __user *buf)
 {
 	if (use_fxsr()) {
-		struct xregs_state *xsave = &tsk->thread.fpu.state.xsave;
+		struct xregs_state *xsave = __xsave(&tsk->thread.fpu);
 		struct user_i387_ia32_struct env;
 		struct _fpstate_32 __user *fp = buf;
 
@@ -152,8 +152,8 @@ static inline int copy_fpregs_to_sigframe(struct xregs_state __user *buf)
  *
  * Try to save it directly to the user frame with disabled page fault handler.
  * If this fails then do the slow path where the FPU state is first saved to
- * task's fpu->state and then copy it to the user frame pointed to by the
- * aligned pointer 'buf_fx'.
+ * task->fpu and then copy it to the user frame pointed to by the aligned
+ * pointer 'buf_fx'.
  *
  * If this is a 32-bit frame with fxstate, put a fsave header before
  * the aligned state at 'buf_fx'.
@@ -216,7 +216,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
 			      struct user_i387_ia32_struct *ia32_env,
 			      u64 user_xfeatures, int fx_only)
 {
-	struct xregs_state *xsave = &fpu->state.xsave;
+	struct xregs_state *xsave = __xsave(fpu);
 	struct xstate_header *header = &xsave->header;
 
 	if (use_xsave()) {
@@ -253,7 +253,7 @@ sanitize_restored_user_xstate(struct fpu *fpu,
 		xsave->i387.mxcsr &= mxcsr_feature_mask;
 
 		if (ia32_env)
-			convert_to_fxsr(&fpu->state.fxsave, ia32_env);
+			convert_to_fxsr(&__xstate(fpu)->fxsave, ia32_env);
 	}
 }
 
@@ -295,6 +295,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 	struct task_struct *tsk = current;
 	struct fpu *fpu = &tsk->thread.fpu;
 	struct user_i387_ia32_struct env;
+	union fpregs_state *xstate;
 	u64 user_xfeatures = 0;
 	int fx_only = 0;
 	int ret = 0;
@@ -335,6 +336,8 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 	if ((unsigned long)buf_fx % 64)
 		fx_only = 1;
 
+	xstate = __xstate(fpu);
+
 	if (!ia32_fxstate) {
 		/*
 		 * Attempt to restore the FPU registers directly from user
@@ -363,7 +366,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			 */
 			if (test_thread_flag(TIF_NEED_FPU_LOAD) &&
 			    xfeatures_mask_supervisor())
-				copy_kernel_to_xregs(&fpu->state.xsave,
+				copy_kernel_to_xregs(&xstate->xsave,
 						     xfeatures_mask_supervisor());
 			fpregs_mark_activate();
 			fpregs_unlock();
@@ -430,7 +433,7 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 		ret = copy_kernel_to_xregs_err(fpu, user_xfeatures | xfeatures_mask_supervisor());
 
 	} else if (use_fxsr()) {
-		ret = __copy_from_user(&fpu->state.fxsave, buf_fx, state_size);
+		ret = __copy_from_user(&xstate->fxsave, buf_fx, state_size);
 		if (ret) {
 			ret = -EFAULT;
 			goto err_out;
@@ -446,14 +449,14 @@ static int __fpu__restore_sig(void __user *buf, void __user *buf_fx, int size)
 			copy_kernel_to_xregs(&init_fpstate.xsave, init_bv);
 		}
 
-		ret = copy_kernel_to_fxregs_err(&fpu->state.fxsave);
+		ret = copy_kernel_to_fxregs_err(&xstate->fxsave);
 	} else {
-		ret = __copy_from_user(&fpu->state.fsave, buf_fx, state_size);
+		ret = __copy_from_user(&xstate->fsave, buf_fx, state_size);
 		if (ret)
 			goto err_out;
 
 		fpregs_lock();
-		ret = copy_kernel_to_fregs_err(&fpu->state.fsave);
+		ret = copy_kernel_to_fregs_err(&xstate->fsave);
 	}
 	if (!ret)
 		fpregs_mark_activate();
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 2e190254d4aa..d73ab3259896 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -189,14 +189,16 @@ unsigned int get_xstate_size(u64 mask)
  */
 void fpstate_sanitize_xstate(struct fpu *fpu)
 {
-	struct fxregs_state *fx = &fpu->state.fxsave;
+	union fpregs_state *xstate = __xstate(fpu);
+	struct xregs_state *xsave = &xstate->xsave;
+	struct fxregs_state *fx = &xstate->fxsave;
 	int feature_bit;
 	u64 xfeatures;
 
 	if (!use_xsaveopt())
 		return;
 
-	xfeatures = fpu->state.xsave.header.xfeatures;
+	xfeatures = xsave->header.xfeatures;
 
 	/*
 	 * None of the feature bits are in init state. So nothing else
@@ -241,7 +243,7 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
 			int offset = xstate_comp_offsets[feature_bit];
 			int size = xstate_sizes[feature_bit];
 
-			memcpy((void *)fx + offset,
+			memcpy((void *)xsave + offset,
 			       (void *)&init_fpstate.xsave + offset,
 			       size);
 		}
@@ -990,7 +992,7 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 	}
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = __xsave(fpu);
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -1034,7 +1036,7 @@ void *get_xsave_addr(struct fpu *fpu, int xfeature_nr)
 		  "get of unsupported state");
 
 	if (fpu)
-		xsave = &fpu->state.xsave;
+		xsave = __xsave(fpu);
 	else
 		xsave = &init_fpstate.xsave;
 
@@ -1242,7 +1244,7 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 	unsigned last = 0;
 	int i;
 
-	xsave = &fpu->state.xsave;
+	xsave = __xsave(fpu);
 
 	/*
 	 * The destination is a ptrace buffer; we put in only user xstates:
@@ -1320,7 +1322,7 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 		}
 	}
 
-	xsave = &fpu->state.xsave;
+	xsave = __xsave(fpu);
 
 	if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
 		offset = offsetof(struct fxregs_state, mxcsr);
@@ -1378,7 +1380,7 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 		}
 	}
 
-	xsave = &fpu->state.xsave;
+	xsave = __xsave(fpu);
 
 	if (xfeatures_mxcsr_quirk(hdr.xfeatures)) {
 		offset = offsetof(struct fxregs_state, mxcsr);
@@ -1423,7 +1425,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
 	max_bit = __fls(xfeatures_mask_supervisor());
 	min_bit = __ffs(xfeatures_mask_supervisor());
 
-	xstate = &fpu->state.xsave;
+	xstate = __xsave(fpu);
 	lmask = xfeatures_mask_supervisor();
 	hmask = xfeatures_mask_supervisor() >> 32;
 	XSTATE_OP(XSAVES, xstate, lmask, hmask, err);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 10/22] x86/fpu/xstate: Update xstate save function for supporting dynamic user xstate
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (8 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 09/22] x86/fpu/xstate: Introduce wrapper functions for organizing xstate area access Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 11/22] x86/fpu/xstate: Update xstate area address finder " Chang S. Bae
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, kvm

copy_xregs_to_kernel() used to save all user states in an invariably
sufficient buffer. When the dynamic user state is enabled, it becomes
conditional which state to be saved.

fpu->state_mask can indicate which state components are reserved to be
saved in XSAVE buffer. Use it as XSAVE's instruction mask to select states.

KVM saves xstate in guest_fpu and user_fpu. With the change, the KVM code
needs to ensure a valid fpu->state_mask before XSAVE.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: kvm@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h |  3 +--
 arch/x86/kernel/fpu/core.c          |  2 +-
 arch/x86/kvm/x86.c                  | 11 ++++++++---
 3 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 2dfb3b6f58fc..3b03ead87a46 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -331,9 +331,8 @@ static inline void copy_kernel_to_xregs_booting(struct xregs_state *xstate)
 /*
  * Save processor xstate to xsave area.
  */
-static inline void copy_xregs_to_kernel(struct xregs_state *xstate)
+static inline void copy_xregs_to_kernel(struct xregs_state *xstate, u64 mask)
 {
-	u64 mask = xfeatures_mask_all;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index dca4961fcc36..ece6428ba85b 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -99,7 +99,7 @@ int copy_fpregs_to_fpstate(struct fpu *fpu)
 	if (likely(use_xsave())) {
 		struct xregs_state *xsave = &xstate->xsave;
 
-		copy_xregs_to_kernel(xsave);
+		copy_xregs_to_kernel(xsave, fpu->state_mask);
 
 		/*
 		 * AVX512 state is tracked here because its use is
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ecec6418ccca..a8b5f507083c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8842,15 +8842,20 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
 
 static void kvm_save_current_fpu(struct fpu *fpu)
 {
+	struct fpu *src_fpu = &current->thread.fpu;
+
 	/*
 	 * If the target FPU state is not resident in the CPU registers, just
 	 * memcpy() from current, else save CPU state directly to the target.
 	 */
-	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(&fpu->state, &current->thread.fpu.state,
+	if (test_thread_flag(TIF_NEED_FPU_LOAD)) {
+		memcpy(&fpu->state, &src_fpu->state,
 		       fpu_kernel_xstate_default_size);
-	else
+	} else {
+		if (fpu->state_mask != src_fpu->state_mask)
+			fpu->state_mask = src_fpu->state_mask;
 		copy_fpregs_to_fpstate(fpu);
+	}
 }
 
 /* Swap (qemu) user FPU context for the guest FPU context. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 11/22] x86/fpu/xstate: Update xstate area address finder for supporting dynamic user xstate
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (9 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 10/22] x86/fpu/xstate: Update xstate save function for supporting dynamic user xstate Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 12/22] x86/fpu/xstate: Update xstate context copy function for supporting dynamic area Chang S. Bae
                   ` (10 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

__raw_xsave_addr() returns the requested component's pointer in an XSAVE
buffer, by simply looking up the offset table. The offset used to be fixed,
but, with dynamic user states, it becomes variable.

get_xstate_size() has a routine to find an offset at run-time. Refactor to
use it for the address finder.

No functional change until the kernel enables dynamic user xstates.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/fpu/xstate.c | 82 +++++++++++++++++++++++-------------
 1 file changed, 52 insertions(+), 30 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index d73ab3259896..556ae8593806 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -130,15 +130,50 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+/*
+ * Available once those arrays for the offset, size, and alignment info are set up,
+ * by setup_xstate_features().
+ */
+static unsigned int __get_xstate_comp_offset(u64 mask, int feature_nr)
+{
+	u64 xmask = BIT_ULL(feature_nr + 1) - 1;
+	unsigned int next_offset, offset = 0;
+	int i;
+
+	if ((mask & xmask) == (xfeatures_mask_all & xmask))
+		return xstate_comp_offsets[feature_nr];
+
+	/*
+	 * Calculate the size by summing up each state together, since no known
+	 * offset found with the xstate area format out of the given mask.
+	 */
+
+	next_offset = FXSAVE_SIZE + XSAVE_HDR_SIZE;
+
+	for (i = FIRST_EXTENDED_XFEATURE; i <= feature_nr; i++) {
+		if (!(mask & BIT_ULL(i)))
+			continue;
+
+		offset = xstate_aligns[i] ? ALIGN(next_offset, 64) : next_offset;
+		next_offset += xstate_sizes[i];
+	}
+
+	return offset;
+}
+
+static unsigned int get_xstate_comp_offset(struct fpu *fpu, int feature_nr)
+{
+	return __get_xstate_comp_offset(fpu->state_mask, feature_nr);
+}
+
 /*
  * Available once those arrays for the offset, size, and alignment info are set up,
  * by setup_xstate_features().
  */
 unsigned int get_xstate_size(u64 mask)
 {
-	unsigned int size;
-	u64 xmask;
-	int i, nr;
+	unsigned int offset;
+	int nr;
 
 	if (!mask)
 		return 0;
@@ -152,24 +187,8 @@ unsigned int get_xstate_size(u64 mask)
 	if (!using_compacted_format())
 		return xstate_offsets[nr] + xstate_sizes[nr];
 
-	xmask = BIT_ULL(nr + 1) - 1;
-
-	if (mask == (xmask & xfeatures_mask_all))
-		return xstate_comp_offsets[nr] + xstate_sizes[nr];
-
-	/*
-	 * Calculate the size by summing up each state together, since no known
-	 * size found with the xstate area format out of the given mask.
-	 */
-	for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
-		if (!(mask & BIT_ULL(i)))
-			continue;
-
-		if (xstate_aligns[i])
-			size = ALIGN(size, 64);
-		size += xstate_sizes[i];
-	}
-	return size;
+	offset = __get_xstate_comp_offset(mask, nr);
+	return offset + xstate_sizes[nr];
 }
 
 /*
@@ -986,17 +1005,20 @@ static void *__raw_xsave_addr(struct fpu *fpu, int xfeature_nr)
 {
 	void *xsave;
 
-	if (!xfeature_enabled(xfeature_nr)) {
-		WARN_ON_FPU(1);
-		return NULL;
-	}
-
-	if (fpu)
-		xsave = __xsave(fpu);
-	else
+	if (!xfeature_enabled(xfeature_nr))
+		goto not_found;
+	else if (!fpu)
 		xsave = &init_fpstate.xsave;
+	else if (!(fpu->state_mask & BIT_ULL(xfeature_nr)))
+		goto not_found;
+	else
+		xsave = __xsave(fpu);
+
+	return (xsave + get_xstate_comp_offset(fpu, xfeature_nr));
 
-	return xsave + xstate_comp_offsets[xfeature_nr];
+not_found:
+	WARN_ON_FPU(1);
+	return NULL;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 12/22] x86/fpu/xstate: Update xstate context copy function for supporting dynamic area
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (10 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 11/22] x86/fpu/xstate: Update xstate area address finder " Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use Chang S. Bae
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

There are xstate context copy functions that used in ptrace() and signal
return paths. They serve callers to read (or write) xstate values in the
task->fpu's buffer, or to get initial values. With dynamic user states, a
component's position in the buffer may vary and the initial value is not
always stored in init_fpstate.

Change the helpers to find a component's offset accordingly (either lookup
table or calculation).

When copying an initial value, explicitly check the init_fpstate coverage.
If not found, reset the memory in the destination. Otherwise, copy values
from init_fpstate.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/fpu/xstate.c | 55 +++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 556ae8593806..b9261ab4e5e2 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -245,12 +245,14 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
 	if (!(xfeatures & XFEATURE_MASK_SSE))
 		memset(&fx->xmm_space[0], 0, 256);
 
+	/* Make sure 'xfeatures' to be a subset of fpu->state_mask */
+	xfeatures = ((xfeatures_mask_user() & fpu->state_mask) & ~xfeatures);
 	/*
 	 * First two features are FPU and SSE, which above we handled
 	 * in a special way already:
 	 */
 	feature_bit = 0x2;
-	xfeatures = (xfeatures_mask_user() & ~xfeatures) >> 2;
+	xfeatures >>= 0x2;
 
 	/*
 	 * Update all the remaining memory layouts according to their
@@ -259,12 +261,15 @@ void fpstate_sanitize_xstate(struct fpu *fpu)
 	 */
 	while (xfeatures) {
 		if (xfeatures & 0x1) {
-			int offset = xstate_comp_offsets[feature_bit];
-			int size = xstate_sizes[feature_bit];
-
-			memcpy((void *)xsave + offset,
-			       (void *)&init_fpstate.xsave + offset,
-			       size);
+			unsigned int offset = get_xstate_comp_offset(fpu, feature_bit);
+			unsigned int size = xstate_sizes[feature_bit];
+
+			if (get_init_fpstate_mask() & BIT_ULL(feature_bit))
+				memcpy((void *)xsave + offset,
+				       (void *)&init_fpstate.xsave + offset,
+				       size);
+			else
+				memset((void *)xsave + offset, 0, size);
 		}
 
 		xfeatures >>= 1;
@@ -1238,7 +1243,10 @@ static void fill_gap(struct membuf *to, unsigned *last, unsigned offset)
 {
 	if (*last >= offset)
 		return;
-	membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+	if (offset <= get_init_fpstate_size())
+		membuf_write(to, (void *)&init_fpstate.xsave + *last, offset - *last);
+	else
+		membuf_zero(to, offset - *last);
 	*last = offset;
 }
 
@@ -1246,7 +1254,10 @@ static void copy_part(struct membuf *to, unsigned *last, unsigned offset,
 		      unsigned size, void *from)
 {
 	fill_gap(to, last, offset);
-	membuf_write(to, from, size);
+	if (from)
+		membuf_write(to, from, size);
+	else
+		membuf_zero(to, size);
 	*last = offset + size;
 }
 
@@ -1298,12 +1309,22 @@ void copy_xstate_to_kernel(struct membuf to, struct fpu *fpu)
 		  sizeof(header), &header);
 
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 mask = BIT_ULL(i);
+		void *src;
 		/*
-		 * Copy only in-use xstates:
+		 * Copy only in-use xstate at first. If the feature is enabled,
+		 * find the init value, whether stored in init_fpstate or simply
+		 * zeros, and then copy them.
 		 */
-		if ((header.xfeatures >> i) & 1) {
-			void *src = __raw_xsave_addr(fpu, i);
-
+		if (header.xfeatures & mask) {
+			src = __raw_xsave_addr(fpu, i);
+			copy_part(&to, &last, xstate_offsets[i],
+				  xstate_sizes[i], src);
+		} else if (xfeatures_mask_user() & mask) {
+			if (get_init_fpstate_mask() & mask)
+				src = (void *)&init_fpstate.xsave + last;
+			else
+				src = NULL;
 			copy_part(&to, &last, xstate_offsets[i],
 				  xstate_sizes[i], src);
 		}
@@ -1337,6 +1358,9 @@ int copy_kernel_to_xstate(struct fpu *fpu, const void *kbuf)
 		if (hdr.xfeatures & mask) {
 			void *dst = __raw_xsave_addr(fpu, i);
 
+			if (!dst)
+				continue;
+
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
 
@@ -1394,6 +1418,9 @@ int copy_user_to_xstate(struct fpu *fpu, const void __user *ubuf)
 		if (hdr.xfeatures & mask) {
 			void *dst = __raw_xsave_addr(fpu, i);
 
+			if (!dst)
+				continue;
+
 			offset = xstate_offsets[i];
 			size = xstate_sizes[i];
 
@@ -1476,7 +1503,7 @@ void copy_supervisor_to_kernel(struct fpu *fpu)
 			continue;
 
 		/* Move xfeature 'i' into its normal location */
-		memmove(xbuf + xstate_comp_offsets[i],
+		memmove(xbuf + get_xstate_comp_offset(fpu, i),
 			xbuf + xstate_supervisor_only_offsets[i],
 			xstate_sizes[i]);
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (11 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 12/22] x86/fpu/xstate: Update xstate context copy function for supporting dynamic area Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 23:39   ` Andy Lutomirski
  2020-10-01 20:39 ` [RFC PATCH 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent Chang S. Bae
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
architecture. XFD allows the kernel to enable a feature state in XCR0 and
to receive a #NM trap when a task uses instructions accessing that state.
In this way, Linux can allocate the large task->fpu buffer only for tasks
that use it.

XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
IA32_XFD_ERR to assist the #NM trap handler. Both use the same
state-component bitmap format, used by XCR0.

Use this hardware capability to find the right time to expand xstate area.
Introduce two sets of helper functions for that:

1. The first set is primarily for interacting with the XFD hardware
   feature. Helpers for configuring disablement, e.g. in context switching,
   are:
	xdisable_setbits()
	xdisable_getbits()
	xdisable_switch()

2. The second set is for managing the first-use status and handling #NM
   trap:
	xfirstuse_enabled()
	xfirstuse_not_detected()
	xfirstuse_event_handler()

The #NM handler induces the xstate area expansion to save the first-used
states.

No functional change until the kernel enables dynamic user states and XFD.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/cpufeatures.h  |  1 +
 arch/x86/include/asm/fpu/internal.h | 53 ++++++++++++++++++++++++++++-
 arch/x86/include/asm/msr-index.h    |  2 ++
 arch/x86/kernel/fpu/core.c          | 37 ++++++++++++++++++++
 arch/x86/kernel/fpu/xstate.c        | 34 ++++++++++++++++--
 arch/x86/kernel/process.c           |  5 +++
 arch/x86/kernel/process_32.c        |  2 +-
 arch/x86/kernel/process_64.c        |  2 +-
 arch/x86/kernel/traps.c             |  3 ++
 9 files changed, 133 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 2901d5df4366..7d7fe1d82966 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -274,6 +274,7 @@
 #define X86_FEATURE_XSAVEC		(10*32+ 1) /* XSAVEC instruction */
 #define X86_FEATURE_XGETBV1		(10*32+ 2) /* XGETBV with ECX = 1 instruction */
 #define X86_FEATURE_XSAVES		(10*32+ 3) /* XSAVES/XRSTORS instructions */
+#define X86_FEATURE_XFD			(10*32+ 4) /* eXtended Feature Disabling */
 
 /*
  * Extended auxiliary flags: Linux defined - for features scattered in various
diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index 3b03ead87a46..f5dbbaa060fb 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
  * Misc helper functions:
  */
 
+/* The first-use detection helpers: */
+
+static inline void xdisable_setbits(u64 value)
+{
+	wrmsrl_safe(MSR_IA32_XFD, value);
+}
+
+static inline u64 xdisable_getbits(void)
+{
+	u64 value;
+
+	rdmsrl_safe(MSR_IA32_XFD, &value);
+	return value;
+}
+
+static inline u64 xfirstuse_enabled(void)
+{
+	/* All the dynamic user components are first-use enabled. */
+	return xfeatures_mask_user_dynamic;
+}
+
+/*
+ * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
+ * xdisable_setbits() only uses this.
+ */
+static inline u64 xfirstuse_not_detected(struct fpu *fpu)
+{
+	u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
+
+	/*
+	 * If first-use is not detected, set the bit. If the detection is
+	 * not enabled, the bit is always zero in firstuse_bv. So, make
+	 * following conversion:
+	 */
+	return  (xfirstuse_enabled() ^ firstuse_bv);
+}
+
+/* Update MSR IA32_XFD based on fpu->firstuse_bv */
+static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
+{
+	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+		return;
+
+	if (unlikely(prev->state_mask != next->state_mask))
+		xdisable_setbits(xfirstuse_not_detected(next));
+}
+
+bool xfirstuse_event_handler(struct fpu *fpu);
+
 /*
  * Load PKRU from the FPU context if available. Delay loading of the
  * complete FPU state until the return to userland.
  */
-static inline void switch_fpu_finish(struct fpu *new_fpu)
+static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
 {
 	u32 pkru_val = init_pkru_value;
 	struct pkru_state *pk;
@@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
 
 	set_thread_flag(TIF_NEED_FPU_LOAD);
 
+	xdisable_switch(old_fpu, new_fpu);
+
 	if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
 		return;
 
diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 2859ee4f39a8..0ccbe8cc99ad 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -610,6 +610,8 @@
 #define MSR_IA32_BNDCFGS_RSVD		0x00000ffc
 
 #define MSR_IA32_XSS			0x00000da0
+#define MSR_IA32_XFD			0x000001c4
+#define MSR_IA32_XFD_ERR		0x000001c5
 
 #define MSR_IA32_APICBASE		0x0000001b
 #define MSR_IA32_APICBASE_BSP		(1<<8)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index ece6428ba85b..2e07bfcd54b3 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
 	 */
 	return 0;
 }
+
+bool xfirstuse_event_handler(struct fpu *fpu)
+{
+	bool handled = false;
+	u64 event_mask;
+
+	/* Check whether the first-use detection is running. */
+	if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
+		return handled;
+
+	rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);
+
+	/* The trap event should happen to one of first-use enabled features */
+	WARN_ON(!(event_mask & xfirstuse_enabled()));
+
+	/* If IA32_XFD_ERR is empty, the current trap has nothing to do with. */
+	if (!event_mask)
+		return handled;
+
+	/*
+	 * The first-use event is presumed to be from userspace, so it should have
+	 * nothing to do with interrupt context.
+	 */
+	if (WARN_ON(in_interrupt()))
+		return handled;
+
+	if (alloc_xstate_area(fpu, event_mask, NULL))
+		return handled;
+
+	xdisable_setbits(xfirstuse_not_detected(fpu));
+
+	/* Clear the trap record. */
+	wrmsrl_safe(MSR_IA32_XFD_ERR, 0);
+	handled = true;
+
+	return handled;
+}
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index b9261ab4e5e2..13e8eff7a23b 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -130,6 +130,18 @@ static bool xfeature_is_supervisor(int xfeature_nr)
 	return ecx & 1;
 }
 
+static bool xfeature_disable_supported(int xfeature_nr)
+{
+	u32 eax, ebx, ecx, edx;
+
+	/*
+	 * If state component 'i' supports xfeature disable (or first-use
+	 * detection, ECX[2] return 1; otherwise, 0.
+	 */
+	cpuid_count(XSTATE_CPUID, xfeature_nr, &eax, &ebx, &ecx, &edx);
+	return ecx & 4;
+}
+
 /*
  * Available once those arrays for the offset, size, and alignment info are set up,
  * by setup_xstate_features().
@@ -313,6 +325,9 @@ void fpu__init_cpu_xstate(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor() |
 				     xfeatures_mask_supervisor_dynamic());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD))
+		xdisable_setbits(xfirstuse_enabled());
 }
 
 static bool xfeature_enabled(enum xfeature xfeature)
@@ -511,8 +526,9 @@ static void __init print_xstate_offset_size(void)
 	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
 		if (!xfeature_enabled(i))
 			continue;
-		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d\n",
-			 i, xstate_comp_offsets[i], i, xstate_sizes[i]);
+		pr_info("x86/fpu: xstate_offset[%d]: %4d, xstate_sizes[%d]: %4d (%s)\n",
+			i, xstate_comp_offsets[i], i, xstate_sizes[i],
+			(xfeatures_mask_user_dynamic & BIT_ULL(i)) ? "on-demand" : "default");
 	}
 }
 
@@ -942,9 +958,18 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-	/* Do not support the dynamically allocated area yet. */
 	xfeatures_mask_user_dynamic = 0;
 
+	for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
+		u64 feature_mask = BIT_ULL(i);
+
+		if (!(xfeatures_mask_user() & feature_mask))
+			continue;
+
+		if (xfeature_disable_supported(i))
+			xfeatures_mask_user_dynamic |= feature_mask;
+	}
+
 	/* Enable xstate instructions to be able to continue with initialization: */
 	fpu__init_cpu_xstate();
 	err = init_xstate_size();
@@ -999,6 +1024,9 @@ void fpu__resume_cpu(void)
 		wrmsrl(MSR_IA32_XSS, xfeatures_mask_supervisor()  |
 				     xfeatures_mask_supervisor_dynamic());
 	}
+
+	if (boot_cpu_has(X86_FEATURE_XFD))
+		xdisable_setbits(xfirstuse_not_detected(&current->thread.fpu));
 }
 
 /*
diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 43d38bd09fb1..66850b63fe66 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -102,6 +102,11 @@ void arch_thread_struct_whitelist(unsigned long *offset, unsigned long *size)
 	*size = fpu_kernel_xstate_default_size;
 }
 
+void arch_release_task_struct(struct task_struct *tsk)
+{
+	free_xstate_area(&tsk->thread.fpu);
+}
+
 /*
  * Free thread data structures etc..
  */
diff --git a/arch/x86/kernel/process_32.c b/arch/x86/kernel/process_32.c
index 4f2f54e1281c..7bd5d08eeb41 100644
--- a/arch/x86/kernel/process_32.c
+++ b/arch/x86/kernel/process_32.c
@@ -213,7 +213,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 
 	this_cpu_write(current_task, next_p);
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Load the Intel cache allocation PQR MSR. */
 	resctrl_sched_in();
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
index 9afefe325acb..3970367c02c0 100644
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -595,7 +595,7 @@ __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
 	this_cpu_write(current_task, next_p);
 	this_cpu_write(cpu_current_top_of_stack, task_top_of_stack(next_p));
 
-	switch_fpu_finish(next_fpu);
+	switch_fpu_finish(prev_fpu, next_fpu);
 
 	/* Reload sp0. */
 	update_task_stack(next_p);
diff --git a/arch/x86/kernel/traps.c b/arch/x86/kernel/traps.c
index 81a2fb711091..0eebcae99938 100644
--- a/arch/x86/kernel/traps.c
+++ b/arch/x86/kernel/traps.c
@@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
 {
 	unsigned long cr0 = read_cr0();
 
+	if (xfirstuse_event_handler(&current->thread.fpu))
+		return;
+
 #ifdef CONFIG_MATH_EMULATION
 	if (!boot_cpu_has(X86_FEATURE_FPU) && (cr0 & X86_CR0_EM)) {
 		struct math_emu_info info = { };
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (12 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 15/22] x86/fpu/xstate: Support ptracer-induced xstate area expansion Chang S. Bae
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

When a new task is created, the kernel copies all the states from the
parent. If the parent already has any dynamic user state in use, the new
task has to expand the XSAVE buffer to save them. Also, disable the
associated first-use fault.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/fpu/core.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index 2e07bfcd54b3..239c7798bc01 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -225,6 +225,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 {
 	struct fpu *dst_fpu = &dst->thread.fpu;
 	struct fpu *src_fpu = &src->thread.fpu;
+	unsigned int size;
 
 	dst_fpu->last_cpu = -1;
 
@@ -233,15 +234,26 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 
 	WARN_ON_FPU(src_fpu != &current->thread.fpu);
 
-	/*
-	 * Don't let 'init optimized' areas of the XSAVE area
-	 * leak into the child task:
-	 */
-	memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_default_size);
-
-	dst_fpu->state_mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
 	dst_fpu->state_ptr = NULL;
 
+	/* Inherit the dynamic area if the parent already has. */
+	if (src_fpu->state_ptr) {
+		int ret;
+
+		dst_fpu->state_mask = 0;
+		ret = alloc_xstate_area(dst_fpu, src_fpu->state_mask, &size);
+		if (ret)
+			return ret;
+	} else {
+		dst_fpu->state_mask = src_fpu->state_mask & ~xfeatures_mask_user_dynamic;
+		size = fpu_kernel_xstate_default_size;
+		/*
+		 * Don't let 'init optimized' areas of the XSAVE area
+		 * leak into the child task:
+		 */
+		memset(&dst_fpu->state.xsave, 0, size);
+	}
+
 	/*
 	 * If the FPU registers are not current just memcpy() the state.
 	 * Otherwise save current FPU registers directly into the child's FPU
@@ -252,7 +264,7 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
 	 */
 	fpregs_lock();
 	if (test_thread_flag(TIF_NEED_FPU_LOAD))
-		memcpy(__xstate(dst_fpu), __xstate(src_fpu), fpu_kernel_xstate_default_size);
+		memcpy(__xstate(dst_fpu), __xstate(src_fpu), size);
 
 	else if (!copy_fpregs_to_fpstate(dst_fpu))
 		copy_kernel_to_fpregs(dst_fpu);
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 15/22] x86/fpu/xstate: Support ptracer-induced xstate area expansion
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (13 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 16/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
                   ` (6 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

ptrace() may request an update to task->fpu that has not yet been
allocated. Detect this case and allocate task->fpu to support the request.
Also, disable the (now unnecessary) associated first-use fault.

No functional change until the kernel supports dynamic user states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/fpu/regset.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/arch/x86/kernel/fpu/regset.c b/arch/x86/kernel/fpu/regset.c
index 8d863240b9c6..6b9d0c0a266d 100644
--- a/arch/x86/kernel/fpu/regset.c
+++ b/arch/x86/kernel/fpu/regset.c
@@ -125,6 +125,35 @@ int xstateregs_set(struct task_struct *target, const struct user_regset *regset,
 
 	xsave = __xsave(fpu);
 
+	/*
+	 * When a ptracer attempts to write any state in task->fpu but not allocated,
+	 * it dynamically expands the xstate area of fpu->state_ptr.
+	 */
+	if (count > get_xstate_size(fpu->state_mask)) {
+		unsigned int offset, size;
+		struct xstate_header hdr;
+		u64 mask;
+
+		offset = offsetof(struct xregs_state, header);
+		size = sizeof(hdr);
+
+		/* Retrieve XSTATE_BV */
+		if (kbuf) {
+			memcpy(&hdr, kbuf + offset, size);
+		} else {
+			ret = __copy_from_user(&hdr, ubuf + offset, size);
+			if (ret)
+				return ret;
+		}
+
+		mask = hdr.xfeatures & xfeatures_mask_user_dynamic;
+		if (!mask) {
+			ret = alloc_xstate_area(fpu, mask, NULL);
+			if (ret)
+				return ret;
+		}
+	}
+
 	fpu__prepare_write(fpu);
 
 	if (using_compacted_format()) {
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 16/22] x86/fpu/xstate: Support dynamic user state in the signal handling path
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (14 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 15/22] x86/fpu/xstate: Support ptracer-induced xstate area expansion Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 17/22] x86/fpu/xstate: Extend the table for mapping xstate components with features Chang S. Bae
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Entering a signal handler, the kernel saves XSAVE area. The dynamic user
state is better to be saved only when used. fpu->state_mask can help to
exclude unused states.

Returning from signal handler, XRSTOR re-initializes the excluded state
components.

No functional change until the kernel actually supports the dynamic user
states.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/internal.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
index f5dbbaa060fb..fd044b31ce40 100644
--- a/arch/x86/include/asm/fpu/internal.h
+++ b/arch/x86/include/asm/fpu/internal.h
@@ -368,7 +368,7 @@ static inline void copy_kernel_to_xregs(struct xregs_state *xstate, u64 mask)
  */
 static inline int copy_xregs_to_user(struct xregs_state __user *buf)
 {
-	u64 mask = xfeatures_mask_user();
+	u64 mask = current->thread.fpu.state_mask;
 	u32 lmask = mask;
 	u32 hmask = mask >> 32;
 	int err;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 17/22] x86/fpu/xstate: Extend the table for mapping xstate components with features
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (15 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 16/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 18/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

At compile-time xfeatures_mask_all includes all possible XCR0 features. At
run-time fpu__init_system_xstate() clears features in xfeatures_mask_all
that are not enabled in CPUID. It does this by looping through all possible
XCR0 features.

Update the code to handle the possibility that there will be gaps in the
XCR0 feature bit numbers.

No functional change, until hardware with bit number gaps in XCR0.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/kernel/fpu/xstate.c | 39 ++++++++++++++++++++++--------------
 1 file changed, 24 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index 13e8eff7a23b..eaada4a38153 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -41,17 +41,22 @@ static const char *xfeature_names[] =
 	"unknown xstate feature"	,
 };
 
-static short xsave_cpuid_features[] __initdata = {
-	X86_FEATURE_FPU,
-	X86_FEATURE_XMM,
-	X86_FEATURE_AVX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_MPX,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_AVX512F,
-	X86_FEATURE_INTEL_PT,
-	X86_FEATURE_PKU,
+struct xfeature_capflag_info {
+	int xfeature_idx;
+	short cpu_cap;
+};
+
+static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
+	{ XFEATURE_FP,				X86_FEATURE_FPU },
+	{ XFEATURE_SSE,				X86_FEATURE_XMM },
+	{ XFEATURE_YMM,				X86_FEATURE_AVX },
+	{ XFEATURE_BNDREGS,			X86_FEATURE_MPX },
+	{ XFEATURE_BNDCSR,			X86_FEATURE_MPX },
+	{ XFEATURE_OPMASK,			X86_FEATURE_AVX512F },
+	{ XFEATURE_ZMM_Hi256,			X86_FEATURE_AVX512F },
+	{ XFEATURE_Hi16_ZMM,			X86_FEATURE_AVX512F },
+	{ XFEATURE_PT_UNIMPLEMENTED_SO_FAR,	X86_FEATURE_INTEL_PT },
+	{ XFEATURE_PKRU,			X86_FEATURE_PKU },
 };
 
 /*
@@ -950,11 +955,15 @@ void __init fpu__init_system_xstate(void)
 	}
 
 	/*
-	 * Clear XSAVE features that are disabled in the normal CPUID.
+	 * Cross-check XSAVE feature with CPU capability flag. Clear the
+	 * mask bit for disabled features.
 	 */
-	for (i = 0; i < ARRAY_SIZE(xsave_cpuid_features); i++) {
-		if (!boot_cpu_has(xsave_cpuid_features[i]))
-			xfeatures_mask_all &= ~BIT_ULL(i);
+	for (i = 0; i < ARRAY_SIZE(xfeature_capflags); i++) {
+		short cpu_cap = xfeature_capflags[i].cpu_cap;
+		int idx = xfeature_capflags[i].xfeature_idx;
+
+		if (!boot_cpu_has(cpu_cap))
+			xfeatures_mask_all &= ~BIT_ULL(idx);
 	}
 
 	xfeatures_mask_all &= fpu__get_supported_xfeatures_mask();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 18/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (16 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 17/22] x86/fpu/xstate: Extend the table for mapping xstate components with features Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 19/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Intel's Advanced Matrix Extension (AMX) is a new 64-bit extended feature
consisting of two-dimensional registers and an accelerator unit. The first
implementation of the latter is the tile matrix multiply unit (TMUL). TMUL
performs SIMD dot-products on four bytes (INT8) or two bfloat16
floating-point (BF16) elements.

Here we add AMX to the kernel/user ABI, by enumerating the capability.
E.g., /proc/cpuinfo: amx_tile, amx_bf16, amx_int8

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/cpufeatures.h | 3 +++
 arch/x86/kernel/cpu/cpuid-deps.c   | 3 +++
 2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 7d7fe1d82966..79ad9bb1c01c 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -371,6 +371,9 @@
 #define X86_FEATURE_SERIALIZE		(18*32+14) /* SERIALIZE instruction */
 #define X86_FEATURE_PCONFIG		(18*32+18) /* Intel PCONFIG */
 #define X86_FEATURE_ARCH_LBR		(18*32+19) /* Intel ARCH LBR */
+#define X86_FEATURE_AMX_BF16		(18*32+22) /* AMX BF16 Support */
+#define X86_FEATURE_AMX_TILE		(18*32+24) /* AMX tile Support */
+#define X86_FEATURE_AMX_INT8		(18*32+25) /* AMX INT8 Support */
 #define X86_FEATURE_SPEC_CTRL		(18*32+26) /* "" Speculation Control (IBRS + IBPB) */
 #define X86_FEATURE_INTEL_STIBP		(18*32+27) /* "" Single Thread Indirect Branch Predictors */
 #define X86_FEATURE_FLUSH_L1D		(18*32+28) /* Flush L1D cache */
diff --git a/arch/x86/kernel/cpu/cpuid-deps.c b/arch/x86/kernel/cpu/cpuid-deps.c
index 3cbe24ca80ab..27e036c73f7e 100644
--- a/arch/x86/kernel/cpu/cpuid-deps.c
+++ b/arch/x86/kernel/cpu/cpuid-deps.c
@@ -69,6 +69,9 @@ static const struct cpuid_dep cpuid_deps[] = {
 	{ X86_FEATURE_CQM_MBM_TOTAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_CQM_MBM_LOCAL,		X86_FEATURE_CQM_LLC   },
 	{ X86_FEATURE_AVX512_BF16,		X86_FEATURE_AVX512VL  },
+	{ X86_FEATURE_AMX_TILE,			X86_FEATURE_XSAVE     },
+	{ X86_FEATURE_AMX_INT8,			X86_FEATURE_AMX_TILE  },
+	{ X86_FEATURE_AMX_BF16,			X86_FEATURE_AMX_TILE  },
 	{}
 };
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 19/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (17 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 18/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 20/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

Linux uses check_xstate_against_struct() to sanity check the size of
XSTATE-enabled features. AMX is the XSAVE-enabled feature, and its size is
not hard-coded but discoverable at run-time via CPUID.

The AMX state is composed of state components 17 and 18, which are all user
state components. The first component is the XTILECFG state of a 64-byte
tile-related control register. The state component 18, called XTILEDATA,
contains the actual tile data, and the state size varies on
implementations. The architectural maximum, as defined in the CPUID(0x1d,
1): EAX[15:0], is a byte less than 64KB. The first implementation supports
8KB.

Check the XTILEDATA state size dynamically. The feature introduces the new
tile register, TMM. Define one register struct only and read the number of
registers from CPUID. Cross-check the overall size with CPUID again.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/types.h  | 27 +++++++++++++
 arch/x86/include/asm/fpu/xstate.h |  2 +
 arch/x86/kernel/fpu/xstate.c      | 65 +++++++++++++++++++++++++++++++
 3 files changed, 94 insertions(+)

diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 4b7756644824..002248dba6dc 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -120,6 +120,9 @@ enum xfeature {
 	XFEATURE_RSRVD_COMP_13,
 	XFEATURE_RSRVD_COMP_14,
 	XFEATURE_LBR,
+	XFEATURE_RSRVD_COMP_16,
+	XFEATURE_XTILE_CFG,
+	XFEATURE_XTILE_DATA,
 
 	XFEATURE_MAX,
 };
@@ -135,11 +138,15 @@ enum xfeature {
 #define XFEATURE_MASK_PT		(1 << XFEATURE_PT_UNIMPLEMENTED_SO_FAR)
 #define XFEATURE_MASK_PKRU		(1 << XFEATURE_PKRU)
 #define XFEATURE_MASK_LBR		(1 << XFEATURE_LBR)
+#define XFEATURE_MASK_XTILE_CFG	(1 << XFEATURE_XTILE_CFG)
+#define XFEATURE_MASK_XTILE_DATA	(1 << XFEATURE_XTILE_DATA)
 
 #define XFEATURE_MASK_FPSSE		(XFEATURE_MASK_FP | XFEATURE_MASK_SSE)
 #define XFEATURE_MASK_AVX512		(XFEATURE_MASK_OPMASK \
 					 | XFEATURE_MASK_ZMM_Hi256 \
 					 | XFEATURE_MASK_Hi16_ZMM)
+#define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
+					 | XFEATURE_MASK_XTILE_CFG)
 
 #define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM
 
@@ -152,6 +159,9 @@ struct reg_256_bit {
 struct reg_512_bit {
 	u8	regbytes[512/8];
 };
+struct reg_1024_byte {
+	u8	regbytes[1024];
+};
 
 /*
  * State component 2:
@@ -254,6 +264,23 @@ struct arch_lbr_state {
 	u64 ler_to;
 	u64 ler_info;
 	struct lbr_entry		entries[];
+};
+
+/*
+ * State component 17: 64-byte tile configuration register.
+ */
+struct xtile_cfg {
+	u64				tcfg[8];
+} __packed;
+
+/*
+ * State component 18: 1KB tile data register.
+ * Each register represents 16 64-byte rows of the matrix
+ * data. But the number of registers depends on the actual
+ * implementation.
+ */
+struct xtile_data {
+	struct reg_1024_byte		tmm;
 } __packed;
 
 struct xstate_header {
diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index b2125ec90cdb..aadbcf893cc0 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -13,6 +13,8 @@
 
 #define XSTATE_CPUID		0x0000000d
 
+#define TILE_CPUID		0x0000001d
+
 #define FXSAVE_SIZE	512
 
 #define XSAVE_HDR_SIZE	    64
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index eaada4a38153..9d617d6506be 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -39,6 +39,15 @@ static const char *xfeature_names[] =
 	"Processor Trace (unused)"	,
 	"Protection Keys User registers",
 	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"unknown xstate feature"	,
+	"AMX Tile config"		,
+	"AMX Tile data"			,
+	"unknown xstate feature"	,
 };
 
 struct xfeature_capflag_info {
@@ -57,6 +66,8 @@ static struct xfeature_capflag_info xfeature_capflags[] __initdata = {
 	{ XFEATURE_Hi16_ZMM,			X86_FEATURE_AVX512F },
 	{ XFEATURE_PT_UNIMPLEMENTED_SO_FAR,	X86_FEATURE_INTEL_PT },
 	{ XFEATURE_PKRU,			X86_FEATURE_PKU },
+	{ XFEATURE_XTILE_CFG,			X86_FEATURE_AMX_TILE },
+	{ XFEATURE_XTILE_DATA,			X86_FEATURE_AMX_TILE }
 };
 
 /*
@@ -417,6 +428,8 @@ static void __init print_xstate_features(void)
 	print_xstate_feature(XFEATURE_MASK_ZMM_Hi256);
 	print_xstate_feature(XFEATURE_MASK_Hi16_ZMM);
 	print_xstate_feature(XFEATURE_MASK_PKRU);
+	print_xstate_feature(XFEATURE_MASK_XTILE_CFG);
+	print_xstate_feature(XFEATURE_MASK_XTILE_DATA);
 }
 
 /*
@@ -669,6 +682,51 @@ static void __xstate_dump_leaves(void)
 	}								\
 } while (0)
 
+static void check_xtile_data_against_struct(int size)
+{
+	u32 max_palid, palid, state_size;
+	u32 eax, ebx, ecx, edx;
+	u16 max_tile;
+
+	/*
+	 * Check the maximum palette id:
+	 * eax:  Highest numbered palette subleaf.
+	 */
+	cpuid_count(TILE_CPUID, 0, &max_palid, &ebx, &ecx, &edx);
+
+	/*
+	 * Cross-check each tile size and find the maximum
+	 * number of supported tiles.
+	 */
+	for (palid = 1, max_tile = 0; palid <= max_palid; palid++) {
+		u16 tile_size, max;
+
+		/*
+		 * Check the tile size info:
+		 * eax[31:16]:  bytes per title
+		 * ebx[31:16]:  the max names (or max number of tiles)
+		 */
+		cpuid_count(TILE_CPUID, palid, &eax, &ebx, &edx, &edx);
+
+		tile_size = eax >> 16;
+		max = ebx >> 16;
+		if (WARN_ONCE(tile_size != sizeof(struct xtile_data),
+			      "%s: struct is %zu bytes, cpu xtile %d bytes\n",
+			      __stringify(XFEATURE_XTILE_DATA),
+			      sizeof(struct xtile_data), tile_size))
+			__xstate_dump_leaves();
+
+		if (max > max_tile)
+			max_tile = max;
+	}
+
+	state_size = sizeof(struct xtile_data) * max_tile;
+	if (WARN_ONCE(size != state_size,
+		      "%s: calculated size is %u bytes, cpu state %d bytes\n",
+		      __stringify(XFEATURE_XTILE_DATA), state_size, size))
+		__xstate_dump_leaves();
+}
+
 /*
  * We have a C struct for each 'xstate'.  We need to ensure
  * that our software representation matches what the CPU
@@ -691,6 +749,13 @@ static void check_xstate_against_struct(int nr)
 	XCHECK_SZ(sz, nr, XFEATURE_ZMM_Hi256, struct avx_512_zmm_uppers_state);
 	XCHECK_SZ(sz, nr, XFEATURE_Hi16_ZMM,  struct avx_512_hi16_state);
 	XCHECK_SZ(sz, nr, XFEATURE_PKRU,      struct pkru_state);
+	XCHECK_SZ(sz, nr, XFEATURE_XTILE_CFG, struct xtile_cfg);
+	/*
+	 * The tile data size varies between implementations, while the other state
+	 * sizes are constant.
+	 */
+	if (nr == XFEATURE_XTILE_DATA)
+		check_xtile_data_against_struct(sz);
 
 	/*
 	 * Make *SURE* to add any feature numbers in below if
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 20/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (18 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 19/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 21/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae

In 64-bit mode, include the AMX state components in
XFEATURE_MASK_USER_SUPPORTED.

The XFD feature will be used to dynamically allocate per-task XSAVE
buffer on first use.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
---
 arch/x86/include/asm/fpu/xstate.h | 3 ++-
 arch/x86/kernel/fpu/init.c        | 8 ++++++--
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
index aadbcf893cc0..872325768b13 100644
--- a/arch/x86/include/asm/fpu/xstate.h
+++ b/arch/x86/include/asm/fpu/xstate.h
@@ -34,7 +34,8 @@
 				      XFEATURE_MASK_Hi16_ZMM	 | \
 				      XFEATURE_MASK_PKRU | \
 				      XFEATURE_MASK_BNDREGS | \
-				      XFEATURE_MASK_BNDCSR)
+				      XFEATURE_MASK_BNDCSR | \
+				      XFEATURE_MASK_XTILE)
 
 /* All currently supported supervisor features */
 #define XFEATURE_MASK_SUPERVISOR_SUPPORTED (0)
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index ee6499075a89..8e2a77bc1782 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -229,8 +229,12 @@ static void __init fpu__init_system_xstate_size_legacy(void)
  */
 u64 __init fpu__get_supported_xfeatures_mask(void)
 {
-	return XFEATURE_MASK_USER_SUPPORTED |
-	       XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+	u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
+
+	if (!IS_ENABLED(CONFIG_X86_64))
+		mask &= ~(XFEATURE_MASK_XTILE);
+
+	return mask;
 }
 
 /* Legacy code to initialize eager fpu mode. */
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 21/22] selftest/x86/amx: Include test cases for the AMX state management
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (19 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 20/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
  21 siblings, 0 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-kselftest

This selftest exercises the kernel's ability to inherit and context switch
AMX state, by verifying that they retain unique data when creating a child
process and between multiple threads.

Also, ptrace() is used to insert AMX state into existing threads -- both
before and after the existing thread has initialized its AMX state.

For the signal handling path, verify in the signal handler that the signal
frame either include or exclude AMX data -- depending on if the signaled
thread has initialized AMX state.

Collect the test cases of validating those operations together, as they
share some common setup for the AMX state.

These test cases do not depend on AMX compiler support, as they employ
user-space-XSAVE directly to access AMX state.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: linux-kselftest@vger.kernel.org
---
 tools/testing/selftests/x86/Makefile |   2 +-
 tools/testing/selftests/x86/amx.c    | 736 +++++++++++++++++++++++++++
 2 files changed, 737 insertions(+), 1 deletion(-)
 create mode 100644 tools/testing/selftests/x86/amx.c

diff --git a/tools/testing/selftests/x86/Makefile b/tools/testing/selftests/x86/Makefile
index e0c52e5ab49e..6f6e6cabca69 100644
--- a/tools/testing/selftests/x86/Makefile
+++ b/tools/testing/selftests/x86/Makefile
@@ -17,7 +17,7 @@ TARGETS_C_BOTHBITS := single_step_syscall sysret_ss_attrs syscall_nt test_mremap
 TARGETS_C_32BIT_ONLY := entry_from_vm86 test_syscall_vdso unwind_vdso \
 			test_FCMOV test_FCOMI test_FISTTP \
 			vdso_restorer
-TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering
+TARGETS_C_64BIT_ONLY := fsgsbase sysret_rip syscall_numbering amx
 # Some selftests require 32bit support enabled also on 64bit systems
 TARGETS_C_32BIT_NEEDED := ldt_gdt ptrace_syscall
 
diff --git a/tools/testing/selftests/x86/amx.c b/tools/testing/selftests/x86/amx.c
new file mode 100644
index 000000000000..bf766b22cf77
--- /dev/null
+++ b/tools/testing/selftests/x86/amx.c
@@ -0,0 +1,736 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#define _GNU_SOURCE
+#include <err.h>
+#include <elf.h>
+#include <pthread.h>
+#include <sched.h>
+#include <signal.h>
+#include <stdio.h>
+#include <string.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <stdlib.h>
+#include <time.h>
+#include <malloc.h>
+#include <unistd.h>
+#include <ucontext.h>
+
+#include <linux/futex.h>
+
+#include <sys/ipc.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/shm.h>
+#include <sys/signal.h>
+#include <sys/syscall.h>
+#include <sys/time.h>
+#include <sys/types.h>
+#include <sys/wait.h>
+#include <sys/uio.h>
+#include <sys/ucontext.h>
+
+#include <x86intrin.h>
+
+#ifndef __x86_64__
+# error This test is 64-bit only
+#endif
+
+typedef uint8_t u8;
+typedef uint16_t u16;
+typedef uint32_t u32;
+typedef uint64_t u64;
+
+#define PAGE_SIZE			(1 << 12)
+
+#define NUM_TILES			8
+#define TILE_SIZE			1024
+#define XSAVE_SIZE			((NUM_TILES * TILE_SIZE) + PAGE_SIZE)
+
+struct xsave_data {
+	u8 area[XSAVE_SIZE];
+} __attribute__((aligned(64)));
+
+/* Tile configuration associated: */
+#define MAX_TILES			16
+#define RESERVED_BYTES			14
+
+struct tile_config {
+	u8  palette_id;
+	u8  start_row;
+	u8  reserved[RESERVED_BYTES];
+	u16 colsb[MAX_TILES];
+	u8  rows[MAX_TILES];
+};
+
+struct tile_data {
+	u8 data[NUM_TILES * TILE_SIZE];
+};
+
+static inline u64 __xgetbv(u32 index)
+{
+	u32 eax, edx;
+
+	asm volatile(".byte 0x0f,0x01,0xd0"
+		     : "=a" (eax), "=d" (edx)
+		     : "c" (index));
+	return eax + ((u64)edx << 32);
+}
+
+static inline void __cpuid(u32 *eax, u32 *ebx, u32 *ecx, u32 *edx)
+{
+	asm volatile("cpuid;"
+		     : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx)
+		     : "0" (*eax), "2" (*ecx));
+}
+
+/* Load tile configuration */
+static inline void __ldtilecfg(void *cfg)
+{
+	asm volatile(".byte 0xc4,0xe2,0x78,0x49,0x00"
+		     : : "a"(cfg));
+}
+
+/* Load tile data to %tmm0 register only */
+static inline void __tileloadd(void *tile)
+{
+	asm volatile(".byte 0xc4,0xe2,0x7b,0x4b,0x04,0x10"
+		     : : "a"(tile), "d"(0));
+}
+
+/* Save extended states */
+static inline void __xsave(void *area, u32 lo, u32 hi)
+{
+	asm volatile(".byte 0x48,0x0f,0xae,0x27"
+		     : : "D" (area), "a" (lo), "d" (hi)
+		     : "memory");
+}
+
+/* Restore extended states */
+static inline void __xrstor(void *area, u32 lo, u32 hi)
+{
+	asm volatile(".byte 0x48,0x0f,0xae,0x2f"
+		     : : "D" (area), "a" (lo), "d" (hi));
+}
+
+/* Release tile states to init values */
+static inline void __tilerelease(void)
+{
+	asm volatile(".byte 0xc4, 0xe2, 0x78, 0x49, 0xc0" ::);
+}
+
+/* Hardware info check: */
+
+static inline bool check_xsave_supports_xtile(void)
+{
+	u32 eax, ebx, ecx, edx;
+	bool available = false;
+
+#define XSAVE_CPUID		0x1
+#define XSAVE_ECX_BIT		26
+#define XFEATURE_XTILE_CFG	17
+#define XFEATURE_XTILE_DATA	18
+#define XFEATURE_MASK_XTILE	((1 << XFEATURE_XTILE_DATA) | \
+				(1 << XFEATURE_XTILE_CFG))
+
+	eax = XSAVE_CPUID;
+	ecx = 0;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (!(ecx & (1 << XSAVE_ECX_BIT)))
+		return available;
+
+	if (__xgetbv(0) & XFEATURE_MASK_XTILE)
+		available = true;
+
+	return available;
+}
+
+struct xtile_hwinfo {
+	struct {
+		u16 bytes_per_tile;
+		u16 bytes_per_row;
+		u16 max_names;
+		u16 max_rows;
+	} spec;
+
+	struct {
+		u32 offset;
+		u32 size;
+	} xsave;
+};
+
+static struct xtile_hwinfo xtile;
+
+static bool __enum_xtile_config(void)
+{
+	u32 eax, ebx, ecx, edx;
+	u16 bytes_per_tile;
+	bool valid = false;
+
+#define TILE_CPUID			0x1d
+#define TILE_PALETTE_CPUID_SUBLEAVE	0x1
+
+	eax = TILE_CPUID;
+	ecx = TILE_PALETTE_CPUID_SUBLEAVE;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx || !ecx)
+		return valid;
+
+	xtile.spec.max_names = ebx >> 16;
+	if (xtile.spec.max_names < NUM_TILES)
+		return valid;
+
+	bytes_per_tile = eax >> 16;
+	if (bytes_per_tile < TILE_SIZE)
+		return valid;
+
+	xtile.spec.bytes_per_row = ebx;
+	xtile.spec.max_rows = ecx;
+	valid = true;
+
+	return valid;
+}
+
+static bool __enum_xsave_tile(void)
+{
+	u32 eax, ebx, ecx, edx;
+	bool valid = false;
+
+#define XSTATE_CPUID			0xd
+#define XSTATE_USER_STATE_SUBLEAVE	0x0
+
+	eax = XSTATE_CPUID;
+	ecx = XFEATURE_XTILE_DATA;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (!eax || !ebx)
+		return valid;
+
+	xtile.xsave.offset = ebx;
+	xtile.xsave.size = eax;
+	valid = true;
+
+	return valid;
+}
+
+static bool __check_xsave_size(void)
+{
+	u32 eax, ebx, ecx, edx;
+	bool valid = false;
+
+	eax = XSTATE_CPUID;
+	ecx = XSTATE_USER_STATE_SUBLEAVE;
+
+	__cpuid(&eax, &ebx, &ecx, &edx);
+	if (ebx && ebx <= XSAVE_SIZE)
+		valid = true;
+
+	return valid;
+}
+
+/*
+ * Check the hardware-provided tile state info and cross-check it with the
+ * hard-coded values: XSAVE_SIZE, NUM_TILES, and TILE_SIZE.
+ */
+static int check_xtile_hwinfo(void)
+{
+	bool success = false;
+
+	if (!__check_xsave_size())
+		return success;
+
+	if (!__enum_xsave_tile())
+		return success;
+
+	if (!__enum_xtile_config())
+		return success;
+
+	if (sizeof(struct tile_data) >= xtile.xsave.size)
+		success = true;
+
+	return success;
+}
+
+/* The helpers for managing XSAVE area and tile states: */
+
+/* Use the uncompacted format without 'init optimization' */
+static void save_xdata(void *data)
+{
+	__xsave(data, -1, -1);
+}
+
+static void restore_xdata(void *data)
+{
+	__xrstor(data, -1, -1);
+}
+
+static inline u64 __get_xsave_xstate_bv(void *data)
+{
+#define XSAVE_HDR_OFFSET	512
+	return *(u64 *)(data + XSAVE_HDR_OFFSET);
+}
+
+static void set_tilecfg(struct tile_config *cfg)
+{
+	int i;
+
+	memset(cfg, 0, sizeof(*cfg));
+	/* The first implementation has one significant palette with id 1 */
+	cfg->palette_id = 1;
+	for (i = 0; i < xtile.spec.max_names; i++) {
+		cfg->colsb[i] = xtile.spec.bytes_per_row;
+		cfg->rows[i] = xtile.spec.max_rows;
+	}
+}
+
+static void load_tilecfg(struct tile_config *cfg)
+{
+	__ldtilecfg(cfg);
+}
+
+static void make_tiles(void *tiles)
+{
+	u32 iterations = xtile.xsave.size / sizeof(u32);
+	static u32 value = 1;
+	u32 *ptr = tiles;
+	int i;
+
+	for (i = 0, ptr = tiles; i < iterations; i++, ptr++)
+		*ptr  = value;
+	value++;
+}
+
+/*
+ * Initialize the XSAVE area:
+ *
+ * Make sure tile configuration loaded already. Load limited tile data (%tmm0 only)
+ * and save all the states. XSAVE area is ready to complete tile data.
+ */
+static void init_xdata(void *data)
+{
+	struct tile_data tiles;
+
+	make_tiles(&tiles);
+	__tileloadd(&tiles);
+	__xsave(data, -1, -1);
+}
+
+static inline void *__get_xsave_tile_data_addr(void *data)
+{
+	return data + xtile.xsave.offset;
+}
+
+static void copy_tiles_to_xdata(void *xdata, void *tiles)
+{
+	void *dst = __get_xsave_tile_data_addr(xdata);
+
+	memcpy(dst, tiles, xtile.xsave.size);
+}
+
+static int compare_xdata_tiles(void *xdata, void *tiles)
+{
+	void *tile_data = __get_xsave_tile_data_addr(xdata);
+
+	if (memcmp(tile_data, tiles, xtile.xsave.size))
+		return 1;
+
+	return 0;
+}
+
+static int nerrs, errs;
+
+/* Testing tile data inheritance */
+
+static void test_tile_data_inheritance(void)
+{
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct tile_config cfg;
+	pid_t child;
+	int status;
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	errs = 0;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		memset(&xdata, 0, sizeof(xdata));
+		save_xdata(&xdata);
+		if (compare_xdata_tiles(&xdata, &tiles)) {
+			printf("[FAIL]\tchild failed to inherit tile data during fork()\n");
+			nerrs++;
+		} else {
+			printf("[OK]\tchild inherited tile data during fork()\n");
+		}
+		_exit(0);
+	}
+	wait(&status);
+}
+
+static void test_fork(void)
+{
+	pid_t child;
+	int status;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		test_tile_data_inheritance();
+		_exit(0);
+	}
+
+	wait(&status);
+}
+
+/* Context switching test */
+
+#define ITERATIONS			10
+#define NUM_THREADS			5
+
+struct futex_info {
+	int current;
+	int next;
+	int *futex;
+};
+
+static inline void command_wait(struct futex_info *info, int value)
+{
+	do {
+		sched_yield();
+	} while (syscall(SYS_futex, info->futex, FUTEX_WAIT, value, 0, 0, 0));
+}
+
+static inline void command_wake(struct futex_info *info, int value)
+{
+	do {
+		*info->futex = value;
+		while (!syscall(SYS_futex, info->futex, FUTEX_WAKE, 1, 0, 0, 0))
+			sched_yield();
+	} while (0);
+}
+
+static inline int get_iterative_value(int id)
+{
+	return ((id << 1) & ~0x1);
+}
+
+static inline int get_endpoint_value(int id)
+{
+	return ((id << 1) | 0x1);
+}
+
+static void *check_tiles(void *info)
+{
+	struct futex_info *finfo = (struct futex_info *)info;
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct tile_config cfg;
+	int i;
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wait(finfo, get_iterative_value(finfo->current));
+
+		memset(&xdata, 0, sizeof(xdata));
+		save_xdata(&xdata);
+		errs += compare_xdata_tiles(&xdata, &tiles);
+
+		make_tiles(&tiles);
+		copy_tiles_to_xdata(&xdata, &tiles);
+		restore_xdata(&xdata);
+
+		command_wake(finfo, get_iterative_value(finfo->next));
+	}
+
+	command_wait(finfo, get_endpoint_value(finfo->current));
+	__tilerelease();
+	return NULL;
+}
+
+static int create_children(int num, struct futex_info *finfo)
+{
+	const int shm_id = shmget(IPC_PRIVATE, sizeof(int), IPC_CREAT | 0666);
+	int *futex = shmat(shm_id, NULL, 0);
+	pthread_t thread;
+	int i;
+
+	for (i = 0; i < num; i++) {
+		finfo[i].futex = futex;
+		finfo[i].current = i + 1;
+		finfo[i].next = (i + 2) % (num + 1);
+
+		if (pthread_create(&thread, NULL, check_tiles, &finfo[i])) {
+			err(1, "pthread_create");
+			return 1;
+		}
+	}
+	return 0;
+}
+
+static void test_context_switch(void)
+{
+	struct futex_info *finfo;
+	cpu_set_t cpuset;
+	int i;
+
+	printf("[RUN]\t%u context switches of tile states in %d threads\n",
+	       ITERATIONS * NUM_THREADS, NUM_THREADS);
+
+	errs = 0;
+
+	CPU_ZERO(&cpuset);
+	CPU_SET(0, &cpuset);
+	if (sched_setaffinity(0, sizeof(cpuset), &cpuset) != 0)
+		err(1, "sched_setaffinity to CPU 0");
+
+	finfo = malloc(sizeof(*finfo) * NUM_THREADS);
+
+	if (create_children(NUM_THREADS, finfo))
+		return;
+
+	for (i = 0; i < ITERATIONS; i++) {
+		command_wake(finfo, get_iterative_value(1));
+		command_wait(finfo, get_iterative_value(0));
+	}
+
+	for (i = 1; i <= NUM_THREADS; i++)
+		command_wake(finfo, get_endpoint_value(i));
+
+	if (errs) {
+		printf("[FAIL]\t%u incorrect tile states\n", errs);
+		nerrs += errs;
+		return;
+	}
+
+	printf("[OK]\tall tile states are correct\n");
+}
+
+/* Ptrace test */
+
+static inline long get_tile_state(pid_t child, struct iovec *iov)
+{
+	return ptrace(PTRACE_GETREGSET, child, (u32)NT_X86_XSTATE, iov);
+}
+
+static inline long set_tile_state(pid_t child, struct iovec *iov)
+{
+	return ptrace(PTRACE_SETREGSET, child, (u32)NT_X86_XSTATE, iov);
+}
+
+static int write_tile_state(bool load_tile, pid_t child)
+{
+	struct xsave_data xdata;
+	struct tile_data tiles;
+	struct iovec iov;
+
+	iov.iov_base = &xdata;
+	iov.iov_len = sizeof(xdata);
+
+	if (get_tile_state(child, &iov))
+		err(1, "PTRACE_GETREGSET");
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	if (set_tile_state(child, &iov))
+		err(1, "PTRACE_SETREGSET");
+
+	memset(&xdata, 0, sizeof(xdata));
+	if (get_tile_state(child, &iov))
+		err(1, "PTRACE_GETREGSET");
+
+	if (!load_tile)
+		memset(&tiles, 0, sizeof(tiles));
+
+	return compare_xdata_tiles(&xdata, &tiles);
+}
+
+static void test_tile_state_write(bool load_tile)
+{
+	pid_t child;
+	int status;
+
+	child = fork();
+	if (child < 0)
+		err(1, "fork");
+
+	if (child == 0) {
+		printf("[RUN]\tPtrace-induced tile state write, ");
+		printf("%s tile data loaded\n", load_tile ? "with" : "without");
+
+		if (ptrace(PTRACE_TRACEME, 0, NULL, NULL))
+			err(1, "PTRACE_TRACEME");
+
+		if (load_tile) {
+			struct tile_config cfg;
+			struct tile_data tiles;
+
+			set_tilecfg(&cfg);
+			load_tilecfg(&cfg);
+			make_tiles(&tiles);
+			/* Load only %tmm0 but inducing the #NM */
+			__tileloadd(&tiles);
+		}
+
+		raise(SIGTRAP);
+		_exit(0);
+	}
+
+	do {
+		wait(&status);
+	} while (WSTOPSIG(status) != SIGTRAP);
+
+	errs = write_tile_state(load_tile, child);
+	if (errs) {
+		nerrs++;
+		printf("[FAIL]\t%s write\n", load_tile ? "incorrect" : "unexpected");
+	} else {
+		printf("[OK]\t%s write\n", load_tile ? "correct" : "no");
+	}
+
+	ptrace(PTRACE_DETACH, child, NULL, NULL);
+	wait(&status);
+}
+
+static void test_ptrace(void)
+{
+	bool ptracee_loads_tiles;
+
+	ptracee_loads_tiles = true;
+	test_tile_state_write(ptracee_loads_tiles);
+
+	ptracee_loads_tiles = false;
+	test_tile_state_write(ptracee_loads_tiles);
+}
+
+/* Signal handling test */
+
+static int sigtrapped;
+struct tile_data sig_tiles, sighdl_tiles;
+
+static void sethandler(int sig, void (*handler)(int, siginfo_t *, void *),
+		       int flags)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_sigaction = handler;
+	sa.sa_flags = SA_SIGINFO | flags;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static void clearhandler(int sig)
+{
+	struct sigaction sa;
+
+	memset(&sa, 0, sizeof(sa));
+	sa.sa_handler = SIG_DFL;
+	sigemptyset(&sa.sa_mask);
+	if (sigaction(sig, &sa, 0))
+		err(1, "sigaction");
+}
+
+static void sighandler(int sig, siginfo_t *info, void *ctx_void)
+{
+	ucontext_t *uctxt = (ucontext_t *)ctx_void;
+	struct xsave_data xdata;
+	struct tile_config cfg;
+	struct tile_data tiles;
+	u64 header;
+
+	header = __get_xsave_xstate_bv((void *)uctxt->uc_mcontext.fpregs);
+
+	if (header & (1 << XFEATURE_XTILE_DATA))
+		printf("[FAIL]\ttile data was written in sigframe\n");
+	else
+		printf("[OK]\ttile data was skipped in sigframe\n");
+
+	set_tilecfg(&cfg);
+	load_tilecfg(&cfg);
+	init_xdata(&xdata);
+
+	make_tiles(&tiles);
+	copy_tiles_to_xdata(&xdata, &tiles);
+	restore_xdata(&xdata);
+
+	save_xdata(&xdata);
+	if (compare_xdata_tiles(&xdata, &tiles))
+		err(1, "tile load file");
+
+	printf("\tsignal handler: load tile data\n");
+
+	sigtrapped = sig;
+}
+
+static void test_signal_handling(void)
+{
+	struct xsave_data xdata = { 0 };
+	struct tile_data tiles = { 0 };
+
+	clearhandler(SIGTRAP);
+	sethandler(SIGTRAP, sighandler, 1);
+	sigtrapped = 0;
+
+	printf("[RUN]\tCheck tile state management in handling signal\n");
+
+	printf("\tbefore signal: initial tile data state\n");
+
+	raise(SIGTRAP);
+
+	if (sigtrapped == 0)
+		err(1, "sigtrap");
+
+	save_xdata(&xdata);
+	if (compare_xdata_tiles(&xdata, &tiles)) {
+		printf("[FAIL]\ttile data was not loaded at sigreturn\n");
+		nerrs++;
+	} else {
+		printf("[OK]\ttile data was re-initialized at sigreturn\n");
+	}
+}
+
+int main(void)
+{
+	/* Check hardware availability at first */
+
+	if (!check_xsave_supports_xtile()) {
+		printf("\tTile state is not supported in the XSAVE.\n");
+		return 0;
+	}
+
+	if (!check_xtile_hwinfo()) {
+		printf("\tAvailable tile state size is insufficient to test.\n");
+		return 0;
+	}
+
+	nerrs = 0;
+
+	test_fork();
+	test_context_switch();
+	test_ptrace();
+	test_signal_handling();
+
+	return nerrs ? 1 : 0;
+}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
                   ` (20 preceding siblings ...)
  2020-10-01 20:39 ` [RFC PATCH 21/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
@ 2020-10-01 20:39 ` Chang S. Bae
  2020-10-02  2:09   ` Randy Dunlap
                     ` (2 more replies)
  21 siblings, 3 replies; 43+ messages in thread
From: Chang S. Bae @ 2020-10-01 20:39 UTC (permalink / raw)
  To: tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	chang.seok.bae, linux-doc

"xstate.disable=0x6000" will disable AMX on a system that has AMX compiled
into XFEATURE_MASK_USER_SUPPORTED.

"xstate.enable=0x6000" will enable AMX on a system that does NOT have AMX
compiled into XFEATURE_MASK_USER_SUPPORTED (assuming the kernel is new
enough to support this feature).

While this cmdline is currently enabled only for AMX, it is intended to be
easily enabled to be useful for future XSAVE-enabled features.

Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
Reviewed-by: Len Brown <len.brown@intel.com>
Cc: x86@kernel.org
Cc: linux-doc@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
---
 .../admin-guide/kernel-parameters.txt         | 15 ++++++
 arch/x86/include/asm/fpu/types.h              |  6 +++
 arch/x86/kernel/fpu/init.c                    | 52 +++++++++++++++++--
 3 files changed, 70 insertions(+), 3 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index a1068742a6df..742167c6f789 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5838,6 +5838,21 @@
 			which allow the hypervisor to 'idle' the guest on lock
 			contention.
 
+	xstate.enable=	[X86-64]
+	xstate.disable=	[X86-64]
+			The kernel is compiled with a default xstate bitmask --
+			enabling it to use the XSAVE hardware to efficiently
+			save and restore thread states on context switch.
+			xstate.enable allows adding to that default mask at
+			boot-time without recompiling the kernel just to support
+			the new thread state. (Note that the kernel will ignore
+			any bits in the mask that do not correspond to features
+			that are actually available in CPUID)  xstate.disable
+			allows clearing bits in the default mask, forcing the
+			kernel to forget that it supports the specified thread
+			state. When a bit set for both, the kernel takes
+			xstate.disable in a priority.
+
 	xirc2ps_cs=	[NET,PCMCIA]
 			Format:
 			<irq>,<irq_mask>,<io>,<full_duplex>,<do_sound>,<lockup_hack>[,<irq2>[,<irq3>[,<irq4>]]]
diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
index 002248dba6dc..2a944e8903bb 100644
--- a/arch/x86/include/asm/fpu/types.h
+++ b/arch/x86/include/asm/fpu/types.h
@@ -148,6 +148,12 @@ enum xfeature {
 #define XFEATURE_MASK_XTILE		(XFEATURE_MASK_XTILE_DATA \
 					 | XFEATURE_MASK_XTILE_CFG)
 
+#define XFEATURE_REGION_MASK(max_bit, min_bit) \
+	((BIT_ULL((max_bit) - (min_bit) + 1) - 1) << (min_bit))
+
+#define XFEATURE_MASK_CONFIGURABLE \
+	XFEATURE_REGION_MASK(XFEATURE_XTILE_DATA, XFEATURE_XTILE_CFG)
+
 #define FIRST_EXTENDED_XFEATURE	XFEATURE_YMM
 
 struct reg_128_bit {
diff --git a/arch/x86/kernel/fpu/init.c b/arch/x86/kernel/fpu/init.c
index 8e2a77bc1782..a354286e7c90 100644
--- a/arch/x86/kernel/fpu/init.c
+++ b/arch/x86/kernel/fpu/init.c
@@ -227,13 +227,42 @@ static void __init fpu__init_system_xstate_size_legacy(void)
  * This must be called after fpu__init_parse_early_param() is called and
  * xfeatures_mask is enumerated.
  */
+
+static u64 xstate_enable;
+static u64 xstate_disable;
+
 u64 __init fpu__get_supported_xfeatures_mask(void)
 {
 	u64 mask = XFEATURE_MASK_USER_SUPPORTED | XFEATURE_MASK_SUPERVISOR_SUPPORTED;
 
-	if (!IS_ENABLED(CONFIG_X86_64))
-		mask &= ~(XFEATURE_MASK_XTILE);
-
+	if (!IS_ENABLED(CONFIG_X86_64)) {
+		mask  &= ~(XFEATURE_MASK_XTILE);
+	} else if (xstate_enable || xstate_disable) {
+		u64 custom = mask;
+		u64 unknown;
+
+		custom |= xstate_enable;
+		custom &= ~xstate_disable;
+
+		unknown = custom & ~mask;
+		if (unknown) {
+			/*
+			 * User should fully understand the result of using undocumented
+			 * xstate component
+			 */
+			pr_warn("x86/fpu: Attempt to enable unknown xstate features 0x%llx\n",
+				unknown);
+			WARN_ON_FPU(1);
+		}
+
+		if ((custom & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE) {
+			pr_warn("x86/fpu: Disable 0x%x components due to incorrect setup\n",
+				XFEATURE_MASK_XTILE);
+			custom &= ~(XFEATURE_MASK_XTILE);
+		}
+
+		mask = custom;
+	}
 	return mask;
 }
 
@@ -254,6 +283,7 @@ static void __init fpu__init_parse_early_param(void)
 {
 	char arg[32];
 	char *argptr = arg;
+	u64 mask;
 	int bit;
 
 #ifdef CONFIG_X86_32
@@ -283,6 +313,22 @@ static void __init fpu__init_parse_early_param(void)
 	    bit >= 0 &&
 	    bit < NCAPINTS * 32)
 		setup_clear_cpu_cap(bit);
+
+	if (cmdline_find_option(boot_command_line, "xstate.enable", arg,
+				sizeof(arg)) &&
+	    !kstrtoull(arg, 16, &mask) &&
+	    (mask &= XFEATURE_MASK_CONFIGURABLE))
+		xstate_enable = mask;
+	else
+		xstate_enable = 0;
+
+	if (cmdline_find_option(boot_command_line, "xstate.disable", arg,
+				sizeof(arg)) &&
+	    !kstrtoull(arg, 16, &mask) &&
+	    (mask &= XFEATURE_MASK_CONFIGURABLE))
+		xstate_disable = mask;
+	else
+		xstate_disable = 0;
 }
 
 /*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-01 20:39 ` [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use Chang S. Bae
@ 2020-10-01 23:39   ` Andy Lutomirski
  2020-10-13 22:31     ` Brown, Len
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-01 23:39 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andrew Lutomirski,
	X86 ML, Len Brown, Dave Hansen, jing2.liu, Ravi V. Shankar, LKML

On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> Intel's Extended Feature Disable (XFD) feature is an extension of the XSAVE
> architecture. XFD allows the kernel to enable a feature state in XCR0 and
> to receive a #NM trap when a task uses instructions accessing that state.
> In this way, Linux can allocate the large task->fpu buffer only for tasks
> that use it.
>
> XFD introduces two MSRs: IA32_XFD to enable/disable the feature and
> IA32_XFD_ERR to assist the #NM trap handler. Both use the same
> state-component bitmap format, used by XCR0.
>
> Use this hardware capability to find the right time to expand xstate area.
> Introduce two sets of helper functions for that:
>
> 1. The first set is primarily for interacting with the XFD hardware
>    feature. Helpers for configuring disablement, e.g. in context switching,
>    are:
>         xdisable_setbits()
>         xdisable_getbits()
>         xdisable_switch()
>
> 2. The second set is for managing the first-use status and handling #NM
>    trap:
>         xfirstuse_enabled()
>         xfirstuse_not_detected()
>         xfirstuse_event_handler()
>
> The #NM handler induces the xstate area expansion to save the first-used
> states.
>
> No functional change until the kernel enables dynamic user states and XFD.
>
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/cpufeatures.h  |  1 +
>  arch/x86/include/asm/fpu/internal.h | 53 ++++++++++++++++++++++++++++-
>  arch/x86/include/asm/msr-index.h    |  2 ++
>  arch/x86/kernel/fpu/core.c          | 37 ++++++++++++++++++++
>  arch/x86/kernel/fpu/xstate.c        | 34 ++++++++++++++++--
>  arch/x86/kernel/process.c           |  5 +++
>  arch/x86/kernel/process_32.c        |  2 +-
>  arch/x86/kernel/process_64.c        |  2 +-
>  arch/x86/kernel/traps.c             |  3 ++
>  9 files changed, 133 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
> index 2901d5df4366..7d7fe1d82966 100644
> --- a/arch/x86/include/asm/cpufeatures.h
> +++ b/arch/x86/include/asm/cpufeatures.h
> @@ -274,6 +274,7 @@
>  #define X86_FEATURE_XSAVEC             (10*32+ 1) /* XSAVEC instruction */
>  #define X86_FEATURE_XGETBV1            (10*32+ 2) /* XGETBV with ECX = 1 instruction */
>  #define X86_FEATURE_XSAVES             (10*32+ 3) /* XSAVES/XRSTORS instructions */
> +#define X86_FEATURE_XFD                        (10*32+ 4) /* eXtended Feature Disabling */
>
>  /*
>   * Extended auxiliary flags: Linux defined - for features scattered in various
> diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h
> index 3b03ead87a46..f5dbbaa060fb 100644
> --- a/arch/x86/include/asm/fpu/internal.h
> +++ b/arch/x86/include/asm/fpu/internal.h
> @@ -572,11 +572,60 @@ static inline void switch_fpu_prepare(struct fpu *old_fpu, int cpu)
>   * Misc helper functions:
>   */
>
> +/* The first-use detection helpers: */
> +
> +static inline void xdisable_setbits(u64 value)
> +{
> +       wrmsrl_safe(MSR_IA32_XFD, value);
> +}
> +
> +static inline u64 xdisable_getbits(void)
> +{
> +       u64 value;
> +
> +       rdmsrl_safe(MSR_IA32_XFD, &value);
> +       return value;
> +}
> +
> +static inline u64 xfirstuse_enabled(void)
> +{
> +       /* All the dynamic user components are first-use enabled. */
> +       return xfeatures_mask_user_dynamic;
> +}
> +
> +/*
> + * Convert fpu->firstuse_bv to xdisable configuration in MSR IA32_XFD.
> + * xdisable_setbits() only uses this.
> + */
> +static inline u64 xfirstuse_not_detected(struct fpu *fpu)
> +{
> +       u64 firstuse_bv = (fpu->state_mask & xfirstuse_enabled());
> +
> +       /*
> +        * If first-use is not detected, set the bit. If the detection is
> +        * not enabled, the bit is always zero in firstuse_bv. So, make
> +        * following conversion:
> +        */
> +       return  (xfirstuse_enabled() ^ firstuse_bv);
> +}
> +
> +/* Update MSR IA32_XFD based on fpu->firstuse_bv */
> +static inline void xdisable_switch(struct fpu *prev, struct fpu *next)
> +{
> +       if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> +               return;
> +
> +       if (unlikely(prev->state_mask != next->state_mask))
> +               xdisable_setbits(xfirstuse_not_detected(next));
> +}
> +
> +bool xfirstuse_event_handler(struct fpu *fpu);
> +
>  /*
>   * Load PKRU from the FPU context if available. Delay loading of the
>   * complete FPU state until the return to userland.
>   */
> -static inline void switch_fpu_finish(struct fpu *new_fpu)
> +static inline void switch_fpu_finish(struct fpu *old_fpu, struct fpu *new_fpu)
>  {
>         u32 pkru_val = init_pkru_value;
>         struct pkru_state *pk;
> @@ -586,6 +635,8 @@ static inline void switch_fpu_finish(struct fpu *new_fpu)
>
>         set_thread_flag(TIF_NEED_FPU_LOAD);
>
> +       xdisable_switch(old_fpu, new_fpu);
> +
>         if (!cpu_feature_enabled(X86_FEATURE_OSPKE))
>                 return;
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 2859ee4f39a8..0ccbe8cc99ad 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -610,6 +610,8 @@
>  #define MSR_IA32_BNDCFGS_RSVD          0x00000ffc
>
>  #define MSR_IA32_XSS                   0x00000da0
> +#define MSR_IA32_XFD                   0x000001c4
> +#define MSR_IA32_XFD_ERR               0x000001c5
>
>  #define MSR_IA32_APICBASE              0x0000001b
>  #define MSR_IA32_APICBASE_BSP          (1<<8)
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index ece6428ba85b..2e07bfcd54b3 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
>          */
>         return 0;
>  }
> +
> +bool xfirstuse_event_handler(struct fpu *fpu)
> +{
> +       bool handled = false;
> +       u64 event_mask;
> +
> +       /* Check whether the first-use detection is running. */
> +       if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> +               return handled;
> +
> +       rdmsrl_safe(MSR_IA32_XFD_ERR, &event_mask);

NAK.

MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
some helper called farther down the stack

But this raises an interesting point -- what happens if allocation
fails?  I think that, from kernel code, we simply cannot support this
exception mechanism.  If kernel code wants to use AMX (and that would
be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
handle errors, not rely on exceptions.  From user code, I assume we
send a signal if allocation fails.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically
  2020-10-01 20:38 ` [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically Chang S. Bae
@ 2020-10-01 23:41   ` Andy Lutomirski
  2020-10-13 22:00     ` Brown, Len
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-01 23:41 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andrew Lutomirski,
	X86 ML, Len Brown, Dave Hansen, jing2.liu, Ravi V. Shankar, LKML

On Thu, Oct 1, 2020 at 1:42 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> task->fpu has a buffer to keep the extended register states, but it is not
> expandable at runtime. Introduce runtime methods and new fpu struct fields
> to support the expansion.
>
> fpu->state_mask indicates the saved states per task and fpu->state_ptr
> points the dynamically allocated area.
>
> alloc_xstate_area() uses vmalloc() for its scalability. However, set a
> threshold (64KB) to watch out a potential need for an alternative
> mechanism.
>
> Also, introduce a new helper -- get_xstate_size() to calculate the area
> size.
>
> No functional change until the kernel supports dynamic user states.
>
> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Cc: x86@kernel.org
> Cc: linux-kernel@vger.kernel.org
> ---
>  arch/x86/include/asm/fpu/types.h  |  29 +++++--
>  arch/x86/include/asm/fpu/xstate.h |   3 +
>  arch/x86/kernel/fpu/core.c        |   3 +
>  arch/x86/kernel/fpu/xstate.c      | 124 ++++++++++++++++++++++++++++++
>  4 files changed, 154 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/include/asm/fpu/types.h b/arch/x86/include/asm/fpu/types.h
> index c87364ea6446..4b7756644824 100644
> --- a/arch/x86/include/asm/fpu/types.h
> +++ b/arch/x86/include/asm/fpu/types.h
> @@ -327,14 +327,33 @@ struct fpu {
>          */
>         unsigned long                   avx512_timestamp;
>
> +       /*
> +        * @state_mask:
> +        *
> +        * The state component bitmap. It indicates the saved xstate in
> +        * either @state or @state_ptr. The map value starts to be aligned
> +        * with @state and then with @state_ptr once it is in use.
> +        */
> +       u64                             state_mask;
> +
> +       /*
> +        * @state_ptr:
> +        *
> +        * Copy of all extended register states, in a dynamically-allocated
> +        * area, we save and restore over context switches. When a task is
> +        * using extended features, the register state is always the most
> +        * current. This state copy is more recent than @state. If the task
> +        * context-switches away, they get saved here, representing the xstate.
> +        */
> +       union fpregs_state              *state_ptr;
> +
>         /*
>          * @state:
>          *
> -        * In-memory copy of all FPU registers that we save/restore
> -        * over context switches. If the task is using the FPU then
> -        * the registers in the FPU are more recent than this state
> -        * copy. If the task context-switches away then they get
> -        * saved here and represent the FPU state.
> +        * Copy of some extended register state that we save and restore
> +        * over context switches. If a task uses a dynamically-allocated
> +        * area, @state_ptr, then it has a more recent state copy than this.
> +        * This copy follows the same attributes as described for @state_ptr.
>          */
>         union fpregs_state              state;
>         /*
> diff --git a/arch/x86/include/asm/fpu/xstate.h b/arch/x86/include/asm/fpu/xstate.h
> index 9aad91c0725b..37728bfcb71e 100644
> --- a/arch/x86/include/asm/fpu/xstate.h
> +++ b/arch/x86/include/asm/fpu/xstate.h
> @@ -103,6 +103,9 @@ extern void __init update_regset_xstate_info(unsigned int size,
>                                              u64 xstate_mask);
>
>  void *get_xsave_addr(struct fpu *fpu, int xfeature_nr);
> +int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size);
> +void free_xstate_area(struct fpu *fpu);
> +
>  const void *get_xsave_field_ptr(int xfeature_nr);
>  int using_compacted_format(void);
>  int xfeature_size(int xfeature_nr);
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index 875620fdfe61..e25f7866800e 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -235,6 +235,9 @@ int fpu__copy(struct task_struct *dst, struct task_struct *src)
>          */
>         memset(&dst_fpu->state.xsave, 0, fpu_kernel_xstate_default_size);
>
> +       dst_fpu->state_mask = xfeatures_mask_all & ~xfeatures_mask_user_dynamic;
> +       dst_fpu->state_ptr = NULL;
> +
>         /*
>          * If the FPU registers are not current just memcpy() the state.
>          * Otherwise save current FPU registers directly into the child's FPU
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index 6e0d8a9699ed..af60332aafef 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -10,6 +10,7 @@
>  #include <linux/pkeys.h>
>  #include <linux/seq_file.h>
>  #include <linux/proc_fs.h>
> +#include <linux/vmalloc.h>
>
>  #include <asm/fpu/api.h>
>  #include <asm/fpu/internal.h>
> @@ -69,6 +70,7 @@ static unsigned int xstate_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] =
>  static unsigned int xstate_sizes[XFEATURE_MAX]   = { [ 0 ... XFEATURE_MAX - 1] = -1};
>  static unsigned int xstate_comp_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
>  static unsigned int xstate_supervisor_only_offsets[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = -1};
> +static bool xstate_aligns[XFEATURE_MAX] = { [ 0 ... XFEATURE_MAX - 1] = false};
>
>  /*
>   * The XSAVE area of kernel can be in standard or compacted format;
> @@ -128,6 +130,48 @@ static bool xfeature_is_supervisor(int xfeature_nr)
>         return ecx & 1;
>  }
>
> +/*
> + * Available once those arrays for the offset, size, and alignment info are set up,
> + * by setup_xstate_features().
> + */
> +static unsigned int get_xstate_size(u64 mask)
> +{
> +       unsigned int size;
> +       u64 xmask;
> +       int i, nr;
> +
> +       if (!mask)
> +               return 0;
> +       else if (mask == (xfeatures_mask_all & ~xfeatures_mask_user_dynamic))
> +               return fpu_kernel_xstate_default_size;
> +       else if (mask == xfeatures_mask_all)
> +               return fpu_kernel_xstate_max_size;
> +
> +       nr = fls64(mask) - 1;
> +
> +       if (!using_compacted_format())
> +               return xstate_offsets[nr] + xstate_sizes[nr];
> +
> +       xmask = BIT_ULL(nr + 1) - 1;
> +
> +       if (mask == (xmask & xfeatures_mask_all))
> +               return xstate_comp_offsets[nr] + xstate_sizes[nr];
> +
> +       /*
> +        * Calculate the size by summing up each state together, since no known
> +        * size found with the xstate area format out of the given mask.
> +        */
> +       for (size = FXSAVE_SIZE + XSAVE_HDR_SIZE, i = FIRST_EXTENDED_XFEATURE; i <= nr; i++) {
> +               if (!(mask & BIT_ULL(i)))
> +                       continue;
> +
> +               if (xstate_aligns[i])
> +                       size = ALIGN(size, 64);
> +               size += xstate_sizes[i];
> +       }
> +       return size;
> +}
> +
>  /*
>   * When executing XSAVEOPT (or other optimized XSAVE instructions), if
>   * a processor implementation detects that an FPU state component is still
> @@ -268,10 +312,12 @@ static void __init setup_xstate_features(void)
>         xstate_offsets[XFEATURE_FP]     = 0;
>         xstate_sizes[XFEATURE_FP]       = offsetof(struct fxregs_state,
>                                                    xmm_space);
> +       xstate_aligns[XFEATURE_FP]      = true;
>
>         xstate_offsets[XFEATURE_SSE]    = xstate_sizes[XFEATURE_FP];
>         xstate_sizes[XFEATURE_SSE]      = sizeof_field(struct fxregs_state,
>                                                        xmm_space);
> +       xstate_aligns[XFEATURE_SSE]     = true;
>
>         for (i = FIRST_EXTENDED_XFEATURE; i < XFEATURE_MAX; i++) {
>                 if (!xfeature_enabled(i))
> @@ -289,6 +335,7 @@ static void __init setup_xstate_features(void)
>                         continue;
>
>                 xstate_offsets[i] = ebx;
> +               xstate_aligns[i] = (ecx & 2) ? true : false;
>
>                 /*
>                  * In our xstate size checks, we assume that the highest-numbered
> @@ -753,6 +800,9 @@ static bool is_supported_xstate_size(unsigned int test_xstate_size)
>         return false;
>  }
>
> +/* The watched threshold size of dynamically allocated xstate area */
> +#define XSTATE_AREA_MAX_BYTES          (64 * 1024)
> +
>  static int __init init_xstate_size(void)
>  {
>         /* Recompute the context size for enabled features: */
> @@ -777,6 +827,14 @@ static int __init init_xstate_size(void)
>         if (!is_supported_xstate_size(fpu_kernel_xstate_default_size))
>                 return -EINVAL;
>
> +       /*
> +        * When observing the allocation goes beyond the threshold, it is better to consider
> +        * switching a better scalable mechanism than the current.
> +        */
> +       if (fpu_kernel_xstate_max_size > XSTATE_AREA_MAX_BYTES)
> +               pr_warn("x86/fpu: xstate buffer too large (%u > %u)\n",
> +                       fpu_kernel_xstate_max_size, XSTATE_AREA_MAX_BYTES);
> +
>         /*
>          * User space is always in standard format.
>          */
> @@ -867,6 +925,12 @@ void __init fpu__init_system_xstate(void)
>         if (err)
>                 goto out_disable;
>
> +       /*
> +        * At this point, it passed the size sanity check. Have the init_task take
> +        * the feature mask.
> +        */
> +       current->thread.fpu.state_mask = (xfeatures_mask_all & ~xfeatures_mask_user_dynamic);
> +
>         /*
>          * Update info used for ptrace frames; use standard-format size and no
>          * supervisor xstates:
> @@ -1086,6 +1150,66 @@ static inline bool xfeatures_mxcsr_quirk(u64 xfeatures)
>         return true;
>  }
>
> +void free_xstate_area(struct fpu *fpu)
> +{
> +       vfree(fpu->state_ptr);
> +}
> +
> +/*
> + * Allocate a new xstate area with a calculated size, based on the given bit value.
> + *
> + * No mechanism implemented yet to shrink or reclaim the xstate area on the fly,
> + * the need of which is subject to the real usage.
> + */
> +int alloc_xstate_area(struct fpu *fpu, u64 mask, unsigned int *alloc_size)
> +{
> +       union fpregs_state *state_ptr;
> +       unsigned int oldsz, newsz;
> +       u64 state_mask;
> +
> +       state_mask = fpu->state_mask | mask;
> +
> +       oldsz = get_xstate_size(fpu->state_mask);
> +       newsz = get_xstate_size(state_mask);
> +
> +       if (oldsz >= newsz)
> +               return 0;
> +
> +       if (newsz > fpu_kernel_xstate_max_size) {
> +               pr_warn_once("x86/fpu: state buffer too large (%u > %u bytes)\n",
> +                            newsz, fpu_kernel_xstate_max_size);
> +               XSTATE_WARN_ON(1);
> +               return 0;
> +       }
> +
> +       /*
> +        * The caller may be under interrupt disabled condition. Ensure interrupt
> +        * allowance before the memory allocation, which may involve with page faults.
> +        */
> +       local_irq_enable();
> +       /* We need 64B aligned pointer, but vmalloc() returns a page-aligned address */
> +       state_ptr = vmalloc(newsz);

I'm speechless.

First, you can't just enable IRQs here.  If IRQs are off, they're off
for a reason.  Secondly, if they're *on*, you just forgot that fact.

And allocating this state from vmalloc() seems questionable.  Why are
you doing this?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
@ 2020-10-02  2:09   ` Randy Dunlap
  2020-10-13 23:00     ` Brown, Len
  2020-10-02 17:15   ` Andy Lutomirski
  2020-10-02 17:25   ` Dave Hansen
  2 siblings, 1 reply; 43+ messages in thread
From: Randy Dunlap @ 2020-10-02  2:09 UTC (permalink / raw)
  To: Chang S. Bae, tglx, mingo, bp, luto, x86
  Cc: len.brown, dave.hansen, jing2.liu, ravi.v.shankar, linux-kernel,
	linux-doc

Hi--

On 10/1/20 1:39 PM, Chang S. Bae wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index a1068742a6df..742167c6f789 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -5838,6 +5838,21 @@
>  			which allow the hypervisor to 'idle' the guest on lock
>  			contention.
>  
> +	xstate.enable=	[X86-64]
> +	xstate.disable=	[X86-64]
> +			The kernel is compiled with a default xstate bitmask --
> +			enabling it to use the XSAVE hardware to efficiently
> +			save and restore thread states on context switch.
> +			xstate.enable allows adding to that default mask at
> +			boot-time without recompiling the kernel just to support
> +			the new thread state. (Note that the kernel will ignore
> +			any bits in the mask that do not correspond to features
> +			that are actually available in CPUID)  xstate.disable

			                             in CPUID.)

> +			allows clearing bits in the default mask, forcing the
> +			kernel to forget that it supports the specified thread
> +			state. When a bit set for both, the kernel takes
> +			xstate.disable in a priority.

			               as a priority.


What do these bitmasks look like?  what do the bits mean?
Where does a user find this info?


thanks.
-- 
~Randy


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
  2020-10-02  2:09   ` Randy Dunlap
@ 2020-10-02 17:15   ` Andy Lutomirski
  2020-10-02 17:26     ` Dave Hansen
  2020-10-13 23:38     ` Brown, Len
  2020-10-02 17:25   ` Dave Hansen
  2 siblings, 2 replies; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-02 17:15 UTC (permalink / raw)
  To: Chang S. Bae
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, Andrew Lutomirski,
	X86 ML, Len Brown, Dave Hansen, jing2.liu, Ravi V. Shankar, LKML,
	open list:DOCUMENTATION

On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:
>
> "xstate.disable=0x6000" will disable AMX on a system that has AMX compiled
> into XFEATURE_MASK_USER_SUPPORTED.

Can we please use words for this?  Perhaps:

xstate.disable=amx,zmm

and maybe add a list in /proc/cpuinfo or somewhere in /proc or /sys
that shows all the currently enabled xsave states.

>
> "xstate.enable=0x6000" will enable AMX on a system that does NOT have AMX
> compiled into XFEATURE_MASK_USER_SUPPORTED (assuming the kernel is new
> enough to support this feature).

This sounds like it will be quite confusing to anyone reading the
kernel code to discover that a feature that is not "SUPPORTED" is
nonetheless enabled.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
  2020-10-02  2:09   ` Randy Dunlap
  2020-10-02 17:15   ` Andy Lutomirski
@ 2020-10-02 17:25   ` Dave Hansen
  2 siblings, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-10-02 17:25 UTC (permalink / raw)
  To: Chang S. Bae, tglx, mingo, bp, luto, x86
  Cc: len.brown, jing2.liu, ravi.v.shankar, linux-kernel, linux-doc

On 10/1/20 1:39 PM, Chang S. Bae wrote:
> +		if ((custom & XFEATURE_MASK_XTILE) != XFEATURE_MASK_XTILE) {
> +			pr_warn("x86/fpu: Disable 0x%x components due to incorrect setup\n",
> +				XFEATURE_MASK_XTILE);
> +			custom &= ~(XFEATURE_MASK_XTILE);
> +		}

Saying "incorrect setup" is pretty much just wasting the bytes.  We
might as well just say "disabling due to random error", or "disabling
due to the easter bunny".  Each are equally actionable.  How about:

"error in xstate.disable parameter.  Additionally disabling '%s'".

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-02 17:15   ` Andy Lutomirski
@ 2020-10-02 17:26     ` Dave Hansen
  2020-10-13 23:38     ` Brown, Len
  1 sibling, 0 replies; 43+ messages in thread
From: Dave Hansen @ 2020-10-02 17:26 UTC (permalink / raw)
  To: Andy Lutomirski, Chang S. Bae
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML, Len Brown,
	jing2.liu, Ravi V. Shankar, LKML, open list:DOCUMENTATION

On 10/2/20 10:15 AM, Andy Lutomirski wrote:
>> "xstate.enable=0x6000" will enable AMX on a system that does NOT have AMX
>> compiled into XFEATURE_MASK_USER_SUPPORTED (assuming the kernel is new
>> enough to support this feature).
> This sounds like it will be quite confusing to anyone reading the
> kernel code to discover that a feature that is not "SUPPORTED" is
> nonetheless enabled.
Yeah, if we do this, XFEATURE_MASK_USER_SUPPORTED needs a name change
for sure.


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically
  2020-10-01 23:41   ` Andy Lutomirski
@ 2020-10-13 22:00     ` Brown, Len
  0 siblings, 0 replies; 43+ messages in thread
From: Brown, Len @ 2020-10-13 22:00 UTC (permalink / raw)
  To: Andy Lutomirski, Bae, Chang Seok
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML, Hansen,
	Dave, Liu, Jing2, Shankar, Ravi V, LKML



>
> From: Andy Lutomirski <luto@kernel.org> 
>
> > +       /*
> > +        * The caller may be under interrupt disabled condition. Ensure interrupt
> > +        * allowance before the memory allocation, which may involve with page faults.
> > +        */
> > +       local_irq_enable();

> ... you can't just enable IRQs here.  If IRQs are off, they're off for a reason.  Secondly, if they're *on*, you just forgot that fact.

Good catch.  This is a trap handler from user-space and should never be called with irqs disabled,
So the local_irq_enable() should be dead code.  Chang suggested that he erroneously left it in
from a previous implementation.

> > +       /* We need 64B aligned pointer, but vmalloc() returns a page-aligned address */
> > +       state_ptr = vmalloc(newsz);

> And allocating this state from vmalloc() seems questionable.  Why are you doing this?

While the buffer needs to be virtually contiguous, it doesn't need to be physically contiguous,
And so vmalloc() is less overhead than kmalloc() for this.

Thanks,
-Len


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-01 23:39   ` Andy Lutomirski
@ 2020-10-13 22:31     ` Brown, Len
  2020-10-13 22:43       ` Dave Hansen
  2020-10-14  1:47       ` Andy Lutomirski
  0 siblings, 2 replies; 43+ messages in thread
From: Brown, Len @ 2020-10-13 22:31 UTC (permalink / raw)
  To: Andy Lutomirski, Bae, Chang Seok
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML, Hansen,
	Dave, Liu, Jing2, Shankar, Ravi V, LKML

> From: Andy Lutomirski <luto@kernel.org> 

> > diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> > @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
..
> > +bool xfirstuse_event_handler(struct fpu *fpu)
> > +{
> > +       bool handled = false;
> > +       u64 event_mask;
> > +
> > +       /* Check whether the first-use detection is running. */
> > +       if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
> > +               return handled;
> > +

> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
> some helper called farther down the stack

xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available():

> > @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
> >  {
> >  	unsigned long cr0 = read_cr0();
> >  
> > +	if (xfirstuse_event_handler(&current->thread.fpu))
> > +		return;

Are you suggesting we should instead open code it inside that routine?

> But this raises an interesting point -- what happens if allocation
> fails?  I think that, from kernel code, we simply cannot support this
> exception mechanism.  If kernel code wants to use AMX (and that would
> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
> handle errors, not rely on exceptions.  From user code, I assume we
> send a signal if allocation fails.

The XFD feature allows us to transparently expand the kernel context switch buffer
for a user task, when that task first touches this associated hardware.
It allows applications to operate as if the kernel had allocated the backing AMX
context switch buffer at initialization time.  However, since we expect only
a sub-set of tasks to actually use AMX, we instead defer allocation until
we actually see first use for that task, rather than allocating for all tasks.

While we currently don't propose AMX use inside the kernel, it is conceivable
that could be done in the future, just like AVX is used by the RAID code;
and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault.
(note that we context switch the XFD-armed state per task)

vmalloc() does not fail, and does not return an error, and so there is no concept
of returning a signal.  If we got to the point where vmalloc() sleeps, then the system
has bigger OOM issues, and the OOM killer would be on the prowl.

If we were concerned about using vmalloc for a couple of pages in the task structure,
Then we could implement a routine to harvest unused buffers and free them --
but that didn't seem worth the complexity.  Note that this feature is 64-bit only.

Thanks,
-Len



^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-13 22:31     ` Brown, Len
@ 2020-10-13 22:43       ` Dave Hansen
  2020-10-14  1:11         ` Andy Lutomirski
  2020-10-14 10:41         ` Peter Zijlstra
  2020-10-14  1:47       ` Andy Lutomirski
  1 sibling, 2 replies; 43+ messages in thread
From: Dave Hansen @ 2020-10-13 22:43 UTC (permalink / raw)
  To: Brown, Len, Andy Lutomirski, Bae, Chang Seok
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML, Liu,
	Jing2, Shankar, Ravi V, LKML

On 10/13/20 3:31 PM, Brown, Len wrote:
> vmalloc() does not fail, and does not return an error, and so there is no concept
> of returning a signal.

Well, the order-0 allocations are no-fail, as are the vmalloc kernel
structures and the page tables that might have to be allocated.  But,
that's not guaranteed to be in place *forever*.  I think we still need
to check for and handle allocation failures, even if they're not known
to be possible today.

> If we got to the point where vmalloc() sleeps, then the system
> has bigger OOM issues, and the OOM killer would be on the prowl.

vmalloc() can *certainly* sleep.  Allocation failures mean returning
NULL from the allocator, and the very way we avoid doing that is by
sleeping to go reclaim some memory from some other allocation.

Sleeping is a normal and healthy part of handling allocation requests,
including vmalloc().

> If we were concerned about using vmalloc for a couple of pages in the task structure,
> Then we could implement a routine to harvest unused buffers and free them --
> but that didn't seem worth the complexity.  Note that this feature is 64-bit only.

IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
XSAVE buffer for the first AMX implementation.  But, it's not overkill
for the ~66k of space that will be needed if some CPU implementation
comes along and uses all of the architectural space AMX provides.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-02  2:09   ` Randy Dunlap
@ 2020-10-13 23:00     ` Brown, Len
  0 siblings, 0 replies; 43+ messages in thread
From: Brown, Len @ 2020-10-13 23:00 UTC (permalink / raw)
  To: Randy Dunlap, Bae, Chang Seok, tglx, mingo, bp, luto, x86
  Cc: Hansen, Dave, Liu, Jing2, Shankar, Ravi V, linux-kernel, linux-doc


> From: Randy Dunlap <rdunlap@infradead.org> 

> What do these bitmasks look like?  what do the bits mean?
> Where does a user find this info?

The XSAVE state component bitmaps are detailed in
the Intel Software Developer's Manual, volume 1, Chapter 13:
"Managing State using the XSAVE Feature Set".

http://intel.com/sdm

In the kernel source, they are enumerated in xstate.c
and you can observe them in dmesg:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'

Thanks,
-Len


^ permalink raw reply	[flat|nested] 43+ messages in thread

* RE: [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support
  2020-10-02 17:15   ` Andy Lutomirski
  2020-10-02 17:26     ` Dave Hansen
@ 2020-10-13 23:38     ` Brown, Len
  1 sibling, 0 replies; 43+ messages in thread
From: Brown, Len @ 2020-10-13 23:38 UTC (permalink / raw)
  To: Andy Lutomirski, Bae, Chang Seok
  Cc: Thomas Gleixner, Ingo Molnar, Borislav Petkov, X86 ML, Hansen,
	Dave, Liu, Jing2, Shankar, Ravi V, LKML, open list:DOCUMENTATION

> From: Andy Lutomirski <luto@kernel.org> 

> On Thu, Oct 1, 2020 at 1:43 PM Chang S. Bae <chang.seok.bae@intel.com> wrote:

> > "xstate.disable=0x6000" will disable AMX on a system that has AMX 
> > compiled into XFEATURE_MASK_USER_SUPPORTED.

> Can we please use words for this?  Perhaps:

> xstate.disable=amx,zmm

Yes, I think it is reasonable to add support for keywords for the features that
are both supported by the hardware and known by the kernel.

However, we need to continue to support numerical state-component bitmap format.
Otherwise, it would not be possible to coerce a legacy kernel (eg. a distro kernel)
to enable a feature on a new machine until it has been updated to know that keyword.

> and maybe add a list in /proc/cpuinfo or somewhere in /proc or /sys that shows all the currently enabled xsave states.

Agreed, if we invent keywords, the list of supported+known keywords should be available on the system.

I do not advocate /proc/cpuinfo -- which is already an out of control parsing mess.

We could add the keywords to dmesg, since we already print the supported XSAVE BV:

[    0.000000] x86/fpu: Supporting XSAVE feature 0x001: 'x87 floating  point registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x002: 'SSE registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x004: 'AVX registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x008: 'MPX bounds registers'
[    0.000000] x86/fpu: Supporting XSAVE feature 0x010: 'MPX CSR'

Or maybe a sysfs attribute or a modparam that simply lists them all.

We wouldn't be able to dynamically  _write_ to that attribute, since the cmdline is boot-time only.

> > "xstate.enable=0x6000" will enable AMX on a system that does NOT have 
> > AMX compiled into XFEATURE_MASK_USER_SUPPORTED (assuming the kernel is 
> > new enough to support this feature).
>
> This sounds like it will be quite confusing to anyone reading the kernel code to discover that a feature that is not "SUPPORTED" is nonetheless enabled.

Right now, this cmdline will only allow a new kernel to enable/disable kernel support for AMX
on hardware that also supports AMX.  But we hope to re-use this XSTATE code -- unchanged --
for future features that require just this simple state management support from the kernel.

We anticipate a future hardware enumeration mechanism to identify such qualified features
to assist the kernel in deciding whether to support a feature or not, by default.
The kernel can use the combination of its build-time config with advice from this boot-time
enumeration to decide if it wants to enable a particular feature or not.
And at the end, a user is empowered to override that default using this cmdline.

Thanks,
-Len


^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-13 22:43       ` Dave Hansen
@ 2020-10-14  1:11         ` Andy Lutomirski
  2020-10-14  6:03           ` Dave Hansen
  2020-10-14 10:41         ` Peter Zijlstra
  1 sibling, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-14  1:11 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Brown, Len, Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, Liu, Jing2, Shankar,
	Ravi V, LKML


> On Oct 13, 2020, at 3:44 PM, Dave Hansen <dave.hansen@intel.com> wrote:
> 
> On 10/13/20 3:31 PM, Brown, Len wrote:
>> vmalloc() does not fail, and does not return an error, and so there is no concept
>> of returning a signal.
> 
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated.  But,
> that's not guaranteed to be in place *forever*.  I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.
> 
>> If we got to the point where vmalloc() sleeps, then the system
>> has bigger OOM issues, and the OOM killer would be on the prowl.
> 
> vmalloc() can *certainly* sleep.  Allocation failures mean returning
> NULL from the allocator, and the very way we avoid doing that is by
> sleeping to go reclaim some memory from some other allocation.
> 
> Sleeping is a normal and healthy part of handling allocation requests,
> including vmalloc().
> 
>> If we were concerned about using vmalloc for a couple of pages in the task structure,
>> Then we could implement a routine to harvest unused buffers and free them --
>> but that didn't seem worth the complexity.  Note that this feature is 64-bit only.
> 
> IMNHO, vmalloc() is overkill for ~10k, which is roughly the size of the
> XSAVE buffer for the first AMX implementation.  But, it's not overkill
> for the ~66k of space that will be needed if some CPU implementation
> comes along and uses all of the architectural space AMX provides.

I have no problem with vmalloc(), but I do have a problem with vfree() due to the IPIs that result. We need a cache or something.

I have to say: this mechanism is awful. Can we get away with skipping the dynamic XSAVES mess entirely?  What if we instead allocate however much space we need as an array of pages and have one percpu contiguous region. To save, we XSAVE(S or C) just the AMX state to the percpu area and then copy it.  To restore, we do the inverse.  Or would this kill the modified optimization and thus be horrible?

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-13 22:31     ` Brown, Len
  2020-10-13 22:43       ` Dave Hansen
@ 2020-10-14  1:47       ` Andy Lutomirski
  1 sibling, 0 replies; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-14  1:47 UTC (permalink / raw)
  To: Brown, Len
  Cc: Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, X86 ML, Hansen, Dave, Liu, Jing2, Shankar,
	Ravi V, LKML


> On Oct 13, 2020, at 3:31 PM, Brown, Len <len.brown@intel.com> wrote:
> 
> 
>> 
>> From: Andy Lutomirski <luto@kernel.org> 
> 
>>> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
>>> @@ -518,3 +518,40 @@ int fpu__exception_code(struct fpu *fpu, int trap_nr)
> ..
>>> +bool xfirstuse_event_handler(struct fpu *fpu)
>>> +{
>>> +       bool handled = false;
>>> +       u64 event_mask;
>>> +
>>> +       /* Check whether the first-use detection is running. */
>>> +       if (!static_cpu_has(X86_FEATURE_XFD) || !xfirstuse_enabled())
>>> +               return handled;
>>> +
> 
>> MSR_IA32_XFD_ERR needs to be wired up in the exception handler, not in
>> some helper called farther down the stack
> 
> xfirstuse_event_handler() is called directly from the IDTENTRY exc_device_not_available():
> 
>>> @@ -1028,6 +1028,9 @@ DEFINE_IDTENTRY(exc_device_not_available)
>>> {
>>>    unsigned long cr0 = read_cr0();
>>> 
>>> +    if (xfirstuse_event_handler(&current->thread.fpu))
>>> +        return;
> 
> Are you suggesting we should instead open code it inside that routine?

MSR_IA32_XFD_ERR is like CR2 and DR6 — it’s functionally a part of the exception. Let’s handle it as such.  (And, a bit like DR6, it’s a bit broken.)

> 
>> But this raises an interesting point -- what happens if allocation
>> fails?  I think that, from kernel code, we simply cannot support this
>> exception mechanism.  If kernel code wants to use AMX (and that would
>> be very strange indeed), it should call x86_i_am_crazy_amx_begin() and
>> handle errors, not rely on exceptions.  From user code, I assume we
>> send a signal if allocation fails.
> 
> The XFD feature allows us to transparently expand the kernel context switch buffer
> for a user task, when that task first touches this associated hardware.
> It allows applications to operate as if the kernel had allocated the backing AMX
> context switch buffer at initialization time.  However, since we expect only
> a sub-set of tasks to actually use AMX, we instead defer allocation until
> we actually see first use for that task, rather than allocating for all tasks.

I sure hope that not all tasks use it. Context-switching it will be absurdly expensive.

> 
> While we currently don't propose AMX use inside the kernel, it is conceivable
> that could be done in the future, just like AVX is used by the RAID code;
> and it would be done the same way -- kernel_fpu_begin()/kernel_fpu_end().
> Such future Kernel AMX use would _not_ arm XFD, and would _not_ provoke this fault.
> (note that we context switch the XFD-armed state per task)

How expensive is *that*?  Can you give approximate cycle counts for saving, restoring, arming and disarming?

This reminds me of TS.  Writing TS was more expensive than saving the whole FPU state.  And, for kernel code, we can’t just “not arm” the XFD — we would have to disarm it.



> 
> vmalloc() does not fail, and does not return an error, and so there is no concept
> of returning a signal.  If we got to the point where vmalloc() sleeps, then the system
> has bigger OOM issues, and the OOM killer would be on the prowl.
> 
> If we were concerned about using vmalloc for a couple of pages in the task structure,
> Then we could implement a routine to harvest unused buffers and free them

Kind of like we vmalloc a couple pages for kernel stacks, and we carefully cache them.  And we handle failure gracefully.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-14  1:11         ` Andy Lutomirski
@ 2020-10-14  6:03           ` Dave Hansen
  2020-10-14 16:10             ` Andy Lutomirski
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-10-14  6:03 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brown, Len, Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, Liu, Jing2, Shankar,
	Ravi V, LKML

On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> I have no problem with vmalloc(), but I do have a problem with
> vfree() due to the IPIs that result. We need a cache or something.

This sounds like the kind of thing we should just build into vmalloc()
instead of having a bunch of callers implement their own caches.  It
shouldn't be too much of a challenge to have vmalloc() keep a cacheline
or two of stats about common vmalloc() sizes and keep some pools around.

It's not going to be hard to implement caches to reduce vfree()-induced
churn, but I'm having a hard time imaging that it'll have anywhere near
the benefits that it did for stacks.  Tasks fundamentally come and go a
*lot*, and those paths are hot.

Tasks who go to the trouble to populate 8k or 64k of register state
fundamentally *can't* come and go frequently.  We also (probably) don't
have to worry about AMX tasks doing fork()/exec() pairs and putting
pressure on vmalloc().  Before an app can call out to library functions
to do the fork, they've got to save the state off themselves and likely
get it back to the init state.  The fork() code can tell AMX is in the
init state and decline to allocate actual space for the child.

> I have to say: this mechanism is awful. Can we get away with skipping
> the dynamic XSAVES mess entirely?  What if we instead allocate
> however much space we need as an array of pages and have one percpu
> contiguous region. To save, we XSAVE(S or C) just the AMX state to
> the percpu area and then copy it.  To restore, we do the inverse.  Or
> would this kill the modified optimization and thus be horrible?

Actually, I think the modified optimization would survive such a scheme:

 * copy page array into percpu area
 * XRSTORS from percpu area, modified optimization tuple is saved
 * run userspace
 * XSAVES back to percpu area.  tuple matches, modified optimization
   is still in play
 * copy percpu area back to page array

Since the XRSTORS->XSAVES pair is both done to the percpu area, the
XSAVE tracking hardware never knows it isn't working on the "canonical"
buffer (the page array).

It seems a bit ugly, but it's not like an 8k memcpy() is *that* painful.
 The double dcache footprint isn't super nice, though.

Chang, I think that leaves us with three possibilities:

1. use plain old vmalloc()
2. use vmalloc(), but implement caches for large XSAVE state, either in
   front of, or inside the vmalloc() API
3. Use a static per-cpu buffer for XSAVE*/XRSTOR* and memcpy() to/from
   a scattered list of pages.

A #4 also comes to mind:

Do something somewhat like kmap_atomic().  Map the scattered pages
contiguously and use XSAVE*/XRSTOR* directly, but unmap immediately
after XSAVE*/XRSTOR*.  For ~12k of state, I suspect the ~400-500 cycles
for 3 INVLPG's might beat out a memcpy() of 12k of state.

Global TLB invalidations would not be required.

I have to say, though, that I'd be OK with sticking with plain old
vmalloc() for now.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-13 22:43       ` Dave Hansen
  2020-10-14  1:11         ` Andy Lutomirski
@ 2020-10-14 10:41         ` Peter Zijlstra
  1 sibling, 0 replies; 43+ messages in thread
From: Peter Zijlstra @ 2020-10-14 10:41 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Brown, Len, Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, Liu, Jing2, Shankar,
	Ravi V, LKML

On Tue, Oct 13, 2020 at 03:43:59PM -0700, Dave Hansen wrote:
> On 10/13/20 3:31 PM, Brown, Len wrote:
> > vmalloc() does not fail, and does not return an error, and so there is no concept
> > of returning a signal.
> 
> Well, the order-0 allocations are no-fail, as are the vmalloc kernel
> structures and the page tables that might have to be allocated.  But,
> that's not guaranteed to be in place *forever*.  I think we still need
> to check for and handle allocation failures, even if they're not known
> to be possible today.

Quite, on top of that, we could run out of vmalloc space (unlikely, but
sitll).

You really have to deal with vmalloc() failing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-14  6:03           ` Dave Hansen
@ 2020-10-14 16:10             ` Andy Lutomirski
  2020-10-14 16:29               ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Andy Lutomirski @ 2020-10-14 16:10 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Brown, Len, Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, Liu, Jing2, Shankar,
	Ravi V, LKML

On Tue, Oct 13, 2020 at 11:03 PM Dave Hansen <dave.hansen@intel.com> wrote:
>
> On 10/13/20 6:11 PM, Andy Lutomirski wrote:
> > I have no problem with vmalloc(), but I do have a problem with
> > vfree() due to the IPIs that result. We need a cache or something.
>
> This sounds like the kind of thing we should just build into vmalloc()
> instead of having a bunch of callers implement their own caches.  It
> shouldn't be too much of a challenge to have vmalloc() keep a cacheline
> or two of stats about common vmalloc() sizes and keep some pools around.
>
> It's not going to be hard to implement caches to reduce vfree()-induced
> churn, but I'm having a hard time imaging that it'll have anywhere near
> the benefits that it did for stacks.  Tasks fundamentally come and go a
> *lot*, and those paths are hot.
>
> Tasks who go to the trouble to populate 8k or 64k of register state
> fundamentally *can't* come and go frequently.  We also (probably) don't
> have to worry about AMX tasks doing fork()/exec() pairs and putting
> pressure on vmalloc().  Before an app can call out to library functions
> to do the fork, they've got to save the state off themselves and likely
> get it back to the init state.  The fork() code can tell AMX is in the
> init state and decline to allocate actual space for the child.
>
> > I have to say: this mechanism is awful. Can we get away with skipping
> > the dynamic XSAVES mess entirely?  What if we instead allocate
> > however much space we need as an array of pages and have one percpu
> > contiguous region. To save, we XSAVE(S or C) just the AMX state to
> > the percpu area and then copy it.  To restore, we do the inverse.  Or
> > would this kill the modified optimization and thus be horrible?
>
> Actually, I think the modified optimization would survive such a scheme:
>
>  * copy page array into percpu area
>  * XRSTORS from percpu area, modified optimization tuple is saved
>  * run userspace
>  * XSAVES back to percpu area.  tuple matches, modified optimization
>    is still in play
>  * copy percpu area back to page array
>
> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
> XSAVE tracking hardware never knows it isn't working on the "canonical"
> buffer (the page array).

I was suggesting something a little bit different.  We'd keep XMM,
YMM, ZMM, etc state stored exactly the way we do now and, for
AMX-using tasks, we would save the AMX state in an entirely separate
buffer.  This way the pain of having a variable xstate layout is
confined just to AMX tasks.

I'm okay with vmalloc() too, but I do think we need to deal with the
various corner cases like allocation failing.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-14 16:10             ` Andy Lutomirski
@ 2020-10-14 16:29               ` Dave Hansen
  2020-11-03 21:32                 ` Bae, Chang Seok
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-10-14 16:29 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: Brown, Len, Andy Lutomirski, Bae, Chang Seok, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, X86 ML, Liu, Jing2, Shankar,
	Ravi V, LKML

On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>> Actually, I think the modified optimization would survive such a scheme:
>>
>>  * copy page array into percpu area
>>  * XRSTORS from percpu area, modified optimization tuple is saved
>>  * run userspace
>>  * XSAVES back to percpu area.  tuple matches, modified optimization
>>    is still in play
>>  * copy percpu area back to page array
>>
>> Since the XRSTORS->XSAVES pair is both done to the percpu area, the
>> XSAVE tracking hardware never knows it isn't working on the "canonical"
>> buffer (the page array).
> I was suggesting something a little bit different.  We'd keep XMM,
> YMM, ZMM, etc state stored exactly the way we do now and, for
> AMX-using tasks, we would save the AMX state in an entirely separate
> buffer.  This way the pain of having a variable xstate layout is
> confined just to AMX tasks.

OK, got it.

So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
of the registers out to memory.  We can preserve the modified
optimization if we're careful about ordering, but only for *ONE* of the
XSAVE buffers (if we use two).

> I'm okay with vmalloc() too, but I do think we need to deal with the
> various corner cases like allocation failing.

Yeah, agreed about handling the corner cases.  Also, if we preserve
plain old vmalloc() for now, we need good tracepoints or stats so we can
precisely figure out how many vmalloc()s (and IPIs) are due to AMX.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-10-14 16:29               ` Dave Hansen
@ 2020-11-03 21:32                 ` Bae, Chang Seok
  2020-11-03 21:41                   ` Dave Hansen
  0 siblings, 1 reply; 43+ messages in thread
From: Bae, Chang Seok @ 2020-11-03 21:32 UTC (permalink / raw)
  To: Lutomirski, Andy, Hansen, Dave
  Cc: mingo, linux-kernel, bp, tglx, Liu, Jing2, x86, luto, Shankar,
	Ravi V, Brown, Len

On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> > 
> > I was suggesting something a little bit different.  We'd keep XMM,
> > YMM, ZMM, etc state stored exactly the way we do now and, for
> > AMX-using tasks, we would save the AMX state in an entirely separate
> > buffer.  This way the pain of having a variable xstate layout is
> > confined just to AMX tasks.
> 
> OK, got it.
> 
> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
> of the registers out to memory.  We can preserve the modified
> optimization if we're careful about ordering, but only for *ONE* of the
> XSAVE buffers (if we use two).

For what is worth,

If using two buffers, the buffer for saving the tile data also needs space
for the legacy states.

The AMX state is stored at the end of the XSAVE buffer (at least for now).
So, the layout (in terms of offsets of non-AMX states) won't be changed at
run-time.

Thanks,
Chang

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-11-03 21:32                 ` Bae, Chang Seok
@ 2020-11-03 21:41                   ` Dave Hansen
  2020-11-03 21:53                     ` Bae, Chang Seok
  0 siblings, 1 reply; 43+ messages in thread
From: Dave Hansen @ 2020-11-03 21:41 UTC (permalink / raw)
  To: Bae, Chang Seok, Lutomirski, Andy
  Cc: mingo, linux-kernel, bp, tglx, Liu, Jing2, x86, luto, Shankar,
	Ravi V, Brown, Len

On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
>> On 10/14/20 9:10 AM, Andy Lutomirski wrote:
>>> I was suggesting something a little bit different.  We'd keep XMM,
>>> YMM, ZMM, etc state stored exactly the way we do now and, for
>>> AMX-using tasks, we would save the AMX state in an entirely separate
>>> buffer.  This way the pain of having a variable xstate layout is
>>> confined just to AMX tasks.
>> OK, got it.
>>
>> So, we'd either need a second set of XSAVE/XRSTORs, or "manual" copying
>> of the registers out to memory.  We can preserve the modified
>> optimization if we're careful about ordering, but only for *ONE* of the
>> XSAVE buffers (if we use two).
> For what is worth,
> 
> If using two buffers, the buffer for saving the tile data also needs space
> for the legacy states.

Just to be clear, you're talking about the 512-byte 'struct
fxregs_state' which is the first member of 'struct xregs_state', right?

Basically, any two-buffer model wastes 576 bytes of space (legacy fxsave
plus xsave header) for each task that uses two buffers.  This provides
an overall space savings unless there are something like 16x more
TMUL-using tasks than non-TMUL-using tasks.

We could eliminate the 576 bytes of waste, but at the *performance* cost
of having a permanently (non-task_struct-embedded) out-of-line XSAVE
buffer for all tasks.

> The AMX state is stored at the end of the XSAVE buffer (at least for now).
> So, the layout (in terms of offsets of non-AMX states) won't be changed at
> run-time.

I don't know what point you are trying to make here.

^ permalink raw reply	[flat|nested] 43+ messages in thread

* Re: [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use
  2020-11-03 21:41                   ` Dave Hansen
@ 2020-11-03 21:53                     ` Bae, Chang Seok
  0 siblings, 0 replies; 43+ messages in thread
From: Bae, Chang Seok @ 2020-11-03 21:53 UTC (permalink / raw)
  To: Lutomirski, Andy, Hansen, Dave
  Cc: mingo, linux-kernel, bp, tglx, Liu, Jing2, x86, luto, Shankar,
	Ravi V, Brown, Len

On Tue, 2020-11-03 at 13:41 -0800, Dave Hansen wrote:
> On 11/3/20 1:32 PM, Bae, Chang Seok wrote:
> > On Wed, 2020-10-14 at 09:29 -0700, Dave Hansen wrote:
> > > On 10/14/20 9:10 AM, Andy Lutomirski wrote:
> > > > I was suggesting something a little bit different.  We'd keep XMM,
> > > > YMM, ZMM, etc state stored exactly the way we do now and, for
> > > > AMX-using tasks, we would save the AMX state in an entirely
> > > > separate
> > > > buffer.  This way the pain of having a variable xstate layout is
> > > > confined just to AMX tasks.
> > > 
> > > OK, got it.
> > > 
> > > So, we'd either need a second set of XSAVE/XRSTORs, or "manual"
> > > copying
> > > of the registers out to memory.  We can preserve the modified
> > > optimization if we're careful about ordering, but only for *ONE* of
> > > the
> > > XSAVE buffers (if we use two).
> > 
> > For what is worth,
> > 
> > If using two buffers, the buffer for saving the tile data also needs
> > space
> > for the legacy states.
> 
> Just to be clear, you're talking about the 512-byte 'struct
> fxregs_state' which is the first member of 'struct xregs_state', right?

Yes.

> 
> > The AMX state is stored at the end of the XSAVE buffer (at least for
> > now).
> > So, the layout (in terms of offsets of non-AMX states) won't be changed
> > at
> > run-time.
> 
> I don't know what point you are trying to make here.

I was trying to clarify the concern that Andy had:

	"the pain of having a variable xstate layout"

Thanks,
Chang

^ permalink raw reply	[flat|nested] 43+ messages in thread

end of thread, other threads:[~2020-11-03 21:53 UTC | newest]

Thread overview: 43+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-10-01 20:38 [RFC PATCH 00/22] x86: Support Intel Advanced Matrix Extensions Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 01/22] x86/fpu/xstate: Modify area init helper prototypes to access all the possible areas Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 02/22] x86/fpu/xstate: Modify xstate copy " Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 03/22] x86/fpu/xstate: Modify address finder " Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 04/22] x86/fpu/xstate: Modify save and restore helper " Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 05/22] x86/fpu/xstate: Introduce a new variable for dynamic user states Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 06/22] x86/fpu/xstate: Outline dynamic xstate area size in the task context Chang S. Bae
2020-10-01 20:38 ` [RFC PATCH 07/22] x86/fpu/xstate: Introduce helpers to manage an xstate area dynamically Chang S. Bae
2020-10-01 23:41   ` Andy Lutomirski
2020-10-13 22:00     ` Brown, Len
2020-10-01 20:38 ` [RFC PATCH 08/22] x86/fpu/xstate: Define the scope of the initial xstate data Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 09/22] x86/fpu/xstate: Introduce wrapper functions for organizing xstate area access Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 10/22] x86/fpu/xstate: Update xstate save function for supporting dynamic user xstate Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 11/22] x86/fpu/xstate: Update xstate area address finder " Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 12/22] x86/fpu/xstate: Update xstate context copy function for supporting dynamic area Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 13/22] x86/fpu/xstate: Expand dynamic user state area on first use Chang S. Bae
2020-10-01 23:39   ` Andy Lutomirski
2020-10-13 22:31     ` Brown, Len
2020-10-13 22:43       ` Dave Hansen
2020-10-14  1:11         ` Andy Lutomirski
2020-10-14  6:03           ` Dave Hansen
2020-10-14 16:10             ` Andy Lutomirski
2020-10-14 16:29               ` Dave Hansen
2020-11-03 21:32                 ` Bae, Chang Seok
2020-11-03 21:41                   ` Dave Hansen
2020-11-03 21:53                     ` Bae, Chang Seok
2020-10-14 10:41         ` Peter Zijlstra
2020-10-14  1:47       ` Andy Lutomirski
2020-10-01 20:39 ` [RFC PATCH 14/22] x86/fpu/xstate: Inherit dynamic user state when used in the parent Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 15/22] x86/fpu/xstate: Support ptracer-induced xstate area expansion Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 16/22] x86/fpu/xstate: Support dynamic user state in the signal handling path Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 17/22] x86/fpu/xstate: Extend the table for mapping xstate components with features Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 18/22] x86/cpufeatures/amx: Enumerate Advanced Matrix Extension (AMX) feature bits Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 19/22] x86/fpu/amx: Define AMX state components and have it used for boot-time checks Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 20/22] x86/fpu/amx: Enable the AMX feature in 64-bit mode Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 21/22] selftest/x86/amx: Include test cases for the AMX state management Chang S. Bae
2020-10-01 20:39 ` [RFC PATCH 22/22] x86/fpu/xstate: Introduce boot-parameters for control some state component support Chang S. Bae
2020-10-02  2:09   ` Randy Dunlap
2020-10-13 23:00     ` Brown, Len
2020-10-02 17:15   ` Andy Lutomirski
2020-10-02 17:26     ` Dave Hansen
2020-10-13 23:38     ` Brown, Len
2020-10-02 17:25   ` Dave Hansen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).